-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failover support #232
Comments
just linking here the issue I created some time ago that has some insights #147 I guess we should close on or the other as they seem like a duplicates of each other |
@mswiderski I wanted a place to specifically discuss failover |
Many years ago all of our EAPs were deployed as 2-nodes clusters, without a load balancer in front. Here is a description of the approach we took, which hopefully will help the discussion. Let's say that for the remote service we have 3 urls: this is an important property: we always keep the nok instances in the list as a last resort. a nok instance is good again if:
continuing from the previous example, we had: initial list=2-3-1, ok=3-1 and nok=2, and final list=3-1-2 an important aspect of the whole mechanism is recognizing when you can retry, and when you must not. this actually depends on the protocol used. for rest, we used the idempotency property of the http verbs defined in the spec. for ejb/http, since all calls are going through a POST, we created a dedicated in addition to these protocol specific conditions, we would recognize certain situations where we knew for sure that the call did not happen (including exceptions specific to libs we were using underneath):
We would retry also on the fact that client side load balancing handled a number of retries automatically relieved the developers from having to deal with this at the app resilience level (e.g. MP Fault Tolerance). so for instance developers did not have to add they knew that the obvious errors would be handled underneath, and they needed to add fault tolerance only for the extra situations that needed some app knowledge. the other benefit was they did not have to know how many retries to do to make sure they touched all nodes in the cluster, which they could not know at the app level. the client side load balancing knows how many instances there are. so in a 3 nodes cluster, there is no need to retry 50 times an idempotent method. 3 is enough. Similarly, if we know we have a 3 nodes cluster, it is worth trying 3 times rather than 2. The application may still need/want to do some extra retries if it suspects the issue is not with the endpoint itself but with one of its dependencies (e.g. a db). but that is a different level of retry. part of this designed has been influenced by the retry mechanism in apache http client: the algorithm has been running for roughly 10 years, and started way before there was a resiliency libs available. it provided us with a lot of stability, even through crashes, restarts and redeploys. as we are moving to k8s, the need for client side load balancing will probably diminish, but we still have to deal with our legacy infrastructure, so this component is going to stay critical for many more years I suspect. I think stork is very well positioned for us considering our situation and the investment we are making in Quarkus. I am very excited about using it. happy to discuss it further. |
this is the culprit from my point of view. Load balancer should automatically handle the connectivity errors based on the state of instances it is aware of. And only propagate errors that are not related to connectivity as it cannot figure out on its own if particular error should be retried or not. |
we have to recognize that awareness is a weak property here. there are instances that you think are good, but are not. and some instances that you think are bad, but are not.
it can actually in some cases. the obvious example is idempotent http verbs in rest, for instance in apache httpclient the retryRequest() method relying on the isIdempotent() definition. in some cases it may not be appropriate for a particular protocol. one for instance could argue that a in a flexible design, there should be a retry strategy interface (similar in spirit to HttpRequestRetryStrategy) with a default implementation. some context on the type of call we are making should be passed to this strategy (at the very least the http verb, but ideally some reference on the java business method invoked on the proxy, if any, so that we can look at the annotations), so that the custom retry strategy can make intelligent decisions. |
agreed but there must be something to start with and I believe relying on connection related errors is a good start and certainly not that it has to end there.
alright, it can find certain things to rely on but again (and as you said later) it can be hidden into various types of retry strategy implementations with reasonable default if needed. |
oh yes sure. grpc also added support for idempotent operations a few years ago using a method level idempotency option. as discussed here: Calls that are marked as idempotent may be sent multiple times. this means that client side load balancing (and failover) library would be well inclined to retry idempotent calls if it thought the call may succeed on other service instances. if an api designer specified that an operation was idempotent, a client user should not have to re-state that the service should be retried at the Fault Tolerance level. |
for me, it's outside of the scope of Stork. Why do you think it should be handled by the load balancer? |
mainly because the load balancer is completely hidden from client code. In my case I use MP Rest Client (with quarkus rest easy reactive client) so I have no access to the load balancer though I have configured multiple endpoints to make sure they will be tried in case of failures related to connectivity. |
But there's also the client library. Which knows better the concurrency model (knows which thread/event loop should handle a retry), has the knowledge which operations are idempotent, etc. @vsevel can the retry mechanism you described be also implemented as follows:
The difference between this and iterating over a list of all service instances is that it will take into account failures from many calls, not only the current one. The situation of having many instances down is unlikely, they would probably be removed on the next refresh of service discovery. Creating an ordered list of service instances may be inefficient, depending on the load balancing strategy. |
Part of the discussion is deciding if your load balancer is acting as an intermediate between the client and the service (that is the traditional definition), or a name resolver (you give me a name and I give you an address you can call). The proprietary implementation I did, was closer to a client side load balancer (with no health checks), since it would serve as an intermediate between the client and the service, and was able to make informed failover decisions, because it was topology aware, and used client provided strategies to help figure out retriability for different protocol. It is somehow similar to what the apache http client is doing with HttpRequestRetryExec, but limited to a single host. In my case I took what they were doing, and applied to a situation where the retries could be done on different hosts, rather than on the same one. I looked at Ribbon. It seems to be what I would call client side load balancer since they would handle the call, and support retry options. check DefaultLoadBalancerRetryHandler and the different retry options such as I am not a I am wondering if stork is closer to consul from that perspective where the name you resolve will point to a real service that is supposed to be up, according to consul's healthchecks. But if that is not the case, that is the responsibility of the client to ask for new address and retry. The issue with that approach is that each client (e.g. grpc, rest client) needs to implement auto retry on common situations (e.g. I am not saying this is wrong. However it must be understood that the retriability needs to be moved upward in the clients (e.g. rest client, grpc). So for instance for grpc, some code will have to figure out that the first try failed with a retriable error, in a situation where a second call may actually pass (it would have to know that a As stated previously, and easy way to do this is to push that back to the application developer by asking him to use MP Fault Tolerance for this. But that is a burden that he/she should have not to endure since we know that some errors are always retriable, and some of the protocols have already defined rules about retriability (e.g. idempotent verbs for rest, idempotency annotation for grpc). So the developer should not have to re-state it. So where does it leave us? With the current approach, I guess you need to make sure all client stacks implement auto-retry, some of logic being duplicated (e.g. reacting to connection errors for instance). In my experience, we took a different approach where this concern was handled centrally. |
When we wrote our client side load balancing component, we also wrote a dedicated test suite to validate the behavior.
After the backend had been stopped, we would restart it, make sure load balancing had resumed normally on both server nodes, and play another incident. The client side would make calls on 2 types of endpoints:
The client application was not using any resilience lib (e.g. We would let the test run for tenths of minutes (hundreds of thousands calls), and assess at the end the following conditions for success: For crash situations:
For graceful shutdown situations:
Watching the type of errors during the tests (for instance the errors the client would receive while the backend was restarting), would allow to categorize them in either retriable and non retriable, depending on the protocol, and adjust the failure conditions on the client side for the automatic retries. Eventually we got complete coverage on all possible errors. I finished integrating stork in my pilot application. Please let me know how you see stork going forward. |
Interesting reading from the grpc project: Transparent Retries Grpc already distinguishes retries that it will do automatically (aka transparent retries) from retries that are governed by the retry policy. 3 types of failures are defined:
In the first case, the call is retried multiple times until the deadlines passes. So for grpc, this may already work well. I looked at the resteasy reactive |
I had my first experience with the stork via Quarkus and it was a disappointment for me to get a ProcessingException when I shut down one of the services I was calling. I can accept my load balancer returning an exception if no instance of the service I'm calling isn't available, otherwise, my expectation is that it points me to an accessible instance. I think this expectation is fully in line with the nature of load balancing. |
@hakdogan while ideal, there is no way to be sure that the instance that has been selected is healthy. Your service discovery may or may not use health data (eureka does, DNS does not), but even with that, the state can change between the selection and the call. Also note that the health check may not provide the right data (and instance returning ready because the server is up, but actually some requirements are not there) One of the patterns I recommend, when retry is possible is: @Retry
Uni<String> callMyRemoteService(): with the round-robin (default) strategy random, power-of-two-choices, or least-response-time strategies it will pick another instance during the next call (so the retry). What can be improved from the Stork POV would be to capture the failing instances and blacklist them for some time. The least-response-time strategy is already doing this. |
Also, Stork has no idea if the operation you are going to call is idempotent or not, only the user and sometimes the transport layer know. |
Also (yes, second one), not all failures are equal. So, we would have a list of:
However, @Retry already has such a configuration, if I'm not mistaken. |
Yes it does, you can choose on which exceptions you want to retry |
@cescoffier Thank you for your detailed explanation. From what you wrote, I understand and respect that you hold the position that @michalszynkiewicz expressed in his first comment. Stork excites me when I think of its easy integration with Quarkus. I will keep watching the development/evolution of it. |
so when the protocol knows if it can retry, do you see an opportunity to implement those retries directly at this level (e.g. in the resteasy proxy), rather than pushing it to the app level? do you see |
At this point, I see |
I know almost nothing about Stork, but I find this discussion very interesting, as it is closely related to the topic of fault tolerance. For a load balancer to be able to handle failures, it must be aware of the underlying protocol (know the Java classes of exceptions that may be thrown, be able to read status codes from responses, things like that), which is something I guess the Stork authors are not very keen on (understandably). Here's a probably-silly idea that might be worth exploring. SmallRye Fault Tolerance now has a programmatic API, whose core is this interface: interface FaultTolerance<T> {
T call(Callable<T> action) throws Exception;
} An instance of this interface may be created using a builder that allows specifying everything that MicroProfile Fault Tolerance allows specifying using annotations. Maybe Stork could accept an instance of |
@Ladicek what gains do you see in that? There would have to be another annotation to mark an operation as idempotent/retriable, right? There is one thing I'm afraid the MP FT + Stork won't be sufficient for. The exception thrown on failure may need to be analyzed (to e.g. get the exact status code) before making a decision about retry. That can be worked around by a custom exception mapper though. |
I was thinking the programmatic API is somewhat more expressive than the declarative Fault Tolerance API. But if all Stork is also declarative, then there's indeed no gain possible. |
Stork itself is purely programmatic but users won't rather use it directly. |
Gotcha, in that case probably just ignore me :-) That said, this discussion actually reminded me of one idea I had recently: to be able to configure a set of fault tolerance strategies on one place (most likely programmatically, but configuration could be doable too) and then apply it to various methods with a single annotation. That would allow centralizing the fault tolerance related understanding of protocol exceptions and probably more. I'm thinking something like: @Produces
@Identifier("my-fault-tolerance")
static final FaultTolerance<Object> = ...;
...
@ApplyFaultTolerance("my-fault-tolerance")
public void doSomething() {
...
} It's probably enough time I filed an issue in SmallRye Fault Tolerance for this (EDIT: smallrye/smallrye-fault-tolerance#589). |
if failover is addressed by MP FT, and not in stork or the rest client, then @Ladicek 's proposition is going to help a lot. What I would like to be able to do for instance is actually one step further: define a standard retry policy where we retry all |
We would have to experiment to check how doable it is but: would a (quarkiverse?) extension that would alter all the clients to apply retries according to defined criteria work? |
And with @Ladicek's proposal we could even assemble the fault tolerance programmatically :) |
That's my feeling too. What @vsevel described is somewhat related to an implicit "ambassador" pattern. It can only be assumed if the interactions with the services are opinionated (in a rigorous way). It can be hard to generalize this approach because some HTTP verbs may be idempotent in some context and not in some other (I've seen I would consider this as related but outside of the scope of Stork per se. Stork was initially thought of as doing just discovery and selection in a customizable and embeddable way (see the discussion around the programmatic API - quarkusio/quarkus#237). One of the use cases (not yet done, but I'm sure @geoand will soon look into it) is related to API gateways. So Stork should provide everything to enable this, but not necessarily to do it internally. The separation with fault tolerance is a crucial aspect because fault tolerance is a complex problem on its own (so let's delegate that to someone else :-)). With the programmatic API of FT and (soon ) Stork + the Quarkus extension model, we will assemble everything to implement that implicit ambassador pattern (any fan of Lego?). That being said, it's a fantastic idea, and I can't wait to see how we can enable and implement such an approach. It would be a blueprint for many other extensions. |
Good idea. This is definitely worth trying.
but then it is a mistake of the application developer. a layer should not restrain itself from doing transparent retries when the spec says that it can, just because some lack awareness on what idempotent and safe means. |
I actually added complex retry conditions (as well as circuit breaker and fallback conditions) yesterday: smallrye/smallrye-fault-tolerance#591 This is still limited to inspecting exceptions and non-exceptional results are always treated as success, but if you need to be able to inspect non-exceptional results too, that should be possible to add. |
I suppose we will need this as well. for instance if the endpoint is returning a |
Fair enough, though I'd expect that error responses would typically still be represented as exceptions instead of |
IIRC, even if you return a |
Followed up by quarkusio/quarkus-upstream-roadmap#3. organization: QuarkusIO
repository: quarkus-upstream-roadmap
issue: 3
url: quarkusio/quarkus-upstream-roadmap#3 |
I started some experiments about fault tolerant client here: https://github.com/michalszynkiewicz/fault-tolerant-client This test illustrates what works now: https://github.com/michalszynkiewicz/fault-tolerant-client/blob/main/deployment/src/test/java/io/quarkiverse/fault/tolerant/rest/client/reactive/deployment/DefaultFaultToleranceTest.java The thing is it only works for clients injected with CDI (it requires interceptors to work). @vsevel would that work for your use case? |
this sounds interesting. we happened to implement something similar, with some differences. if a we set also a default max retry and default delay, overridable at build time. this works well in our initial tests. there are shortcomings however:
issue 1 is directly related to doing retries at the app level, rather than at the resteasy/stork level, which is where you know for sure that you have multiple target addresses. there is nothing we can do about issue 2. issue 3 is annoying. I see your solution is more advanced, so may be you do not have the same limitation. beside the tests and the impl you provided, could you describe what it does (and does not), and how it is working? |
Right now it treats all operations but POST as idempotent, with a possibility to override it with My plan is to move it to quarkiverse. Can you and your team contribute to open source to join forces? |
I'm wondering if moving the integration lower (generated client code instead of interceptors) wouldn't be better. |
not sure about this. I do not think it is appropriate
non idempotent operations can be retried when we know for sure that the operation could not be executed. that is the situation we have when we receive those exceptions: idempotent operations can be retried generally speaking on
I would say yes tentatively. we are working on a governance that would allow employees to contribute more easily to opensource.
it is a valid argument. we did not attempt to cover this use case at this point. I suppose your I am not sure about:
this seems to be the same code.
why don't you let the exception bubble up in how does it work if the contract also adds a I am surprised to see |
With my SmallRye Fault Tolerance maintainer hat on, I'm in touch with @michalszynkiewicz and I'm aware of this and something like |
It's just a PoC, I'm sharing it to get some initial feedback, esp. on whether not having it for programmatically created clients isn't a blocker for you @vsevel |
hi @michalszynkiewicz, @cescoffier suggested that we share what we had done with FT. I am not convinced it has a big value for you since you impl seemed more advanced. I am showing the code that processes rest client interfaces. we have reused the logic to treat also proxy that we generate for an old http based proprietary protocol that we are still using, which I am not showing. one difference with your impl, is the way we define the retry conditions: I wish we could retry on some specific http codes (e.g. so I am not sure you are going to learn a lot, but here it is anyway. Let me know if you have questions.
|
This might qualify as the most interesting/informative discussion I've read around here. Cheers to all! We are looking for a solution that sounds very close to what @vsevel has described and I'm wondering if anything more has come of this yet? Specifically in the way of an extension that uses the programmatic APIs of both libraries to achieve an "ambassador" (using @cescoffier term). Our use case is actually both REST and gRPC. The REST API(s) follow an exact definition of idempotence based on the HTTP verb and status code. So it would seem we have the right situation for its use. |
hello, we mostly did what I described in #232 (comment) improving it a little bit using |
We had a few asks for failover support in Stork.
Currently, we stand at the position that Stork should only provide service instances and not be involved in making calls, but maybe we should revisit it?
Currently, the way to solve it is to use a failure aware load balancer, such as
least-response-time
and combine it e.g. with MicroProfile Fault Tolerance@Retry
annotation.The text was updated successfully, but these errors were encountered: