Failover support #232

michalszynkiewicz · 2022-02-10T11:42:41Z

We had a few asks for failover support in Stork.
Currently, we stand at the position that Stork should only provide service instances and not be involved in making calls, but maybe we should revisit it?

Currently, the way to solve it is to use a failure aware load balancer, such as least-response-time and combine it e.g. with MicroProfile Fault Tolerance @Retry annotation.

The text was updated successfully, but these errors were encountered:

michalszynkiewicz · 2022-02-10T11:47:07Z

CC @mswiderski @vsevel

mswiderski · 2022-02-10T11:53:05Z

just linking here the issue I created some time ago that has some insights #147 I guess we should close on or the other as they seem like a duplicates of each other

michalszynkiewicz · 2022-02-10T18:57:01Z

@mswiderski I wanted a place to specifically discuss failover

vsevel · 2022-02-11T10:49:35Z

Many years ago all of our EAPs were deployed as 2-nodes clusters, without a load balancer in front.
Additionally, we also developed a proprietary protocol to allow ejb invocations over http (for our EAP5 to EAP6 migration).
We ended up having to develop our own client side load balancing functionality in our EAP servers, when calling remote endpoints (other EAP servers) for these 2 protocols.

Here is a description of the approach we took, which hopefully will help the discussion.

Let's say that for the remote service we have 3 urls: http://one:8080, http://two:8080, http://three:8080.
Each time a service needs to be called we assemble a list of all urls ordered by eligibility and load balancing priorities.
When everything is up, for the first call we return 1-2-3, then 2-3-1 on the second call, ...
If a call fails with a retriable error (more on that later), we record the error (with a timestamp attached to the service instance) and we continue on the list.
So if 2 returns a ConnectException, we will mark 2 as failed then try to execute the call on 3, then on 1 if 3 fails as well.
Let's say say 2 failed but 3 succeeded.
Next time we will assemble the list a little bit differently. We will start by creating the usual list 1-2-3 (assuming the last call was done on 2-3-1), but we separate this list in 2 sublists: instances with no failures and instances with failures.
so we would create the ok list with 1-3 and the nok list with just 2, then reassemble a complete list with ok+nok, so in that case: 1-3-2.
let's say that the call on 1 succeeds, next call we will have initial list=2-3-1, ok=3-1 and nok=2, and final list=3-1-2
most of the time we will be cycling on the good instances, and try out the nok instances only if all good instances have failed.

this is an important property: we always keep the nok instances in the list as a last resort.

a nok instance is good again if:

it had a successful call (that is the situation where all good instances failed, and we used a nok instance, which happened to be good actually),
or the last failure was more than 60 secs ago.

continuing from the previous example, we had: initial list=2-3-1, ok=3-1 and nok=2, and final list=3-1-2
next call we have initial list=3-1-2, ok=3-1 and nok=2, and final list=3-1-2 (note that a weakness of the algorithm is that in that case 3 will be used for 2 subsequent calls).
60 seconds pass by.
next call we have initial list=1-2-3, ok=1-2-3 and nok=[], final list=1-2-3 (2 is considered good again).

an important aspect of the whole mechanism is recognizing when you can retry, and when you must not. this actually depends on the protocol used.

for rest, we used the idempotency property of the http verbs defined in the spec.
idempotent calls would be retried only if we were facing an IOException that was not a com.fasterxml.jackson.core.JsonProcessingException.

for ejb/http, since all calls are going through a POST, we created a dedicated @Idempotent annotation that we would add on the java methods of the interface describing the remote endpoint.
any @Idempotent annotated method that failed with an IOException would be retried.

in addition to these protocol specific conditions, we would recognize certain situations where we knew for sure that the call did not happen (including exceptions specific to libs we were using underneath):

java: java.net.ConnectException and java.net.UnknownHostException,
apache http:org.apache.http.con.ConnectTimeoutException and org.apache.http.con.HttpHostConnectException,
resteasy: org.jboss.resteasy.client.exception.ResteasyHttpHostConnectException

We would retry also on 503.

the fact that client side load balancing handled a number of retries automatically relieved the developers from having to deal with this at the app resilience level (e.g. MP Fault Tolerance). so for instance developers did not have to add @Retry annotations to all of their @GET endpoints. And they did not have to add @Retry(retryOn = "ConnectException") to all of their methods.

they knew that the obvious errors would be handled underneath, and they needed to add fault tolerance only for the extra situations that needed some app knowledge.

the other benefit was they did not have to know how many retries to do to make sure they touched all nodes in the cluster, which they could not know at the app level. the client side load balancing knows how many instances there are. so in a 3 nodes cluster, there is no need to retry 50 times an idempotent method. 3 is enough. Similarly, if we know we have a 3 nodes cluster, it is worth trying 3 times rather than 2. The application may still need/want to do some extra retries if it suspects the issue is not with the endpoint itself but with one of its dependencies (e.g. a db). but that is a different level of retry.

part of this designed has been influenced by the retry mechanism in apache http client:

the algorithm has been running for roughly 10 years, and started way before there was a resiliency libs available. it provided us with a lot of stability, even through crashes, restarts and redeploys. as we are moving to k8s, the need for client side load balancing will probably diminish, but we still have to deal with our legacy infrastructure, so this component is going to stay critical for many more years I suspect.

I think stork is very well positioned for us considering our situation and the investment we are making in Quarkus. I am very excited about using it.
to be really effective it needs to go farther in its ability to select several eligible instances for a single call and provide automatic retries when it can.

happy to discuss it further.

mswiderski · 2022-02-11T11:56:54Z

they knew that the obvious errors would be handled underneath, and they needed to add fault tolerance only for the extra situations that needed some app knowledge.

this is the culprit from my point of view. Load balancer should automatically handle the connectivity errors based on the state of instances it is aware of. And only propagate errors that are not related to connectivity as it cannot figure out on its own if particular error should be retried or not.

vsevel · 2022-02-11T12:36:52Z

Load balancer should automatically handle the connectivity errors based on the state of instances it is aware of.

we have to recognize that awareness is a weak property here. there are instances that you think are good, but are not. and some instances that you think are bad, but are not.
that is why in our design our load balancer returns a list of instances and not just one. and based on protocol/endpoint retriability, it may go through the entire list.

as it cannot figure out on its own if particular error should be retried or not.

it can actually in some cases. the obvious example is idempotent http verbs in rest, for instance in apache httpclient the retryRequest() method relying on the isIdempotent() definition.

in some cases it may not be appropriate for a particular protocol. one for instance could argue that a 503 should be reported back to the app layer and not retrioed. in our impl we decided that even 503 were retriable.

in a flexible design, there should be a retry strategy interface (similar in spirit to HttpRequestRetryStrategy) with a default implementation. some context on the type of call we are making should be passed to this strategy (at the very least the http verb, but ideally some reference on the java business method invoked on the proxy, if any, so that we can look at the annotations), so that the custom retry strategy can make intelligent decisions.

mswiderski · 2022-02-11T12:53:40Z

we have to recognize that awareness is a weak property here. there are instances that you think are good, but are not. and some instances that you think are bad, but are not.

agreed but there must be something to start with and I believe relying on connection related errors is a good start and certainly not that it has to end there.

it can actually in some cases. the obvious example is idempotent http verbs in rest, for instance in apache httpclient the retryRequest() method relying on the isIdempotent() definition.

alright, it can find certain things to rely on but again (and as you said later) it can be hidden into various types of retry strategy implementations with reasonable default if needed.

vsevel · 2022-02-11T13:52:22Z

oh yes sure.
I was certainly not saying that there is only one truth, and everything needed to be implemented to be correct.
but knowing the use cases you want to support will have an impact on the spi interfaces and the overall design.

grpc also added support for idempotent operations a few years ago using a method level idempotency option.

as discussed here: Calls that are marked as idempotent may be sent multiple times.

this means that client side load balancing (and failover) library would be well inclined to retry idempotent calls if it thought the call may succeed on other service instances.

if an api designer specified that an operation was idempotent, a client user should not have to re-state that the service should be retried at the Fault Tolerance level.

michalszynkiewicz · 2022-02-11T14:39:16Z

for me, it's outside of the scope of Stork.
The client library (that uses Stork) can choose to ask for a next service instance and do a retry, and if a load balancer keeps track of errors, it will most likely return a different instance.

Why do you think it should be handled by the load balancer?

mswiderski · 2022-02-11T14:56:42Z

Why do you think it should be handled by the load balancer?

mainly because the load balancer is completely hidden from client code. In my case I use MP Rest Client (with quarkus rest easy reactive client) so I have no access to the load balancer though I have configured multiple endpoints to make sure they will be tried in case of failures related to connectivity.

michalszynkiewicz · 2022-02-11T21:00:05Z

But there's also the client library. Which knows better the concurrency model (knows which thread/event loop should handle a retry), has the knowledge which operations are idempotent, etc.

@vsevel can the retry mechanism you described be also implemented as follows:

a load balancer marks a service instance as blocked for some period of time after a failure
the client is configured to retry many times (many more than we expect instances), unless NoAcceptableServiceInstanceFoundException is thrown - this is what LBs are meant to throw when all instances are deemed unavailable

The difference between this and iterating over a list of all service instances is that it will take into account failures from many calls, not only the current one. The situation of having many instances down is unlikely, they would probably be removed on the next refresh of service discovery. Creating an ordered list of service instances may be inefficient, depending on the load balancing strategy.

vsevel · 2022-02-15T16:32:09Z

Part of the discussion is deciding if your load balancer is acting as an intermediate between the client and the service (that is the traditional definition), or a name resolver (you give me a name and I give you an address you can call).
it seems that you are doing a little bit of both.
your load balancer is not an intermediate.
your default round robin load balancer looks almost like a round robin DNS, except that instead of providing a FQDN and getting a IP back, you give an abstract service name and you get a name (or ip) back.
once your load balancer has given the host:port back, it gets out of the loop.
you do implement some strategies in terms smart load distribution (not really a failover, nor health checks, but the least-response-time will favor instances that work well based on client feedback). so that is where I say it is a little bit of both.

The proprietary implementation I did, was closer to a client side load balancer (with no health checks), since it would serve as an intermediate between the client and the service, and was able to make informed failover decisions, because it was topology aware, and used client provided strategies to help figure out retriability for different protocol.

It is somehow similar to what the apache http client is doing with HttpRequestRetryExec, but limited to a single host. In my case I took what they were doing, and applied to a situation where the retries could be done on different hosts, rather than on the same one.

I looked at Ribbon. It seems to be what I would call client side load balancer since they would handle the call, and support retry options. check DefaultLoadBalancerRetryHandler and the different retry options such as MaxAutoRetries, MaxAutoRetriesNextServer, OkToRetryOnAllOperations, RequestSpecificRetryOn in CommonClientConfigKey.
Apparently there is also a retryableStatusCodes in spring cloud netflix as well.

I am not a ribbon expert, but the healthchecks seem to be provided by eureka (i.e. discovery), rather than ribbon itself.

I am wondering if stork is closer to consul from that perspective where the name you resolve will point to a real service that is supposed to be up, according to consul's healthchecks. But if that is not the case, that is the responsibility of the client to ask for new address and retry.

The issue with that approach is that each client (e.g. grpc, rest client) needs to implement auto retry on common situations (e.g. ConnectException), and since they are not topology aware, they can only guess how many tries they can/should make. Also, since the load balancer component will round robin across all addresses based on all calls happening in parallel, a retry may actually land on the same server.

I am not saying this is wrong. However it must be understood that the retriability needs to be moved upward in the clients (e.g. rest client, grpc). So for instance for grpc, some code will have to figure out that the first try failed with a retriable error, in a situation where a second call may actually pass (it would have to know that a stork:// url means that there probably multiple instances behind), and as a result will need to re-fetch a host from the load balancer, and re-execute the code.

As stated previously, and easy way to do this is to push that back to the application developer by asking him to use MP Fault Tolerance for this. But that is a burden that he/she should have not to endure since we know that some errors are always retriable, and some of the protocols have already defined rules about retriability (e.g. idempotent verbs for rest, idempotency annotation for grpc). So the developer should not have to re-state it.

So where does it leave us? With the current approach, I guess you need to make sure all client stacks implement auto-retry, some of logic being duplicated (e.g. reacting to connection errors for instance).

In my experience, we took a different approach where this concern was handled centrally.

vsevel · 2022-02-21T09:44:34Z

When we wrote our client side load balancing component, we also wrote a dedicated test suite to validate the behavior.
We would run a client application (with the client side load balancing component) and a backend.
The client application would call the backend application using a mix of short and long calls (Thread.sleep() on the server side) with a high throughput (many threads, each thread repeats calls as fast as possible).
The backend would be deployed as a cluster of 2 nodes.
On the server side, we would take one of those nodes and play 2 types of incidents:

crash: kill -9 on the process
graceful shutdown: let in flight calls finish (up to a timeout), and redirect new calls to the other server, stop the app when all inflight calls have finished

After the backend had been stopped, we would restart it, make sure load balancing had resumed normally on both server nodes, and play another incident.

The client side would make calls on 2 types of endpoints:

idempotent
non idempotent

The client application was not using any resilience lib (e.g. resilience4j, MP fault tolerance, ...). All retries were done transparently by the client side load balancing component.

We would let the test run for tenths of minutes (hundreds of thousands calls), and assess at the end the following conditions for success:

For crash situations:

client side:
- idempotent calls have all succeeded failures == 0
- non idempotent calls may fail failures <= 1
server side:
- idempotent calls have been executed at least once exec >= 1
- non idempotent calls may be executed at most once exec <= 1

For graceful shutdown situations:

client side: all calls have succeeded (idempotent and non idempotent calls) failures == 0
server side: all calls have executed exactly once exec == 1

Watching the type of errors during the tests (for instance the errors the client would receive while the backend was restarting), would allow to categorize them in either retriable and non retriable, depending on the protocol, and adjust the failure conditions on the client side for the automatic retries.

Eventually we got complete coverage on all possible errors.

I finished integrating stork in my pilot application.
No surprise, but the client application is very sensitive to the stability of the invoked remote services.
Since stork does not have a health check on those services, and no automatic retries, there will be guaranteed occasional failures on the client side, forcing using a resilience mechanism at the app level.
As previously discussed this will not work for us at scale (a few hundreds apps, thousands of endpoints). We have a little bit of time, but I am not sure how to approach this. A solution would be to plug a custom invoker in resteasy and migrate my original client side load balancing component, which I would like to avoid as much as possible, since stork in providing a solution in this exact area.

Please let me know how you see stork going forward.

vsevel · 2022-02-21T19:27:14Z

Interesting reading from the grpc project: Transparent Retries

Grpc already distinguishes retries that it will do automatically (aka transparent retries) from retries that are governed by the retry policy. 3 types of failures are defined:

The RPC never leaves the client.
The RPC reaches the server, but has never been seen by the server application logic.
The RPC is seen by the server application logic, and fails.
The 2 first ones are handled by transparent retries, the last one is handled by the retry policy, for instance on retriable code UNAVAILABLE(14), as demonstrated by this example.

In the first case, the call is retried multiple times until the deadlines passes.
In the 2nd case, the call is retried only once, then handled by the retry policy.

So for grpc, this may already work well.

I looked at the resteasy reactive ClientSendRequestHandler, and I did not see anything like this.

hakdogan · 2022-02-28T14:50:23Z

Why do you think it should be handled by the load balancer?

I had my first experience with the stork via Quarkus and it was a disappointment for me to get a ProcessingException when I shut down one of the services I was calling.

I can accept my load balancer returning an exception if no instance of the service I'm calling isn't available, otherwise, my expectation is that it points me to an accessible instance. I think this expectation is fully in line with the nature of load balancing.

cescoffier · 2022-02-28T14:56:54Z

@hakdogan while ideal, there is no way to be sure that the instance that has been selected is healthy. Your service discovery may or may not use health data (eureka does, DNS does not), but even with that, the state can change between the selection and the call. Also note that the health check may not provide the right data (and instance returning ready because the server is up, but actually some requirements are not there)

One of the patterns I recommend, when retry is possible is:

@Retry 
Uni<String> callMyRemoteService():

with the round-robin (default) strategy random, power-of-two-choices, or least-response-time strategies it will pick another instance during the next call (so the retry).

What can be improved from the Stork POV would be to capture the failing instances and blacklist them for some time. The least-response-time strategy is already doing this.

cescoffier · 2022-02-28T15:18:46Z

Also, Stork has no idea if the operation you are going to call is idempotent or not, only the user and sometimes the transport layer know.

cescoffier · 2022-02-28T15:32:46Z

Also (yes, second one), not all failures are equal. So, we would have a list of:

transient issues - we can retry with the same instance
catastrophic issues - something terrible, we should not use the instance anymore (Unreachable address, Host not found...)
everything in between

However, @Retry already has such a configuration, if I'm not mistaken.

michalszynkiewicz · 2022-02-28T17:55:52Z

Yes it does, you can choose on which exceptions you want to retry

hakdogan · 2022-02-28T19:17:27Z

@cescoffier Thank you for your detailed explanation. From what you wrote, I understand and respect that you hold the position that @michalszynkiewicz expressed in his first comment.

Stork excites me when I think of its easy integration with Quarkus. I will keep watching the development/evolution of it.

vsevel · 2022-03-01T07:15:51Z

Also, Stork has no idea if the operation you are going to call is idempotent or not, only the user and sometimes the transport layer know.

so when the protocol knows if it can retry, do you see an opportunity to implement those retries directly at this level (e.g. in the resteasy proxy), rather than pushing it to the app level? do you see @Retry in the application as a temporary workaround, or as the correct way moving forward?

michalszynkiewicz · 2022-03-01T08:06:12Z

At this point, I see @Retry as the correct way moving forward.
But we have to keep an eye on whether it's enough to cover all the use cases.

Ladicek · 2022-03-01T08:37:49Z

I know almost nothing about Stork, but I find this discussion very interesting, as it is closely related to the topic of fault tolerance. For a load balancer to be able to handle failures, it must be aware of the underlying protocol (know the Java classes of exceptions that may be thrown, be able to read status codes from responses, things like that), which is something I guess the Stork authors are not very keen on (understandably).

Here's a probably-silly idea that might be worth exploring. SmallRye Fault Tolerance now has a programmatic API, whose core is this interface:

interface FaultTolerance<T> {
    T call(Callable<T> action) throws Exception;
}

An instance of this interface may be created using a builder that allows specifying everything that MicroProfile Fault Tolerance allows specifying using annotations. Maybe Stork could accept an instance of FaultTolerance<Uni<T>> (I guess Stork operates solely on Mutiny types?) and run the invocation through that instance?

michalszynkiewicz · 2022-03-01T09:52:09Z

@Ladicek what gains do you see in that? There would have to be another annotation to mark an operation as idempotent/retriable, right?

There is one thing I'm afraid the MP FT + Stork won't be sufficient for. The exception thrown on failure may need to be analyzed (to e.g. get the exact status code) before making a decision about retry. That can be worked around by a custom exception mapper though.

Ladicek · 2022-03-01T09:58:49Z

I was thinking the programmatic API is somewhat more expressive than the declarative Fault Tolerance API. But if all Stork is also declarative, then there's indeed no gain possible.

michalszynkiewicz · 2022-03-01T10:08:32Z

Stork itself is purely programmatic but users won't rather use it directly.

Ladicek · 2022-03-01T10:22:51Z

Gotcha, in that case probably just ignore me :-)

That said, this discussion actually reminded me of one idea I had recently: to be able to configure a set of fault tolerance strategies on one place (most likely programmatically, but configuration could be doable too) and then apply it to various methods with a single annotation. That would allow centralizing the fault tolerance related understanding of protocol exceptions and probably more.

I'm thinking something like:

@Produces
@Identifier("my-fault-tolerance")
static final FaultTolerance<Object> = ...;

...

@ApplyFaultTolerance("my-fault-tolerance")
public void doSomething() {
    ...
}

It's probably enough time I filed an issue in SmallRye Fault Tolerance for this (EDIT: smallrye/smallrye-fault-tolerance#589).

vsevel · 2022-03-01T18:55:27Z

At this point, I see @Retry as the correct way moving forward.

if failover is addressed by MP FT, and not in stork or the rest client, then @Ladicek 's proposition is going to help a lot.

What I would like to be able to do for instance is actually one step further: define a standard retry policy where we retry all GET operations throwing any IOException , and POST throwing ConnectException or returning 503. the error conditions need to be adjusted, but you see the idea.
And I wish this custom policy could be applied with minimum declaration across a wide variety of services (i.e. not on all operations one by one).
What do you think? What would be needed?
If all goes well I am looking at migrating more than 100 EAP applications, with dozens rest endpoints, each comprised of dozens of operations. Asking each developer to look at each GET and POST and specify the necessary conditions won't be done consistently. We will be skipping retry opportunities, or worse retrying when we should not. To be efficient and reliable we have to have a way to define a standard policy and apply it widely.

michalszynkiewicz · 2022-03-01T22:32:10Z

We would have to experiment to check how doable it is but: would a (quarkiverse?) extension that would alter all the clients to apply retries according to defined criteria work?
I think we should be able to do it with an AnnotationTransformer.
Such an extension could also bring its own exception mapper if the granularity of classes of the exceptions thrown is not sufficient.

michalszynkiewicz · 2022-03-01T22:47:59Z

And with @Ladicek's proposal we could even assemble the fault tolerance programmatically :)

cescoffier · 2022-03-02T07:11:34Z

That's my feeling too.

What @vsevel described is somewhat related to an implicit "ambassador" pattern. It can only be assumed if the interactions with the services are opinionated (in a rigorous way). It can be hard to generalize this approach because some HTTP verbs may be idempotent in some context and not in some other (I've seen GET requests modifying databases unconditionally).

I would consider this as related but outside of the scope of Stork per se. Stork was initially thought of as doing just discovery and selection in a customizable and embeddable way (see the discussion around the programmatic API - quarkusio/quarkus#237). One of the use cases (not yet done, but I'm sure @geoand will soon look into it) is related to API gateways.

So Stork should provide everything to enable this, but not necessarily to do it internally. The separation with fault tolerance is a crucial aspect because fault tolerance is a complex problem on its own (so let's delegate that to someone else :-)). With the programmatic API of FT and (soon ) Stork + the Quarkus extension model, we will assemble everything to implement that implicit ambassador pattern (any fan of Lego?).

That being said, it's a fantastic idea, and I can't wait to see how we can enable and implement such an approach. It would be a blueprint for many other extensions.

vsevel · 2022-03-02T08:03:37Z

I think we should be able to do it with an AnnotationTransformer.

Good idea. This is definitely worth trying.
I am not familiar with FT. I am hoping that we still have enough information (exceptions and error codes) to take a decision, plus the ability to write complex retry conditions.

I've seen GET requests modifying databases unconditionally

but then it is a mistake of the application developer. a layer should not restrain itself from doing transparent retries when the spec says that it can, just because some lack awareness on what idempotent and safe means.

Ladicek · 2022-03-02T08:25:13Z

I actually added complex retry conditions (as well as circuit breaker and fallback conditions) yesterday: smallrye/smallrye-fault-tolerance#591 This is still limited to inspecting exceptions and non-exceptional results are always treated as success, but if you need to be able to inspect non-exceptional results too, that should be possible to add.

vsevel · 2022-03-02T11:02:12Z

if you need to be able to inspect non-exceptional results too, that should be possible to add.

I suppose we will need this as well. for instance if the endpoint is returning a Response and we want to retry on some specific error codes (e.g. 503, 408, ...)

Ladicek · 2022-03-02T11:19:42Z

Fair enough, though I'd expect that error responses would typically still be represented as exceptions instead of Response objects. I filed smallrye/smallrye-fault-tolerance#592 for that.

michalszynkiewicz · 2022-03-03T16:18:52Z

IIRC, even if you return a Response, an exception mapper is invoked. So you can throw an exception on codes that you want to retry on

cescoffier · 2022-04-05T07:37:01Z

Followed up by quarkusio/quarkus-upstream-roadmap#3.

organization: QuarkusIO
repository: quarkus-upstream-roadmap
issue: 3
url: quarkusio/quarkus-upstream-roadmap#3

michalszynkiewicz · 2022-04-07T19:01:07Z

I started some experiments about fault tolerant client here: https://github.com/michalszynkiewicz/fault-tolerant-client

This test illustrates what works now: https://github.com/michalszynkiewicz/fault-tolerant-client/blob/main/deployment/src/test/java/io/quarkiverse/fault/tolerant/rest/client/reactive/deployment/DefaultFaultToleranceTest.java

The thing is it only works for clients injected with CDI (it requires interceptors to work). @vsevel would that work for your use case?

vsevel · 2022-04-11T10:22:11Z

hi @michalszynkiewicz

this sounds interesting. we happened to implement something similar, with some differences.
on any PATCH/POST operation we use the AnnotationsTransformer to add a @Retry with retryOn={ConnectException, HttpConnectTimeoutException, UnknownHostException}
on any GET/HEAD/OPTIONS/PUT/DELETE we add IOException, TimeoutException on the above retryOn.

if a @Retry is already configured on the method by the app developer, we will not replace it by our own @Retry.

we set also a default max retry and default delay, overridable at build time.

this works well in our initial tests. there are shortcomings however:

we have no idea if the targetted server is actually just one replica, or multiple replicas. so we could be retrying on the same single failing replica.
if the targetted cluster has a single address (load balancer in front, or k8s Service), there is no guarantee that retrying will not land on the same failing replica, since other clients will get the load balancer to rotate as well.
the @Retry specified by the app developer will cancel our @Retry, possibly discarding valid retry situations such as retrying ConnectException for all operations, since the developer will probably not think at those cases.

issue 1 is directly related to doing retries at the app level, rather than at the resteasy/stork level, which is where you know for sure that you have multiple target addresses.

there is nothing we can do about issue 2.

issue 3 is annoying. I see your solution is more advanced, so may be you do not have the same limitation.

beside the tests and the impl you provided, could you describe what it does (and does not), and how it is working?
thanks

michalszynkiewicz · 2022-04-11T10:37:43Z

Right now it treats all operations but POST as idempotent, with a possibility to override it with @Idempotent or @NonIdempotent (I forgot about PATCH :) ).
For these groups a user can define fault tolernace strategies like this
If no strategy is defined, nothing is done on POST, and operations are retried for the rest of the operations.

My plan is to move it to quarkiverse. Can you and your team contribute to open source to join forces?

michalszynkiewicz · 2022-04-11T10:49:46Z

I'm wondering if moving the integration lower (generated client code instead of interceptors) wouldn't be better.
This way we could e.g. also cover programmaticaly created clients.

vsevel · 2022-04-11T13:15:12Z

If no strategy is defined, nothing is done on POST, and operations are retried for the rest of the operations.

not sure about this. I do not think it is appropriate

to retry on any errors for all idempotent operations
to not retry on any errors for all non idempotent operations

non idempotent operations can be retried when we know for sure that the operation could not be executed. that is the situation we have when we receive those exceptions: ConnectException, HttpConnectTimeoutException, UnknownHostException.

idempotent operations can be retried generally speaking on IOException or read timeout related exceptions (e.g. java.util.concurrent.TimeoutException), or some specific http codes (e.g. 408), but that is arguable for rest as any http code could be considered as being part of the contract itself.

contribute to open source to join forces?

I would say yes tentatively. we are working on a governance that would allow employees to contribute more easily to opensource.

I'm wondering if moving the integration lower (generated client code instead of interceptors) wouldn't be better.

it is a valid argument. we did not attempt to cover this use case at this point.

I suppose your ApplyFaultToleranceGroupInterceptor should be thread safe? if so, we need some protection on faultToleranceForMethod.get/put.

I am not sure about:

                if (groupAnnotation.isAsync()) {
                    faultTolerance = (FaultTolerance<Object>) faultToleranceGroup.build(groupAnnotation.returnType());
                } else {
                    faultTolerance = (FaultTolerance<Object>) faultToleranceGroup.build(groupAnnotation.returnType());
                }

this seems to be the same code.

        return faultTolerance == NO_FAULT_TOLERANCE ? context.proceed() : faultTolerance.call(() -> {
            try {
                return context.proceed();
            } catch (Exception any) {
                throw new RuntimeException("any"); // mstodo something smarter
            }
        });

why don't you let the exception bubble up in call?

how does it work if the contract also adds a @Retry on the operation? are they are going to work together independently (which we would need)?

I am surprised to see FaultToleranceGroup FaultToleranceGroupProducer and FaultToleranceGroupBuilder. they seem general purpose. shouldn't that come from the FT project itself? generally speaking, how much of what you developed should be in your extension, rather than in the SR FT project.

Ladicek · 2022-04-11T13:24:30Z

With my SmallRye Fault Tolerance maintainer hat on, I'm in touch with @michalszynkiewicz and I'm aware of this and something like FaultToleranceGroup will definitely become part of SmallRye Fault Tolerance.

michalszynkiewicz · 2022-04-11T18:19:02Z

It's just a PoC, I'm sharing it to get some initial feedback, esp. on whether not having it for programmatically created clients isn't a blocker for you @vsevel

vsevel · 2022-04-26T16:25:01Z

hi @michalszynkiewicz, @cescoffier suggested that we share what we had done with FT. I am not convinced it has a big value for you since you impl seemed more advanced. I am showing the code that processes rest client interfaces. we have reused the logic to treat also proxy that we generate for an old http based proprietary protocol that we are still using, which I am not showing.

one difference with your impl, is the way we define the retry conditions: ConnectException, HttpConnectTimeoutException, UnknownHostExceptionfor all methods, and IOException, TimeoutException in addition for non idempotent methods.

I wish we could retry on some specific http codes (e.g. 409), but I do not know how to implement this. I probably need the condition programmatic approach that we talked about in FT.

so I am not sure you are going to learn a lot, but here it is anyway. Let me know if you have questions.

public class MyProcessor {

    private static final Logger logger = Logger.getLogger(QuarkusStdlibProcessor.class);

    static DotName RETRY_ANNOTATION = DotName.createSimple(Retry.class.getName());

    public static final String MP_FAILOVER_MAX_RETRIES = "maxRetries";

    public static final String MP_FAILOVER_DELAY = "delay";

    public static final String RETRY_ON = "retryOn";

    @Inject
    FailoverConfig failoverConfig;

    @BuildStep
    AnnotationsTransformerBuildItem configureRestClientFailover(CombinedIndexBuildItem indexBuildItem) {

        IndexView index = indexBuildItem.getIndex();
        DotName get = DotName.createSimple(GET.class.getName());
        DotName put = DotName.createSimple(PUT.class.getName());
        DotName delete = DotName.createSimple(DELETE.class.getName());
        DotName post = DotName.createSimple(POST.class.getName());
        DotName patch = DotName.createSimple(PATCH.class.getName());
        DotName head = DotName.createSimple(HEAD.class.getName());
        DotName options = DotName.createSimple(OPTIONS.class.getName());

        // Extract list of HTTP idempotent methods (GET, HEAD, OPTIONS, PUT, DELETE)
        List<String> idempotentMethods = getAnnotatedMethodsNames(index, Arrays.asList(get, put, delete, options, head));

        // Extract list of HTTP non-idempotent methods (PATCH, POST)
        List<String> notIdempotentMethods = getAnnotatedMethodsNames(index, Arrays.asList(post, patch));

        return new AnnotationsTransformerBuildItem(new AnnotationsTransformer() {

            public boolean appliesTo(AnnotationTarget.Kind kind) {
                return kind == AnnotationTarget.Kind.METHOD;
            }

            public void transform(TransformationContext context) {
                if (context.isMethod()) {
                    String methodName = buildQualifiedRestMethodName(context.getTarget().asMethod());
                    if (idempotentMethods.contains(methodName)) {
                        AnnotationInstance retryAnnotation = buildRetryAnnotation(context.getTarget(), true);
                        context.transform().add(retryAnnotation).done();
                        logger.info("Add @Retry on idempotent method " + methodName + " Info: " + retryAnnotation);
                    } else if (notIdempotentMethods.contains(methodName)) {
                        AnnotationInstance retryAnnotation = buildRetryAnnotation(context.getTarget(), false);
                        context.transform().add(retryAnnotation).done();
                        logger.info("Add @Retry on non-idempotent method " + methodName + " Info: " + retryAnnotation);
                    }
                }
            }
        });
    }

    private Predicate<MethodInfo> hasAnnotationPredicate(List<DotName> annotations) {
        return methodInfo -> {
            for (DotName a: annotations) {
                if(methodInfo.hasAnnotation(a)) {
                    return true;
                }
            }
            return false;
        };
    }

    private List<String> getAnnotatedMethodsNames(IndexView index, List<DotName> annotations) {
        Predicate<MethodInfo> idempotentMethPredicate = hasAnnotationPredicate(annotations);
        Map<DotName, List<String>> restClientIdempotentMethods = filterRestClientMethods(index, idempotentMethPredicate);
        return getImplMethods(index, restClientIdempotentMethods);
    }

    private List<String> getImplMethods(IndexView index, Map<DotName, List<String>> restClientIdempotentMethods) {
        List<String> idempotentMethods = new ArrayList<>();
        restClientIdempotentMethods.forEach((i, m) -> {
            List<String> implMethods = index.getAllKnownImplementors(i).stream()
                    .flatMap(c -> c.methods().stream().filter(im -> m.contains(im.toString())))
                    .map(this::buildQualifiedRestMethodName).toList();
            idempotentMethods.addAll(implMethods);
        });
        return idempotentMethods;
    }

    private List<ClassInfo> getAnnotatedClasses(final Class<?> annotationType, final IndexView index) {
        DotName annotation = DotName.createSimple(annotationType.getName());
        Collection<AnnotationInstance> instances = index.getAnnotations(annotation);
        return instances.stream().filter(a -> a.target().kind() == AnnotationTarget.Kind.CLASS).map(a -> a.target().asClass()).collect(Collectors.toList());
    }

    private Map<DotName, List<String>> filterRestClientMethods(IndexView index, Predicate<MethodInfo> methodInfoPredicate) {
        return getAnnotatedClasses(RegisterRestClient.class, index)
                .stream()
                .collect(Collectors.toMap(ClassInfo::name, c -> c.methods()
                        .stream()
                        .filter(methodInfoPredicate)
                        .filter(m -> !m.hasAnnotation(RETRY_ANNOTATION))
                        .map(MethodInfo::toString)
                        .collect(Collectors.toList())));
    }

    private String buildQualifiedRestMethodName(MethodInfo methodInfo) {
        return methodInfo.declaringClass() + "." + methodInfo;
    }

    private AnnotationInstance buildRetryAnnotation(AnnotationTarget target, boolean idempotent) {

        List<AnnotationValue> retryOn = new ArrayList<>();

        retryOn.add(AnnotationValue.createClassValue("value",
                Type.create(DotName.createSimple(ConnectException.class.getName()), org.jboss.jandex.Type.Kind.CLASS)));
        retryOn.add(AnnotationValue.createClassValue("value",
                Type.create(DotName.createSimple(HttpConnectTimeoutException.class.getName()), org.jboss.jandex.Type.Kind.CLASS)));
        retryOn.add(AnnotationValue.createClassValue("value",
                Type.create(DotName.createSimple(UnknownHostException.class.getName()), org.jboss.jandex.Type.Kind.CLASS)));

        if (idempotent) {
            retryOn.add(AnnotationValue.createClassValue("value",
                    Type.create(DotName.createSimple(IOException.class.getName()), org.jboss.jandex.Type.Kind.CLASS)));
            retryOn.add(AnnotationValue.createClassValue("value",
                    Type.create(DotName.createSimple(TimeoutException.class.getName()), org.jboss.jandex.Type.Kind.CLASS)));
        }

        return AnnotationInstance.create(RETRY_ANNOTATION, target,
                new AnnotationValue[] {
                        AnnotationValue.createIntegerValue(MP_FAILOVER_MAX_RETRIES, failoverConfig.maxRetry),
                        AnnotationValue.createLongValue(MP_FAILOVER_DELAY, failoverConfig.delay),
                        AnnotationValue.createArrayValue(RETRY_ON, retryOn.toArray(new AnnotationValue[0]))
                });
    }
    
}

kdubb · 2023-07-25T02:19:19Z

This might qualify as the most interesting/informative discussion I've read around here. Cheers to all!

We are looking for a solution that sounds very close to what @vsevel has described and I'm wondering if anything more has come of this yet? Specifically in the way of an extension that uses the programmatic APIs of both libraries to achieve an "ambassador" (using @cescoffier term).

Our use case is actually both REST and gRPC. The REST API(s) follow an exact definition of idempotence based on the HTTP verb and status code. So it would seem we have the right situation for its use.

vsevel · 2023-07-26T17:24:42Z

hello, we mostly did what I described in #232 (comment) improving it a little bit using @ApplyFaultTolerance instead of applying the @Retry annotation directly on methods, as suggested in #232 (comment).
This allowed us to retry on http error codes and some specific io exceptions.
the biggest drawback I see, as discussed in different comments, and specifically in #232 (comment) is that the approach is not topology aware. at the FT level I have no idea how many replicas I may able to talk to for a given service. even if I knew I have no guarantee that a retry will land on a different replica since the rotation is global, and not per call.
as an approximation, we set it up to 2 retries.
I believe our original approach described in #232 (comment) was giving us more certainty (e.g. if we can guarantee that at least a server is up, then we could count on all idempotent calls to be executed at least once, and all non idempotent calls to be executed at most once).
I have not done the same exact tests with our stork/FT implementation for quarkus, but I am convinced this is probably good enough for our apps (until an app team proves me otherwise).

vsevel mentioned this issue Feb 10, 2022

Business calls keep failing on non running instance using load-balancer=least-response-time #233

Closed

cescoffier added the on-roadmap label Apr 5, 2022

cescoffier mentioned this issue Apr 5, 2022

smallrye-stork - Failover support quarkusio/quarkus-upstream-roadmap#3

Open

cescoffier added enhancement Enhancements or Request for enhancement and removed on-roadmap labels Nov 2, 2022

Failover support #232

Failover support #232

Comments

michalszynkiewicz commented Feb 10, 2022 • edited Loading

michalszynkiewicz commented Feb 10, 2022

mswiderski commented Feb 10, 2022

michalszynkiewicz commented Feb 10, 2022

vsevel commented Feb 11, 2022 • edited Loading

mswiderski commented Feb 11, 2022

vsevel commented Feb 11, 2022

mswiderski commented Feb 11, 2022

vsevel commented Feb 11, 2022

michalszynkiewicz commented Feb 11, 2022

mswiderski commented Feb 11, 2022

michalszynkiewicz commented Feb 11, 2022 • edited Loading

vsevel commented Feb 15, 2022

vsevel commented Feb 21, 2022

vsevel commented Feb 21, 2022 • edited Loading

hakdogan commented Feb 28, 2022

cescoffier commented Feb 28, 2022 • edited Loading

cescoffier commented Feb 28, 2022

cescoffier commented Feb 28, 2022 • edited Loading

michalszynkiewicz commented Feb 28, 2022

hakdogan commented Feb 28, 2022

vsevel commented Mar 1, 2022

michalszynkiewicz commented Mar 1, 2022

Ladicek commented Mar 1, 2022

michalszynkiewicz commented Mar 1, 2022

Ladicek commented Mar 1, 2022

michalszynkiewicz commented Mar 1, 2022

Ladicek commented Mar 1, 2022 • edited Loading

vsevel commented Mar 1, 2022 • edited Loading

michalszynkiewicz commented Mar 1, 2022

michalszynkiewicz commented Mar 1, 2022

cescoffier commented Mar 2, 2022

vsevel commented Mar 2, 2022

Ladicek commented Mar 2, 2022

vsevel commented Mar 2, 2022

Ladicek commented Mar 2, 2022

michalszynkiewicz commented Mar 3, 2022

cescoffier commented Apr 5, 2022 • edited Loading

michalszynkiewicz commented Apr 7, 2022

vsevel commented Apr 11, 2022

michalszynkiewicz commented Apr 11, 2022

michalszynkiewicz commented Apr 11, 2022 • edited Loading

vsevel commented Apr 11, 2022 • edited Loading

Ladicek commented Apr 11, 2022

michalszynkiewicz commented Apr 11, 2022

vsevel commented Apr 26, 2022

kdubb commented Jul 25, 2023

vsevel commented Jul 26, 2023

michalszynkiewicz commented Feb 10, 2022 •

edited

Loading

vsevel commented Feb 11, 2022 •

edited

Loading

michalszynkiewicz commented Feb 11, 2022 •

edited

Loading

vsevel commented Feb 21, 2022 •

edited

Loading

cescoffier commented Feb 28, 2022 •

edited

Loading

cescoffier commented Feb 28, 2022 •

edited

Loading

Ladicek commented Mar 1, 2022 •

edited

Loading

vsevel commented Mar 1, 2022 •

edited

Loading

cescoffier commented Apr 5, 2022 •

edited

Loading

michalszynkiewicz commented Apr 11, 2022 •

edited

Loading

vsevel commented Apr 11, 2022 •

edited

Loading