New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zuul won't stop retry after hystrix is timed out #1772
Comments
I found this issue in the Hystrix project that is particularly interesting. It seems like the application thread should try to respond to a thread interruption (when using an isolation strategy of thread). Might need to dig into this a little bit more to understand it better. |
Because of the way the |
@spencergibb thank you for pointing out it's hystrix isolation issue, I tried to set
but it seems not working. |
What does "but it seems not working." mean? |
Hystrix uses RxJava, and when command should be failed by timeout, hystrix unsubscribes from command observable (see com.netflix.hystrix.AbstractCommand.HystrixObservableTimeoutOperator) If THREAD isolation level is used, command thread can be interrupted. But @ryanjbaxter found issue. For SEMAPHORE isolation level, which default for Zuul, command thread is not interrupted. Now we have Spring's AbstractRibbonCommand. It extends HystrixCommand and implements run method. We can rewrite AbstractRibbonCommand to extend HystrixObservableCommand and implement construct method that returns observable. This will give unsubscription propagation. I tried to do this, see my commit. I am using ribbon's LoadBalancerCommand that also returns observable. But I faced another problem: retryable clients. When retryable client is used, ribbons retry logic is ignored and retry is performed in client. I don't know what is the best way to make it work with retryable clients. Add smth like toObservable to client, and implement execution in react way? Anyway, see my branch. It has tests for retry abort. If you revert all changes and leave only tests - they will fail. |
My gut tells me that we can probably do something in |
I think, what I did in my branch is wrong. Retry logic should be only in client.
Yeah, but you will need to ask hystrix somehow, is command timed out. I have a another question. What was the point to add those retryable clients? To be able to write custom LoadBalancedRetryPolicy? |
Beginning in Camden we no longer used the Netflix Can you perhaps submit a PR with your changes so we can take an easier look at the code changes? |
|
Deprecated New retryable clients also extend Comparing those implementations
What is the contract for client? |
Yes |
So I did some investigation into how Hystrix communicates that a timeout has occurred to the wrapped operation and it looks like it is just interrupting the thread, so all we should have to do is to check
I think you should be able to apply this to the retryable clients as well. |
Thank you for clarifications! Unfortunately I am trying to refactor clients right now, to make some general fix. Long short story is to add
|
I meant to add a note to my previous comment stating that it would only work for thread isolation, sorry about that. I think the |
I updated my pull request #1793 with latest changes. Is it ok to remove |
No it is not OK we want those clients so we can use Spring Retry to retry the failed requests and not rely on the deprecated Netflix client to do so. |
Well, I was confused. Old and new retryable clients extends Netflix's Anyway, see my updated pull request. All clients are doing same as before now. Basically, I just fixed the issue. |
I'd be very uncomfortable merging the changes you have made. |
No problem. At least I showed one of possible ways how to fix the issue. Thanks. |
I am going to close this issue for now. If we find there is others who want retries to stop after a hystrix timeout we can come back and address this. |
I was following your discussion as this issue is actually happening for our service in Production. As soon as services behind Zuul begin to respond slower, Zuul retries requests until some of our nodes die. :( Thanks |
Are you saying the retries that dont stop from Zuul are causing your apps to crash? |
@spencergibb even ribbon-isolation-strategy is set to THREAD, retry can't be interrupted by hystrix timeout. |
@ts-huyao yes because we are not checking if the thread has been interrupted |
@Aloren and @rterentiev we talked about this issue on our team call today and we can to the consensus that the proper way to configure the Hystrix timeout is to make sure you account for the total timeout of all retries. This means, for example, if you have the Ribbon timeout configured to be 1 second and you are going to retry the request 3 times that the Hystrix timeout should be at least 3 seconds. Is there a reason why you have not configured your Hystrix timeout this way? |
If I recall correctly, the hystrix timeout should be slightly greater than the max expected duration of the underlying command. So if the http client is configured with a 1s connect+read timeout and 2 potential retries, the hystrix timeout should be slightly more than 3s (to account for jvm overhead). I'm trying to find where I read that. |
@ryanjbaxter , then defaults in spring cloud configs needs to be adjusted as well, 5 secs for zuul hystrix was set to override 1 sec default. WDYT? |
If that is the case then yes we should make them match, however as soon as a second instance of a service is added than that timeout needs to be adjusted. I wonder if we could automatically calculate the hystrix timeout based on the number of instances and the configured ribbon timeout.... |
We don't currently set the defaults for hystrix in zuul except for the max number of semaphores. |
here is the list of settings which needs to match:
By default read timeout is 5000ms and auto retry is 1... |
I'm not sure we should be setting those values automatically. |
There is no sense in hystrix timeout, if command can't be interrupted. I would just disable hystrix timeout by default. Maybe log some warning message, if it was enabled. |
@rterentiev why do you want to want Hystrix to interrupt the command if it hasn't finished retrying the requests? |
Because Hystrix will return timeout error, even if command was executed successfully after retrying. In high load applications, this also will add additional load and problems described by @Aloren. I am not pushing you to fix this issue. I was able to do workaround implementing custom command and command factory. But if you going to not fixing it for now, I would disable hystrix timeout by default. It just makes things more consistent - hystrix timeout is not working and disabled by default. |
@rterentiev you wouldn't have those problems if the hystrix timeout were > the combined potential timeouts of all retries, correct? |
Right so why not configure Hystrix so it at least waits until all retrys complete?
This could potentially be dealt with by implementing some kind of backoff policy, it would be an enhancement. |
@spencergibb, correct, but default setup is broken. There is 1 sec of hystrix timeout and 5 secs of ribbon timeout. |
I've opened #1827 to see about adjusting the ribbon timeouts for the default apache httpclient and the okhttp client. I've opened #1828 to document hystrix/retry/timeout relationships. BTW, I've chatted with a netflix engineer and I am correct in those relationships. As a side note, even they struggle with this some. |
You will need to keep in mind this problem. Each time you update ribbon configuration, you need to recalculate new timeout value for hystrix. Its much more simplier to just disable hystrix timeout. Behavior is same for disabled hystrix timeout, and when hystrix timeout > the combined potential timeouts of all retries. |
Disabling the hystrix timeout seems like a bad idea. May as well remove hystrix at that point. |
Hi, I've been following this thread but am slightly confused by the which settings work and don't work with Zuul, Ribbon and Hytrix. We are using Spring Cloud Edgeware release. We are just using all the default values. We are connecting Zuul to our services using Ribbon connected to Eureka. Some of our services are a little slow ~10 secs ish. How do I increase the various timeouts that seem to be triggered to allow this normal use case to not timeout?? I've set these; which stopped some of the timeouts but I'm still left with the Hystrix timeouts. Many thanks Matt |
@matt-shaw please dont ask the same question on multiple issues |
I'm on spring cloud dependencies Camden.SR6, and a setup of zuul with the following config
and I put a simple resource behind zuul which sleeps 4000ms when invoked.
and I use postman to call the getthensleep resource to let it sleep 4000ms, I expect zuul's RibbonRoutingFilter times out after 3000ms and retry until hystrix times out, 2 separate call should be made from zuul. However, after hystrix times out, zuul returns with code 500, another call is made to getthensleep resource again. In total 3 calls are made by zuul. I think the zuul should stop trying when hystrix times out, and It shouldn't make other attempt after hystrix times out.
The text was updated successfully, but these errors were encountered: