Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow RibbonLoadBalancedRetryPolicy to participate to the ServerStats connection failure count #1878

Closed
nithril opened this issue Apr 21, 2017 · 12 comments
Assignees
Milestone

Comments

@nithril
Copy link

nithril commented Apr 21, 2017

It appears that org.springframework.cloud.netflix.ribbon.RibbonLoadBalancedRetryPolicy#registerThrowable
does not participate to the ServerStats connection failure count. Thus the com.netflix.loadbalancer.BaseLoadBalancer#chooseServer cannot circuit breaks Server with too much failure according to the niws.loadbalancer.default.connectionFailureCountThreshold property.

Does it make sense?

Thanks

@ryanjbaxter
Copy link
Contributor

Potentially we will have to look at it.

@nithril
Copy link
Author

nithril commented Apr 24, 2017

I wonder, what is the advantage of using spring retry, over the native ribbon retry?

@ryanjbaxter
Copy link
Contributor

There was too much inconsistency between various ways requests were made in Spring Cloud Netflix, so we centralized on Spring Retry across all of them. For a good summary of the history behind this see #1290.

When I added the feature I did not try using the AvailabilityFilteringRule (which I assume you are using). Have you tried using this rule without using Spring Retry?

@nithril
Copy link
Author

nithril commented Apr 24, 2017

Without spring retry the AvailabilityFilteringRule works as intended. Note this rule is added by default.

There was too much inconsistency between various ways requests were made in Spring Cloud Netflix

Have you considered putting the retry logic into a custom LoadBalancerCommand ?

I have look on how to use ServerStats into the spring cloud retry logic, but it does not seems straightforward as ribbon is not aware of the next new chooseServer calls.

@ryanjbaxter
Copy link
Contributor

I need to spend some time looking at how this works, I am not to familiar with it. I will post some updates here hopefully tomorrow.

@ryanjbaxter
Copy link
Contributor

Are you testing this by proxying requests through Zuul?

@nithril
Copy link
Author

nithril commented Apr 27, 2017

yes

@pway99
Copy link
Contributor

pway99 commented Sep 20, 2017

Has there been any progress on this issue? We are experiencing unnecessary client side exceptions while attempting to gracefully shutdown an service instance. Since the server stats are only updated after retry exhausts its retry limit it is rare for a circuit breaker to open and common for an unavailable server to be chosen by the LoadBalancer.

@ryanjbaxter
Copy link
Contributor

Sorry this fell off my radar. I will try to carve out some time to look at it.

@pway99
Copy link
Contributor

pway99 commented Sep 29, 2017

@ryanjbaxter I took a stab at fixing this issue, see the pull request I posted above. Thanks!

@tkvangorder
Copy link

@ryanjbaxter @spencergibb I know you two are busy, but was hoping to get your eyes on the updated pull request that @pway99 submitted.

Some background on this: We are moving our legacy deployment infrastructure to a continuous deployment model. All of our stack is still running in a private data center and we are not yet on something like cloud foundry. We are very close to getting a rolling deployment working, however we are having one issue; when we gracefully shutdown our services, we are still getting a few client-side errors. Those clients are using spring-cloud-sidecar (Dalston.SR3) and Pat reworked his pull request after your feedback.

The root cause was that in Zuul/Sidecar, the Feign calls were not updating the load balancing stats when using Spring Retry + a retry policy. This was fixed in RibbonLoadBalancedRetryPolicy, the second part of this fix was in the AbstractRibbonCommand. The code now checks the request.isRetriable() to determine if the Spring code or the Netflix code will be used for retry/stats update. This seems to handle all of the use cases we could think of, but wanted to verify with you.

We have confirmed that this is now working correctly in our testing environment. We were able to spin up 4 service nodes and 4 clients, run a load test, and then shutdown three of those service nodes while the test was running. No client errors were reported.

Thanks for your time!

@tkvangorder
Copy link

Adding a reference to the new pull request.

#2390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants