-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection timeouts #37
Comments
I'm considering a new approach to the DNS timeout problem. The design looks something like the diagram below. What I have in mind is a new
@conradludgate (or anyone else) what do you think? |
I'm going to take some time to fully digest what ginepro/tower/tonic are currently doing under the hood. Then I'll have a think about how this dnstimeout fits in |
Ok, just to confirm that I have a clear picture of what the issue is: Let's say we have 3 connections in our balance queue, but 1 of them has gone offline. We don't need to wait for DNS in this case. After a few attempts of using the dead connection we can remove it and continue using the 2 remaining connections ( Let's say those 2 connections go down, we now have 0 connections in our balance queue. Therefore we need to wait for DNS to get new connections. This is also the init case. When this happens, we want to timeout early waiting for DNS, which would be a different timeout from the unary request round trip. Of course, for DNS local to a kubernetes cluster, I would hope those DNS latencies would be low, so you could set the timeout for the DNS to be low but the timeout for the connection to be higher. Is that your understanding too? |
https://docs.rs/tower/0.4.13/tower/ready_cache/cache/struct.ReadyCache.html
|
Agreed!
I had to have a look, but I'm guessing this is because tonic's So, services can be removed from the balancer for one of two reasons:
Cool. So yeah, either way we can end up with nothing to load balance across.
Yep. I mean, how long DNS vs TCP + TLS vs actual requests will take can vary a lot, which is why separate timeouts are useful. E.g. we might be happy to wait for a really long request which can take 10s P99, but we wouldn't want to wait that long just for DNS or TCP + TLS. We already have (well, will soon have) a timeout for the TCP connection, so DNS seems like the last piece of the puzzle. |
Cool. I'm all caught up then. I'll have a think about yours and some other solutions then. In the end I think we already might have to re-implement some of tonic for #38 so I think we can move the dnstimeout further download the chain (in the Discover layer). But we'll still somehow have to surface the number of services registered to this layer |
Bug description
Symptoms
connection_timeout_is_not_fatal
test takes ~75 seconds to finish.Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.
Causes
I can see two separate timeout problems:
ResolutionStrategy::Lazy
is used, there is currently no way to apply a timeout just for DNS resolution. If DNS never resolves, requests never complete.tonic
doesn't use it!Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.
To Reproduce
For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.
Expected behavior
Ability to control timeouts for TCP connections and DNS resolution.
Environment
Additional context
Solutions
The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set
connect_timeout
in the right places. First,topic
doesn't respectconnect_timeout
, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our ownconnect_timeout
option on top of it in #38.DNS resolution is harder. There are currently two options:
LoadBalancedChannel
is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.Of the two, I wonder if we should favour Eager resolution, and consider changing the default to this.
However, we might want a third option: Active lazy resolution (for want of a better name). Lazy resolution is currently passive, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively callsprobe_once()
(with a timeout!) as part of the first request (or alternatively whenGrpcServiceProbe.endpoints
is empty)? This could give us lazy DNS resolution, but with timeouts.Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.
The text was updated successfully, but these errors were encountered: