-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inactive memcached node remains stuck in ProbeOpen
state, when EjectFailedHost is configured
#524
Comments
@maheshkelkar Thanks for the detailed bug report! I think you're right and we should transition to ProbeClosed in |
👍 to this issue. I have observed this while writing an integration test using 2 memcached containers, killing one, killing both and them bringing them back up. Thanks for the detailed report on the findings. |
Hello @roanta, I will submit a PR on Monday. |
I implemented a simple solution
So, every 30 seconds, we loose at least 1-4 operations in probing. Do we have an option of invoking a probe i.e. self-probe on one of these objects? I thought FailFast does the very same thing. But, if I tried, and it doesnt seem to probe. |
|
@roanta. Thanks for the response.
But, thats not the real problem. IMO problem is probing is causing me at least 1 failure after I wonder:
|
Not per-thread, Finagle multiplexes sessions over threads. Each session that your client connects to gets a new
Where are you seeing four? You might want to see how many nodes the client's load balancer sees, it should help clarify: https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/loadbalancer/Balancer.scala#L97. One thing to note is that "localhost" can resolve to multiple network interfaces which are treated as separate endpoints by the client.
I'm not sure I understand what you mean. What behavior are you referring to?
We have circuit breakers in Finagle that work out-of-band of requests, but they require protocol support (e.g. https://github.com/twitter/finagle/blob/develop/finagle-mux/src/main/scala/com/twitter/finagle/mux/ThresholdFailureDetector.scala).
This is generally the behavior with other Finagle clients (the load balancer guarantees to avoid |
@roanta: Thanks for the response Here is the diffs that I am adding for this PR: https://gist.github.com/maheshkelkar/75b87741008adf7d912cd9a36e40c862
I see 4 instances marked dead for
After 30 seconds i.e. markDeadFor timeout, I see probing started on 4 instances:
In this test, I have enabled FailFastFactory. I am executing 10 GET operations every 60 seconds. After 60 seconds, I see 2 requests picked up by 2 of FAF instances and fails. As a result, I think 1-4 concurrent requests may fail every 60 seconds: First 2 requests received
2 requests picked up by 2 FailureAccrualFactory instances. Not sure which one. I don't see any FailFastFactory messages either.
Note that remaining 8 requests are channeled to
|
I see what's happening here. We replicate, and load balance over, a single endpoint to avoid head-of-line blocking when we pipeline memcached [1]. The problem is that we create a I'm not yet sure what the appropriate fix is for this, let me think about it a bit more. In the meantime, let's fix the probe state issue since its orthogonal. |
Hi @roanta, thank you. Setting the connectionsPerEndpoint to 1 did solve that problem. And I can clearly see that only 1 FAF instance is used. So I will go ahead committing the PR for this issue. I also noted this behavior, that if I try 10 requests at a time, I see some of those numbers (5 in this example below) gets queued to be processed by the failed FAF. And even after marking that FAF instance DEAD, we still go through the queue and eventually failing them. Note that in this example
At this point the FAF is dead, but we still continue processing requests and failing
|
@maheshkelkar fixed this in #527, thanks a ton!!! |
Inactive memcached node remains stuck in
ProbeOpen
state, when EjectFailedHost is configured.Expected behavior
localhost:11212
is down the entire time (because NO server is listening at that port)EjectFailedHost
and strictFailureAccrualPolicy
is configured, every 30 seconds, all 10 operations succeed - by going to memcached nodelocalhost:11211
localhost:11212
staysDead
and out of the ring whole time.Actual behavior
localhost:11211
. Other nodelocalhost:11212
is marked asDead
(the log message print confirms this). So this is good, as per my expectationlocalhost:11212
and fail..configured(FailFastFactory.FailFast(true))
, but that didn't help eitherSteps to reproduce the behavior
Debugging Observations
hits the following exception:
The text was updated successfully, but these errors were encountered: