Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inactive memcached node remains stuck in ProbeOpen state, when EjectFailedHost is configured #524

Closed
maheshkelkar opened this issue Jul 8, 2016 · 11 comments

Comments

@maheshkelkar
Copy link
Contributor

maheshkelkar commented Jul 8, 2016

Inactive memcached node remains stuck in ProbeOpen state, when EjectFailedHost is configured.

Expected behavior

  • I am using finagle 6.35.0, I have Memcached client configured as follows:
Memcached.client
  .configured(FailureAccrualFactory.Param(1, () => Duration.fromSeconds(10)))
  .configured(Memcached.param.EjectFailedHost(true))
  .newRichClient("localhost:11211,localhost:11212")))
  • Every 30 seconds, I am performing get and/or set operation with 10 different Key-Value pairs
  • Node localhost:11212 is down the entire time (because NO server is listening at that port)
  • I expect that since EjectFailedHost and strict FailureAccrualPolicy is configured, every 30 seconds, all 10 operations succeed - by going to memcached node localhost:11211
  • I expect that memcached node localhost:11212 stays Dead and out of the ring whole time.

Actual behavior

  • On first attempt say time-0, all the 10 operations succeed, all of them goto active node localhost:11211. Other node localhost:11212 is marked as Dead (the log message print confirms this). So this is good, as per my expectation
  • On 2nd attempt and successive attempts, some (say 4) operations out of 10 operations try to connect to inactive node i.e. localhost:11212 and fail.
  • I tried configuring .configured(FailFastFactory.FailFast(true)), but that didn't help either

Steps to reproduce the behavior

  • Create Memcached client with above mentioned code (for 2 memcached nodes, 1 active, other inactive)
  • every 30 seconds try 10 set operations with different key-value pairs
  • On 2nd & susccessive attempts, some operations should fail

Debugging Observations

  • The inactive node gets ejected after 1st failure (i.e. 1st attempt)
  • After markDeadFor timeout (10 seconds in this case), on startProbing (i.e. state=probeOpen), the node is added back to ring.
  • When service/probe is invoked on this node, it tries to process underlying(conn).map and
    hits the following exception:
com.twitter.finagle.ChannelWriteException: java.net.ConnectException: Connection refused: localhost/0:0:0:0:0:0:0:1:11213. Remote Info: Upstream Address: /127.0.0.1:54459, Upstream Client Id: Not Available, Downstream Address: localhost/0:0:0:0:0:0:0:1:11213, Downstream Client Id: fixedinet!localhost:11213, Trace Id: 88c6d00fd9f95341.8f24651fa01698bc<:f700f541942f60f2
  • The exception is caught by onServiceAcquisitionFailure and calls didFail with state=probeOpen. But it lacks a check for state probeOpen and fails to put it back into state=Dead.
  • Due to exception it skips this code block.
  • Thus, the node remains stuck in the ring and causing failures
@roanta
Copy link
Contributor

roanta commented Jul 8, 2016

@maheshkelkar Thanks for the detailed bug report! I think you're right and we should transition to ProbeClosed in onServiceAcquisitionFailure. Would you like to submit at PR for this?

@argha-c
Copy link

argha-c commented Jul 8, 2016

👍 to this issue. I have observed this while writing an integration test using 2 memcached containers, killing one, killing both and them bringing them back up.

Thanks for the detailed report on the findings.

@maheshkelkar
Copy link
Contributor Author

Hello @roanta, I will submit a PR on Monday.

@maheshkelkar
Copy link
Contributor Author

maheshkelkar commented Jul 12, 2016

@roanta @argha-c @kevinoliver

I implemented a simple solution stopProbing that gets executed either we succed, fail or hit an exception. However, what I notice is:

  1. There are 4 instances of FailureAccrualFactory. Not sure why.
  2. After markDeadFor timeout, all 4 objects start probing
  3. If one or more requests arrive at the same time, all one or more objects accept that request, fail and go back to dead state. Now, if requests are sparse, then only 1 request fails, mark all of these objects to dead.

So, every 30 seconds, we loose at least 1-4 operations in probing.

Do we have an option of invoking a probe i.e. self-probe on one of these objects? I thought FailFast does the very same thing. But, if I tried, and it doesnt seem to probe.

@roanta
Copy link
Contributor

roanta commented Jul 12, 2016

FailureAccrualFactory and FailFastFactory both operate per session, so there should be one instance per memcached host that your client talks to. Your observations sound like correct behavior to me given the respective hosts are actually down and your client has concurrent requests.

@maheshkelkar
Copy link
Contributor Author

@roanta. Thanks for the response.

  • Are you suggesting that we have 1 FailureAccrualFactory per thread? i.e. we have a session per thread?
  • I see 4 FailureAccrualFactory instances for localhost:11212, whether I configure FailFastFactory or not.

But, thats not the real problem. IMO problem is probing is causing me at least 1 failure after markDeadFor timeout.

I wonder:

  • why don't I see this on the very first attempt, when all 10 ops goto active node localhost:11211
  • why do we not use auto-triggering for probing, instead of wasting a request
  • if we fail the request, why not attept it again over the different memcached node (I wonder if this can be controlled by retry)

@roanta
Copy link
Contributor

roanta commented Jul 13, 2016

Are you suggesting that we have 1 FailureAccrualFactory per thread? i.e. we have a session per thread?

Not per-thread, Finagle multiplexes sessions over threads. Each session that your client connects to gets a new FailureAccrualFactory or FailFastFactory (if enabled). Take a look at http://twitter.github.io/finagle/guide/Clients.html -- the branching after the load balancer represents the per-sessions stack.

I see 4 FailureAccrualFactory instances for localhost:11212, whether I configure FailFastFactory or not.

Where are you seeing four? You might want to see how many nodes the client's load balancer sees, it should help clarify: https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/loadbalancer/Balancer.scala#L97. One thing to note is that "localhost" can resolve to multiple network interfaces which are treated as separate endpoints by the client.

why don't I see this on the very first attempt, when all 10 ops goto active node localhost:11211

I'm not sure I understand what you mean. What behavior are you referring to?

why do we not use auto-triggering for probing, instead of wasting a request

We have circuit breakers in Finagle that work out-of-band of requests, but they require protocol support (e.g. https://github.com/twitter/finagle/blob/develop/finagle-mux/src/main/scala/com/twitter/finagle/mux/ThresholdFailureDetector.scala). FailureAccrualFactory is designed to be protocol agnostic and in-band with requests which has its advantages.

if we fail the request, why not attept it again over the different memcached node (I wonder if this can be controlled by retry)

This is generally the behavior with other Finagle clients (the load balancer guarantees to avoid Busy nodes). However, the memcached is special client that is partitioned so Finagle doesn't have as much freedom (https://github.com/twitter/finagle/blob/develop/finagle-memcached/src/main/scala/com/twitter/finagle/Memcached.scala#L82-L87).

@maheshkelkar
Copy link
Contributor Author

@roanta: Thanks for the response

Here is the diffs that I am adding for this PR: https://gist.github.com/maheshkelkar/75b87741008adf7d912cd9a36e40c862

I see 4 FailureAccrualFactory instances for localhost:11212, whether I configure FailFastFactory or not.
Where are you seeing four?

I see 4 instances marked dead for localhost:11212. See concurrent_lb_replica with values of 0,1,2,3 at the end.

D 0714 15:01:38.714 THREAD234 TraceId:a0a70e136dac97d4: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 2))
D 0714 15:01:38.739 THREAD234 TraceId:797cb15c29c1f414: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 0))
D 0714 15:01:38.745 THREAD234 TraceId:b4769a43446a65e5: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 1))
D 0714 15:01:38.750 THREAD234 TraceId:a0a70e136dac97d4: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 3))

After 30 seconds i.e. markDeadFor timeout, I see probing started on 4 instances:

W 0714 15:02:08.726 THREAD200 TraceId:a0a70e136dac97d4: ***FailureAccrualFactory: startProbing for ""
W 0714 15:02:08.748 THREAD200 TraceId:a0a70e136dac97d4: ***FailureAccrualFactory: startProbing for ""
W 0714 15:02:08.748 THREAD200 TraceId:7b2279e7c3c1cace: ***FailureAccrualFactory: startProbing for ""
W 0714 15:02:08.757 THREAD200 TraceId:b4769a43446a65e5: ***FailureAccrualFactory: startProbing for ""

why don't I see this on the very first attempt, when all 10 ops goto active node localhost:11211
I'm not sure I understand what you mean. What behavior are you referring to?

In this test, I have enabled FailFastFactory. I am executing 10 GET operations every 60 seconds. After 60 seconds, I see 2 requests picked up by 2 of FAF instances and fails. As a result, I think 1-4 concurrent requests may fail every 60 seconds:

First 2 requests received

D 0714 15:02:41.307 THREAD209 TraceId:b82a8065f1050e8a: Processing: Request(GET, mtp.localhost:9090/les with CustomerId: mtp
D 0714 15:02:41.307 THREAD210 TraceId:aa3f6e285c203e9a: Processing: Request(GET, mtp.localhost:9090/les with CustomerId: mtp

2 requests picked up by 2 FailureAccrualFactory instances. Not sure which one. I don't see any FailFastFactory messages either.

W 0714 15:02:41.308 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 15:02:41.308 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 15:02:41.308 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: stopProbing for ""
W 0714 15:02:41.308 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: didFail for ""
W 0714 15:02:41.308 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: stopProbing for ""
W 0714 15:02:41.309 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: DEAD for ""
W 0714 15:02:41.308 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: didFail for ""
W 0714 15:02:41.309 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory marking connection to "" as dead. Remote Address: Failed(com.twitter.finagle.Address$$anon$1: failing)
W 0714 15:02:41.309 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory marking connection to "" as dead. Remote Address: Failed(com.twitter.finagle.Address$$anon$1: failing)
W 0714 15:02:41.309 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: DEAD for ""

Note that remaining 8 requests are channeled to localhost:11211 and succeeded. I tried configuring the Retry but that didnt work. I mean I didn't any retry occuring, I wonder if its protocol specific

        SessionStores.MemcachedStore(
          Memcached.client
            .configured(FailureAccrualFactory.Param(1, () => Duration.fromSeconds(30)))
            .configured(Memcached.param.EjectFailedHost(true))
            .configured(FailFastFactory.FailFast(true))
            .withRetryBudget(RetryBudget.Infinite)
            .newRichClient(s"memcached=${hosts}")))

@roanta
Copy link
Contributor

roanta commented Jul 14, 2016

I see what's happening here. We replicate, and load balance over, a single endpoint to avoid head-of-line blocking when we pipeline memcached [1]. The problem is that we create a FailureAccrualFactory for each replica even though they share the same endpoint address. This would definitely make it less effective. Try setting the parameter connectionsPerEndpoint to 1 to verify we get the behavior we want w.r.t to FailureAccrualFactory [2].

I'm not yet sure what the appropriate fix is for this, let me think about it a bit more. In the meantime, let's fix the probe state issue since its orthogonal.

[1] https://github.com/twitter/finagle/blob/develop/finagle-memcached/src/main/scala/com/twitter/finagle/Memcached.scala#L187-L196

[2] https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/param/ConcurrentLoadBalancingParams.scala#L28

@maheshkelkar
Copy link
Contributor Author

Hi @roanta, thank you. Setting the connectionsPerEndpoint to 1 did solve that problem. And I can clearly see that only 1 FAF instance is used.

So I will go ahead committing the PR for this issue.

I also noted this behavior, that if I try 10 requests at a time, I see some of those numbers (5 in this example below) gets queued to be processed by the failed FAF. And even after marking that FAF instance DEAD, we still go through the queue and eventually failing them.

Note that in this example 11211 is inactive.

D 0714 18:07:21.716 THREAD232 TraceId:3217119f247dc776: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11211,Map(concurrent_lb_replica -> 0))
W 0714 18:07:21.719 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.720 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.720 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory: DEAD for ""
W 0714 18:07:21.721 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory marking connection to "" as dead. Remote Address: Failed(com.twitter.finagle.Address$$anon$1: failing)

At this point the FAF is dead, but we still continue processing requests and failing

W 0714 18:07:21.726 THREAD232 TraceId:aabdb26547896feb: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:aabdb26547896feb: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.726 THREAD232 TraceId:cd0eedac5bf41047: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:cd0eedac5bf41047: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.726 THREAD232 TraceId:41c7e1288d7ccc07: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:41c7e1288d7ccc07: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.726 THREAD232 TraceId:e66ba0ff875da440: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:e66ba0ff875da440: ***FailureAccrualFactory: didFail for ""

@mosesn
Copy link
Contributor

mosesn commented Jul 25, 2016

@maheshkelkar fixed this in #527, thanks a ton!!!

@mosesn mosesn closed this as completed Jul 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

4 participants