Inactive memcached node remains stuck in `ProbeOpen` state, when EjectFailedHost is configured #524

maheshkelkar · 2016-07-08T13:56:32Z

Inactive memcached node remains stuck in ProbeOpen state, when EjectFailedHost is configured.

Expected behavior

I am using finagle 6.35.0, I have Memcached client configured as follows:

Memcached.client
  .configured(FailureAccrualFactory.Param(1, () => Duration.fromSeconds(10)))
  .configured(Memcached.param.EjectFailedHost(true))
  .newRichClient("localhost:11211,localhost:11212")))

Every 30 seconds, I am performing get and/or set operation with 10 different Key-Value pairs
Node localhost:11212 is down the entire time (because NO server is listening at that port)
I expect that since EjectFailedHost and strict FailureAccrualPolicy is configured, every 30 seconds, all 10 operations succeed - by going to memcached node localhost:11211
I expect that memcached node localhost:11212 stays Dead and out of the ring whole time.

Actual behavior

On first attempt say time-0, all the 10 operations succeed, all of them goto active node localhost:11211. Other node localhost:11212 is marked as Dead (the log message print confirms this). So this is good, as per my expectation
On 2nd attempt and successive attempts, some (say 4) operations out of 10 operations try to connect to inactive node i.e. localhost:11212 and fail.
I tried configuring .configured(FailFastFactory.FailFast(true)), but that didn't help either

Steps to reproduce the behavior

Create Memcached client with above mentioned code (for 2 memcached nodes, 1 active, other inactive)
every 30 seconds try 10 set operations with different key-value pairs
On 2nd & susccessive attempts, some operations should fail

Debugging Observations

The inactive node gets ejected after 1st failure (i.e. 1st attempt)
After markDeadFor timeout (10 seconds in this case), on startProbing (i.e. state=probeOpen), the node is added back to ring.
When service/probe is invoked on this node, it tries to process underlying(conn).map and
hits the following exception:

com.twitter.finagle.ChannelWriteException: java.net.ConnectException: Connection refused: localhost/0:0:0:0:0:0:0:1:11213. Remote Info: Upstream Address: /127.0.0.1:54459, Upstream Client Id: Not Available, Downstream Address: localhost/0:0:0:0:0:0:0:1:11213, Downstream Client Id: fixedinet!localhost:11213, Trace Id: 88c6d00fd9f95341.8f24651fa01698bc<:f700f541942f60f2

The exception is caught by onServiceAcquisitionFailure and calls didFail with state=probeOpen. But it lacks a check for state probeOpen and fails to put it back into state=Dead.
Due to exception it skips this code block.
Thus, the node remains stuck in the ring and causing failures

The text was updated successfully, but these errors were encountered:

roanta · 2016-07-08T16:25:13Z

@maheshkelkar Thanks for the detailed bug report! I think you're right and we should transition to ProbeClosed in onServiceAcquisitionFailure. Would you like to submit at PR for this?

argha-c · 2016-07-08T18:56:56Z

👍 to this issue. I have observed this while writing an integration test using 2 memcached containers, killing one, killing both and them bringing them back up.

Thanks for the detailed report on the findings.

maheshkelkar · 2016-07-08T19:48:06Z

Hello @roanta, I will submit a PR on Monday.

maheshkelkar · 2016-07-12T20:51:48Z

@roanta @argha-c @kevinoliver

I implemented a simple solution stopProbing that gets executed either we succed, fail or hit an exception. However, what I notice is:

There are 4 instances of FailureAccrualFactory. Not sure why.
After markDeadFor timeout, all 4 objects start probing
If one or more requests arrive at the same time, all one or more objects accept that request, fail and go back to dead state. Now, if requests are sparse, then only 1 request fails, mark all of these objects to dead.

So, every 30 seconds, we loose at least 1-4 operations in probing.

Do we have an option of invoking a probe i.e. self-probe on one of these objects? I thought FailFast does the very same thing. But, if I tried, and it doesnt seem to probe.

roanta · 2016-07-12T21:32:19Z

FailureAccrualFactory and FailFastFactory both operate per session, so there should be one instance per memcached host that your client talks to. Your observations sound like correct behavior to me given the respective hosts are actually down and your client has concurrent requests.

maheshkelkar · 2016-07-12T21:55:25Z

@roanta. Thanks for the response.

Are you suggesting that we have 1 FailureAccrualFactory per thread? i.e. we have a session per thread?
I see 4 FailureAccrualFactory instances for localhost:11212, whether I configure FailFastFactory or not.

But, thats not the real problem. IMO problem is probing is causing me at least 1 failure after markDeadFor timeout.

I wonder:

why don't I see this on the very first attempt, when all 10 ops goto active node localhost:11211
why do we not use auto-triggering for probing, instead of wasting a request
if we fail the request, why not attept it again over the different memcached node (I wonder if this can be controlled by retry)

roanta · 2016-07-13T16:38:03Z

Are you suggesting that we have 1 FailureAccrualFactory per thread? i.e. we have a session per thread?

Not per-thread, Finagle multiplexes sessions over threads. Each session that your client connects to gets a new FailureAccrualFactory or FailFastFactory (if enabled). Take a look at http://twitter.github.io/finagle/guide/Clients.html -- the branching after the load balancer represents the per-sessions stack.

I see 4 FailureAccrualFactory instances for localhost:11212, whether I configure FailFastFactory or not.

Where are you seeing four? You might want to see how many nodes the client's load balancer sees, it should help clarify: https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/loadbalancer/Balancer.scala#L97. One thing to note is that "localhost" can resolve to multiple network interfaces which are treated as separate endpoints by the client.

why don't I see this on the very first attempt, when all 10 ops goto active node localhost:11211

I'm not sure I understand what you mean. What behavior are you referring to?

why do we not use auto-triggering for probing, instead of wasting a request

We have circuit breakers in Finagle that work out-of-band of requests, but they require protocol support (e.g. https://github.com/twitter/finagle/blob/develop/finagle-mux/src/main/scala/com/twitter/finagle/mux/ThresholdFailureDetector.scala). FailureAccrualFactory is designed to be protocol agnostic and in-band with requests which has its advantages.

if we fail the request, why not attept it again over the different memcached node (I wonder if this can be controlled by retry)

This is generally the behavior with other Finagle clients (the load balancer guarantees to avoid Busy nodes). However, the memcached is special client that is partitioned so Finagle doesn't have as much freedom (https://github.com/twitter/finagle/blob/develop/finagle-memcached/src/main/scala/com/twitter/finagle/Memcached.scala#L82-L87).

maheshkelkar · 2016-07-14T15:31:40Z

@roanta: Thanks for the response

Here is the diffs that I am adding for this PR: https://gist.github.com/maheshkelkar/75b87741008adf7d912cd9a36e40c862

I see 4 FailureAccrualFactory instances for localhost:11212, whether I configure FailFastFactory or not.
Where are you seeing four?

I see 4 instances marked dead for localhost:11212. See concurrent_lb_replica with values of 0,1,2,3 at the end.

D 0714 15:01:38.714 THREAD234 TraceId:a0a70e136dac97d4: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 2))
D 0714 15:01:38.739 THREAD234 TraceId:797cb15c29c1f414: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 0))
D 0714 15:01:38.745 THREAD234 TraceId:b4769a43446a65e5: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 1))
D 0714 15:01:38.750 THREAD234 TraceId:a0a70e136dac97d4: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11212,Map(concurrent_lb_replica -> 3))

After 30 seconds i.e. markDeadFor timeout, I see probing started on 4 instances:

W 0714 15:02:08.726 THREAD200 TraceId:a0a70e136dac97d4: ***FailureAccrualFactory: startProbing for ""
W 0714 15:02:08.748 THREAD200 TraceId:a0a70e136dac97d4: ***FailureAccrualFactory: startProbing for ""
W 0714 15:02:08.748 THREAD200 TraceId:7b2279e7c3c1cace: ***FailureAccrualFactory: startProbing for ""
W 0714 15:02:08.757 THREAD200 TraceId:b4769a43446a65e5: ***FailureAccrualFactory: startProbing for ""

why don't I see this on the very first attempt, when all 10 ops goto active node localhost:11211
I'm not sure I understand what you mean. What behavior are you referring to?

In this test, I have enabled FailFastFactory. I am executing 10 GET operations every 60 seconds. After 60 seconds, I see 2 requests picked up by 2 of FAF instances and fails. As a result, I think 1-4 concurrent requests may fail every 60 seconds:

First 2 requests received

D 0714 15:02:41.307 THREAD209 TraceId:b82a8065f1050e8a: Processing: Request(GET, mtp.localhost:9090/les with CustomerId: mtp
D 0714 15:02:41.307 THREAD210 TraceId:aa3f6e285c203e9a: Processing: Request(GET, mtp.localhost:9090/les with CustomerId: mtp

2 requests picked up by 2 FailureAccrualFactory instances. Not sure which one. I don't see any FailFastFactory messages either.

W 0714 15:02:41.308 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 15:02:41.308 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 15:02:41.308 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: stopProbing for ""
W 0714 15:02:41.308 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: didFail for ""
W 0714 15:02:41.308 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: stopProbing for ""
W 0714 15:02:41.309 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory: DEAD for ""
W 0714 15:02:41.308 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: didFail for ""
W 0714 15:02:41.309 THREAD209 TraceId:b82a8065f1050e8a: ***FailureAccrualFactory marking connection to "" as dead. Remote Address: Failed(com.twitter.finagle.Address$$anon$1: failing)
W 0714 15:02:41.309 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory marking connection to "" as dead. Remote Address: Failed(com.twitter.finagle.Address$$anon$1: failing)
W 0714 15:02:41.309 THREAD210 TraceId:aa3f6e285c203e9a: ***FailureAccrualFactory: DEAD for ""

Note that remaining 8 requests are channeled to localhost:11211 and succeeded. I tried configuring the Retry but that didnt work. I mean I didn't any retry occuring, I wonder if its protocol specific

        SessionStores.MemcachedStore(
          Memcached.client
            .configured(FailureAccrualFactory.Param(1, () => Duration.fromSeconds(30)))
            .configured(Memcached.param.EjectFailedHost(true))
            .configured(FailFastFactory.FailFast(true))
            .withRetryBudget(RetryBudget.Infinite)
            .newRichClient(s"memcached=${hosts}")))

roanta · 2016-07-14T17:11:21Z

I see what's happening here. We replicate, and load balance over, a single endpoint to avoid head-of-line blocking when we pipeline memcached [1]. The problem is that we create a FailureAccrualFactory for each replica even though they share the same endpoint address. This would definitely make it less effective. Try setting the parameter connectionsPerEndpoint to 1 to verify we get the behavior we want w.r.t to FailureAccrualFactory [2].

I'm not yet sure what the appropriate fix is for this, let me think about it a bit more. In the meantime, let's fix the probe state issue since its orthogonal.

[1] https://github.com/twitter/finagle/blob/develop/finagle-memcached/src/main/scala/com/twitter/finagle/Memcached.scala#L187-L196

[2] https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/param/ConcurrentLoadBalancingParams.scala#L28

maheshkelkar · 2016-07-14T18:17:20Z

Hi @roanta, thank you. Setting the connectionsPerEndpoint to 1 did solve that problem. And I can clearly see that only 1 FAF instance is used.

So I will go ahead committing the PR for this issue.

I also noted this behavior, that if I try 10 requests at a time, I see some of those numbers (5 in this example below) gets queued to be processed by the failed FAF. And even after marking that FAF instance DEAD, we still go through the queue and eventually failing them.

Note that in this example 11211 is inactive.

D 0714 18:07:21.716 THREAD232 TraceId:3217119f247dc776: FailFastFactory marking connection to "memcached" as dead. Remote Address: Inet(localhost/127.0.0.1:11211,Map(concurrent_lb_replica -> 0))
W 0714 18:07:21.719 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.720 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.720 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory: DEAD for ""
W 0714 18:07:21.721 THREAD232 TraceId:3217119f247dc776: ***FailureAccrualFactory marking connection to "" as dead. Remote Address: Failed(com.twitter.finagle.Address$$anon$1: failing)

At this point the FAF is dead, but we still continue processing requests and failing

W 0714 18:07:21.726 THREAD232 TraceId:aabdb26547896feb: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:aabdb26547896feb: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.726 THREAD232 TraceId:cd0eedac5bf41047: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:cd0eedac5bf41047: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.726 THREAD232 TraceId:41c7e1288d7ccc07: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:41c7e1288d7ccc07: ***FailureAccrualFactory: didFail for ""
W 0714 18:07:21.726 THREAD232 TraceId:e66ba0ff875da440: ***FailureAccrualFactory: onServiceAcquisitionFailure for ""
W 0714 18:07:21.726 THREAD232 TraceId:e66ba0ff875da440: ***FailureAccrualFactory: didFail for ""

Fixes twitter#524

mosesn · 2016-07-25T17:14:21Z

@maheshkelkar fixed this in #527, thanks a ton!!!

maheshkelkar mentioned this issue Jul 14, 2016

Add transition to ProbeClosed in onServiceAcquisitionFailure #527

Closed

maheshkelkar pushed a commit to maheshkelkar/finagle that referenced this issue Jul 14, 2016

Fixed comments, replaced a testcase check

938d3d7

Fixes twitter#524

mosesn closed this as completed Jul 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inactive memcached node remains stuck in `ProbeOpen` state, when EjectFailedHost is configured #524

Inactive memcached node remains stuck in `ProbeOpen` state, when EjectFailedHost is configured #524

maheshkelkar commented Jul 8, 2016 •

edited

Loading

roanta commented Jul 8, 2016

argha-c commented Jul 8, 2016

maheshkelkar commented Jul 8, 2016

maheshkelkar commented Jul 12, 2016 •

edited

Loading

roanta commented Jul 12, 2016

maheshkelkar commented Jul 12, 2016

roanta commented Jul 13, 2016

maheshkelkar commented Jul 14, 2016

roanta commented Jul 14, 2016

maheshkelkar commented Jul 14, 2016

mosesn commented Jul 25, 2016

Inactive memcached node remains stuck in ProbeOpen state, when EjectFailedHost is configured #524

Inactive memcached node remains stuck in ProbeOpen state, when EjectFailedHost is configured #524

Comments

maheshkelkar commented Jul 8, 2016 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce the behavior

Debugging Observations

roanta commented Jul 8, 2016

argha-c commented Jul 8, 2016

maheshkelkar commented Jul 8, 2016

maheshkelkar commented Jul 12, 2016 • edited Loading

roanta commented Jul 12, 2016

maheshkelkar commented Jul 12, 2016

roanta commented Jul 13, 2016

maheshkelkar commented Jul 14, 2016

roanta commented Jul 14, 2016

maheshkelkar commented Jul 14, 2016

mosesn commented Jul 25, 2016

Inactive memcached node remains stuck in `ProbeOpen` state, when EjectFailedHost is configured #524

Inactive memcached node remains stuck in `ProbeOpen` state, when EjectFailedHost is configured #524

maheshkelkar commented Jul 8, 2016 •

edited

Loading

maheshkelkar commented Jul 12, 2016 •

edited

Loading