Client possibly leaking state under high load #507

fw42 · 2015-02-11T16:26:10Z

We are running into some problems that look very related to #502, #501.

We have code that looks like this:

result = redis.pipelined do
  redis.incrby(key_name, 1)
  redis.expire(key_name, 1.day)
end

and we occasionally get an exception that says:
NoMethodError: undefined method each_slice' for nil:NilClass`

The backtrace looks as follows:

[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:2579 → block in _hashify
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:120 → call
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:120 → _set
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:67 → block in finish
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → each
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → each_with_index
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → each
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → map
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → finish
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/client.rb:149 → block in call_pipeline
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/client.rb:279 → with_reconnect
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/client.rb:147 → call_pipeline
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:2131 → block in pipelined
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:37 → block in synchronize
/usr/lib/shopify-ruby/2.1.5-shopify1/lib/ruby/2.1.0/monitor.rb:211 → mon_synchronize
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:37 → synchronize
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:2127 → pipelined
(now our code)

Looking at the redis-rb source code, this seems very odd, given that the only command that seems to use _hashify for parsing is hgetall. As you can see, neither our code nor the backtrace show any usage of the hgetall command. We have some hgetall though in other areas of our code.

The above exception does not happen consistently but only sometimes and we haven't yet figured out how to reproduce it locally, although we can consistently trigger it in a production environment by enabling new code that uses hgetall (in a different codepath) using the same Redis client object as the above code.

We use redis-rb in a single-threaded Ruby on Rails (4.1.8) web application, running on the Unicorn application server (4.8.2). Our redis-rb version is 26a4309 and we use hiredis 0.6.0. We use MRI Ruby 2.1.5. We use the default timeout values for the redis-rb client configuration.

Our current assumption is that the pipelining code might somehow be leaking state across requests (maybe there could be a Future for hgetall in the pipeline that the client is trying to parse when reading the result of the above incrby+expire).

Any tips on how to debug this would be greatly appreciated.

Thanks!

Ping @djanowski @badboy since you guys worked on the PR that fixed the issue which we thought was the cause of this (#502)

cc @camilo @jasonhl @byroot @csfrancis @tjoyal

The text was updated successfully, but these errors were encountered:

fw42 · 2015-02-11T16:26:48Z

FWIW a similar issue occurs when using multi instead of pipeline.

fw42 · 2015-02-11T16:44:31Z

The Redis server version is 2.8.17 by the way.

Redis server v=2.8.17 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=194f28f25c0b44fb

djanowski · 2015-02-11T17:48:27Z

Thanks for the detailed report. Will look into it.

djanowski · 2015-02-11T18:33:10Z

@fw42 Both @badboy and I spent some time trying to reproduce the issue, without luck. A couple of questions:

Is the production environment Linux? (If so, is your development environment, where you couldn't reproduce, non-Linux?)
By high load do you mean Redis itself is under high load, or the web server handling requests? Both?

fw42 · 2015-02-11T18:55:24Z

Yes, Linux (both locally and production), kernel 3.8.x.

Sorry, I guess the "high load" was a bit vague. I can't really quantify that too much, but if you think it helps I can try to get some metrics (if you tell me what exactly you are looking for). Most of our graphs on the server-side look normal though. Redis itself is not under particularly high load at the time. Our application does about 315.000 http requests per minute and about 5000-7000 requests per minute to the Redis server in question, but that's a "normal" level for us.

The reason why I said "under load" is just that we start seeing problems as soon as we enable a certain new codepath that triggers a higher amount of requests to Redis than before (about 1000 requests/minute more). Sorry for the confusion.

And thanks for looking into this!

fw42 · 2015-02-12T15:13:32Z

We just tried switching from the hiredis connection driver to the Ruby driver. Same problem.

fw42 · 2015-02-12T16:04:13Z

I just added additional logging to the Redis::Pipeline#finish method that logs the content of the futures and replies arrays in the case of an exception. As we suspected, at the time this exception is raised and bubbles up, futures contains three elements (incrby, expire, and the mysterious hgetall, in that order), but the replies array contains only two values ([1,1]). The hgetall that we see in that pipeline is the one that's coming from the new code that we are adding.

badboy · 2015-02-12T16:05:58Z

Thanks for investigating further. I'm gonna read the code once more, maybe I can find the problem (or even reproducing it).

One question: is the hgetall executed as a single command or also in some form of pipeline or transaction?

fw42 · 2015-02-12T16:06:56Z

As far as I can tell, the hgetall is not explicitly executed in a pipeline. At least it's not supposed to be. I will investigate if there is some other codepath which I'm not aware of that still wraps it in a pipeline.

badboy · 2015-02-12T17:30:27Z

You said you were using Unicorn, that means you have forks.
At which point do you initialize the Redis client? Might it be that forks inherit the already-opened connection, which in turn will lead to some nasty problems?
Also see the unicorn example.

Update: Just took a deeper look: First, Redis.current.quit makes no sense here. It will first use the existing connection, reconnecting if necessary, then quit again. Not what we want.
Second, redis-rb has checks in place to raise/reconnect on its own if it detects a PID change.

fw42 · 2015-02-12T17:32:23Z

We reconnect all Redis clients in Unicorn's after_fork hook

badboy · 2015-02-12T18:00:06Z

How do you reconnect? Having one global Redis instance and reconnecting that or creating a new one per fork?

fw42 · 2015-02-12T18:01:32Z

We just call #disconnect on the Redis::Client, which will then automatically call #connect on the next command.

fw42 · 2015-02-12T18:04:53Z

So yeah, we don't instantiate a new object in after_fork (that's why we think there might be state leaking across requests). Changing it so that we do use a new object seems kind of like a hacky workaround, although it might actually work I guess. I can try that next to confirm/disconfirm your suspicions about this being a forking-related issue.

Note though that this bug does not occur close in time to us forking Unicorn, so I don't think it's related.

badboy · 2015-02-12T18:37:59Z

Thinking about it the fork-theory shouldn't cause this issue, especially if you're doing the right thing and reconnect.
My other theory only really works if you have threads accessing the same connection. So right now I don't have a clue what's happening

fw42 · 2015-02-12T21:45:45Z

Ok, I feel like an idiot now :-)

The bug is in our code, not in redis-rb. Please stop investigating. Thanks guys for trying to help!

I will add another comment later explaining what exactly caused this. I'm sure you are curious now :-)

@badboy: Wenn ich das nächste mal in Aachen bin geb ich dir ein Bier aus ;-)

badboy · 2015-02-12T21:50:44Z

\o/
You're not an idiot. We're just humans and we have to deal with computers. And computers are the idiots. Glad we got this figured out, very interested what was the real cause here.

putting german hat on: Klingt super :)

fw42 closed this as completed Feb 12, 2015

byroot mentioned this issue Feb 12, 2015

Fix possible race condition with pipelines #509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client possibly leaking state under high load #507

Client possibly leaking state under high load #507

fw42 commented Feb 11, 2015

fw42 commented Feb 11, 2015

fw42 commented Feb 11, 2015

djanowski commented Feb 11, 2015

djanowski commented Feb 11, 2015

fw42 commented Feb 11, 2015

fw42 commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

Client possibly leaking state under high load #507

Client possibly leaking state under high load #507

Comments

fw42 commented Feb 11, 2015

fw42 commented Feb 11, 2015

fw42 commented Feb 11, 2015

djanowski commented Feb 11, 2015

djanowski commented Feb 11, 2015

fw42 commented Feb 11, 2015

fw42 commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015

fw42 commented Feb 12, 2015

badboy commented Feb 12, 2015