Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client possibly leaking state under high load #507

Closed
fw42 opened this issue Feb 11, 2015 · 17 comments
Closed

Client possibly leaking state under high load #507

fw42 opened this issue Feb 11, 2015 · 17 comments

Comments

@fw42
Copy link
Collaborator

fw42 commented Feb 11, 2015

We are running into some problems that look very related to #502, #501.

We have code that looks like this:

result = redis.pipelined do
  redis.incrby(key_name, 1)
  redis.expire(key_name, 1.day)
end

and we occasionally get an exception that says:
NoMethodError: undefined method each_slice' for nil:NilClass`

The backtrace looks as follows:

[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:2579 → block in _hashify
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:120 → call
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:120 → _set
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:67 → block in finish
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → each
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → each_with_index
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → each
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → map
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/pipeline.rb:66 → finish
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/client.rb:149 → block in call_pipeline
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/client.rb:279 → with_reconnect
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis/client.rb:147 → call_pipeline
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:2131 → block in pipelined
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:37 → block in synchronize
/usr/lib/shopify-ruby/2.1.5-shopify1/lib/ruby/2.1.0/monitor.rb:211 → mon_synchronize
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:37 → synchronize
[GEM_ROOT]/bundler/gems/redis-rb-26a4309f30b8/lib/redis.rb:2127 → pipelined
(now our code)

Looking at the redis-rb source code, this seems very odd, given that the only command that seems to use _hashify for parsing is hgetall. As you can see, neither our code nor the backtrace show any usage of the hgetall command. We have some hgetall though in other areas of our code.

The above exception does not happen consistently but only sometimes and we haven't yet figured out how to reproduce it locally, although we can consistently trigger it in a production environment by enabling new code that uses hgetall (in a different codepath) using the same Redis client object as the above code.

We use redis-rb in a single-threaded Ruby on Rails (4.1.8) web application, running on the Unicorn application server (4.8.2). Our redis-rb version is 26a4309 and we use hiredis 0.6.0. We use MRI Ruby 2.1.5. We use the default timeout values for the redis-rb client configuration.

Our current assumption is that the pipelining code might somehow be leaking state across requests (maybe there could be a Future for hgetall in the pipeline that the client is trying to parse when reading the result of the above incrby+expire).

Any tips on how to debug this would be greatly appreciated.

Thanks!

Ping @djanowski @badboy since you guys worked on the PR that fixed the issue which we thought was the cause of this (#502)

cc @camilo @jasonhl @byroot @csfrancis @tjoyal

@fw42
Copy link
Collaborator Author

fw42 commented Feb 11, 2015

FWIW a similar issue occurs when using multi instead of pipeline.

@fw42
Copy link
Collaborator Author

fw42 commented Feb 11, 2015

The Redis server version is 2.8.17 by the way.

Redis server v=2.8.17 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=194f28f25c0b44fb

@djanowski
Copy link
Collaborator

Thanks for the detailed report. Will look into it.

@djanowski
Copy link
Collaborator

@fw42 Both @badboy and I spent some time trying to reproduce the issue, without luck. A couple of questions:

  • Is the production environment Linux? (If so, is your development environment, where you couldn't reproduce, non-Linux?)
  • By high load do you mean Redis itself is under high load, or the web server handling requests? Both?

@fw42
Copy link
Collaborator Author

fw42 commented Feb 11, 2015

Yes, Linux (both locally and production), kernel 3.8.x.

Sorry, I guess the "high load" was a bit vague. I can't really quantify that too much, but if you think it helps I can try to get some metrics (if you tell me what exactly you are looking for). Most of our graphs on the server-side look normal though. Redis itself is not under particularly high load at the time. Our application does about 315.000 http requests per minute and about 5000-7000 requests per minute to the Redis server in question, but that's a "normal" level for us.

The reason why I said "under load" is just that we start seeing problems as soon as we enable a certain new codepath that triggers a higher amount of requests to Redis than before (about 1000 requests/minute more). Sorry for the confusion.

And thanks for looking into this!

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

We just tried switching from the hiredis connection driver to the Ruby driver. Same problem.

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

I just added additional logging to the Redis::Pipeline#finish method that logs the content of the futures and replies arrays in the case of an exception. As we suspected, at the time this exception is raised and bubbles up, futures contains three elements (incrby, expire, and the mysterious hgetall, in that order), but the replies array contains only two values ([1,1]). The hgetall that we see in that pipeline is the one that's coming from the new code that we are adding.

@badboy
Copy link
Contributor

badboy commented Feb 12, 2015

Thanks for investigating further. I'm gonna read the code once more, maybe I can find the problem (or even reproducing it).

One question: is the hgetall executed as a single command or also in some form of pipeline or transaction?

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

As far as I can tell, the hgetall is not explicitly executed in a pipeline. At least it's not supposed to be. I will investigate if there is some other codepath which I'm not aware of that still wraps it in a pipeline.

@badboy
Copy link
Contributor

badboy commented Feb 12, 2015

You said you were using Unicorn, that means you have forks.
At which point do you initialize the Redis client? Might it be that forks inherit the already-opened connection, which in turn will lead to some nasty problems?
Also see the unicorn example.

Update: Just took a deeper look: First, Redis.current.quit makes no sense here. It will first use the existing connection, reconnecting if necessary, then quit again. Not what we want.
Second, redis-rb has checks in place to raise/reconnect on its own if it detects a PID change.

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

We reconnect all Redis clients in Unicorn's after_fork hook

@badboy
Copy link
Contributor

badboy commented Feb 12, 2015

How do you reconnect? Having one global Redis instance and reconnecting that or creating a new one per fork?

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

We just call #disconnect on the Redis::Client, which will then automatically call #connect on the next command.

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

So yeah, we don't instantiate a new object in after_fork (that's why we think there might be state leaking across requests). Changing it so that we do use a new object seems kind of like a hacky workaround, although it might actually work I guess. I can try that next to confirm/disconfirm your suspicions about this being a forking-related issue.

Note though that this bug does not occur close in time to us forking Unicorn, so I don't think it's related.

@badboy
Copy link
Contributor

badboy commented Feb 12, 2015

Thinking about it the fork-theory shouldn't cause this issue, especially if you're doing the right thing and reconnect.
My other theory only really works if you have threads accessing the same connection. So right now I don't have a clue what's happening

@fw42
Copy link
Collaborator Author

fw42 commented Feb 12, 2015

Ok, I feel like an idiot now :-)

The bug is in our code, not in redis-rb. Please stop investigating. Thanks guys for trying to help!

I will add another comment later explaining what exactly caused this. I'm sure you are curious now :-)

@badboy: Wenn ich das nächste mal in Aachen bin geb ich dir ein Bier aus ;-)

@fw42 fw42 closed this as completed Feb 12, 2015
@badboy
Copy link
Contributor

badboy commented Feb 12, 2015

\o/
You're not an idiot. We're just humans and we have to deal with computers. And computers are the idiots. Glad we got this figured out, very interested what was the real cause here.

putting german hat on: Klingt super :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants