Thread hangs on exit #50

Open
tsilen opened this Issue Sep 25, 2012 · 26 comments

Comments

Projects
None yet
6 participants
Contributor

tsilen commented Sep 25, 2012

Recently we've run into an issue where a process using zk and redis_failover sometimes fails to die on exit, but instead hangs in an infinite sched_yield() loop taking all CPU. A SIGKILL is required to get rid of it. I guess it has to do with the program exiting without properly closing the connection first, but i guess it should still die cleanly.

gdb output gives this:

#0  0x00007fb7e6765a67 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fb7e6bd86f7 in gvl_yield (vm=0x1bfdf20, th=<optimized out>) at thread_pthread.c:125
#2  0x00007fb7e6bdac16 in rb_thread_schedule_limits (limits_us=0) at thread.c:1025
#3  rb_thread_schedule_limits (limits_us=0) at thread.c:1033
#4  rb_thread_schedule () at thread.c:1035
#5  0x00007fb7e6bdad5f in rb_thread_terminate_all () at thread.c:375
#6  0x00007fb7e6abf89e in ruby_cleanup (ex=0) at eval.c:140
#7  0x00007fb7e6abfa25 in ruby_run_node (n=0x24f0428) at eval.c:244
#8  0x00000000004007fb in main (argc=3, argv=0x7fff7725e948) at main.c:38

After adding some debug-code to ruby side to get a backtrace when the process is hung, I was able to get this:

Thread TID-t26i0
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/threadpool.rb:268:in `worker_thread_body'

When trying to reproduce it without redis_failover i was able to get it hang in a similar way, but in a different place:

Thread TID-ccaag
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common/queue_with_pipe.rb:59:in `pop'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:56:in `get_next_event'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:94:in `dispatch_thread_body'

and

Thread TID-alg44
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:110:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:122:in `wait_while'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/client/threaded.rb:533:in `block in reconnect_thread_body'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:211:in `mon_synchronize'

Code in all is somewhat similar (@cond.wait).

Any ideas? Ruby is 1.9.3-p194, but it also happens at least on 1.9.3-p0. ZK 1.7.1, zookeeper 1.3.0. OS: linux ubuntu 12.04.

At least in ruby 1.9.3 you can give a timeout to ConditionVariable's wait(), maybe that would help.

With this I was able to get it hang in a similar way quite often:

#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
$stdout.sync = true
@zkservers = "localhost:2181"

trap 'TTIN' do
  Thread.list.each do |thread|
    puts "Thread TID-#{thread.object_id.to_s(36)}"
    puts thread.backtrace.join("\n")
  end
end

def do_something
  zk = ZK.new(@zkservers)
  puts zk.children('/').inspect 
  sleep 1
end

puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
  stack << Thread.new { do_something }
end
sleep rand(0)

Running it in while true; do ./test.rb; done loop until it gets stuck and then kill -TTIN prints the backtraces of the still alive threads.

Contributor

tsilen commented Sep 26, 2012

One workaround would be calling close! in at_exit { }

Contributor

slyphon commented Sep 26, 2012

Just wanted to let you know I got this. I'm not working today, but I'll try
to get a fix together by the EOW.

Jonathan Simms
(sent from my mobile device)

On Sep 26, 2012, at 9:07, tsilen notifications@github.com wrote:

One workaround would be calling close! in at_exit { }


Reply to this email directly or view it on
GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-8889029.

Member

ryanlecompte commented Oct 8, 2012

Any update on this one?

+1

Contributor

slyphon commented Oct 8, 2012

hey guys, sorry, been bogged down with holidays and nonsense, i'll try to get to this one this week

Contributor

slyphon commented Oct 8, 2012

After a quick look, it seems the problem is (as always) 1.8.7 compatibility (yes, i know, talk to @eric), where only Monitor's condvar takes a timeout. I'll have to review it a little more closely. I like the at_exit { zk.close! } idea, i'm just not sure if that will cause a memory leak, as it will keep a reference to the zk instance. It shouldn't matter in that most people are going to only use one or two zk instances, but still.

I'll try to review the places where wait() is being called and at least get a branch together that people can test against.

Member

ryanlecompte commented Oct 8, 2012

Thanks for taking a look! With regards to most people not having many ZK instances - that might not always be true if you're using redis_failover with a connection pool (for example, what's used in Sidekiq). Sidekiq lets you create a connection pool with e.g. 10 or 20 redis clients. Each RedisFailover::Client instance will contain an underlying ZK instance.

Member

ryanlecompte commented Oct 8, 2012

By the way, I just updated (in master) RedisFailover::Client so that you can pass in an existing ZK client instance. You would use it like this:

  zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
  pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
  pool.with { |client| client.get('foo') }  
Contributor

slyphon commented Oct 8, 2012

this is probably how it should be done, just in terms of resource usage
(open file handles, etc).

On Mon, Oct 8, 2012 at 3:37 PM, Ryan LeCompte notifications@github.comwrote:

By the way, I just updated (in master) RedisFailover::Client so that you
can pass in an existing ZK client instance. You would use it like this:

zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
pool.with { |client| client.get('foo') }


Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-9237528.

Contributor

slyphon commented Oct 8, 2012

Ok, so I think the test @tsilen came up with shows a bug in ruby. I've run into issues before with at-exit GC and thread primitives (in 1.8, but still). If you modify the above torture test:

#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
require 'logger'

$stdout.sync = true
$log = Logger.new($stderr).tap { |n| n.level = Logger::DEBUG }

@zkservers = "localhost:2181"

trap 'TTIN' do
  Thread.list.each do |thread|
    puts "Thread TID-#{thread.object_id.to_s(36)}"
    puts thread.backtrace.join("\n")
  end
end

def do_something
  ZK.open(@zkservers) do |zk|
    thr_id = Thread.current.object_id.to_s(36)
    $log.debug { "#{thr_id} #{zk.children('/').inspect}" }
  end
  sleep 1
end

puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
  stack << Thread.new { do_something }
end
# sleep rand(0)

stack.each do |th|
  begin
    th.join(3)
  rescue Exception => e
    puts "kill: #{e.class} #{e.message}"
  end
end

By waiting for the threads to exit before "falling off the end of the script" I can't get it to hang.

Contributor

tsilen commented Oct 9, 2012

Yes, the issue seems to happen when a thread is alive and in some specific state at the end of the script, so waiting for them to exit should also do the trick. I did some quick testing with the updated script and didn't get it to hang either, but I'll test a bit more tomorrow.

Contributor

slyphon commented Oct 9, 2012

Ok, cool. I'm like 99% sure that this is a bug in the way ruby does ConditionVariable cleanup.

Contributor

tsilen commented Oct 26, 2012

Looks like close! in at_exit indeed doesn't work in all cases. This is a quite a big problem, because for example rails (rack) processes many times hang when they should exit when released from e.g. Passenger's pool and keep eating a lot of resources. I think it's something that has to be handled within zk.

Contributor

slyphon commented Oct 26, 2012

Yeah, you need to close your connections before "falling off the end of the script" as above. You cannot rely on ruby to shut down these clients properly because it's not handling cleaning up POSIX thread primitives properly. I tested in an at_exit and it didn't work, I had to wait for all of the clients to finish closing and join all threads before reaching the end.

I'm not sure how to handle this in terms of passenger, but I'd imagine that there's some kind of shutdown you can hook into.

Again, there's really nothing I can do on the ZK end, as I don't control the main thread. The only other thing you can do is do your work inside of a ZK.open() block, which will ensure that zk is shutdown when the block returns.

cheald referenced this issue in mperham/sidekiq Dec 7, 2012

Closed

sidekiq randomly(?) hangs. #547

mperham commented Dec 7, 2012

Can you modify the code to use 1.9's condvar api when running in 1.9?

mperham commented Dec 7, 2012

We are seeing sidekiq hang occasionally in production on exit and yep we're running redis_failover.

Contributor

slyphon commented Dec 7, 2012

If you close ZK before the process exits, the hang doesn't happen because
all the underlying posix stuff gets cleaned up. Clean shutdown is key.

The reason i have to use Monitor is because @eric, who's been a major
contributor is still running 1.8, and I've agreed to maintain backwards
compatibility. 1.8's ConditionVariable doesn't have a timeout option, which
makes it really difficult.

On Fri, Dec 7, 2012 at 11:46 AM, Mike Perham notifications@github.comwrote:

We are seeing sidekiq hang occasionally in production on exit and yep
we're running redis_failover.


Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11136886.

Member

eric commented Dec 7, 2012

Looking at the 1.9 code for monitor, it looks like it's just using a standard ConditionVariable and Mutex under the covers. There shouldn't be any real difference in how it acts on 1.9.

Contributor

slyphon commented Dec 7, 2012

Except that ConditionVariable and Mutex in 1.9 have real posix thread
primitives backing them, and 1.9 doesn't clean up effectively before
exiting, so it deadlocks.

On Fri, Dec 7, 2012 at 2:35 PM, Eric Lindvall notifications@github.comwrote:

Looking at the 1.9 code for monitor, it looks like it's just using a
standard ConditionVariable and Mutex under the covers. There shouldn't be
any real difference in how it acts on 1.9.


Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11143151.

mperham commented Dec 7, 2012

If you are suggesting there's a bug in Ruby 1.9, that's quite possible. I found a bug in CondVar last year, made a reproducible test case and had it fixed in head the next morning.

Member

eric commented Dec 7, 2012

Yes, the consensus is that there is a bug in the 1.9 Mutex/ConditionVariable implementation that causes the VM to hang if there are threads that are waiting on a ConditionVariable when the process exits.

Does sidekiq do graceful shutdowns? If so, it's likely you could work around this by calling close! on the ZK object before exiting.

mperham commented Dec 7, 2012

Where's the redmine ticket for the bug?

Sidekiq knows nothing about ZK. It's using the redis client, which is wrapped by the redis_failover client. Sidekiq does not bother to close sockets / clients before shutting down and redis_failover does not provide a close/shutdown operation.

Member

eric commented Dec 7, 2012

Someone will have to pipe up if they've isolated the problem into a reproducible testcase, but from reading the thread you and @tsilen have described it as an "occasional" issue, which would lead me to believe neither of you have isolated it.

We're all volunteers here. I run ruby 1.8, and haven't run into this specific issue, but hopefully one of the self-motivated people on this thread who does run 1.9 and have been impacted by this issue could find the time to isolate it and submit a redmine ticket.

In the mean time, it may be worth investigating providing a mechanism in sidekiq for users of the library to register callbacks for performing a clean shutdown which would allow a user to perform a graceful shutdown of the redis_failover or zk client.

Member

ryanlecompte commented Dec 7, 2012

FYI, redis_failover does provide the RedisFailover::Client#shutdown method
which closes the ZK connection and closes the underlying Redis clients.

On Fri, Dec 7, 2012 at 12:54 PM, Eric Lindvall notifications@github.comwrote:

Someone will have to pipe up if they've isolated the problem into a
reproducible testcase, but from reading the thread you and @tsilenhttps://github.com/tsilenhave described it as an "occasional" issue, which would lead me to believe
neither of you have isolated it.

We're all volunteers here. I run ruby 1.8, and haven't run into this
specific issue, but hopefully one of the self-motivated people on this
thread who does run 1.9 and have been impacted by this issue could find the
time to isolate it and submit a redmine ticket.

In the mean time, it may be worth investigating providing a mechanism in
sidekiq for users of the library to register callbacks for performing a
clean shutdown which would allow a user to perform a graceful shutdown of
the redis_failover or zk client.


Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11145976.

Contributor

slyphon commented Dec 7, 2012

"Where's the redmine ticket for the bug?"

You're kidding.

On Fri, Dec 7, 2012 at 3:47 PM, Mike Perham notifications@github.comwrote:

Where's the redmine ticket for the bug?

Sidekiq knows nothing about ZK. It's using the redis client, which is
wrapped by the redis_failover client. Sidekiq does not bother to close
sockets / clients before shutting down and redis_failover does not provide
a close/shutdown operation.


Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11145692.

mperham commented Dec 7, 2012

Ryan, FYI it's not listed here which is linked from your readme: http://rubydoc.info/github/ryanlecompte/redis_failover/master/RedisFailover/Client

arohter referenced this issue in ryanlecompte/redis_failover Jan 8, 2014

Merged

Connection handling improvements. #66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment