Work around read/close race (x2) #26714

Merged
merged 3 commits into from Oct 6, 2016

Projects

None yet

3 participants

@matthewd
Member
matthewd commented Oct 6, 2016

It is very very not-good if one thread calls close on an IO at the same time that another calls read.

The segfaults we've recently seen on Travis appear to be attributable to websocket-client-simple, but our real close method seems subject to the race too.

Branch is PRed here: shokai/websocket-client-simple#25

*** glibc detected *** /home/travis/.rvm/rubies/ruby-2.3.1/bin/ruby: corrupted double-linked list: 0x00007fdd500155d0 ***

valgrind --tool=memcheck --max-stackframe=8382448 --track-origins=yes ruby … output: https://gist.github.com/matthewd/0d584378aa3c083fa3581c72f15e292a

Huge thanks to @tenderlove, who spent a lot of time on this with me -- and without whom I'd still just be staring blankly at Valgrind documentation 🍻


I suspect the upstream fix is to wait inside rb_thread_fd_close until there are no threads where th->waiting_fd == fd. But I'll defer to @ko1 and @nobu on that 😅

@matthewd matthewd Close the IO from the read loop thread
IO#close and IO#read across threads don't get along so well:

After T1 enters #read and releases the GVL, T2 can call #close on the
IO, thereby both closing the fd and freeing the buffer while T1 is using
them.
4a7c568
@matthewd matthewd added this to the 5.0.1 milestone Oct 6, 2016
@matthewd matthewd self-assigned this Oct 6, 2016
matthewd added some commits Oct 6, 2016
@matthewd matthewd Use a branch of websocket-client-simple, to work around read/close race acd7ba7
@matthewd matthewd Wait for the socket to be closed asynchronously
1fe967d
@matthewd matthewd merged commit 3a11c5b into rails:master Oct 6, 2016

1 of 2 checks passed

codeclimate Code Climate is analyzing this code.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@matthewd
Member
matthewd commented Oct 6, 2016

Partially backported as e6b435d (websocket-client-simple is only used on master)

@nobu nobu added a commit to nobu/ruby that referenced this pull request Oct 7, 2016
@nobu nobu io.c: defer freeing buffers
* io.c (fptr_finalize): defer freeing buffers not to invalidate
  the buffer which is in use to read, in the case of read/close
  race.  rails/rails#26714

* thread.c (rb_thread_fd_close): return TRUE if the fd is in use.
b8f16cc
@nobu
Contributor
nobu commented Oct 7, 2016

I can't reproduce it with simple code.
Do you have any code, and does it fixed by this patch?
ruby/ruby@trunk...nobu:bug/io-close-race

@matthewd
Member
matthewd commented Oct 8, 2016

@nobu I'm not sure 😕

I've run it for a while, and haven't seen the previous segfault. But I have seen some errors like NameError: uninitialized constant WebSocket::Handshake, which don't normally occur.

Before, I got a segfault every ~20 runs.

With this PR, I ran 1500 times without any problems.

Without this PR and with ruby/ruby@b8f16cc, I've seen two (different) missing constant errors and one test failure, in just 300 runs.


If it helps, here's a recipe to reproduce the original problem:

On Linux (I'm using Ubuntu 16.04 on AWS; 4.4.0-38-generic x86_64) -- I could not reproduce on Mac OS X.

I built ruby/ruby@54b8015 (tip of ruby_2_3 at the time I set it up)

Clone rails/rails@39fb306

gem install bundler; bundle install

cd actioncable/

i=0; while ..../bin/ruby -rbundler/setup -Itest test/client_test.rb -v; do i=$(( $i+1 )); echo -e "\n\n$i :: `date`\n"; done

For me, that will segfault and stop at about loop 20-30.

@matthewd
Member
matthewd commented Oct 8, 2016 edited

I've also tried to get a simple script to reproduce it. This crashes in the same area, but it has a different error:

Edit: This still crashes with ruby/ruby@54b8015 + a cherry-pick of ruby/ruby@b8f16cc -- gist showing both a read into freed memory, and a read into 0x0

100.times do
  60.times.map do |i|
    Thread.new do
      x = i.to_s
      s1, s2 = IO.pipe
      t2 = Thread.new { 100.times { s2 << (x * 1000) } }
      t = Thread.new { loop { s1.getc } }
      Thread.new { sleep 0.1; s1.close }.join
      # sleep 0.1; s1.close 
      begin
        t.join
      rescue IOError
      end
      print "."
    end
  end.map(&:join)
  print "\n"
end
==5076== Invalid read of size 1
==5076==    at 0x1626CA: io_getc (io.c:3535)
==5076==    by 0x163116: rb_io_getc (io.c:3778)
==5076==    by 0x26B2B1: call_cfunc_0 (vm_insnhelper.c:1469)
==5076==    by 0x26BD26: vm_call_cfunc_with_frame (vm_insnhelper.c:1642)
==5076==    by 0x26BE46: vm_call_cfunc (vm_insnhelper.c:1737)
==5076==    by 0x270D51: vm_exec_core (insns.def:994)
==5076==    by 0x282499: vm_exec (vm.c:1650)
==5076==    by 0x2803D4: invoke_block (vm.c:921)
==5076==    by 0x280796: invoke_block_from_c_0 (vm.c:971)
==5076==    by 0x28083A: invoke_block_from_c_splattable (vm.c:988)
==5076==    by 0x280986: vm_yield (vm.c:1023)
==5076==    by 0x27C2CB: rb_yield_0 (vm_eval.c:1010)
==5076==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==5076== 
./test.rb:7: [BUG] Segmentation fault at 0x00000000000000
ruby 2.3.2p193 (2016-10-04 revision 56331) [x86_64-linux]

Also, I can't reproduce it now 😅 , but when I was trying different things in this test script earlier, at one time I saw Valgrind complain about calls to read with an fd of -1. As we set fptr->fd = -1 in fptr_finalize, there may be a memory-read race here too?

@nobu nobu added a commit to nobu/ruby that referenced this pull request Oct 8, 2016
@nobu nobu io.c: defer freeing buffers
* io.c (fptr_finalize): defer freeing buffers not to invalidate
  the buffer which is in use to read, in the case of read/close
  race.  rails/rails#26714

* thread.c (rb_thread_fd_close): return TRUE if the fd is in use.
647e951
@nobu
Contributor
nobu commented Oct 8, 2016

Thank you, I've updated the branch.

@nobu
Contributor
nobu commented Oct 8, 2016

It caused unexpected IOErrors in rubyspec.
Give me time a little.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment