New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around read/close race (x2) #26714

Merged
merged 3 commits into from Oct 6, 2016

Conversation

Projects
None yet
3 participants
@matthewd
Member

matthewd commented Oct 6, 2016

It is very very not-good if one thread calls close on an IO at the same time that another calls read.

The segfaults we've recently seen on Travis appear to be attributable to websocket-client-simple, but our real close method seems subject to the race too.

Branch is PRed here: shokai/websocket-client-simple#25

*** glibc detected *** /home/travis/.rvm/rubies/ruby-2.3.1/bin/ruby: corrupted double-linked list: 0x00007fdd500155d0 ***

valgrind --tool=memcheck --max-stackframe=8382448 --track-origins=yes ruby … output: https://gist.github.com/matthewd/0d584378aa3c083fa3581c72f15e292a

Huge thanks to @tenderlove, who spent a lot of time on this with me -- and without whom I'd still just be staring blankly at Valgrind documentation 🍻


I suspect the upstream fix is to wait inside rb_thread_fd_close until there are no threads where th->waiting_fd == fd. But I'll defer to @ko1 and @nobu on that πŸ˜…

Close the IO from the read loop thread
IO#close and IO#read across threads don't get along so well:

After T1 enters #read and releases the GVL, T2 can call #close on the
IO, thereby both closing the fd and freeing the buffer while T1 is using
them.

@matthewd matthewd added this to the 5.0.1 milestone Oct 6, 2016

@matthewd matthewd self-assigned this Oct 6, 2016

@matthewd matthewd force-pushed the matthewd:close-race branch to 1fe967d Oct 6, 2016

@matthewd matthewd merged commit 3a11c5b into rails:master Oct 6, 2016

1 of 2 checks passed

codeclimate Code Climate is analyzing this code.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

matthewd added a commit that referenced this pull request Oct 6, 2016

Merge pull request #26714 from matthewd/close-race
Work around read/close race (x2)

(partial backport only)
@matthewd

This comment has been minimized.

Member

matthewd commented Oct 6, 2016

Partially backported as e6b435d (websocket-client-simple is only used on master)

nobu added a commit to nobu/ruby that referenced this pull request Oct 7, 2016

io.c: defer freeing buffers
* io.c (fptr_finalize): defer freeing buffers not to invalidate
  the buffer which is in use to read, in the case of read/close
  race.  rails/rails#26714

* thread.c (rb_thread_fd_close): return TRUE if the fd is in use.
@nobu

This comment has been minimized.

Contributor

nobu commented Oct 7, 2016

I can't reproduce it with simple code.
Do you have any code, and does it fixed by this patch?
ruby/ruby@trunk...nobu:bug/io-close-race

@matthewd

This comment has been minimized.

Member

matthewd commented Oct 8, 2016

@nobu I'm not sure πŸ˜•

I've run it for a while, and haven't seen the previous segfault. But I have seen some errors like NameError: uninitialized constant WebSocket::Handshake, which don't normally occur.

Before, I got a segfault every ~20 runs.

With this PR, I ran 1500 times without any problems.

Without this PR and with ruby/ruby@b8f16cc, I've seen two (different) missing constant errors and one test failure, in just 300 runs.


If it helps, here's a recipe to reproduce the original problem:

On Linux (I'm using Ubuntu 16.04 on AWS; 4.4.0-38-generic x86_64) -- I could not reproduce on Mac OS X.

I built ruby/ruby@54b8015 (tip of ruby_2_3 at the time I set it up)

Clone 39fb306

gem install bundler; bundle install

cd actioncable/

i=0; while ..../bin/ruby -rbundler/setup -Itest test/client_test.rb -v; do i=$(( $i+1 )); echo -e "\n\n$i :: `date`\n"; done

For me, that will segfault and stop at about loop 20-30.

@matthewd

This comment has been minimized.

Member

matthewd commented Oct 8, 2016

I've also tried to get a simple script to reproduce it. This crashes in the same area, but it has a different error:

Edit: This still crashes with ruby/ruby@54b8015 + a cherry-pick of ruby/ruby@b8f16cc -- gist showing both a read into freed memory, and a read into 0x0

100.times do
  60.times.map do |i|
    Thread.new do
      x = i.to_s
      s1, s2 = IO.pipe
      t2 = Thread.new { 100.times { s2 << (x * 1000) } }
      t = Thread.new { loop { s1.getc } }
      Thread.new { sleep 0.1; s1.close }.join
      # sleep 0.1; s1.close 
      begin
        t.join
      rescue IOError
      end
      print "."
    end
  end.map(&:join)
  print "\n"
end
==5076== Invalid read of size 1
==5076==    at 0x1626CA: io_getc (io.c:3535)
==5076==    by 0x163116: rb_io_getc (io.c:3778)
==5076==    by 0x26B2B1: call_cfunc_0 (vm_insnhelper.c:1469)
==5076==    by 0x26BD26: vm_call_cfunc_with_frame (vm_insnhelper.c:1642)
==5076==    by 0x26BE46: vm_call_cfunc (vm_insnhelper.c:1737)
==5076==    by 0x270D51: vm_exec_core (insns.def:994)
==5076==    by 0x282499: vm_exec (vm.c:1650)
==5076==    by 0x2803D4: invoke_block (vm.c:921)
==5076==    by 0x280796: invoke_block_from_c_0 (vm.c:971)
==5076==    by 0x28083A: invoke_block_from_c_splattable (vm.c:988)
==5076==    by 0x280986: vm_yield (vm.c:1023)
==5076==    by 0x27C2CB: rb_yield_0 (vm_eval.c:1010)
==5076==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==5076== 
./test.rb:7: [BUG] Segmentation fault at 0x00000000000000
ruby 2.3.2p193 (2016-10-04 revision 56331) [x86_64-linux]

Also, I can't reproduce it now πŸ˜… , but when I was trying different things in this test script earlier, at one time I saw Valgrind complain about calls to read with an fd of -1. As we set fptr->fd = -1 in fptr_finalize, there may be a memory-read race here too?

nobu added a commit to nobu/ruby that referenced this pull request Oct 8, 2016

io.c: defer freeing buffers
* io.c (fptr_finalize): defer freeing buffers not to invalidate
  the buffer which is in use to read, in the case of read/close
  race.  rails/rails#26714

* thread.c (rb_thread_fd_close): return TRUE if the fd is in use.
@nobu

This comment has been minimized.

Contributor

nobu commented Oct 8, 2016

Thank you, I've updated the branch.

@nobu

This comment has been minimized.

Contributor

nobu commented Oct 8, 2016

It caused unexpected IOErrors in rubyspec.
Give me time a little.

@matthewd matthewd deleted the matthewd:close-race branch Jun 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment