-
-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid bisect command to get stuck #2669
Conversation
Just my 2c. Wondering why
With this in mind, do we really need to check that information that Out-of-context reproduction: def experiment(n)
rd, wr = IO.pipe
pid = fork { rd.close; sleep 1; wr.write('a' * n); sleep 1; wr.close }
Process.waitpid(pid)
wr.close
puts 'reading'
puts rd.read.size
puts 'done'
end > experiment(65536)
reading
65536
done
> experiment(65537) the second experiment hangs forever. However, if you comment out
With this in mind, what do you think guys of removing Great in-depth research by the way! 💯 |
Thanks a lot @pirj for this very interesting feedback. I think I thought about removing it at first but was not sure about unexpected behaviors. I redo all my test and removing |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
0fdec07
to
8590b53
Compare
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
8590b53
to
1f5dcf9
Compare
Are we happy this fixes the problem? Have we got a way to check for a regression to this in our tests? (It's green its just not gotten to github) |
I am quite ok with the result even if I would love to test it on a linux and windows.
One test we could have is to run bisect for a lot's of test. If you want I can look at this. |
I think we need something... |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
1f5dcf9
to
2ff60d3
Compare
Sorry for the delay. I added a test on integration side. It was the easiest to implement in my case. I was able to reproduce the lock without my patch, then test pass with |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
2ff60d3
to
f746452
Compare
66ec8ac
to
f746452
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work you two, thanks for digging into this so much @benoittgt, as a test combo this looks sound, I'd just like some minor tweaks and this should be good to go :)
Thanks for the review. I am not about the clarity of my comment. Open for reviews. It was a pleasure to work on this subject. |
40c768b
to
7c79e17
Compare
Thanks @JonRowe. It's much better and shorter with your proposal. |
Happy for you to merge @benoittgt don't forget the change log :) |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
41c7442
to
4c0e507
Compare
4c0e507
to
f902067
Compare
Hey guys @benoittgt is there a special way to run this command and take advantage of this implementation? I am running into the same issue, so wanted to double-check with you. By the way, which version is this one? Thanks. |
This is the master version. We did not release a new version with the patch. You could point in your gemfile rspec-core to master. |
Thanks man @benoittgt |
Avoid bisect command to get stuck
@andrykonchin brought up an interesting point that by removing A quick search showed that to mitigate this we can use Maybe @ioquatix can advise here? |
This probably means that we have not fixed the problem, merely ignored it. |
Ouhao. Very nice article. 👏 This is a google translation but this is written:
Mentioned in
So after few tests, we have probably zombie processes when the bisect commands run. It is not clean while the bisect command is running but it is cleaned after.
I did this test, because I wanted to understand exactly what was happening and also try @pirj proposal:
@read_io, @write_io = IO.pipe
# write into pipe some data
def run_specs(i)
packet = '*' * i
@write_io.write("#{packet.bytesize}\n#{packet}")
end
count = 0
(65500..69500).each do |i|
count += 1
# create a child process
pid = fork { run_specs(i) }
puts "interval: #{i}, pid is: #{pid} and parent pid is: #{Process.ppid}, count: #{count}"
# wait for its terminating
# Process.waitpid(pid)
sleep 0.1
# read result
packet_size = Integer(@read_io.gets)
packet = @read_io.read(packet_size)
puts "packet size: #{packet.size}"
end
$ sysctl kern.maxprocperuid
kern.maxprocperuid: 1418
while true; do pstree -u <username> | wc -l ; sleep 1; clear; done You can also monitor zombie process count
|
If do not `waitpid` or `detach` the process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: rspec#2637 - PR discussion: rspec#2669
Avoid bisect command to get stuck
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - rspec#2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
Avoid bisect command to get stuck --- This commit was imported from rspec/rspec-core@1c31fc4.
This commit was imported from rspec/rspec-core@968fd99.
Fix: #2637
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1
First, I've tried to play with the number of specs which led to the
interesting conclusion: the process hangs only at 1548+ specs.
Try to run with
N=1547
andN=1548
.Seems suspicious, right?
Let's add
pry-byebug
to the equation (or Gemfile).In order it to work we need to tweak our runner code a bit:
After a bit of
puts
debugging I localized the problem:@channel.send
.Channel#send
callsIO#write
herehttps://github.com/rspec/rspec-core/blob/7b6b9c3/lib/rspec/core/bisect/utilities.rb#L41:
Do you know, what is the
packet.bytesize
forN=1548
? It's 65548.This number is very important: the pipe size is only 65536 on MacOS
(see docs for
IO#write_nonblock
for more).
That makes
@write_io.write
hangs forever, because no one reads thebuffer: we call
Channel#receive
only afterProcess.waitpid(pid)
,thus waiting for the write operation to complete.
A basic proposal is to use WNOHANG. From waitpid doc:
To validate this proposal on OSX we run just before running bisect:
lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'
This will give us in loop the PIPE sizes of Ruby processes. Without our
patch we see that quickly we hit 65536 bytes on two pipes, with the patch
we keep pipes at the right size.
Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.
Thanks to Palkan, julik and Samuel Williams for the help. 🙇🏻♂️