Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/fiber/test_queue.rb: Make the stuck test fail. #8791

Merged

Conversation

junaruga
Copy link
Member

@junaruga junaruga commented Oct 27, 2023

This PR is to make the stuck tests fail immediately in a better way.

I observed the 2 tests in the test/fiber/test_queue.rb getting stuck in some GCC compilers in ppc64le Ubuntu, even when the timeout queue.pop(timeout: 0.0001) is set in the code. This commit is to make the test fail rather than getting stuck.

The issue can happen in the current RubyCI ppc64le server with GCC 11.4.0. We also observed the issue an older GCC compiler in the past in Ubuntu focal ppc64le.
#8739 (comment)

Here is the reproducer for the issue.
https://github.com/junaruga/report-ruby-fiber-hung_up-tests

+ make -s test-all TESTS=../test/fiber/test_queue.rb RUBYOPT=-w
generating enc.mk
making enc
making srcs under enc
generating transdb.h
transdb.h unchanged
making trans
making encs
generating makefiles ext/configure-ext.mk
ext/configure-ext.mk updated
generating makefile exts.mk
exts.mk unchanged
linking shared-object -test-/array/resize.so
Run options:
  --seed=47121
  "--ruby=./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --disable-gems"
  --excludes-dir=../test/.excludes
  --name=!/memory_leak/
  -j9
  -q
  --tty=no

# Running tests:

[1/1] 3507820=test_queue
^C
^C
...

After applying this PR, when the test hit the issue, the result is below. I checked it on RubyCI ppc64le server.

jaruga@ruby1:~/git/ruby/ruby/build$ make -s test-all TESTS="../test/fiber/test_queue.rb" RUBYOPT="-w"
generating enc.mk
making enc
making srcs under enc
generating transdb.h
transdb.h unchanged
making trans
making encs
generating makefiles ext/configure-ext.mk
ext/configure-ext.mk unchanged
generating makefile exts.mk
exts.mk unchanged
Run options:
  --seed=26297
  "--ruby=./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --disable-gems"
  --excludes-dir=../test/.excludes
  --name=!/memory_leak/

# Running tests:

[1/2] TestFiberQueue#test_pop_with_timeout = 1.00 s
  1) Failure:
TestFiberQueue#test_pop_with_timeout [/home/jaruga/git/ruby/ruby/test/fiber/test_queue.rb:11]:
The test gets stuck due to a possible compiler bug.
Exception raised:
<#<Timeout::Error: execution expired>>
Backtrace:
  /home/jaruga/git/ruby/ruby/lib/timeout.rb:43:in `rescue in handle_timeout'
  /home/jaruga/git/ruby/ruby/lib/timeout.rb:40:in `handle_timeout'
  /home/jaruga/git/ruby/ruby/lib/timeout.rb:195:in `timeout'
  /home/jaruga/git/ruby/ruby/test/fiber/test_queue.rb:51:in `block in assert_nothing_stuck'.

Leaked file descriptor: TestFiberQueue#test_pop_with_timeout: 5 : #<IO:fd 5>
Leaked file descriptor: TestFiberQueue#test_pop_with_timeout: 6 : #<IO:fd 6>
COMMAND     PID   USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
ruby    3669508 jaruga    5r  FIFO   0,13      0t0 61598837 pipe
ruby    3669508 jaruga    6w  FIFO   0,13      0t0 61598837 pipe
Leaked thread: TestFiberQueue#test_pop_with_timeout: #<Thread:0x00007eb168d5b630 /home/jaruga/git/ruby/ruby/test/fiber/test_queue.rb:12 sleep>
Finished tests in 1.066764s, 1.8748 tests/s, 6.5619 assertions/s.
2 tests, 7 assertions, 1 failures, 0 errors, 0 skips

ruby -v: ruby 3.3.0dev (2023-10-27T19:09:03Z master 4aee6931c3) [powerpc64le-linux]
make: *** [uncommon.mk:942: yes-test-all] Error 1

@ioquatix
Copy link
Member

ioquatix commented Oct 27, 2023

Thanks for working on this.

I don't think using Timeout.timeout here is a good idea.

If you want to limit the thread join, provide an argument to join, e.g.

thread = Thread.new{...}
kill_count = 0

until thread.join(2) # 2 seconds maximum timeout until we kill it.  
  kill_count += 1
  thread.kill
end

assert(kill_count, 0)

I hope we can figure out a better way to fail the test if the thread is killed.

@junaruga
Copy link
Member Author

OK. Thanks for review. I will modify on your suggested way.

@junaruga
Copy link
Member Author

I thought a boolean variable is better than number variable kill_count in this case.

@junaruga junaruga force-pushed the wip/test-fiber-queue-assert-nothing-stuck branch from 2b5f63c to 1611261 Compare October 27, 2023 22:46
@junaruga
Copy link
Member Author

I rebased the PR on your way. I used the boolean value timeout instead of kill_count. So, the result in the failing case is below. What do you think?

jaruga@ruby1:~/git/ruby/ruby/build$ make -s test-all TESTS="../test/fiber/test_queue.rb" RUBYOPT="-w"
generating enc.mk
making enc
making srcs under enc
generating transdb.h
transdb.h unchanged
making trans
making encs
generating makefiles ext/configure-ext.mk
ext/configure-ext.mk unchanged
generating makefile exts.mk
exts.mk unchanged
Run options:
  --seed=4799
  "--ruby=./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --disable-gems"
  --excludes-dir=../test/.excludes
  --name=!/memory_leak/

# Running tests:

[1/2] TestFiberQueue#test_pop_with_timeout = 2.00 s
  1) Failure:
TestFiberQueue#test_pop_with_timeout [/home/jaruga/git/ruby/ruby/test/fiber/test_queue.rb:26]:
Getting stuck due to a possible compiler bug.
Expected true to be false.

Leaked file descriptor: TestFiberQueue#test_pop_with_timeout: 5 : #<IO:fd 5>
Leaked file descriptor: TestFiberQueue#test_pop_with_timeout: 6 : #<IO:fd 6>
COMMAND     PID   USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
ruby    3693354 jaruga    5r  FIFO   0,13      0t0 61644190 pipe
ruby    3693354 jaruga    6w  FIFO   0,13      0t0 61644190 pipe
Finished tests in 2.067022s, 0.9676 tests/s, 3.3865 assertions/s.
2 tests, 7 assertions, 1 failures, 0 errors, 0 skips

ruby -v: ruby 3.3.0dev (2023-10-27T19:09:03Z master 4aee6931c3) [powerpc64le-linux]
make: *** [uncommon.mk:942: yes-test-all] Error 1

@junaruga
Copy link
Member Author

junaruga commented Oct 27, 2023

Travis CI ppc64le is failing with another issue. This is not related to this PR.
https://app.travis-ci.com/github/ruby/ruby/jobs/612375819#L2906

  1) Failure:
TestRDocGeneratorJsonIndex#test_generate [/home/travis/build/ruby/ruby/test/rdoc/test_rdoc_generator_json_index.rb:108]:
.js files should be the same timestamp of original

But I noticed the ppc64le job finished with total time 14:08. It is really faster than the current master's time 21:59. Perhaps this PR's change affected it?
https://app.travis-ci.com/github/ruby/ruby/builds/266896832

@junaruga junaruga marked this pull request as ready for review October 27, 2023 23:04
@junaruga
Copy link
Member Author

I rebased the PR on your way. I used the boolean value timeout instead of kill_count. So, the result in the failing case is below. What do you think?

Maybe I think the kill (true/false) is better than timeout (true/false). Let me modify again.

We observed the 2 tests in the `test/fiber/test_queue.rb` getting stuck
in some GCC compilers in Ubuntu ppc64le focal/jammy, even when the timeout
`queue.pop(timeout: 0.0001)` is set in the code.

This commit is to make the tests fail rather than getting stuck.
@junaruga junaruga force-pushed the wip/test-fiber-queue-assert-nothing-stuck branch from 1611261 to bb1c33e Compare October 28, 2023 01:11
@junaruga
Copy link
Member Author

OK. Now rebased with kill (true/false). the result on the RubyCI Ubuntu ppc64le jammy server is below.

jaruga@ruby1:~/git/ruby/ruby/build$ make -s test-all TESTS="../test/fiber/test_queue.rb" RUBYOPT="-w"
generating enc.mk
making enc
making srcs under enc
generating transdb.h
transdb.h unchanged
making trans
making encs
generating makefiles ext/configure-ext.mk
ext/configure-ext.mk unchanged
generating makefile exts.mk
exts.mk unchanged
Run options:
  --seed=12531
  "--ruby=./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --disable-gems"
  --excludes-dir=../test/.excludes
  --name=!/memory_leak/

# Running tests:

[2/2] TestFiberQueue#test_pop_with_timeout = 2.00 s
  1) Failure:
TestFiberQueue#test_pop_with_timeout [/home/jaruga/git/ruby/ruby/test/fiber/test_queue.rb:26]:
Getting stuck due to a possible compiler bug.
Expected true to be false.

Leaked file descriptor: TestFiberQueue#test_pop_with_timeout: 5 : #<IO:fd 5>
Leaked file descriptor: TestFiberQueue#test_pop_with_timeout: 6 : #<IO:fd 6>
COMMAND     PID   USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
ruby    3855819 jaruga    5r  FIFO   0,13      0t0 61744853 pipe
ruby    3855819 jaruga    6w  FIFO   0,13      0t0 61744853 pipe
Finished tests in 2.066826s, 0.9677 tests/s, 3.3868 assertions/s.
2 tests, 7 assertions, 1 failures, 0 errors, 0 skips

ruby -v: ruby 3.3.0dev (2023-10-27T19:09:03Z master 4aee6931c3) [powerpc64le-linux]
make: *** [uncommon.mk:942: yes-test-all] Error 1

@ioquatix
Copy link
Member

This PR may fix the stuck/hang: #8393 - do you want to try applying it and see if it helps?

@junaruga
Copy link
Member Author

junaruga commented Oct 28, 2023

This PR may fix the stuck/hang: #8393 - do you want to try applying it and see if it helps?

I don't understand what your idea is. Could you explain more about it?

@ioquatix
Copy link
Member

ioquatix commented Oct 28, 2023

It causes the compiler to generate different code which doesn't hang. I don't know if the mutex code path is used for queue but it may be.

@junaruga
Copy link
Member Author

junaruga commented Oct 28, 2023

It causes the compiler to generate different code which doesn't hang. I don't know if the mutex code path is used for queue but it may be.

OK. Maybe I understood your idea. The PR #8393 is an implementation to fix stuck/hang. However, we don't know if implementation is used for Thread::Queue that causes the stuck/hang in your opinion. So, you are asking me if adding the implementation for the Thread::Queue is a good idea.

Yes, I think so. That's a good idea. Because while we avoid the stuck/hang to the tests in test_queue.rb in this PR, users may face stuck/hang in their real use cases.

I don't want to do it by myself. It's great if you do it. You have a reproducer to test stuck/hang in RubyCI ppc64le server.
https://github.com/junaruga/report-ruby-fiber-hung_up-tests

I am going to merge this PR now. I think keeping this PR's assertion in the test is still useful even after you implement for the Thread::Queue. It's a kind of safety net.

Thanks for your review!

@junaruga junaruga merged commit 3eaae72 into ruby:master Oct 28, 2023
97 checks passed
@junaruga junaruga deleted the wip/test-fiber-queue-assert-nothing-stuck branch October 28, 2023 09:10
@ioquatix
Copy link
Member

That all seems reasonable to me.

matzbot pushed a commit that referenced this pull request Nov 19, 2023
	test/fiber/test_queue.rb: Make the stuck test fail. (#8791)

	test/fiber/test_queue.rb: Make the stuck tests fail.

	We observed the 2 tests in the `test/fiber/test_queue.rb` getting stuck
	in some GCC compilers in Ubuntu ppc64le focal/jammy, even when the timeout
	`queue.pop(timeout: 0.0001)` is set in the code.

	This commit is to make the tests fail rather than getting stuck.
	---
	 test/fiber/test_queue.rb | 20 ++++++++++++++++----
	 1 file changed, 16 insertions(+), 4 deletions(-)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants