Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate the timeout thread at exit #21

Closed
wants to merge 3 commits into from

Conversation

headius
Copy link

@headius headius commented Aug 31, 2022

The main thread created by the timeout library, after 0.3.0, does not provide any mechanism to cleanly shut down. This leads to failures in a JRuby test that launches a JRuby instance in an existing JVM process and confirms no additional threads remain alive after the instance is terminated. This behavior also differs from the old timeout library, which always terminated every thread it created (i.e. there were no background threads since a new thread was created and subsequently terminated for each timeout operation).

JRuby, like CRuby, does attempt to shut down all Ruby threads in a given instance before "exit", but it does this as a best-effort very late in the shutdown process, possibly not giving the timeout thread enough time to finalize at a native level. JRuby's previous timeout implementation was built as an extension using JDK concurrency APIs like Executor, and included a mechanism for shutting down the timeout Executor and its threads (using a mechanism similar to at_exit).

The following patch adds an at_exit shutdown process for the timeout thread, to explicitly kill and wait for it to die. With this change, the test in question appears to pass more reliably: every test run out of 50 passed locally versus failing most of the time before.

@headius
Copy link
Author

headius commented Aug 31, 2022

I am not sure how to test this, given that it only affects runtimes that can spin up multiple instances in a single process. CRuby does not have this capability so there's no opportunity to test this.

lib/timeout.rb Show resolved Hide resolved
lib/timeout.rb Outdated Show resolved Hide resolved
lib/timeout.rb Outdated Show resolved Hide resolved
In some runtimes, the process may not go away but the Ruby runtime
should be shut down. This thread does not clean itself up, which
leads to such runtimes leaning on a best-effort kill+join of the
dangling threads late in the shutdown process. As a result these
threads may linger slightly after shutdown and fail expectations
that they will have completed and finalized.

This patch adds an `at_exit` for the timeout library to terminate
its own thread, which would happen earlier in the shutdown process
(`at_exit` execution is considered part of the "user" shutdown
phase in JRuby). This appears to fix issues running JRuby's
embedding API tests that check for timeout threads to have been
cleanly terminated and finalized.
@headius
Copy link
Author

headius commented Sep 6, 2022

I updated this based on review comments. It now triggers shutdown by closing the timeout queue and then waits for the thread to complete. The thread itself sees the queue shutdown and terminates itself. There may be some small delay in shutting down the runtime as the timeout thread will need to schedule and run to completion, but it is not doing a hard kill from outside the timeout thread anymore.

@headius
Copy link
Author

headius commented Sep 6, 2022

We would like to get this merged and released since the dangling thread causes a number of our JRuby master test suites to fail.

@headius
Copy link
Author

headius commented Sep 6, 2022

@hsbt If I can push timeout gem to RG.org I will push a 0.3.1.pre1 with this and other unreleased changes on master, so we can test against JRuby and CRuby master before full release.

@hsbt @eregon Please review and let me know if we can merge this.

@eregon
Copy link
Member

eregon commented Sep 7, 2022

Actually this approach seems problematic, because there might be code running inside another at_exit hook which uses Timeout.timeout.
In fact, unit tests (test-unit & probably minitest too) run inside at_exit.

That's also the case for unit tests of this gem.
And it only "works" by luck because of the order of at_exit registration and Timeout.imeout being only used after the test-unit at_exit.
So if I apply this change on this branch:

diff --git a/test/test_timeout.rb b/test/test_timeout.rb
index 76de389..62e3861 100644
--- a/test/test_timeout.rb
+++ b/test/test_timeout.rb
@@ -1,6 +1,7 @@
 # frozen_string_literal: false
 require 'test/unit'
 require 'timeout'
+Timeout.timeout(1){}
 
 class TestTimeout < Test::Unit::TestCase
 

Then all/most tests fail.

I think the real solution here is fixing JRuby so it can reliably kill Ruby-created threads when the instance/context is being closed. This is already what TruffleRuby does.
While that may be hard for some threads for the timeout thread it should be straightforward as it's always in Ruby code or blocking in the Queue#pop/other Ruby sync, there is no native/java code involved which could make it hard to interrupt.
The instance close should then wait that the underlying Java thread is joined.

@eregon
Copy link
Member

eregon commented Sep 7, 2022

I consider this an upstream/JRuby issue so I close this.
JRuby could do the same as thread.kill; thread.join in its instance/context close for threads created by that instance/context (just like TruffleRuby already does) and that would work without needing changes here.

@eregon eregon closed this Sep 7, 2022
@eregon
Copy link
Member

eregon commented Sep 7, 2022

There might be changes needed to mark this thread as kind of a daemon thread, which is something Ruby doesn't have currently, so I'm open to other changes in general to improve dealing with that daemon thread.

For this specific case here though I don't think there is anything needed to change here.

@headius
Copy link
Author

headius commented Sep 7, 2022

I disagree with your assessment and would like to debate this more.

Does TruffleRuby have tests for repeatedly spinning up a runtime, running a timeout, and then confirming all threads are terminated immediately after that runtime is terminated? Is it even possible to terminate a TruffleRuby runtime instance and start another in the same process?

Your argument is questionable anyway, since threads shutting down may need to timeout calls in their ensure blocks on the way out, and killing the timeout thread earlier on will potentially break this in the same way.

Another aspect of this: your version already changed behavior, by adding a daemon thread that is now visible to the rest of the runtime and still executing when the runtime proceeds to shut down. Ergo, you introduced a behavioral change that affects any implementation that may run multiple runtimes in a given process, since there's now an expectation that implicitly-started threads from this library must also be cleaned up. I could argue that you are the one that introduce the bug here, by not cleaning up resources on shutdown.

JRuby solved this problem when we used the JDK version of this logic by adding an explicit shutdown of the executor thread before we proceed with the rest of runtime teardown. The test we added to confirm that shutdown happened is the test that's failing now with the new version of the timeout library. That shutdown is what I'm asking for here.

The native JRuby version did not interfere with Ruby behavior because the executor thread was considered separate from the JRuby runtime and threads spawned by Ruby code. All shutdown logic for userspace code and resources ran first, and then the timeout executor was terminated as part of JRuby's internal teardown. Unfortunately we don't have a way to mark library resources as being internal or system-level, so there will always be potential conflict when trying to release those resources.

And in the end you not are the sole maintainer or consumer of this library, and I do not think it is appropriate for you to unilaterally close this without discussion. Remember you already had several standard libraries incorporate patches to ignore the gem contents and use the versions TruffleRuby ships. That is a TruffleRuby issue, so I propose we also remove those changes if we are not in the business of making these libraries accommodate non-CRuby runtimes.

@headius headius reopened this Sep 7, 2022
If the shutdown at_exit runs before other at_exit or termination
logic that still depends on the timeout thread, those subsequent
uses of timeout may fail. This additional patch allows the timeout
thread and queue to be restarted after the at_exit has fired, with
the side effect that it will not try to shutdown a second time.

This addresses review comments from ruby#21.
@headius
Copy link
Author

headius commented Sep 7, 2022

I have reopened this PR and added a commit that makes it possible for timeout to be restarted, for the rare cases where the at_exit shutdown occurs before other at_exit hooks that still require timeout to be functional. This balances the need to shut down the timeout thread with the few cases where timeout is also used in runtime-level shutdown hooks. We make a best effort to shut down on the way out, but if you are still using timeout during shutdown it will still work and you will have to deal with the resources yourself.

Your modification to the test suite passes as before with the additional patch in place.

[] ~/work/timeout $ git diff
diff --git a/test/test_timeout.rb b/test/test_timeout.rb
index 76de389..62e3861 100644
--- a/test/test_timeout.rb
+++ b/test/test_timeout.rb
@@ -1,6 +1,7 @@
 # frozen_string_literal: false
 require 'test/unit'
 require 'timeout'
+Timeout.timeout(1){}
 
 class TestTimeout < Test::Unit::TestCase
 

[] ~/work/timeout $ rake test
Loaded suite /home/headius/.rvm/rubies/ruby-3.1.2/lib/ruby/gems/3.1.0/gems/rake-13.0.6/lib/rake/rake_test_loader
Started
..............
Finished in 0.399320627 seconds.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
14 tests, 25 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
35.06 tests/s, 62.61 assertions/s

@headius
Copy link
Author

headius commented Sep 7, 2022

I'm working on refining the shutdown. The locking gets tricky when we have parallel-executing implementations, which is causing some JRuby and TruffleRuby runs to hang.

This allows more atomic shutdown; either the queue close or the
thread terminating indicates the executor has been shut down, and
a new one can be created and put into action without interfering
with the shutdown of a previous one.
@headius
Copy link
Author

headius commented Sep 7, 2022

I've committed a different version that only requires a single lock to manage a new Executor instance; when it is shut down, either via the queue being closed or the thread terminating, it will be seen as not alive and a new one instantiated. This allows CRuby to pass all tests (even with the Timeout.timeout(1){} hack from @eregon above) and the parallel-threaded implementations like JRuby and TruffleRuby should run correctly.

@headius
Copy link
Author

headius commented Sep 7, 2022

As noted in #17, an alternative to a "restartable" executor would be to just fall back on the old thread-per-timeout logic when the executor has been shutdown (or when we are running in a context that disallows mutex usage).

There are many different ways to make this shutdown logic work cleanly. There's no valid reason to close this PR without discussing those options.

headius added a commit to jruby/jruby that referenced this pull request Sep 7, 2022
See #7349 and the fix in ruby/timeout#21 which will address it.
@eregon
Copy link
Member

eregon commented Sep 8, 2022

Does TruffleRuby have tests for repeatedly spinning up a runtime, running a timeout, and then confirming all threads are terminated immediately after that runtime is terminated? Is it even possible to terminate a TruffleRuby runtime instance and start another in the same process?

Yes. An instance is called a Context, and for example tests here https://github.com/oracle/truffleruby/tree/master/src/test/java/org/truffleruby create multiple Contexts in the same process.
And Truffle itself ensures all threads created by the Truffle languages are joined when the Context is closed, and throws an error otherwise, so this check happens on every Context#close, and so it's already tested by every single truffleruby execution and those tests.

And since TruffleRuby does that, I'm pretty sure JRuby can too, at very least for a Ruby-created thread only running Ruby code like the timeout thread.

Your argument is questionable anyway, since threads shutting down may need to timeout calls in their ensure blocks on the way out, and killing the timeout thread earlier on will potentially break this in the same way.

Thread#kill should not execute ensure, just like it doesn't on CRuby.
But even if it does, it's 1) very unlikely to have a timeout in an ensure (literally playing with fire to only execute part of the ensure), 2) doesn't matter that exception will kill the thread too.

I could argue that you are the one that introduce the bug here, by not cleaning up resources on shutdown.

Good luck with that given the behavior of killing other threads on context close (or process shutdown) is standard for Ruby since many years.

if we are not in the business of making these libraries accommodate non-CRuby runtimes.

The solution is clear, JRuby needs to kill other threads on context close, and join them. This is compatible with what CRuby does and it is already what TruffleRuby does.
I guess you're just missing the join in JRuby here given you already have the kill part from what you wrote in the description.
Hence it seems very clearly a JRuby bug, and workarounds here seem inappropriate.
If you need to identity this thread specifically it already has a name so that should be enough.

@eregon
Copy link
Member

eregon commented Sep 8, 2022

I closed this PR because I do not want alternatives.
The current diff adds significant complexity and solves no problem besides working around a JRuby bug.
The bug is in JRuby and needs to be solved there, and I know well enough because I implemented similar logic in TruffleRuby.

Literally, JRuby instance shutdown must do:

Thread.list.each do |t|
  next if t == Thread.main
  t.kill
  t.join
end

And if it doesn't JRuby leaks threads, and that's then JRuby's problem.
It needs to do this anyway for compatibility with CRuby's shutdown.

@headius What is the issue with doing that in JRuby?

@eregon
Copy link
Member

eregon commented Sep 8, 2022

FWIW, this test passes on TruffleRuby:

    @Test
    public void testKillThread() {
        for (int i = 0; i < 100; i++) {
            System.err.println(i);

            try (Context context = Context.newBuilder("ruby")
                    .allowCreateThread(true)
                    .allowExperimentalOptions(true)
                    .option("ruby.single-threaded", "false")
                    .build()) {
                final Value value = context.eval(
                        "ruby",
                        "Thread.new { sleep }; sleep 0.1; 42");
                assertEquals(42, value.asInt());
            }

            for (Thread t : Thread.getAllStackTraces().keySet()) {
                Assert.assertFalse(t.getName(), t.getName().startsWith("Ruby"));
            }
        }
    }

and is basically the same as the JRuby test in jruby/jruby@ebc0557

@headius
Copy link
Author

headius commented Sep 8, 2022

Thread#kill should not execute ensure, just like it doesn't on CRuby.

You are incorrect about the behavior of Thread#kill. It always runs ensure blocks:

$ ruby -v -e 't = Thread.new { begin; sleep; ensure; p 1; end }; sleep 1; t.kill; t.join'
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
1

I suspect you were confusing it with kill! but you did not use that method in your examples. It may be that CRuby does use kill! internally for shutting down threads at termination time. I am looking into a possible bug in our thread shutdown logic that may explain why the timeout thread does not always get cleaned up, and will also look at using kill! logic to avoid ensures, since that appears to match CRuby.

However, my primary point with this PR is that libraries that start up threads and hold resources should always make a best effort to tidy up those threads and resources at shutdown, given the impact to multi-runtime embedding use cases. Your changes mean that there's now one more dangling thread at termination time that must be forcibly killed rather than gracefully shutting itself down. It's untidy and unnecessary.

The implementation changes I provided are also no more complicated than what you had.

  • I compartmentalized the entire Executor concept, avoiding multiple private methods and constants on the Timeout module.
  • I introduced a mechanism to shut down the executor at exit time, with a fallback to allow it to be created again if an unusual library uses timeout during the exit sequence.
  • It passes all tests on CRuby, JRuby, and TruffleRuby, including with your Timeout.timeout(1) {} test hack, and under all typical use cases the timeout thread will now shut down gracefully rather than being forcibly killed.

These are good improvements to the library.

I closed this PR because I do not want alternatives.

That's not your decision to make.

I could have committed the changes directly, but I wanted to have a dialog with other maintainers about the best way to implement the shutdown logic. This is a shared project, used by all implementations and pretty much every Rubyist out there. We need to come to some agreement before accepting or rejecting reasonable changes.

I still believe that a graceful shutdown is appropriate here, along with other refactoring. It may also be useful for users to be able to trigger the shutdown themselves, as part of reloaders and continuous deployment setups.

@eregon
Copy link
Member

eregon commented Sep 8, 2022

You are incorrect about the behavior of Thread#kill. It always runs ensure blocks:

Right, I was confusing with killing Fibers, which does not run ensure blocks.
CRuby runs ensure of Threads when killing threads and on context close:

$ ruby -v -e 'Fiber.new { begin; Fiber.yield; ensure; p :fiber; end }.resume; Thread.new { begin; sleep; ensure; p :thread; end }; sleep 0.1'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-linux]
:thread

@headius
Copy link
Author

headius commented Sep 8, 2022

I suspect you were confusing it with kill!

I guess I lost track of this because Thread#kill! was removed some time ago. There's no way I know of to trigger thread death that skips ensures from Ruby code.

I'm trying to determine if CRuby uses something similar to shut down threads, ensures do not seem to fire for a sleeping thread at shutdown. Either they are not actively killing the thread, or they are doing it in a way that skips ensures.

Do you recall where this happens in CRuby code? I thought I found it years ago, but can't locate the logic anymore.

@eregon
Copy link
Member

eregon commented Sep 8, 2022

ensures do not seem to fire for a sleeping thread at shutdown

It does fire on CRuby, see my command and output above.

@headius
Copy link
Author

headius commented Sep 8, 2022

It does fire on CRuby, see my command and output above.

You are correct, my local test was not giving the thread enough time to start and sleep.

This means that threads may encounter a case where the timeout thread has been killed, but an ensure wants to run a timeout.

[] ~/work/timeout $ rvm ruby-3.1 do ruby -v -Ilib -rtimeout -e 't = Thread.new { begin; sleep; ensure; Timeout.timeout(1) {p 1}; end; }; Timeout.timeout(1) {}; t = Thread.new { begin; sleep; ensure; Timeout.timeout(1) {p 1}; end; }; sleep 1'
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
#<Thread:0x00007f07e2a19258 -e:1 aborting> terminated with exception (report_on_exception is true):
/home/headius/work/timeout/lib/timeout.rb:101:in `new': can't alloc thread (ThreadError)
	from /home/headius/work/timeout/lib/timeout.rb:101:in `create_timeout_thread'
	from /home/headius/work/timeout/lib/timeout.rb:133:in `block in ensure_timeout_thread_created'
	from /home/headius/work/timeout/lib/timeout.rb:131:in `synchronize'
	from /home/headius/work/timeout/lib/timeout.rb:131:in `ensure_timeout_thread_created'
	from /home/headius/work/timeout/lib/timeout.rb:180:in `timeout'
	from -e:1:in `ensure in block in <main>'
	from -e:1:in `block in <main>'
#<Thread:0x00007f07e2a1f8d8 -e:1 aborting> terminated with exception (report_on_exception is true):
/home/headius/work/timeout/lib/timeout.rb:101:in `new': can't alloc thread (ThreadError)
	from /home/headius/work/timeout/lib/timeout.rb:101:in `create_timeout_thread'
	from /home/headius/work/timeout/lib/timeout.rb:133:in `block in ensure_timeout_thread_created'
	from /home/headius/work/timeout/lib/timeout.rb:131:in `synchronize'
	from /home/headius/work/timeout/lib/timeout.rb:131:in `ensure_timeout_thread_created'
	from /home/headius/work/timeout/lib/timeout.rb:180:in `timeout'
	from -e:1:in `ensure in block in <main>'
	from -e:1:in `block in <main>'

[] ~/work/timeout $ rvm ruby-3.1 do ruby -v -Ilib -rtimeout -e 't = Thread.new { begin; sleep; ensure; Timeout.timeout(1) {p 1}; end; }; Timeout.timeout(1) {}; t = Thread.new { begin; sleep; ensure; Timeout.timeout(1) {p 1}; end; }; sleep 1'
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
1
1

[] ~/work/timeout $ rvm ruby-3.1 do ruby -v -Ilib -rtimeout -e 't = Thread.new { begin; sleep; ensure; Timeout.timeout(1) {p 1}; end; }; Timeout.timeout(1) {}; t = Thread.new { begin; sleep; ensure; Timeout.timeout(1) {p 1}; end; }; sleep 1'
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
1
#<Thread:0x00007fa15ede75b8 -e:1 aborting> terminated with exception (report_on_exception is true):
/home/headius/work/timeout/lib/timeout.rb:101:in `new': can't alloc thread (ThreadError)
	from /home/headius/work/timeout/lib/timeout.rb:101:in `create_timeout_thread'
	from /home/headius/work/timeout/lib/timeout.rb:133:in `block in ensure_timeout_thread_created'
	from /home/headius/work/timeout/lib/timeout.rb:131:in `synchronize'
	from /home/headius/work/timeout/lib/timeout.rb:131:in `ensure_timeout_thread_created'
	from /home/headius/work/timeout/lib/timeout.rb:180:in `timeout'
	from -e:1:in `ensure in block in <main>'
	from -e:1:in `block in <main>'

Depending on the shutdown sequence, a thread containing an ensure that does a timeout (for example, attempt to clean up some system resource but don't wait forever) may output an error and fail to run the timeout block, as I originally suggested. Unfortunately there's no way to fix this case, since CRuby disallows thread creation during this phase of shutdown.

@eregon
Copy link
Member

eregon commented Sep 28, 2022

I think jruby/jruby#7351 solves this and makes any change here unnecessary, so I think we should close this.
We could add a method to explicitly shutdown the timeout thread if that's useful, as that should be fairly simple, but I'm against at_exit stuff as it's quite complex (we cannot know when it will fire) and the timeout thread might end up getting killed and restarted multiple times for no good reason.

@eregon
Copy link
Member

eregon commented Sep 28, 2022

Regarding the potential issue during killing other threads in the shutdown sequence,
I believe the solution to that is to not run any ensure (or any Ruby code really) while killing threads, just like it's already the case for Fibers, or at least a possibility to do it that way. It's the same the JVM does in its shutdown sequence BTW.
At that time everything can become inconsistent, the only real fix then is to not run any Ruby code anymore.

@eregon
Copy link
Member

eregon commented Feb 11, 2023

Let's close this, it seems unnecessary and would add a non-trivial amount of complexity to solve problems that don't seem to happen in practice.

@eregon eregon closed this Feb 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants