Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Threaded actor] Fix threaded actor race condition #19751

Merged
merged 5 commits into from
Oct 26, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 26, 2021

Why are these changes needed?

This PR fixes the root cause of threaded actor thread SIGSEGV issue by properly joining the thread pool.

Basically threaded actors are executed in this way;

io_service (HandlePushTask) (post to)-> task_execution_service (post to)-> thread_pool

But our shutdown hook only stops the task_execution service, which is why there are still tasks running in the thread pool that accesses core workers instances (e.g., process & threads).

We basically stop the threadpool and join before we stop the task execution loop. In this way, we can properly waits until all threads pool operations are terminated.

I verified this fixes the segfault issues from #19746.

Related issue number

Closes #19748

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jovany-wang jovany-wang self-assigned this Oct 26, 2021
@rkooo567 rkooo567 assigned jovany-wang and unassigned ericl, scv119 and jovany-wang Oct 26, 2021
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 26, 2021
@rkooo567
Copy link
Contributor Author

Let me fix test failures first

@rkooo567 rkooo567 assigned scv119 and jovany-wang and unassigned jovany-wang Oct 26, 2021
@@ -768,6 +774,7 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_
void CoreWorker::Shutdown() {
io_service_.stop();
if (options_.worker_type == WorkerType::WORKER) {
direct_task_receiver_->Stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's correct!

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Threaded actor stress test invokes SIGSEGV
4 participants