Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Use 1 thread for all fibers for an actor scheduling queue. #37949

Merged
merged 19 commits into from
Aug 9, 2023

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Jul 31, 2023

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Fixes #33957.
Fixes #38240.

@rynewang rynewang changed the title Use 1 thread for all fibers for an actor scheduling queue. [core] Use 1 thread for all fibers for an actor scheduling queue. Jul 31, 2023
@@ -95,7 +94,6 @@ void ConcurrencyGroupManager<ExecutorType>::Stop() {
}
}

template class ConcurrencyGroupManager<FiberState>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing anything? Shouldn't we only have 1 fiber state (1 thread) per concurrency group. What do you mean by 1 thread per submitter worker?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we have actor scheduling queue per caller worker.

@rynewang
Copy link
Contributor Author

rynewang commented Aug 2, 2023 via email

@rynewang
Copy link
Contributor Author

rynewang commented Aug 2, 2023 via email

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test in the Python layer with the repro script?

@rynewang rynewang force-pushed the single-thread-for-fibers-per-actor branch from 6ed925b to 30738e2 Compare August 2, 2023 18:24
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

src/ray/core_worker/transport/direct_actor_transport.cc Outdated Show resolved Hide resolved
@@ -18,6 +18,8 @@
#include <chrono>

#include "ray/util/logging.h"
#include "ray/util/macros.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed for RAY_UNUSED. Not sure why it did not complain before.

python/ray/tests/test_actor_bounded_threads.py Outdated Show resolved Hide resolved
src/ray/core_worker/test/fiber_state_test.cc Outdated Show resolved Hide resolved
… rate limiters

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…added python tests.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang force-pushed the single-thread-for-fibers-per-actor branch from 16a6333 to 3dcf8ad Compare August 7, 2023 14:44
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

rynewang commented Aug 7, 2023

There's a ASAN & TSAN error I believe originates from boostorg/fiber#214. Trying to add flags to solve...

@rynewang
Copy link
Contributor Author

rynewang commented Aug 8, 2023

Turns out the ASAN is not from boost issues but from our own variable lifetime management. Fixed.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang force-pushed the single-thread-for-fibers-per-actor branch from 18bd679 to 88c30d0 Compare August 9, 2023 00:49
src/ray/core_worker/fiber.h Outdated Show resolved Hide resolved
@@ -124,7 +134,7 @@ class FiberState {
// no fibers can run after this point as we don't yield here.
// This makes sure this thread won't accidentally
// access being destructed core worker.
fiber_stopped_event_.Notify();
fiber_stopped_event->Notify();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a fiber_stopped_event->clear() to explicitly free the pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do clear because lambda captures the shared ptr as const. We can mark it mutable but I guess that's overkill.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@jjyao jjyao merged commit 0d6126f into ray-project:master Aug 9, 2023
5 of 11 checks passed
@rynewang rynewang deleted the single-thread-for-fibers-per-actor branch August 9, 2023 20:53
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants