-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix performance regression in single_client_tasks_and_get_batch #39362
Conversation
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Running microbenchmark here https://buildkite.com/ray-project/release-tests-pr/builds/52417. @vitsai can you let me know the number after this PR from this test after it is completed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Nit comments
@@ -254,6 +252,15 @@ def _remote(self, args=None, kwargs=None, **task_options): | |||
worker = ray._private.worker.global_worker | |||
worker.check_connected() | |||
|
|||
# We cannot do this when the function is first defined, because we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the before this change, it was already broken? I think when the remote function is first created, ray should automatically call ray.init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray calls ray.init when the function is first invoked, not defined
python/ray/remote_function.py
Outdated
if not self._injected_tracing: | ||
self._function = _inject_tracing_into_function(self._function) | ||
self._function_signature = ray._private.signature.extract_signature( | ||
self._function | ||
) | ||
self._injected_tracing = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ray API needs to be thread-safe, is there any concern of a race condition here if multiple threads invoke a function concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a mutex, and custom getstate/setstate for pickling (lock is not picklable).
Since simple assignments of variables are atomic in CPython and these two functions are idempotent I believe it would have been mostly fine except for the dictionary mutating part inside _inject_tracing_into_function
. Also learned that Python doesn't provide any kind of CAS? ChatGPT suggested tuple swap x, y = y, x
but seems like that is not actually atomic either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no _inject
is not idempotent so
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment otherwise lgtm
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Lmk when tests pass! Seems good to merge |
Tests look good, seems like microbenchmark ran into some infra failure so running it again now. On my machine, I did see the throughput go back up to ~13k/s from ~11k/s on the latest commit, so personally don't think we have to wait. |
You machine has a different environment from microbenchmark, so we shouldn't rely on that result. Let me rerun tests |
Try again here; https://buildkite.com/ray-project/release-tests-pr/builds/52670 |
wait, actually I realized it succeeded? https://buildkite.com/ray-project/release-tests-pr/builds/52417#018a6f01-015c-4d80-8f8c-f37dc1a0a686 |
= single_client_tasks_and_get_batch = [11.222801689347772, 0.4976784102185062] <br class="Apple-interchange-newline single_client_tasks_and_get_batch = [11.222801689347772, 0.4976784102185062] |
(probably because I retried at the time of failure in the last comment) |
ray-project#39362) The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to ray-project#38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function. Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.
#39362) (#39429) The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to #38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function. Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.
ray-project#39362) The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to ray-project#38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function. Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times. Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
ray-project#39362) The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to ray-project#38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function. Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times. Signed-off-by: Victor <vctr.y.m@example.com>
The
single_client_tasks_and_get_batch
benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to #38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function.Make the signature change conditional again, but move it to the execution portion of
RemoteFunction
rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.Why are these changes needed?
Related issue number
#39259
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.