Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix performance regression in single_client_tasks_and_get_batch #39362

Merged
merged 6 commits into from
Sep 8, 2023

Conversation

vitsai
Copy link
Contributor

@vitsai vitsai commented Sep 7, 2023

The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to #38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function.

Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.

Why are these changes needed?

Related issue number

#39259

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 7, 2023

Running microbenchmark here https://buildkite.com/ray-project/release-tests-pr/builds/52417.

@vitsai can you let me know the number after this PR from this test after it is completed?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Nit comments

python/ray/util/tracing/tracing_helper.py Show resolved Hide resolved
python/ray/remote_function.py Outdated Show resolved Hide resolved
@@ -254,6 +252,15 @@ def _remote(self, args=None, kwargs=None, **task_options):
worker = ray._private.worker.global_worker
worker.check_connected()

# We cannot do this when the function is first defined, because we need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the before this change, it was already broken? I think when the remote function is first created, ray should automatically call ray.init

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray calls ray.init when the function is first invoked, not defined

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 7, 2023
Comment on lines 257 to 262
if not self._injected_tracing:
self._function = _inject_tracing_into_function(self._function)
self._function_signature = ray._private.signature.extract_signature(
self._function
)
self._injected_tracing = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ray API needs to be thread-safe, is there any concern of a race condition here if multiple threads invoke a function concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a mutex, and custom getstate/setstate for pickling (lock is not picklable).

Since simple assignments of variables are atomic in CPython and these two functions are idempotent I believe it would have been mostly fine except for the dictionary mutating part inside _inject_tracing_into_function. Also learned that Python doesn't provide any kind of CAS? ChatGPT suggested tuple swap x, y = y, x but seems like that is not actually atomic either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no _inject is not idempotent so

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai vitsai removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 7, 2023
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai
Copy link
Contributor Author

vitsai commented Sep 7, 2023

single_client_tasks_and_get_batch = [12.225312658308761, 0.38310197760588016] for microbenchmark release test after addressing comments (does not include latest commit, which only changes 1 line of test logic)

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment otherwise lgtm

python/ray/remote_function.py Outdated Show resolved Hide resolved
python/ray/remote_function.py Outdated Show resolved Hide resolved
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 8, 2023

Lmk when tests pass! Seems good to merge

@vitsai
Copy link
Contributor Author

vitsai commented Sep 8, 2023

Tests look good, seems like microbenchmark ran into some infra failure so running it again now. On my machine, I did see the throughput go back up to ~13k/s from ~11k/s on the latest commit, so personally don't think we have to wait.

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 8, 2023

You machine has a different environment from microbenchmark, so we shouldn't rely on that result. Let me rerun tests

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 8, 2023

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 8, 2023

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 8, 2023

=

single_client_tasks_and_get_batch = [11.222801689347772, 0.4976784102185062]
 

<br class="Apple-interchange-newline

single_client_tasks_and_get_batch = [11.222801689347772, 0.4976784102185062]
 


@rkooo567 rkooo567 merged commit b6edccf into ray-project:master Sep 8, 2023
88 of 96 checks passed
@vitsai
Copy link
Contributor Author

vitsai commented Sep 8, 2023

(probably because I retried at the time of failure in the last comment)

vitsai added a commit to vitsai/ray that referenced this pull request Sep 8, 2023
ray-project#39362)

The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to ray-project#38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function.

Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.
GeneDer pushed a commit that referenced this pull request Sep 8, 2023
#39362) (#39429)

The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to #38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function.

Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
ray-project#39362)

The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to ray-project#38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function.

Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.

Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
ray-project#39362)

The single_client_tasks_and_get_batch benchmark saw a ~0.5-1k tasks/s average regression (2k tasks/s on a local machine) due to ray-project#38323, which changed some tracing logic to unconditionally change the signature of every remote function to accomodate tracing during _inject_tracing_into_function.

Make the signature change conditional again, but move it to the execution portion of RemoteFunction rather than the definition. Also make sure the injection only happens once even when the remote function is executed multiple times.

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants