Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Optimize next/anext performance for streaming generator #41270

Merged
merged 11 commits into from
Nov 22, 2023

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Nov 20, 2023

Why are these changes needed?

Currently, when the worker has a lot of yield and next in a generator, it is bottlenecked by the performance of yield and next itself.

This PR optimizes the performance if next. next was slow because we always call ray.wait, which was pretty expensive.

wait is not necessary when an object is actually already ready. If it is not ready yet, the overhead of wait is trivial. This PR fixes the performance overhead by not calling ray.wait if the object is already ready when we peak it (we can figure it out because we know when the object is reported, meaning it is ready).

After this PR, the throughput of tests are doubled.

After this yield becomes the bottleneck. Unfortunately, yield is hard to optimize because it is due to serialization. We will deal with it in the longer term.

Related issue number

Closes #39643

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

SangBin Cho added 3 commits November 1, 2023 07:31
@@ -332,16 +345,18 @@ class StreamingObjectRefGenerator:
timeout_s: If the next object is not ready within
this timeout, it returns the nil object ref.
"""
self.worker.check_connected()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of unnecessary and has high overhead

self._generator_ref)
ready, unready = ray.wait(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why ray.wait() ready object is expensive even when fetch_local is false?

Copy link
Contributor Author

@rkooo567 rkooo567 Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean add comments from code?

It is because the router itself has very high load (10s of thousand of iteration/s). ray.wait itself only has 200-500us overhead, but it is expensive when you have 10s of thousand of tasks. If we skip this, we can do this process within 10-20 us

src/ray/core_worker/task_manager.h Outdated Show resolved Hide resolved
@rkooo567 rkooo567 merged commit 3b032fa into ray-project:master Nov 22, 2023
10 of 15 checks passed
@edoakes
Copy link
Contributor

edoakes commented Nov 23, 2023

@rkooo567 can you please document the change in performance on the Serve streaming microbenchmarks on this PR for posterity?

@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Nov 28, 2023

I'm adding a few more benchmarks in here, here are the results:

Before

# Core Streaming
Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 12384.79 +- 106.67 tokens/s
(CallerActor pid=66926) Individual request quantiles:
(CallerActor pid=66926) 	P50=807.6792080000002
(CallerActor pid=66926) 	P75=857.1882605000001
(CallerActor pid=66926) 	P99=903.21980678

# Serve Handle Streaming
DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 10981.91 +- 280.73 tokens/s
(ServeReplica:default:CallerDeployment pid=71672) Individual request quantiles:
(ServeReplica:default:CallerDeployment pid=71672) 	P50=905.3096459999992
(ServeReplica:default:CallerDeployment pid=71672) 	P75=957.0047495000003
(ServeReplica:default:CallerDeployment pid=71672) 	P99=993.9941300000013

After

# Core Streaming
Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 19252.14 +- 330.47 tokens/s

P50=519.7334164999993
P75=585.4596035
P99=595.3468529799989

# Serve Handle Streaming
DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 17393.68 +- 644.61 tokens/s

P50=583.9117500000004
P75=633.7538224999998
P99=653.4384627099994

Great job @rkooo567 !

@rkooo567
Copy link
Contributor Author

nice! this aligns with my result. Let's keep your benchmark and start tracking the performance stats

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
…roject#41270)

Currently, when the worker has a lot of yield and next in a generator, it is bottlenecked by the performance of yield and next itself.

This PR optimizes the performance if next. next was slow because we always call ray.wait, which was pretty expensive.

wait is not necessary when an object is actually already ready. If it is not ready yet, the overhead of wait is trivial. This PR fixes the performance overhead by not calling ray.wait if the object is already ready when we peak it (we can figure it out because we know when the object is reported, meaning it is ready).

After this PR, the throughput of tests are doubled.

After this yield becomes the bottleneck. Unfortunately, yield is hard to optimize because it is due to serialization. We will deal with it in the longer term.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Very high CPU usage for proxying streaming requests
4 participants