[Core] Optimize next/anext performance for streaming generator #41270

rkooo567 · 2023-11-20T10:15:33Z

Why are these changes needed?

Currently, when the worker has a lot of yield and next in a generator, it is bottlenecked by the performance of yield and next itself.

This PR optimizes the performance if next. next was slow because we always call ray.wait, which was pretty expensive.

wait is not necessary when an object is actually already ready. If it is not ready yet, the overhead of wait is trivial. This PR fixes the performance overhead by not calling ray.wait if the object is already ready when we peak it (we can figure it out because we know when the object is reported, meaning it is ready).

After this PR, the throughput of tests are doubled.

After this yield becomes the bottleneck. Unfortunately, yield is hard to optimize because it is due to serialization. We will deal with it in the longer term.

Related issue number

Closes #39643

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2023-11-20T14:14:40Z

python/ray/_raylet.pyx

@@ -332,16 +345,18 @@ class StreamingObjectRefGenerator:
            timeout_s: If the next object is not ready within
                this timeout, it returns the nil object ref.
        """
-        self.worker.check_connected()


this is kind of unnecessary and has high overhead

jjyao · 2023-11-20T17:37:28Z

python/ray/_raylet.pyx

            self._generator_ref)
-        ready, unready = ray.wait(


Could you explain why ray.wait() ready object is expensive even when fetch_local is false?

You mean add comments from code?

It is because the router itself has very high load (10s of thousand of iteration/s). ray.wait itself only has 200-500us overhead, but it is expensive when you have 10s of thousand of tasks. If we skip this, we can do this process within 10-20 us

src/ray/core_worker/task_manager.h

edoakes · 2023-11-23T06:05:11Z

@rkooo567 can you please document the change in performance on the Serve streaming microbenchmarks on this PR for posterity?

alexeykudinkin · 2023-11-28T02:01:12Z

I'm adding a few more benchmarks in here, here are the results:

Before

# Core Streaming
Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 12384.79 +- 106.67 tokens/s
(CallerActor pid=66926) Individual request quantiles:
(CallerActor pid=66926) 	P50=807.6792080000002
(CallerActor pid=66926) 	P75=857.1882605000001
(CallerActor pid=66926) 	P99=903.21980678

# Serve Handle Streaming
DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 10981.91 +- 280.73 tokens/s
(ServeReplica:default:CallerDeployment pid=71672) Individual request quantiles:
(ServeReplica:default:CallerDeployment pid=71672) 	P50=905.3096459999992
(ServeReplica:default:CallerDeployment pid=71672) 	P75=957.0047495000003
(ServeReplica:default:CallerDeployment pid=71672) 	P99=993.9941300000013

After

# Core Streaming
Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 19252.14 +- 330.47 tokens/s

P50=519.7334164999993
P75=585.4596035
P99=595.3468529799989

# Serve Handle Streaming
DeploymentHandle streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 17393.68 +- 644.61 tokens/s

P50=583.9117500000004
P75=633.7538224999998
P99=653.4384627099994

Great job @rkooo567 !

rkooo567 · 2023-11-28T11:19:42Z

nice! this aligns with my result. Let's keep your benchmark and start tracking the performance stats

…roject#41270) Currently, when the worker has a lot of yield and next in a generator, it is bottlenecked by the performance of yield and next itself. This PR optimizes the performance if next. next was slow because we always call ray.wait, which was pretty expensive. wait is not necessary when an object is actually already ready. If it is not ready yet, the overhead of wait is trivial. This PR fixes the performance overhead by not calling ray.wait if the object is already ready when we peak it (we can figure it out because we know when the object is reported, meaning it is ready). After this PR, the throughput of tests are doubled. After this yield becomes the bottleneck. Unfortunately, yield is hard to optimize because it is due to serialization. We will deal with it in the longer term.

SangBin Cho added 3 commits November 1, 2023 07:31

ip

cb76349

ip

7d865e4

.

3ca07ec

rkooo567 assigned jjyao and edoakes Nov 20, 2023

SangBin Cho added 3 commits November 20, 2023 23:13

Merge branch 'master' into streaming-generator-perf

d25de70

in progress

ccb9b31

remove benchmark

4840472

rkooo567 commented Nov 20, 2023

View reviewed changes

jjyao reviewed Nov 20, 2023

View reviewed changes

SangBin Cho added 4 commits November 21, 2023 13:52

Merge branch 'master' into streaming-generator-perf

67f54fd

Fix

e8ccfab

done

f436ccb

done

f4b3d4c

jjyao approved these changes Nov 21, 2023

View reviewed changes

src/ray/core_worker/task_manager.h Outdated Show resolved Hide resolved

Fixed.

0cbc7fe

rkooo567 merged commit 3b032fa into ray-project:master Nov 22, 2023
10 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Optimize next/anext performance for streaming generator #41270

[Core] Optimize next/anext performance for streaming generator #41270

rkooo567 commented Nov 20, 2023 •

edited

Loading

rkooo567 Nov 20, 2023

jjyao Nov 20, 2023

rkooo567 Nov 21, 2023 •

edited

Loading

edoakes commented Nov 23, 2023

alexeykudinkin commented Nov 28, 2023 •

edited

Loading

rkooo567 commented Nov 28, 2023

[Core] Optimize next/anext performance for streaming generator #41270

[Core] Optimize next/anext performance for streaming generator #41270

Conversation

rkooo567 commented Nov 20, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 Nov 20, 2023

Choose a reason for hiding this comment

jjyao Nov 20, 2023

Choose a reason for hiding this comment

rkooo567 Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

edoakes commented Nov 23, 2023

alexeykudinkin commented Nov 28, 2023 • edited Loading

Before

After

rkooo567 commented Nov 28, 2023

rkooo567 commented Nov 20, 2023 •

edited

Loading

rkooo567 Nov 21, 2023 •

edited

Loading

alexeykudinkin commented Nov 28, 2023 •

edited

Loading