-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Optimize next/anext performance for streaming generator #41270
[Core] Optimize next/anext performance for streaming generator #41270
Conversation
@@ -332,16 +345,18 @@ class StreamingObjectRefGenerator: | |||
timeout_s: If the next object is not ready within | |||
this timeout, it returns the nil object ref. | |||
""" | |||
self.worker.check_connected() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of unnecessary and has high overhead
self._generator_ref) | ||
ready, unready = ray.wait( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why ray.wait()
ready object is expensive even when fetch_local
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean add comments from code?
It is because the router itself has very high load (10s of thousand of iteration/s). ray.wait itself only has 200-500us overhead, but it is expensive when you have 10s of thousand of tasks. If we skip this, we can do this process within 10-20 us
@rkooo567 can you please document the change in performance on the Serve streaming microbenchmarks on this PR for posterity? |
I'm adding a few more benchmarks in here, here are the results: Before
After
Great job @rkooo567 ! |
nice! this aligns with my result. Let's keep your benchmark and start tracking the performance stats |
…roject#41270) Currently, when the worker has a lot of yield and next in a generator, it is bottlenecked by the performance of yield and next itself. This PR optimizes the performance if next. next was slow because we always call ray.wait, which was pretty expensive. wait is not necessary when an object is actually already ready. If it is not ready yet, the overhead of wait is trivial. This PR fixes the performance overhead by not calling ray.wait if the object is already ready when we peak it (we can figure it out because we know when the object is reported, meaning it is ready). After this PR, the throughput of tests are doubled. After this yield becomes the bottleneck. Unfortunately, yield is hard to optimize because it is due to serialization. We will deal with it in the longer term.
Why are these changes needed?
Currently, when the worker has a lot of yield and next in a generator, it is bottlenecked by the performance of yield and next itself.
This PR optimizes the performance if next. next was slow because we always call ray.wait, which was pretty expensive.
wait is not necessary when an object is actually already ready. If it is not ready yet, the overhead of wait is trivial. This PR fixes the performance overhead by not calling ray.wait if the object is already ready when we peak it (we can figure it out because we know when the object is reported, meaning it is ready).
After this PR, the throughput of tests are doubled.
After this yield becomes the bottleneck. Unfortunately, yield is hard to optimize because it is due to serialization. We will deal with it in the longer term.
Related issue number
Closes #39643
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.