[Feature] Code implementation of Async Scheduler #924
Merged
+479
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat(scheduler): Implement asynchronous scheduler to reduce CPU wait overhead
Problem:
The existing synchronous scheduler blocks the CPU while waiting for TPU
results. This wait time creates significant overhead, especially for
small models or decode-heavy scenarios where the scheduler is invoked
frequently.
Solution:
This commit introduces an asynchronous scheduler that decouples the CPU
scheduler from the TPU execution. The CPU no longer blocks and can
continue processing, effectively eliminating the wait-time overhead.
Impact:
Observed a ~33% throughput increase on Llama3.2-1B. This change is
most significant for workloads with high scheduler call frequency.
Benchmark (Llama3.2-1B):
Before: 6.38 req/s (Sync @ cychiuak@8c7e7bb)
After: 8.89 req/s (Async)
How to Reproduce: