-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
🚀 The feature, motivation and pitch
We propose reducing the framework’s overhead by introducing two key improvements—an asynchronous scheduler and a multi-step approach—all with minimal code modifications.
- Asynchronous Scheduler
For the async scheduler, we suggest the following design:

This solution requires changes only in EngineCore without modifying other modules. By incorporating an update_schedule module, the framework can also seamlessly support speculative decoding.
- Multi-Step Approach
Although v1’s preprocessing and postprocessing are lighter compared to v0, we still observe notable inefficiencies on some platforms (e.g., ARM + XPU). In particular, there exists a significant gap between input preparation and launching the forward model, and the device-to-host (D2H) communication for each model output further increases overall latency.
To address these issues, we introduce a multi-step strategy that differs from v0 in several key ways:
• We propose a simple_prepare_input function to reduce unnecessary CPU operations.
• We defer the D2H communication to avoid excessive stream synchronizations.
• We integrate the multi-step process with the asynchronous scheduler, thereby alleviating the scheduler’s load when handling multiple outputs.

─────────────────────────────
This refined solution minimizes code modifications while significantly improving performance and reducing overheads.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.