Replies: 1 comment 1 reply
-
|
You’re running into a classic issue: prefill is compute-heavy, especially with long contexts, and can starve decode batches if the scheduler isn’t tuned. We’ve hit this in production with LLMs on similar hardware. Here’s what’s worked for us: 1. Adjust --schedule-conservativeness 0.82. Tweak --max-prefill-tokens 20483. Use 4. Read the scheduling code: SGLang’s batch scheduler prioritization is configurable, but the default is “prefill first, decode second.” If you need even more control, you can patch the scheduler logic to check decode batch size and force decode prioritization for streaming scenarios. We’ve done similar mods in vLLM. 5. Monitor with Summary Settings Example: --schedule-conservativeness 0.8 \
--chunked-prefill-size 2048 \
--max-prefill-tokens 2048 \
--max-running-requests 12Let me know if any of these tweaks help. If decode is still getting starved, you might need to patch the scheduler or split the deployment: one node for prefill-heavy requests, one for streaming. That’s how we’ve handled this at scale. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm deploying the GLM-4.7 model on H20 using SGLang and need some help with throughput optimization.
Current Setup
python -m sglang.launch_server
--model-path /ai/models/glm-4.7-fp8
--api-key sk-1024
--context-length 128000
--chunked-prefill-size 4096
--max-prefill-tokens 4096
--schedule-conservativeness 1.1
--max-running-requests 16
--max-queued-requests 30
--reasoning-parser glm45
--tool-call-parser glm47
--served-model-name glm-4.7
--mem-frac 0.85
--tp 8
--enable-metrics
--enable-mixed-chunk
My Use Case
Current Behavior
To prioritize decode, I've set:
However, when new requests arrive, prefill still consumes most of the scheduling priority. While decode continues, its performance drops significantly, causing noticeable latency for users with ongoing requests.
The Question
How can I configure SGLang to:
Are there specific parameters or scheduling strategies I should be using? Any guidance would be helpful
Beta Was this translation helpful? Give feedback.
All reactions