How to prioritize decode batches over prefill in SGLang? (GLM-4.7 deployment) #27762

coach00 · 2026-06-10T04:01:19Z

coach00
Jun 10, 2026

Hi everyone,

I'm deploying the GLM-4.7 model on H20 using SGLang and need some help with throughput optimization.

Current Setup

python -m sglang.launch_server
--model-path /ai/models/glm-4.7-fp8
--api-key sk-1024
--context-length 128000
--chunked-prefill-size 4096
--max-prefill-tokens 4096
--schedule-conservativeness 1.1
--max-running-requests 16
--max-queued-requests 30
--reasoning-parser glm45
--tool-call-parser glm47
--served-model-name glm-4.7
--mem-frac 0.85
--tp 8
--enable-metrics
--enable-mixed-chunk

My Use Case

Concurrent requests: ~10 requests at any given time
Request type: Primarily long-context requests (like Claude Code conversations)
Goal: Ensure smooth decode throughput for ongoing requests

Current Behavior

To prioritize decode, I've set:

--chunked-prefill-size 4096
--enable-mixed-chunk

However, when new requests arrive, prefill still consumes most of the scheduling priority. While decode continues, its performance drops significantly, causing noticeable latency for users with ongoing requests.

The Question

How can I configure SGLang to:

Prioritize decode batches over prefill
Avoid severe blocking of ongoing decode requests when new requests enter
Maintain better throughput for long-context, token-streaming scenarios?

Are there specific parameters or scheduling strategies I should be using? Any guidance would be helpful

smqd19 · 2026-06-11T12:21:25Z

smqd19
Jun 11, 2026

You’re running into a classic issue: prefill is compute-heavy, especially with long contexts, and can starve decode batches if the scheduler isn’t tuned. We’ve hit this in production with LLMs on similar hardware. Here’s what’s worked for us:

1. Adjust --schedule-conservativeness: Lower values (e.g., 0.9 or even 0.7) make the scheduler more aggressive about prioritizing decode over prefill. This can reduce the latency for ongoing streaming, but might make new requests wait longer for prefill.

--schedule-conservativeness 0.8

2. Tweak --max-prefill-tokens: If your context windows are huge, prefill can hog resources. Try dropping this to 2048 or even 1024, which chunks up prefill and allows decode to interleave more often.

--max-prefill-tokens 2048

3. Use --max-running-requests: If you expect ~10 concurrent requests, set --max-running-requests to 10-12. This prevents queue flooding, giving decode batches more headroom.

4. Read the scheduling code: SGLang’s batch scheduler prioritization is configurable, but the default is “prefill first, decode second.” If you need even more control, you can patch the scheduler logic to check decode batch size and force decode prioritization for streaming scenarios. We’ve done similar mods in vLLM.

5. Monitor with --enable-metrics: Look at prefill vs decode times in the metrics output; this’ll show if your tweaks are actually shifting priority.

Summary Settings Example:

--schedule-conservativeness 0.8 \
--chunked-prefill-size 2048 \
--max-prefill-tokens 2048 \
--max-running-requests 12

Let me know if any of these tweaks help. If decode is still getting starved, you might need to patch the scheduler or split the deployment: one node for prefill-heavy requests, one for streaming. That’s how we’ve handled this at scale.

1 reply

coach00 Jun 12, 2026
Author

Thank you very much for your help ！！

I will try to make adjustments based on your suggestions and monitor the subsequent performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prioritize decode batches over prefill in SGLang? (GLM-4.7 deployment) #27762

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to prioritize decode batches over prefill in SGLang? (GLM-4.7 deployment) #27762

Uh oh!

coach00 Jun 10, 2026

Replies: 1 comment · 1 reply

Uh oh!

smqd19 Jun 11, 2026

Uh oh!

coach00 Jun 12, 2026 Author

coach00
Jun 10, 2026

Replies: 1 comment 1 reply

smqd19
Jun 11, 2026

coach00 Jun 12, 2026
Author