-
Notifications
You must be signed in to change notification settings - Fork 863
Description
This is to track the optimization after skipping _generate and for context, see the parent issue.
During prompt calibration, after prefilling the prompt tokens into the KV cache, _generate runs an autoregressive loop:
while total_token_list[-1] != tokenizer.eos_id and num_tokens < max_seq_len:
# generate one token per forward passFor a 16-token prompt with max_seq_len=1024, this produces 546 forward passes before hitting EOS, all wasted work since quantization observers already have sufficient activation statistics from the prefill pass.
Note: Task calibration is unaffected, wikitext chunks fill the context window (1023 of 1024 tokens), so _generate would exit after at most 1 step.
Both runs: max_seq_len=1024, prefill_ar_len=128, --tasks wikitext --limit 1 + user prompt.
| Phase | Baseline (min) | skip_generate=True (min) | Saved |
|---|---|---|---|
| DECODE | |||
| calibration (tasks) | 122.6 | 89.1 | — (variance) |
| calibration (prompts) (546 _generate fwd passes) | 37.1 | 0.5 | 36.6 min |
| PREFILL | |||
| calibration (tasks) | 5.8 | 5.4 | — |
| calibration (prompts) (546 _generate fwd passes) | 152.0 | 0.3 | 151.7 min |
| Lowering + QNN Compile | |||
| qnn_manager.Compile | 113.3 | 102.4 | — (variance) |
| Total end-to-end | 484 min (8.1h) | 235 min (3.9h) | 249 min (4.2h) |
Prompt calibration savings: 188.3 min (3.1h) — from 189.1 min down to 0.8 min.
Projected savings for qwen3-1_7b
| Phase | Before (min) | After (min) | Saved |
|---|---|---|---|
| DECODE prompt calibration | 40.3 | ~0.8 | 39.5 |
| PREFILL prompt calibration | 280.5 | ~0.2 | 280.3 |
| Total | 653 (10.9h) | ~333 (5.6h) | ~320 (5.3h) |
cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin
Metadata
Metadata
Assignees
Labels
Type
Projects
Status