Skip calibrating with generated tokens in the calibration loop.

This is to track the optimization after skipping `_generate` and for context, see the parent issue.

During prompt calibration, after prefilling the prompt tokens into the KV cache, `_generate` runs an autoregressive loop:

```python
while total_token_list[-1] != tokenizer.eos_id and num_tokens < max_seq_len:
    # generate one token per forward pass
```

For a 16-token prompt with max_seq_len=1024, this produces 546 forward passes before hitting EOS, all wasted work since quantization observers already have sufficient activation statistics from the prefill pass.

Note: Task calibration is unaffected, wikitext chunks fill the context window (1023 of 1024 tokens), so `_generate` would exit after at most 1 step.

Both runs: max_seq_len=1024, prefill_ar_len=128, `--tasks wikitext --limit 1` + user prompt.

| Phase | Baseline (min) | skip_generate=True (min) | Saved |
|---|---|---|---|
| **DECODE** | | | |
| calibration (tasks) | 122.6 | 89.1 | — (variance) |
| calibration (prompts) (546 _generate fwd passes) | **37.1** | **0.5** | **36.6 min** |
| **PREFILL** | | | |
| calibration (tasks) | 5.8 | 5.4 | — |
| calibration (prompts) (546 _generate fwd passes) | **152.0** | **0.3** | **151.7 min** |
| **Lowering + QNN Compile** | | | |
| qnn_manager.Compile | 113.3 | 102.4 | — (variance) |
| | | | |
| **Total end-to-end** | **484 min (8.1h)** | **235 min (3.9h)** | **249 min (4.2h)** |

Prompt calibration savings: **188.3 min (3.1h)** — from 189.1 min down to 0.8 min.

### Projected savings for qwen3-1_7b

| Phase | Before (min) | After (min) | Saved |
|---|---|---|---|
| DECODE prompt calibration | 40.3 | ~0.8 | 39.5 |
| PREFILL prompt calibration | 280.5 | ~0.2 | 280.3 |
| **Total** | **653 (10.9h)** | **~333 (5.6h)** | **~320 (5.3h)** |

---


cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip calibrating with generated tokens in the calibration loop. #17785

Projected savings for qwen3-1_7b

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Baseline (min)	skip_generate=True (min)	Saved
DECODE
calibration (tasks)	122.6	89.1	— (variance)
calibration (prompts) (546 _generate fwd passes)	37.1	0.5	36.6 min
PREFILL
calibration (tasks)	5.8	5.4	—
calibration (prompts) (546 _generate fwd passes)	152.0	0.3	151.7 min
Lowering + QNN Compile
qnn_manager.Compile	113.3	102.4	— (variance)

Total end-to-end	484 min (8.1h)	235 min (3.9h)	249 min (4.2h)

Phase	Before (min)	After (min)	Saved
DECODE prompt calibration	40.3	~0.8	39.5
PREFILL prompt calibration	280.5	~0.2	280.3
Total	653 (10.9h)	~333 (5.6h)	~320 (5.3h)

Skip calibrating with generated tokens in the calibration loop. #17785

Description

Projected savings for qwen3-1_7b

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions