Skip to content

Skip calibrating with generated tokens in the calibration loop. #17785

@abhinaykukkadapu

Description

@abhinaykukkadapu

This is to track the optimization after skipping _generate and for context, see the parent issue.

During prompt calibration, after prefilling the prompt tokens into the KV cache, _generate runs an autoregressive loop:

while total_token_list[-1] != tokenizer.eos_id and num_tokens < max_seq_len:
    # generate one token per forward pass

For a 16-token prompt with max_seq_len=1024, this produces 546 forward passes before hitting EOS, all wasted work since quantization observers already have sufficient activation statistics from the prefill pass.

Note: Task calibration is unaffected, wikitext chunks fill the context window (1023 of 1024 tokens), so _generate would exit after at most 1 step.

Both runs: max_seq_len=1024, prefill_ar_len=128, --tasks wikitext --limit 1 + user prompt.

Phase Baseline (min) skip_generate=True (min) Saved
DECODE
calibration (tasks) 122.6 89.1 — (variance)
calibration (prompts) (546 _generate fwd passes) 37.1 0.5 36.6 min
PREFILL
calibration (tasks) 5.8 5.4
calibration (prompts) (546 _generate fwd passes) 152.0 0.3 151.7 min
Lowering + QNN Compile
qnn_manager.Compile 113.3 102.4 — (variance)
Total end-to-end 484 min (8.1h) 235 min (3.9h) 249 min (4.2h)

Prompt calibration savings: 188.3 min (3.1h) — from 189.1 min down to 0.8 min.

Projected savings for qwen3-1_7b

Phase Before (min) After (min) Saved
DECODE prompt calibration 40.3 ~0.8 39.5
PREFILL prompt calibration 280.5 ~0.2 280.3
Total 653 (10.9h) ~333 (5.6h) ~320 (5.3h)

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Metadata

Metadata

Labels

module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

Type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions