Skip to content

Optimize decode loop in calibration #18065

@abhinaykukkadapu

Description

@abhinaykukkadapu

This task is to discuss ways to optimize decode bottleneck (~35%) of the total end-to-end time, or ~50% of the calibration time.

Phase Time % of total
torch.export 25s 0.2%
prepare_pt2e 36s 0.3%
SeqMSE grid search 7,867s (2h 11m) 59%
Remaining calibration 5,145s 39%
convert_pt2e 18s 0.1%

Observations:

SeqMSE alone was 2+ hours, more than the rest of the pipeline combined. The flame graph below confirms _find_best_candidate dominates wall time. (refer fig 1)

Possible optimizations:

SeqMSE time for Llama3.2-1B: ~23 min (coarse-to-fine) vs ~90+ min estimated at 1000 points. ~6.5x speedup.

Model Baseline (1000-point brute force) Coarse-to-fine (150 evals)
Qwen3-0.6B 2h 11m ~20 min
Llama3.2-1B ~90 min ~23 min

1. Coarse + Fine approach:

  • Loss curves are smooth and monotonically decreasing, no local minima, no spikes.
  • First we will run a 100 coarse steps with 0.01 step each and 50 steps around the best candidate that is found in the first 100 steps, see the loss curves below to know why this works.

LLama 3.2 1B

Image Image

Qwen 2.5 1B

Image Image

2. Parallelize the computation

< Yet to explore>

3. Vectorize the computation for multiple steps

< Yet to explore>

Flamegraph of the decode loop (fig 1)

Image

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Metadata

Metadata

Labels

module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions