-
Notifications
You must be signed in to change notification settings - Fork 870
Open
Labels
module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, QualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm
Description
This task is to discuss ways to optimize decode bottleneck (~35%) of the total end-to-end time, or ~50% of the calibration time.
| Phase | Time | % of total |
|---|---|---|
| torch.export | 25s | 0.2% |
| prepare_pt2e | 36s | 0.3% |
| SeqMSE grid search | 7,867s (2h 11m) | 59% |
| Remaining calibration | 5,145s | 39% |
| convert_pt2e | 18s | 0.1% |
Observations:
SeqMSE alone was 2+ hours, more than the rest of the pipeline combined. The flame graph below confirms _find_best_candidate dominates wall time. (refer fig 1)
Possible optimizations:
SeqMSE time for Llama3.2-1B: ~23 min (coarse-to-fine) vs ~90+ min estimated at 1000 points. ~6.5x speedup.
| Model | Baseline (1000-point brute force) | Coarse-to-fine (150 evals) |
|---|---|---|
| Qwen3-0.6B | 2h 11m | ~20 min |
| Llama3.2-1B | ~90 min | ~23 min |
1. Coarse + Fine approach:
- Loss curves are smooth and monotonically decreasing, no local minima, no spikes.
- First we will run a 100 coarse steps with 0.01 step each and 50 steps around the best candidate that is found in the first 100 steps, see the loss curves below to know why this works.
LLama 3.2 1B
Qwen 2.5 1B
2. Parallelize the computation
< Yet to explore>
3. Vectorize the computation for multiple steps
< Yet to explore>
Flamegraph of the decode loop (fig 1)
cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, QualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm
Type
Projects
Status
In progress