Skip to content

Conversation

DannyYuyang-quic
Copy link
Contributor

@DannyYuyang-quic DannyYuyang-quic commented Oct 8, 2025

Summary

change the llama tutorial to static llama version

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @cbilgin

Copy link

pytorch-bot bot commented Oct 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14887

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 3c15e00 with merge base 400b2a5 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 8, 2025
@DannyYuyang-quic
Copy link
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Oct 8, 2025
@DannyYuyang-quic
Copy link
Contributor Author

Hi @cccclai, we’re considering updating the Llama 8B tutorial to Llama 3B Instruct with the static llama version, since we’re currently validating the static Llama 3B Instruct setup.
What’s your perspective on this?

Thanks!!

cc: @haowhsu-quic

@mergennachin mergennachin added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Oct 8, 2025
# The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template.
instruct_model = False

num_sharding = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should num_sharding here be 4, according to the steps above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for 8B model only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mistakenly pasted the config for the 1B model. It should be for the 3B model instead. I’ll update the config accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the config.


If you encounter any issues while reproducing the tutorial, please file a github
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


3. SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. SeqMSE uses a search-based approach with `seq_mse_candidates` = 1000. (Implementation details: [SeqMSE pass](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/_passes/seq_mse.py))

4. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qualify this comment to suggest # of shards might be different depending on the model size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I’ll add the comment.

-- artifact/
└── llama_qnn.pte

**3.3 Upload model, tokenizer and llama runner binary to phone**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these instructions are removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the upload step is already included in the script

@cccclai cccclai merged commit bdc526b into pytorch:main Oct 9, 2025
133 of 135 checks passed
@cccclai
Copy link
Contributor

cccclai commented Oct 9, 2025

@pytorchbot cherry-pick --onto release/1.0 -c docs

pytorchbot pushed a commit that referenced this pull request Oct 9, 2025
… version (#14887)

### Summary
change the llama tutorial to static llama version

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @cbilgin

(cherry picked from commit bdc526b)
@pytorchbot
Copy link
Collaborator

Cherry picking #14887

The cherry pick PR is at #14949 The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants