-
Notifications
You must be signed in to change notification settings - Fork 687
Qualcomm AI Engine Direct - change the llama tutorial to static llama version #14887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - change the llama tutorial to static llama version #14887
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14887
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 3c15e00 with merge base 400b2a5 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "release notes: qualcomm" |
Hi @cccclai, we’re considering updating the Llama 8B tutorial to Llama 3B Instruct with the static llama version, since we’re currently validating the static Llama 3B Instruct setup. Thanks!! cc: @haowhsu-quic |
# The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template. | ||
instruct_model = False | ||
|
||
num_sharding = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should num_sharding here be 4, according to the steps above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for 8B model only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mistakenly pasted the config for the 1B model. It should be for the 3B model instead. I’ll update the config accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the config.
docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md
Outdated
Show resolved
Hide resolved
|
||
If you encounter any issues while reproducing the tutorial, please file a github | ||
issue on ExecuTorch repo and tag use `#qcom_aisw` tag | ||
issue on ExecuTorch repo and tag use `#qcom_aisw` tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to https://github.com/pytorch/executorch/issues
|
||
3. SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. SeqMSE uses a search-based approach with `seq_mse_candidates` = 1000. (Implementation details: [SeqMSE pass](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/_passes/seq_mse.py)) | ||
|
||
4. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qualify this comment to suggest # of shards might be different depending on the model size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I’ll add the comment.
-- artifact/ | ||
└── llama_qnn.pte | ||
|
||
**3.3 Upload model, tokenizer and llama runner binary to phone** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these instructions are removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the upload step is already included in the script
@pytorchbot cherry-pick --onto release/1.0 -c docs |
… version (#14887) ### Summary change the llama tutorial to static llama version cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @cbilgin (cherry picked from commit bdc526b)
Cherry picking #14887The cherry pick PR is at #14949 The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Summary
change the llama tutorial to static llama version
cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @cbilgin