-
Notifications
You must be signed in to change notification settings - Fork 712
Qualcomm AI Engine Direct - Add the tutorial to deploy llama3 8B Instruct #5335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Add the tutorial to deploy llama3 8B Instruct #5335
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5335
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2876d16 with merge base 9256b4a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @cccclai, The PR is to add a document to show how to run and export llama 3 8B Instruct. Thanks :) |
cccclai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
| - Follow [the README for executorch llama](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) to know how to run a llama model on mobile via ExecuTorch. | ||
| - A Qualcomm device with 16GB RAM | ||
| - We are continuing to optimize our memory usage to ensure compatibility with lower memory devices. | ||
| - The version of [Qualcomm AI Engine Direct SDK](https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk) is 2.25.0 or above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might point to 2.26 version, or convolutions in you case (converted from linear with no bias) will fail to lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reminder!
| - [QNN 2.25.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.25.0.240728.zip) | ||
| - [QNN 2.24.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.24.0.240626.zip) | ||
| - [QNN 2.23.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.23.0.24.06.24.zip) | ||
| - Note that convolution op might be failed for QNN 2.25.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean 2.26 is the preferred version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is the preferred version. Because we find the failed for conv in QNN 2.25.
| ## What is coming? | ||
|
|
||
| - [llama2 and llama3](https://github.com/pytorch/executorch/pull/4030). Note that at the moment of writing, we still suffer from the quantization issue in llama2-7B and llama3-8B cases. Only storiesllama works well. | ||
| - Improve the performance for llama3-8B-Instruct and support bert-mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bert-mode isn't a common word. Batch prefill is probably a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice.
|
I'm hitting missing rms norm it the main branch. does it mean we fail to lower somehow |
Could you please check is there any error about rms norm op validation? It should work with qnn 2.26. |
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I actually had the same problem with qnn 2.23 but 2.25 doesn't have this issue. But for 2.25, like I mentioned in the other thread, the export actually didn't quantize the model, instead it just upcast it to fp32. |
If possible, could you try again with qnn 2.26? |
Confirmed qnn 2.26 works and the model exported in expected size |
No description provided.