pytorch · facebook-github-bot · Oct 29, 2024 · Oct 29, 2024
@@ -136,6 +136,8 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
       </em>
 </p>
 
+[Please visit this section to try it on non-CPU backend, including CoreML, MPS, Qualcomm HTP or MediaTek](non_cpu_backends.md).
+
 # Instructions
 
 ## Tested on
@@ -242,6 +244,9 @@ You can export and run the original Llama 3 8B instruct model.
 
     Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
 
+
+    If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md)
+
 ## Step 3: Run on your computer to validate
 
 1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
@@ -261,7 +266,7 @@ You can export and run the original Llama 3 8B instruct model.
 
     cmake --build cmake-out -j16 --target install --config Release
     ```
-Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the session of Common Issues and Mitigations below for solutions.
+Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the section of Common Issues and Mitigations below for solutions.
 
 2. Build llama runner.
     ```

@@ -37,6 +37,7 @@ For CoreML, there are 2 additional optional arguments:
 * `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
 * `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
 
+To deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md).
 
 ## Download models from Hugging Face and convert from safetensor format to state dict
 

@@ -0,0 +1,24 @@
+
+# Running Llama 3/3.1 8B on non-CPU backends
+
+### QNN
+Please follow [the instructions](https://pytorch.org/executorch/stable/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html) to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.
+
+### MPS
+Export:
+```
+python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32
+```
+
+After exporting the MPS model .pte file, the [iOS LLAMA](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) app can support running the model. ` --embedding-quantize 4,32` is an optional args for quantizing embedding to reduce the model size.
+
+### CoreML
+Export:
+```
+python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w
+```
+
+After exporting the CoreML model .pte file, please [follow the instruction to build llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama#step-3-run-on-your-computer-to-validate) with CoreML flags enabled as the instruction described.
+
+### MTK
+Please [follow the instructions](https://github.com/pytorch/executorch/tree/main/examples/mediatek#llama-example-instructions) to deploy llama3 8b to an Android phones with MediaTek chip