Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion examples/models/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,8 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
</em>
</p>

[Please visit this section to try it on non-CPU backend, including CoreML, MPS, Qualcomm HTP or MediaTek](non_cpu_backends.md).

# Instructions

## Tested on
Expand Down Expand Up @@ -242,6 +244,9 @@ You can export and run the original Llama 3 8B instruct model.

Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.


If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md)

## Step 3: Run on your computer to validate

1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
Expand All @@ -261,7 +266,7 @@ You can export and run the original Llama 3 8B instruct model.

cmake --build cmake-out -j16 --target install --config Release
```
Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the session of Common Issues and Mitigations below for solutions.
Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the section of Common Issues and Mitigations below for solutions.

2. Build llama runner.
```
Expand Down
1 change: 1 addition & 0 deletions examples/models/llama/UTILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ For CoreML, there are 2 additional optional arguments:
* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML

To deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md).

## Download models from Hugging Face and convert from safetensor format to state dict

Expand Down
24 changes: 24 additions & 0 deletions examples/models/llama/non_cpu_backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

# Running Llama 3/3.1 8B on non-CPU backends

### QNN
Please follow [the instructions](https://pytorch.org/executorch/stable/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html) to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.

### MPS
Export:
```
python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32
```

After exporting the MPS model .pte file, the [iOS LLAMA](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) app can support running the model. ` --embedding-quantize 4,32` is an optional args for quantizing embedding to reduce the model size.

### CoreML
Export:
```
python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w
```

After exporting the CoreML model .pte file, please [follow the instruction to build llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama#step-3-run-on-your-computer-to-validate) with CoreML flags enabled as the instruction described.

### MTK
Please [follow the instructions](https://github.com/pytorch/executorch/tree/main/examples/mediatek#llama-example-instructions) to deploy llama3 8b to an Android phones with MediaTek chip
Loading