From 19ea439140fd7db43c11d3639c8f3a1d4c259274 Mon Sep 17 00:00:00 2001
From: Lucy Qiu <lfq@meta.com>
Date: Wed, 24 Apr 2024 12:45:41 -0700
Subject: [PATCH] llama2 readme (#3315)

Summary:
- add note for embedding quantize, for llama3
- re-order export args to be the same as llama2, group_size missing `--`

Pull Request resolved: https://github.com/pytorch/executorch/pull/3315

Reviewed By: cccclai

Differential Revision: D56528535

Pulled By: lucylq

fbshipit-source-id: 4453070339ebdb3d782b45f96fe43d28c7006092
(cherry picked from commit 34f59edd8670516774c94b5e3cf4dee26f96dc70)
---
 examples/models/llama2/README.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md
index 6bc7dd0fa0b..99f42f9da10 100644
--- a/examples/models/llama2/README.md
+++ b/examples/models/llama2/README.md
@@ -100,6 +100,18 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
     python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
     ```
 
+### Option C: Download and export Llama3 8B model
+
+You can export and run the original Llama3 8B model.
+
+1. Llama3 pretrained parameters can be downloaded from [Meta's official llama3 repository](https://github.com/meta-llama/llama3/).
+
+2. Export model and generate `.pte` file
+    ```
+    python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w  --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
+    ```
+
+    Due to the larger vocabulary size of Llama3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` to further reduce the model size.
 
 ## (Optional) Finetuning