From 43a69ba255678fae9775a2dc55d61d7120ddcd8c Mon Sep 17 00:00:00 2001 From: Jack Zhang Date: Mon, 7 Oct 2024 17:49:48 -0700 Subject: [PATCH] Update Llama README.md for Stories110M tokenizer (#5960) Summary: The tokenizer from `wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"` is TikToken, so we do not need to generate a `tokenizer.bin` and instead can just use the `tokenizer.model` as is. Pull Request resolved: https://github.com/pytorch/executorch/pull/5960 Reviewed By: tarun292 Differential Revision: D64014160 Pulled By: dvorjackz fbshipit-source-id: 16474a73ed77192f58a5bb9e07426ba58216351e (cherry picked from commit 12cb9ca065f02530218da9485f2e4b6daac60932) --- examples/models/llama2/README.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md index be1d15b3927..9280c8f0a9f 100644 --- a/examples/models/llama2/README.md +++ b/examples/models/llama2/README.md @@ -113,11 +113,6 @@ If you want to deploy and run a smaller model for educational purposes. From `ex ``` python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -X -kv ``` -4. Create tokenizer.bin. - - ``` - python -m extension.llm.tokenizer.tokenizer -t -o tokenizer.bin - ``` ### Option C: Download and export Llama 3 8B instruct model @@ -127,7 +122,11 @@ You can export and run the original Llama 3 8B instruct model. 2. Export model and generate `.pte` file ``` - python -m examples.models.llama2.export_llama --checkpoint -p -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte" + python -m examples.models.llama2.export_llama --checkpoint --params -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 + ``` +4. Create tokenizer.bin. + ``` + python -m extension.llm.tokenizer.tokenizer -t -o tokenizer.bin ``` Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size. @@ -187,7 +186,7 @@ tokenizer.path=/tokenizer.model Using the same arguments from above ``` -python -m examples.models.llama2.eval_llama -c -p -t -d fp32 --max_seq_len --limit +python -m examples.models.llama2.eval_llama -c -p -t -d fp32 --max_seq_len --limit ``` The Wikitext results generated above used: `{max_seq_len: 2048, limit: 1000}` @@ -233,7 +232,7 @@ Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the cmake-out/examples/models/llama2/llama_main --model_path= --tokenizer_path= --prompt= ``` -For Llama3, you can pass the original `tokenizer.model` (without converting to `.bin` file). +For Llama2 models, pass the converted `tokenizer.bin` file instead of `tokenizer.model`. To build for CoreML backend and validate on Mac, replace `-DEXECUTORCH_BUILD_XNNPACK=ON` with `-DEXECUTORCH_BUILD_COREML=ON`