diff --git a/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md b/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md index da67e84adc2..9a8b86b8a50 100644 --- a/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md +++ b/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md @@ -71,7 +71,7 @@ We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1 * Export Llama model and generate .pte file as below: ``` -python -m examples.models.llama2.export_llama --checkpoint --params -kv -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte" +python -m examples.models.llama2.export_llama --checkpoint --params -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte" ``` * Rename tokenizer for Llama 3.2 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly. diff --git a/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md b/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md index 64811ee774f..d7a76da6434 100644 --- a/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md +++ b/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md @@ -55,7 +55,7 @@ We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1 * Export Llama model and generate .pte file as below: ``` -python -m examples.models.llama2.export_llama --checkpoint --params -kv -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte" +python -m examples.models.llama2.export_llama --checkpoint --params -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte" ``` For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-). diff --git a/examples/models/llama2/Android3_2_1B_bf16.gif b/examples/models/llama2/Android3_2_1B_bf16.gif index abe2b8278f6..d40a8c2db97 100644 Binary files a/examples/models/llama2/Android3_2_1B_bf16.gif and b/examples/models/llama2/Android3_2_1B_bf16.gif differ diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md index bcca1b82ba4..f5686eccd95 100644 --- a/examples/models/llama2/README.md +++ b/examples/models/llama2/README.md @@ -85,10 +85,10 @@ We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-a ### Llama 3.2 1B and 3B Llama 3.2 1B and 3B performance was measured on the OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on) for generating 128 tokens. -|Model | 4bit(*) via SpinQuant -|--------| --------------- -|1B | 53.41 tokens/second | -|3B | 22.98 tokens/second | +|Model | bf16 | 4bit(*) via SpinQuant +|--------| ---------------------- | --------------- +|1B | 19.4 tokens/second | 53.41 tokens/second | +|3B | 7.76 tokens/second | 22.98 tokens/second | (*) With SpinQuant, we currently quantize 4-bit groupwise (with groupsize 32) weight, 8bit dynamic activation of all the linear layers of the model, except embedding and output layers. The embedding and output layers are quantized as 8-bit per-channel weight and 8-bit dynamic activation. @@ -142,7 +142,9 @@ LLAMA_PARAMS=path/to/params.json python -m examples.models.llama2.export_llama \ --checkpoint "${LLAMA_CHECKPOINT:?}" \ --params "${LLAMA_PARAMS:?}" \ - -kv -X \ + -kv \ + --use_sdpa_with_kv_cache \ + -X \ -d bf16 \ --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}' \ --output_name="llama3_2.pte"