From 0beadb62a0e79cd9eb348b7fc8a68bd6b13708ed Mon Sep 17 00:00:00 2001 From: Lunwen He Date: Fri, 11 Oct 2024 11:40:43 -0700 Subject: [PATCH] add instructions about getting mmlu score for instruct models [ghstack-poisoned] --- examples/models/llama2/README.md | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md index 023955ab32f..10eccc79755 100644 --- a/examples/models/llama2/README.md +++ b/examples/models/llama2/README.md @@ -49,7 +49,7 @@ We employed 4-bit groupwise per token dynamic quantization of all the linear lay We evaluated WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Please note that LM Eval reports perplexity normalized by word count instead of token count. You may see different perplexity for WikiText from other sources if they implement it differntly. More details could be found [here](https://github.com/EleutherAI/lm-evaluation-harness/issues/2301). -Below are the results for two different groupsizes, with max_seq_len 2048, and 1000 samples. +Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000. |Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256) |--------|-----------------| ---------------------- | --------------- @@ -280,12 +280,32 @@ tokenizer.path=/tokenizer.model > Forewarning: Model evaluation without a GPU may take a long time, especially on larger models. -Using the same arguments from above +We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy. + +Using the following example command to calculate model's perplexity based on WikiText. ``` -python -m examples.models.llama2.eval_llama -c -p -t -d fp32 --max_seq_len --limit +python -m examples.models.llama2.eval_llama \ + -c \ + -p \ + -t \ + -kv \ + -d \ + --max_seq_len \ + --limit ``` -The Wikitext results generated above used: `{max_seq_len: 2048, limit: 1000}` +For instruct models, you can use the following example command to calculate model's MMLU score. +``` +python -m examples.models.llama2.eval_llama \ + -c \ + -p \ + -t \ + -kv \ + -d \ + --tasks mmlu \ + --num_fewshot 5 \ + --max_seq_len +``` ## Step 4: Run on your computer to validate