Congratulations on the release of MergeBench!
While testing the model on gsm8k_cot, I noticed that the math ability of Llama-3.1-8B-Instruct_math is appears to be quite poor.
The following evaluation was run using lm_eval:
lm_eval \
--model_args pretrained="MergeBench/Llama-3.1-8B-Instruct_math",dtype='bfloat16',parallelize=True \
--apply_chat_template \
--tasks gsm8k_cot \
--batch_size 8
Results:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr |
|-----------|---------|------------------|--------|--------------|---------|----------|
| gsm8k_cot | 3 | flexible-extract | 8 | exact_match | 0.1698 | ± 0.0103 |
| | | strict-match | 8 | exact_match | 0.0000 | ± 0.0000 |