Llama-3.1-8B-Instruct_math shows poor performance on GSM8K-CoT

Congratulations on the release of MergeBench!

While testing the model on `gsm8k_cot`, I noticed that the math ability of `Llama-3.1-8B-Instruct_math` is appears to be quite poor.

The following evaluation was run using `lm_eval`:

```bash
lm_eval \
    --model_args pretrained="MergeBench/Llama-3.1-8B-Instruct_math",dtype='bfloat16',parallelize=True \
    --apply_chat_template \
    --tasks gsm8k_cot \
    --batch_size 8
```

Results:

```bash
| Tasks     | Version | Filter           | n-shot | Metric       | Value   | Stderr   |
|-----------|---------|------------------|--------|--------------|---------|----------|
| gsm8k_cot | 3       | flexible-extract | 8      | exact_match  | 0.1698  | ± 0.0103 |
|           |         | strict-match     | 8      | exact_match  | 0.0000  | ± 0.0000 |
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama-3.1-8B-Instruct_math shows poor performance on GSM8K-CoT #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama-3.1-8B-Instruct_math shows poor performance on GSM8K-CoT #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions