Meta-Llama-3-8B-Instruct
PreviewGive feedback
Model navigation navigation
In this section, we report the results for Llama 3 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. For details on the methodology see here.
Category | Benchmark | Llama 3 8B | Llama2 7B | Llama2 13B | Llama 3 70B | Llama2 70B |
General | MMLU (5-shot) | 66.6 | 45.7 | 53.8 | 79.5 | 69.7 |
AGIEval English (3-5 shot) | 45.9 | 28.8 | 38.7 | 63.0 | 54.8 | |
CommonSenseQA (7-shot) | 72.6 | 57.6 | 67.6 | 83.8 | 78.7 | |
Winogrande (5-shot) | 76.1 | 73.3 | 75.4 | 83.1 | 81.8 | |
BIG-Bench Hard (3-shot, CoT) | 61.1 | 38.1 | 47.0 | 81.3 | 65.7 | |
ARC-Challenge (25-shot) | 78.6 | 53.7 | 67.6 | 93.0 | 85.3 | |
Knowledge reasoning | TriviaQA-Wiki (5-shot) | 78.5 | 72.1 | 79.6 | 89.7 | 87.5 |
Reading comprehension | SQuAD (1-shot) | 76.4 | 72.2 | 72.1 | 85.6 | 82.6 |
QuAC (1-shot, F1) | 44.4 | 39.6 | 44.9 | 51.1 | 49.4 | |
BoolQ (0-shot) | 75.7 | 65.5 | 66.9 | 79.0 | 73.1 | |
DROP (3-shot, F1) | 58.4 | 37.9 | 49.8 | 79.7 | 70.2 |
Benchmark | Llama 3 8B | Llama 2 7B | Llama 2 13B | Llama 3 70B | Llama 2 70B |
MMLU (5-shot) | 68.4 | 34.1 | 47.8 | 82.0 | 52.9 |
GPQA (0-shot) | 34.2 | 21.7 | 22.3 | 39.5 | 21.0 |
HumanEval (0-shot) | 62.2 | 7.9 | 14.0 | 81.7 | 25.6 |
GSM-8K (8-shot, CoT) | 79.6 | 25.7 | 77.4 | 93.0 | 57.5 |
MATH (4-shot, CoT) | 30.0 | 3.8 | 6.7 | 50.4 | 11.6 |
About
A versatile 8-billion parameter model optimized for dialogue and text generation tasks.
Context
8k input · 4k output
Training date
Mar 2023
Rate limit tier
Provider support
Languages
(1)English