OpenAI o1-mini
PreviewGive feedback
Model navigation navigation
The following page is an extract from OpenaI o1-mini model announcement. Please refer to the original source for a full benchmark report.
Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.
Task | Dataset | Metric | GPT-4o | o1-mini | o1-preview |
---|---|---|---|---|---|
Coding | Codeforces | Elo | 900 | 1650 | 1258 |
HumanEval | Accuracy | 90.2% | 92.4% | 92.4% | |
Cybersecurity CTFs | Accuracy (Pass@12) | 20.0% | 28.7% | 43.0% | |
STEM | MMLU (o-shot CoT) | 88.7% | 85.2% | 90.8% | |
GPQA (Diamond, 0-shot CoT) | 53.6% | 60.0% | 73.3% | ||
MATH-500 (0-shot CoT) | 60.3% | 90.0% | 858.5% |
Metric | GPT-4o | o1-mini |
---|---|---|
% Safe completions refusal on harmful prompts (standard) |
0.99 | 0.99 |
% Safe completions on harmful prompts (Challenging: jailbreaks & edge cases) |
0.714 | 0.932 |
% Compliance on benign edge cases (“not over-refusal”) |
0.91 | 0.923 |
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024 |
0.22 | 0.83 |
Human sourced jailbreak eval | 0.77 | 0.95 |
About
Smaller, faster, and 80% cheaper than o1-preview, performs well at code generation and small context operations.
Context
128k input · 66k output
Training date
Oct 2023
Rate limit tier
Provider support
Languages
(27)English, Italian, Afrikaans, Spanish, German, French, Indonesian, Russian, Polish, Ukrainian, Greek, Latvian