Skip to content

Model navigation navigation

The following page is an extract from OpenaI o1-mini model announcement. Please refer to the original source for a full benchmark report.

Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.

Evals

Task Dataset Metric GPT-4o o1-mini o1-preview
Coding Codeforces Elo 900 1650 1258
HumanEval Accuracy 90.2% 92.4% 92.4%
Cybersecurity CTFs Accuracy (Pass@12) 20.0% 28.7% 43.0%
STEM MMLU (o-shot CoT) 88.7% 85.2% 90.8%
GPQA (Diamond, 0-shot CoT) 53.6% 60.0% 73.3%
MATH-500 (0-shot CoT) 60.3% 90.0% 858.5%

Safety

Metric GPT-4o o1-mini
% Safe completions refusal on harmful prompts
(standard)
0.99 0.99
% Safe completions on harmful prompts
(Challenging: jailbreaks & edge cases)
0.714 0.932
% Compliance on benign edge cases
(“not over-refusal”)
0.91 0.923
Goodness@0.1 StrongREJECT jailbreak eval
Souly et al. 2024
0.22 0.83
Human sourced jailbreak eval 0.77 0.95

About

Smaller, faster, and 80% cheaper than o1-preview, performs well at code generation and small context operations.
Context
128k input · 66k output
Training date
Oct 2023
Rate limit tier
Provider support

Languages

 (27)
English, Italian, Afrikaans, Spanish, German, French, Indonesian, Russian, Polish, Ukrainian, Greek, Latvian