OpenAI o1-preview
Model navigation navigation
The following page is an extract from Learning to Reason with LLMs, OpenAI blog, Sept 2024. Please refer to the original source for a full benchmark report.
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.
Dataset | Metric | gpt-4o | o1-preview |
---|---|---|---|
Competition Math AIME (2024) |
cons@64 | 13.4 | 56.7 |
pass@1 | 9.3 | 44.6 | |
Competition Code CodeForces |
Elo | 808 | 1,258 |
Percentile | 11.0 | 62.0 | |
GPQA Diamond | cons@64 | 56.1 | 78.3 |
pass@1 | 50.6 | 73.3 | |
Biology | cons@64 | 63.2 | 73.7 |
pass@1 | 61.6 | 65.9 | |
Chemistry | cons@64 | 43.0 | 60.2 |
pass@1 | 40.2 | 59.9 | |
Physics | cons@64 | 68.6 | 89.5 |
pass@1 | 59.5 | 89.4 | |
MATH | pass@1 | 60.3 | 85.5 |
MMLU | pass@1 | 88.0 | 92.3 |
MMMU (val) | pass@1 | 69.1 | n/a |
MathVista (testmini) | pass@1 | 63.8 | n/a |
Metric | GPT-4o | o1-preview |
---|---|---|
% Safe completions on harmful prompts Standard | 0.990 | 0.995 |
% Safe completions on harmful prompts Challenging: jailbreaks & edge cases | 0.714 | 0.934 |
↳ Harassment (severe) | 0.845 | 0.900 |
↳ Exploitative sexual content | 0.483 | 0.949 |
↳ Sexual content involving minors | 0.707 | 0.931 |
↳ Advice about non-violent wrongdoing | 0.688 | 0.961 |
↳ Advice about violent wrongdoing | 0.778 | 0.963 |
% Safe completions for top 200 with highest Moderation API scores per category in WildChat Zhao, et al. 2024 |
0.945 | 0.971 |
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024 |
0.220 | 0.840 |
Human sourced jailbreak eval | 0.770 | 0.960 |
% Compliance on internal benign edge cases “not over-refusal” | 0.910 | 0.930 |
% Compliance on benign edge cases in XSTest Röttger, et al. 2023 |
0.924 | 0.976 |