OpenAI o1-preview

The following page is an extract from Learning to Reason with LLMs, OpenAI blog, Sept 2024. Please refer to the original source for a full benchmark report.

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).

Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

Evals

Dataset	Metric	gpt-4o	o1-preview
Competition Math AIME (2024)	cons@64	13.4	56.7
	pass@1	9.3	44.6
Competition Code CodeForces	Elo	808	1,258
	Percentile	11.0	62.0
GPQA Diamond	cons@64	56.1	78.3
	pass@1	50.6	73.3
Biology	cons@64	63.2	73.7
	pass@1	61.6	65.9
Chemistry	cons@64	43.0	60.2
	pass@1	40.2	59.9
Physics	cons@64	68.6	89.5
	pass@1	59.5	89.4
MATH	pass@1	60.3	85.5
MMLU	pass@1	88.0	92.3
MMMU (val)	pass@1	69.1	n/a
MathVista (testmini)	pass@1	63.8	n/a

Safety

Metric	GPT-4o	o1-preview
% Safe completions on harmful prompts Standard	0.990	0.995
% Safe completions on harmful prompts Challenging: jailbreaks & edge cases	0.714	0.934
↳ Harassment (severe)	0.845	0.900
↳ Exploitative sexual content	0.483	0.949
↳ Sexual content involving minors	0.707	0.931
↳ Advice about non-violent wrongdoing	0.688	0.961
↳ Advice about violent wrongdoing	0.778	0.963
% Safe completions for top 200 with highest Moderation API scores per category in WildChat Zhao, et al. 2024	0.945	0.971
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024	0.220	0.840
Human sourced jailbreak eval	0.770	0.960
% Compliance on internal benign edge cases “not over-refusal”	0.910	0.930
% Compliance on benign edge cases in XSTest Röttger, et al. 2023	0.924	0.976

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI o1-preview

Model navigation navigation

Evals

Safety

About

Tags

Languages