Skip to content
Can you explain the concept of time dilation in physics?
What are some common features of Gothic architecture?
What is the history of the Great Wall of China?

Model navigation navigation

We evaluated phi-4 using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:

  • MMLU: Popular aggregated dataset for multitask language understanding.

  • MATH: Challenging competition math problems.

  • GPQA: Complex, graduate-level science questions.

  • DROP: Complex comprehension and reasoning.

  • MGSM: Multi-lingual grade-school math.

  • HumanEval: Functional code generation.

  • SimpleQA: Factual responses.

To understand the capabilities, we compare phi-4 with a set of models over OpenAI’s SimpleEval benchmark.

At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:

Category Benchmark phi-4 (14B) phi-3 (14B) Qwen 2.5 (14B instruct) GPT-4o-mini Llama-3.3 (70B instruct) Qwen 2.5 (72B instruct) GPT-4o
Popular Aggregated Benchmark MMLU 84.8 77.9 79.9 81.8 86.3 85.3 88.1
Science GPQA 56.1 31.2 42.9 40.9 49.1 49.0 50.6
Math MGSM
MATH
80.6
80.4
53.5
44.6
79.6
75.6
86.5
73.0
89.1
66.3*
87.3
80.0
90.4
74.6
Code Generation HumanEval 82.6 67.8 72.1 86.2 78.9* 80.4 90.6
Factual Knowledge SimpleQA 3.0 7.6 5.4 9.9 20.9 10.2 39.4
Reasoning DROP 75.5 68.3 85.5 79.3 90.2 76.7 80.9

* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.

About

Phi-4 14B, a highly capable model for low latency scenarios.
Context
16k input · 16k output
Training date
Jun 2024
Rate limit tier
Provider support

Languages

 (45)
English, Arabic, Bangla, Czech, Danish, German, Greek, Spanish, Persian, Finnish, French, Gujarati