Skip to content
Microsoft logo

Phi-4-mini-instruct

Playground
What are some popular tourist attractions in Paris?
What is the history of the Great Wall of China?
Can you explain the basics of machine learning?

Model navigation navigation

Safety Approach

The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories. For non-English languages, the existing datasets were extended via machine translation.

Safety Evaluation and Red-Teaming

Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper.

For this release, initial insights from red teaming indicate that the models may at times be mistaken about which company created them; ad-hoc training data were added to correct this behavior. Another insight was that with function calling scenarios, the models could sometimes hallucinate function names or URL’s. Models may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken.

Model Quality

To understand the capabilities, the 3.8B parameters Phi-4-Mini model was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). A high-level overview of the model quality is as follows:

Popular Aggregated Benchmarks

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
Arena Hard 32.8 34.4 17 26.9 32 55.5 37.3 25.7 42.7 43.7 75
BigBench Hard CoT (0-shot) 70.4 63.1 55.4 51.2 56.2 72.4 53.3 63.4 55.5 65.7 80.4
MMLU (5-shot) 67.3 65.5 61.8 60.8 65 72.6 63 68.1 65 71.3 77.2
MMLU-Pro (0-shot, CoT) 52.8 34.4 39.2 35.3 44.7 56.2 36.6 44 40.9 50.1 62.8

Reasoning

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
ARC Challenge (10-shot) 83.7 47.4 76.1 80.3 82.6 90.1 82.7 83.1 79.4 89.8 93.5
BoolQ (2-shot) 81.2 84.6 71.4 79.4 65.4 80 80.5 82.8 79 85.7 88.7
GPQA (0-shot, CoT) 30.4 77.7 26.6 24.3 24.3 30.6 26.3 26.3 29.9 31 41.1
HellaSwag (5-shot) 69.1 25.2 69 77.2 74.6 80.1 80.9 73.5 72.8 80.9 87.1
OpenBook QA (10-shot) 79.2 72.2 72.6 79.8 77.6 86 80.2 84.8 79.8 89.6 90
PIQA (5-shot) 77.6 81.2 68.2 78.3 77.2 80.8 76.2 81.2 83.2 83.7 88.7
Social IQA (5-shot) 72.5 78.2 68.3 73.9 75.3 75.3 77.6 71.8 73.4 74.7 82.9
TruthfulQA (MC2) (10-shot) 66.4 75.1 59.2 62.9 64.3 69.4 63 69.2 64.1 76.6 78.2
WinoGrande (5-shot) 67 65.6 53.2 59.8 63.3 71.1 63.1 64.7 65.4 74 76.9

Multilingual

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
Multilingual MMLU (5-shot) 49.3 51.8 48.1 46.4 55.9 64.4 53.7 56.2 54.5 63.8 72.9
MGSM (0-shot CoT) 63.9 47 49.6 44.6 53.5 64.5 58.3 56.7 58.6 75.1 81.7

Math

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
GSM8K (8-shot, CoT) 88.6 76.9 75.6 80.1 80.6 88.7 81.9 82.4 84.3 84.9 91.3
MATH (0-shot, CoT) 64 49.8 46.7 41.8 61.7 60.4 41.6 47.6 46.1 51.3 70.2

Long Context

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
Qasper 40.4 41.9 33.4 35.3 32.1 38.1 37.4 37.2 35.4 13.9 39.8
SQuALITY 22.8 25.3 25.7 25.5 25.3 23.8 24.9 26.2 26.7 23.6 23.8

Instruction Following

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
IFEval 70.1 50.6 68 47.5 59 69.5 52.5 74.1 77.3 73.2 80.1

Function Calling

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
BFCL 70.3 66.1 78.6 61.4 74.2 81.3 74 77 59.4 59.9 83.3

Code Generation

Phi-4 Mini-Ins Phi-3.5-Mini-Ins Llama-3.2-3B-Ins Ministral-3B Qwen2.5-3B-Ins Qwen2.5-7B-Ins Ministral-8B-2410 Llama-3.1-8B-Ins Llama-3.1-Tulu-3-8B Gemma 2-9B-It GPT-4o-mini-2024-07-18
HumanEval (0-shot) 74.4 70.1 62.8 72 72 75 70.7 66.5 62.8 63.4 86.6
MBPP (3-shot) 65.3 70 67.2 65.1 65.3 76.3 68.9 69.4 63.9 69.6 84.1
Overall 63.5 60.5 56.2 56.9 60.1 67.9 60.2 62.3 60.9 65.0 75.5

Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings.

About

3.8B parameters Small Language Model outperforming larger models in reasoning, math, coding, and function-calling
Context
128k input · 4k output
Training date
Jun 2024
Rate limit tier
Provider support

Languages

 (23)
Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian