The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories. For non-English languages, the existing datasets were extended via machine translation.
Safety Evaluation and Red-Teaming
Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper.
For this release, initial insights from red teaming indicate that the models may at times be mistaken about which company created them; ad-hoc training data were added to correct this behavior. Another insight was that with function calling scenarios, the models could sometimes hallucinate function names or URL’s. Models may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken.
To understand the capabilities, the 3.8B parameters Phi-4-Mini model was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). A high-level overview of the model quality is as follows:
Popular Aggregated Benchmarks
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
Arena Hard |
32.8 |
34.4 |
17 |
26.9 |
32 |
55.5 |
37.3 |
25.7 |
42.7 |
43.7 |
75 |
BigBench Hard CoT (0-shot) |
70.4 |
63.1 |
55.4 |
51.2 |
56.2 |
72.4 |
53.3 |
63.4 |
55.5 |
65.7 |
80.4 |
MMLU (5-shot) |
67.3 |
65.5 |
61.8 |
60.8 |
65 |
72.6 |
63 |
68.1 |
65 |
71.3 |
77.2 |
MMLU-Pro (0-shot, CoT) |
52.8 |
34.4 |
39.2 |
35.3 |
44.7 |
56.2 |
36.6 |
44 |
40.9 |
50.1 |
62.8 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
ARC Challenge (10-shot) |
83.7 |
47.4 |
76.1 |
80.3 |
82.6 |
90.1 |
82.7 |
83.1 |
79.4 |
89.8 |
93.5 |
BoolQ (2-shot) |
81.2 |
84.6 |
71.4 |
79.4 |
65.4 |
80 |
80.5 |
82.8 |
79 |
85.7 |
88.7 |
GPQA (0-shot, CoT) |
30.4 |
77.7 |
26.6 |
24.3 |
24.3 |
30.6 |
26.3 |
26.3 |
29.9 |
31 |
41.1 |
HellaSwag (5-shot) |
69.1 |
25.2 |
69 |
77.2 |
74.6 |
80.1 |
80.9 |
73.5 |
72.8 |
80.9 |
87.1 |
OpenBook QA (10-shot) |
79.2 |
72.2 |
72.6 |
79.8 |
77.6 |
86 |
80.2 |
84.8 |
79.8 |
89.6 |
90 |
PIQA (5-shot) |
77.6 |
81.2 |
68.2 |
78.3 |
77.2 |
80.8 |
76.2 |
81.2 |
83.2 |
83.7 |
88.7 |
Social IQA (5-shot) |
72.5 |
78.2 |
68.3 |
73.9 |
75.3 |
75.3 |
77.6 |
71.8 |
73.4 |
74.7 |
82.9 |
TruthfulQA (MC2) (10-shot) |
66.4 |
75.1 |
59.2 |
62.9 |
64.3 |
69.4 |
63 |
69.2 |
64.1 |
76.6 |
78.2 |
WinoGrande (5-shot) |
67 |
65.6 |
53.2 |
59.8 |
63.3 |
71.1 |
63.1 |
64.7 |
65.4 |
74 |
76.9 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
Multilingual MMLU (5-shot) |
49.3 |
51.8 |
48.1 |
46.4 |
55.9 |
64.4 |
53.7 |
56.2 |
54.5 |
63.8 |
72.9 |
MGSM (0-shot CoT) |
63.9 |
47 |
49.6 |
44.6 |
53.5 |
64.5 |
58.3 |
56.7 |
58.6 |
75.1 |
81.7 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
GSM8K (8-shot, CoT) |
88.6 |
76.9 |
75.6 |
80.1 |
80.6 |
88.7 |
81.9 |
82.4 |
84.3 |
84.9 |
91.3 |
MATH (0-shot, CoT) |
64 |
49.8 |
46.7 |
41.8 |
61.7 |
60.4 |
41.6 |
47.6 |
46.1 |
51.3 |
70.2 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
Qasper |
40.4 |
41.9 |
33.4 |
35.3 |
32.1 |
38.1 |
37.4 |
37.2 |
35.4 |
13.9 |
39.8 |
SQuALITY |
22.8 |
25.3 |
25.7 |
25.5 |
25.3 |
23.8 |
24.9 |
26.2 |
26.7 |
23.6 |
23.8 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
IFEval |
70.1 |
50.6 |
68 |
47.5 |
59 |
69.5 |
52.5 |
74.1 |
77.3 |
73.2 |
80.1 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
BFCL |
70.3 |
66.1 |
78.6 |
61.4 |
74.2 |
81.3 |
74 |
77 |
59.4 |
59.9 |
83.3 |
|
Phi-4 Mini-Ins |
Phi-3.5-Mini-Ins |
Llama-3.2-3B-Ins |
Ministral-3B |
Qwen2.5-3B-Ins |
Qwen2.5-7B-Ins |
Ministral-8B-2410 |
Llama-3.1-8B-Ins |
Llama-3.1-Tulu-3-8B |
Gemma 2-9B-It |
GPT-4o-mini-2024-07-18 |
HumanEval (0-shot) |
74.4 |
70.1 |
62.8 |
72 |
72 |
75 |
70.7 |
66.5 |
62.8 |
63.4 |
86.6 |
MBPP (3-shot) |
65.3 |
70 |
67.2 |
65.1 |
65.3 |
76.3 |
68.9 |
69.4 |
63.9 |
69.6 |
84.1 |
Overall |
63.5 |
60.5 |
56.2 |
56.9 |
60.1 |
67.9 |
60.2 |
62.3 |
60.9 |
65.0 |
75.5 |
Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings.