Phi-4-multimodal-instruct
Model navigation navigation
The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red teaming effort focused on the newest Audio input modality and on the following safety areas: harmful content, self-injury risks, and exploits. The model was found to be more susceptible to providing undesirable outputs when attacked with context manipulation or persuasive techniques. These findings applied to all languages, with the persuasive techniques mostly affecting French and Italian. This highlights the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource language.
To assess model safety in scenarios involving both text and images, Microsoft’s Azure AI Evaluation SDK was utilized. This tool facilitates the simulation of single-turn conversations with the target model by providing prompt text and images designed to incite harmful responses. The target model's responses are subsequently evaluated by a capable model across multiple harm categories, including violence, sexual content, self-harm, hateful and unfair content, with each response scored based on the severity of the harm identified. The evaluation results were compared with those of Phi-3.5-Vision and open-source models of comparable size. In addition, we ran both an internal and the public RTVLM and VLGuard multi-modal (text & vision) RAI benchmarks, once again comparing scores with Phi-3.5-Vision and open-source models of comparable size. However, the model may be susceptible to language-specific attack prompts and cultural context.
In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft’s Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model’s responses to Speech prompts. Second, Microsoft’s Speech Fairness evaluation was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform. Users can refer to the Phi-4-Mini model card for details of language benchmarks. Below is a high-level overview of the model quality on representative speech and vision benchmarks:
Phi-4-multimodal-instruct demonstrated strong performance in speech tasks:
- Surpassed expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large in automatic speech recognition (ASR) and speech translation (ST).
- Ranked number 1 on the Huggingface OpenASR leaderboard with a word error rate of 6.14% compared to the current best model at 6.5% as of February 18, 2025.
- First open-sourced model capable of performing speech summarization, with performance close to GPT4o.
- Exhibited a gap with closed models like Gemini-2.0-Flash and GPT-4o-realtime-preview on the speech QA task. Efforts are ongoing to improve this capability in future iterations.
Phi-4-multimodal-instruct can process both image and audio together. The table below shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other state-of-the-art omni models, Phi-4-multimodal-instruct achieves stronger performance on multiple benchmarks.
Benchmarks | Phi-4-multimodal-instruct | InternOmni-7B | Gemini-2.0-Flash-Lite-prv-02-05 | Gemini-2.0-Flash | Gemini-1.5-Pro |
---|---|---|---|---|---|
s_AI2D | 68.9 | 53.9 | 62.0 | 69.4 | 67.7 |
s_ChartQA | 69.0 | 56.1 | 35.5 | 51.3 | 46.9 |
s_DocVQA | 87.3 | 79.9 | 76.0 | 80.3 | 78.2 |
s_InfoVQA | 63.7 | 60.3 | 59.4 | 63.6 | 66.1 |
Average | 72.2 | 62.6 | 58.2 | 66.2 | 64.7 |
To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. Below is a high-level overview of the model quality on representative benchmarks:
Dataset | Phi-4-multimodal-ins | Phi-3.5-vision-ins | Qwen 2.5-VL-3B-ins | Intern VL 2.5-4B | Qwen 2.5-VL-7B-ins | Intern VL 2.5-8B | Gemini 2.0-Flash Lite-prv-0205 | Gemini2.0-Flash | Claude-3.5-Sonnet-2024-10-22 | Gpt-4o-2024-11-20 |
---|---|---|---|---|---|---|---|---|---|---|
Popular aggregated benchmark | 55.1 | 43.0 | 47.0 | 48.3 | 51.8 | 50.6 | 54.1 | 64.7 | 55.8 | 61.7 |
MMBench (dev-en) | 86.7 | 81.9 | 84.3 | 86.8 | 87.8 | 88.2 | 85.0 | 90.0 | 86.7 | 89.0 |
MMMU-Pro (std / vision) | 38.5 | 21.8 | 29.9 | 32.4 | 38.7 | 34.4 | 45.1 | 54.4 | 54.3 | 53.0 |
ScienceQA Visual (img-test) | 97.5 | 91.3 | 79.4 | 96.2 | 87.7 | 97.3 | 85.0 | 88.3 | 81.2 | 88.2 |
MathVista (testmini) | 62.4 | 43.9 | 60.8 | 51.2 | 67.8 | 56.7 | 57.6 | 47.2 | 56.9 | 56.1 |
InterGPS | 48.6 | 36.3 | 48.3 | 53.7 | 52.7 | 54.1 | 57.9 | 65.4 | 47.1 | 49.1 |
AI2D | 82.3 | 78.1 | 78.4 | 80.0 | 82.6 | 83.0 | 77.6 | 82.1 | 70.6 | 83.8 |
ChartQA | 81.4 | 81.8 | 80.0 | 79.1 | 85.0 | 81.0 | 73.0 | 79.0 | 78.4 | 75.1 |
DocVQA | 93.2 | 69.3 | 93.9 | 91.6 | 95.7 | 93.0 | 91.2 | 92.1 | 95.2 | 90.9 |
InfoVQA | 72.7 | 36.6 | 77.1 | 72.1 | 82.6 | 77.6 | 73.0 | 77.8 | 74.3 | 71.9 |
TextVQA (val) | 75.6 | 72.0 | 76.8 | 70.9 | 77.7 | 74.8 | 72.9 | 74.4 | 58.6 | 73.1 |
OCR Bench | 84.4 | 63.8 | 82.2 | 71.6 | 87.7 | 74.8 | 75.7 | 81.0 | 77.0 | 77.7 |
POPE | 85.6 | 86.1 | 87.9 | 89.4 | 87.5 | 89.1 | 87.5 | 88.0 | 82.6 | 86.5 |
BLINK | 61.3 | 57.0 | 48.1 | 51.2 | 55.3 | 52.5 | 59.3 | 64.0 | 56.9 | 62.4 |
Video MME (16 frames) | 55.0 | 50.8 | 56.5 | 57.3 | 58.2 | 58.7 | 58.8 | 65.5 | 60.2 | 68.2 |
Average | 72.0 | 60.9 | 68.7 | 68.8 | 73.3 | 71.1 | 70.2 | 74.3 | 69.1 | 72.4 |
Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and is competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
Dataset | Phi-4-multimodal-instruct | Qwen2.5-VL-3B-Instruct | InternVL 2.5-4B | Qwen2.5-VL-7B-Instruct | InternVL 2.5-8B | Gemini-2.0-Flash-Lite-prv-02-05 | Gemini-2.0-Flash | Claude-3.5-Sonnet-2024-10-22 | Gpt-4o-2024-11-20 |
---|---|---|---|---|---|---|---|---|---|
Art Style | 86.3 | 58.1 | 59.8 | 65.0 | 65.0 | 76.9 | 76.9 | 68.4 | 73.5 |
Counting | 60.0 | 67.5 | 60.0 | 66.7 | 71.7 | 45.8 | 69.2 | 60.8 | 65.0 |
Forensic Detection | 90.2 | 34.8 | 22.0 | 43.9 | 37.9 | 31.8 | 74.2 | 63.6 | 71.2 |
Functional Correspondence | 30.0 | 20.0 | 26.9 | 22.3 | 27.7 | 48.5 | 53.1 | 34.6 | 42.3 |
IQ Test | 22.7 | 25.3 | 28.7 | 28.7 | 28.7 | 28.0 | 30.7 | 20.7 | 25.3 |
Jigsaw | 68.7 | 52.0 | 71.3 | 69.3 | 53.3 | 62.7 | 69.3 | 61.3 | 68.7 |
Multi-View Reasoning | 76.7 | 44.4 | 44.4 | 54.1 | 45.1 | 55.6 | 41.4 | 54.9 | 54.1 |
Object Localization | 52.5 | 55.7 | 53.3 | 55.7 | 58.2 | 63.9 | 67.2 | 58.2 | 65.6 |
Relative Depth | 69.4 | 68.5 | 68.5 | 80.6 | 76.6 | 81.5 | 72.6 | 66.1 | 73.4 |
Relative Reflectance | 26.9 | 38.8 | 38.8 | 32.8 | 38.8 | 33.6 | 34.3 | 38.1 | 38.1 |
Semantic Correspondence | 52.5 | 32.4 | 33.8 | 28.8 | 24.5 | 56.1 | 55.4 | 43.9 | 47.5 |
Spatial Relation | 72.7 | 80.4 | 86.0 | 88.8 | 86.7 | 74.1 | 79.0 | 74.8 | 83.2 |
Visual Correspondence | 67.4 | 28.5 | 39.5 | 50.0 | 44.2 | 84.9 | 91.3 | 72.7 | 82.6 |
Visual Similarity | 86.7 | 67.4 | 88.1 | 87.4 | 85.2 | 87.4 | 80.7 | 79.3 | 83.0 |
Overall | 61.3 | 48.1 | 51.2 | 55.3 | 52.5 | 59.3 | 64.0 | 56.9 | 62.4 |