Skip to content
Microsoft logo

Phi-4-multimodal-instruct

Playground
What are some common features of Gothic architecture?
What are some of the most famous works of Shakespeare?
What are some popular tourist attractions in Paris?

Model navigation navigation

Safety Approach

The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.

Safety Evaluation and Red-Teaming

Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red teaming effort focused on the newest Audio input modality and on the following safety areas: harmful content, self-injury risks, and exploits. The model was found to be more susceptible to providing undesirable outputs when attacked with context manipulation or persuasive techniques. These findings applied to all languages, with the persuasive techniques mostly affecting French and Italian. This highlights the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource language.

Vision Safety Evaluation

To assess model safety in scenarios involving both text and images, Microsoft’s Azure AI Evaluation SDK was utilized. This tool facilitates the simulation of single-turn conversations with the target model by providing prompt text and images designed to incite harmful responses. The target model's responses are subsequently evaluated by a capable model across multiple harm categories, including violence, sexual content, self-harm, hateful and unfair content, with each response scored based on the severity of the harm identified. The evaluation results were compared with those of Phi-3.5-Vision and open-source models of comparable size. In addition, we ran both an internal and the public RTVLM and VLGuard multi-modal (text & vision) RAI benchmarks, once again comparing scores with Phi-3.5-Vision and open-source models of comparable size. However, the model may be susceptible to language-specific attack prompts and cultural context.

Audio Safety Evaluation

In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft’s Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model’s responses to Speech prompts. Second, Microsoft’s Speech Fairness evaluation was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.

Model Quality

To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform. Users can refer to the Phi-4-Mini model card for details of language benchmarks. Below is a high-level overview of the model quality on representative speech and vision benchmarks:

Speech Benchmarks

Phi-4-multimodal-instruct demonstrated strong performance in speech tasks:

  • Surpassed expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large in automatic speech recognition (ASR) and speech translation (ST).
  • Ranked number 1 on the Huggingface OpenASR leaderboard with a word error rate of 6.14% compared to the current best model at 6.5% as of February 18, 2025.
  • First open-sourced model capable of performing speech summarization, with performance close to GPT4o.
  • Exhibited a gap with closed models like Gemini-2.0-Flash and GPT-4o-realtime-preview on the speech QA task. Efforts are ongoing to improve this capability in future iterations.

Vision Benchmarks

Vision-Speech Tasks

Phi-4-multimodal-instruct can process both image and audio together. The table below shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other state-of-the-art omni models, Phi-4-multimodal-instruct achieves stronger performance on multiple benchmarks.

Benchmarks Phi-4-multimodal-instruct InternOmni-7B Gemini-2.0-Flash-Lite-prv-02-05 Gemini-2.0-Flash Gemini-1.5-Pro
s_AI2D 68.9 53.9 62.0 69.4 67.7
s_ChartQA 69.0 56.1 35.5 51.3 46.9
s_DocVQA 87.3 79.9 76.0 80.3 78.2
s_InfoVQA 63.7 60.3 59.4 63.6 66.1
Average 72.2 62.6 58.2 66.2 64.7

Vision Tasks

To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. Below is a high-level overview of the model quality on representative benchmarks:

Dataset Phi-4-multimodal-ins Phi-3.5-vision-ins Qwen 2.5-VL-3B-ins Intern VL 2.5-4B Qwen 2.5-VL-7B-ins Intern VL 2.5-8B Gemini 2.0-Flash Lite-prv-0205 Gemini2.0-Flash Claude-3.5-Sonnet-2024-10-22 Gpt-4o-2024-11-20
Popular aggregated benchmark 55.1 43.0 47.0 48.3 51.8 50.6 54.1 64.7 55.8 61.7
MMBench (dev-en) 86.7 81.9 84.3 86.8 87.8 88.2 85.0 90.0 86.7 89.0
MMMU-Pro (std / vision) 38.5 21.8 29.9 32.4 38.7 34.4 45.1 54.4 54.3 53.0
ScienceQA Visual (img-test) 97.5 91.3 79.4 96.2 87.7 97.3 85.0 88.3 81.2 88.2
MathVista (testmini) 62.4 43.9 60.8 51.2 67.8 56.7 57.6 47.2 56.9 56.1
InterGPS 48.6 36.3 48.3 53.7 52.7 54.1 57.9 65.4 47.1 49.1
AI2D 82.3 78.1 78.4 80.0 82.6 83.0 77.6 82.1 70.6 83.8
ChartQA 81.4 81.8 80.0 79.1 85.0 81.0 73.0 79.0 78.4 75.1
DocVQA 93.2 69.3 93.9 91.6 95.7 93.0 91.2 92.1 95.2 90.9
InfoVQA 72.7 36.6 77.1 72.1 82.6 77.6 73.0 77.8 74.3 71.9
TextVQA (val) 75.6 72.0 76.8 70.9 77.7 74.8 72.9 74.4 58.6 73.1
OCR Bench 84.4 63.8 82.2 71.6 87.7 74.8 75.7 81.0 77.0 77.7
POPE 85.6 86.1 87.9 89.4 87.5 89.1 87.5 88.0 82.6 86.5
BLINK 61.3 57.0 48.1 51.2 55.3 52.5 59.3 64.0 56.9 62.4
Video MME (16 frames) 55.0 50.8 56.5 57.3 58.2 58.7 58.8 65.5 60.2 68.2
Average 72.0 60.9 68.7 68.8 73.3 71.1 70.2 74.3 69.1 72.4

Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and is competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

Dataset Phi-4-multimodal-instruct Qwen2.5-VL-3B-Instruct InternVL 2.5-4B Qwen2.5-VL-7B-Instruct InternVL 2.5-8B Gemini-2.0-Flash-Lite-prv-02-05 Gemini-2.0-Flash Claude-3.5-Sonnet-2024-10-22 Gpt-4o-2024-11-20
Art Style 86.3 58.1 59.8 65.0 65.0 76.9 76.9 68.4 73.5
Counting 60.0 67.5 60.0 66.7 71.7 45.8 69.2 60.8 65.0
Forensic Detection 90.2 34.8 22.0 43.9 37.9 31.8 74.2 63.6 71.2
Functional Correspondence 30.0 20.0 26.9 22.3 27.7 48.5 53.1 34.6 42.3
IQ Test 22.7 25.3 28.7 28.7 28.7 28.0 30.7 20.7 25.3
Jigsaw 68.7 52.0 71.3 69.3 53.3 62.7 69.3 61.3 68.7
Multi-View Reasoning 76.7 44.4 44.4 54.1 45.1 55.6 41.4 54.9 54.1
Object Localization 52.5 55.7 53.3 55.7 58.2 63.9 67.2 58.2 65.6
Relative Depth 69.4 68.5 68.5 80.6 76.6 81.5 72.6 66.1 73.4
Relative Reflectance 26.9 38.8 38.8 32.8 38.8 33.6 34.3 38.1 38.1
Semantic Correspondence 52.5 32.4 33.8 28.8 24.5 56.1 55.4 43.9 47.5
Spatial Relation 72.7 80.4 86.0 88.8 86.7 74.1 79.0 74.8 83.2
Visual Correspondence 67.4 28.5 39.5 50.0 44.2 84.9 91.3 72.7 82.6
Visual Similarity 86.7 67.4 88.1 87.4 85.2 87.4 80.7 79.3 83.0
Overall 61.3 48.1 51.2 55.3 52.5 59.3 64.0 56.9 62.4

About

First small multimodal model to have 3 modality inputs (text, audio, image), excelling in quality and efficiency
Context
128k input · 4k output
Training date
Jun 2024
Rate limit tier
Provider support

Languages

 (23)
Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian