Phi-4-multimodal-instruct

Safety Approach

The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.

Safety Evaluation and Red-Teaming

Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red teaming effort focused on the newest Audio input modality and on the following safety areas: harmful content, self-injury risks, and exploits. The model was found to be more susceptible to providing undesirable outputs when attacked with context manipulation or persuasive techniques. These findings applied to all languages, with the persuasive techniques mostly affecting French and Italian. This highlights the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource language.

Vision Safety Evaluation

To assess model safety in scenarios involving both text and images, Microsoft’s Azure AI Evaluation SDK was utilized. This tool facilitates the simulation of single-turn conversations with the target model by providing prompt text and images designed to incite harmful responses. The target model's responses are subsequently evaluated by a capable model across multiple harm categories, including violence, sexual content, self-harm, hateful and unfair content, with each response scored based on the severity of the harm identified. The evaluation results were compared with those of Phi-3.5-Vision and open-source models of comparable size. In addition, we ran both an internal and the public RTVLM and VLGuard multi-modal (text & vision) RAI benchmarks, once again comparing scores with Phi-3.5-Vision and open-source models of comparable size. However, the model may be susceptible to language-specific attack prompts and cultural context.

Audio Safety Evaluation

In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft’s Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model’s responses to Speech prompts. Second, Microsoft’s Speech Fairness evaluation was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.

Model Quality

To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform. Users can refer to the Phi-4-Mini model card for details of language benchmarks. Below is a high-level overview of the model quality on representative speech and vision benchmarks:

Speech Benchmarks

Phi-4-multimodal-instruct demonstrated strong performance in speech tasks:

Surpassed expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large in automatic speech recognition (ASR) and speech translation (ST).
Ranked number 1 on the Huggingface OpenASR leaderboard with a word error rate of 6.14% compared to the current best model at 6.5% as of February 18, 2025.
First open-sourced model capable of performing speech summarization, with performance close to GPT4o.
Exhibited a gap with closed models like Gemini-2.0-Flash and GPT-4o-realtime-preview on the speech QA task. Efforts are ongoing to improve this capability in future iterations.

Vision Benchmarks

Vision-Speech Tasks

Phi-4-multimodal-instruct can process both image and audio together. The table below shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other state-of-the-art omni models, Phi-4-multimodal-instruct achieves stronger performance on multiple benchmarks.

Benchmarks	Phi-4-multimodal-instruct	InternOmni-7B	Gemini-2.0-Flash-Lite-prv-02-05	Gemini-2.0-Flash	Gemini-1.5-Pro
s_AI2D	68.9	53.9	62.0	69.4	67.7
s_ChartQA	69.0	56.1	35.5	51.3	46.9
s_DocVQA	87.3	79.9	76.0	80.3	78.2
s_InfoVQA	63.7	60.3	59.4	63.6	66.1
Average	72.2	62.6	58.2	66.2	64.7

Vision Tasks

To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. Below is a high-level overview of the model quality on representative benchmarks:

Dataset	Phi-4-multimodal-ins	Phi-3.5-vision-ins	Qwen 2.5-VL-3B-ins	Intern VL 2.5-4B	Qwen 2.5-VL-7B-ins	Intern VL 2.5-8B	Gemini 2.0-Flash Lite-prv-0205	Gemini2.0-Flash	Claude-3.5-Sonnet-2024-10-22	Gpt-4o-2024-11-20
Popular aggregated benchmark	55.1	43.0	47.0	48.3	51.8	50.6	54.1	64.7	55.8	61.7
MMBench (dev-en)	86.7	81.9	84.3	86.8	87.8	88.2	85.0	90.0	86.7	89.0
MMMU-Pro (std / vision)	38.5	21.8	29.9	32.4	38.7	34.4	45.1	54.4	54.3	53.0
ScienceQA Visual (img-test)	97.5	91.3	79.4	96.2	87.7	97.3	85.0	88.3	81.2	88.2
MathVista (testmini)	62.4	43.9	60.8	51.2	67.8	56.7	57.6	47.2	56.9	56.1
InterGPS	48.6	36.3	48.3	53.7	52.7	54.1	57.9	65.4	47.1	49.1
AI2D	82.3	78.1	78.4	80.0	82.6	83.0	77.6	82.1	70.6	83.8
ChartQA	81.4	81.8	80.0	79.1	85.0	81.0	73.0	79.0	78.4	75.1
DocVQA	93.2	69.3	93.9	91.6	95.7	93.0	91.2	92.1	95.2	90.9
InfoVQA	72.7	36.6	77.1	72.1	82.6	77.6	73.0	77.8	74.3	71.9
TextVQA (val)	75.6	72.0	76.8	70.9	77.7	74.8	72.9	74.4	58.6	73.1
OCR Bench	84.4	63.8	82.2	71.6	87.7	74.8	75.7	81.0	77.0	77.7
POPE	85.6	86.1	87.9	89.4	87.5	89.1	87.5	88.0	82.6	86.5
BLINK	61.3	57.0	48.1	51.2	55.3	52.5	59.3	64.0	56.9	62.4
Video MME (16 frames)	55.0	50.8	56.5	57.3	58.2	58.7	58.8	65.5	60.2	68.2
Average	72.0	60.9	68.7	68.8	73.3	71.1	70.2	74.3	69.1	72.4

Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and is competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

Dataset	Phi-4-multimodal-instruct	Qwen2.5-VL-3B-Instruct	InternVL 2.5-4B	Qwen2.5-VL-7B-Instruct	InternVL 2.5-8B	Gemini-2.0-Flash-Lite-prv-02-05	Gemini-2.0-Flash	Claude-3.5-Sonnet-2024-10-22	Gpt-4o-2024-11-20
Art Style	86.3	58.1	59.8	65.0	65.0	76.9	76.9	68.4	73.5
Counting	60.0	67.5	60.0	66.7	71.7	45.8	69.2	60.8	65.0
Forensic Detection	90.2	34.8	22.0	43.9	37.9	31.8	74.2	63.6	71.2
Functional Correspondence	30.0	20.0	26.9	22.3	27.7	48.5	53.1	34.6	42.3
IQ Test	22.7	25.3	28.7	28.7	28.7	28.0	30.7	20.7	25.3
Jigsaw	68.7	52.0	71.3	69.3	53.3	62.7	69.3	61.3	68.7
Multi-View Reasoning	76.7	44.4	44.4	54.1	45.1	55.6	41.4	54.9	54.1
Object Localization	52.5	55.7	53.3	55.7	58.2	63.9	67.2	58.2	65.6
Relative Depth	69.4	68.5	68.5	80.6	76.6	81.5	72.6	66.1	73.4
Relative Reflectance	26.9	38.8	38.8	32.8	38.8	33.6	34.3	38.1	38.1
Semantic Correspondence	52.5	32.4	33.8	28.8	24.5	56.1	55.4	43.9	47.5
Spatial Relation	72.7	80.4	86.0	88.8	86.7	74.1	79.0	74.8	83.2
Visual Correspondence	67.4	28.5	39.5	50.0	44.2	84.9	91.3	72.7	82.6
Visual Similarity	86.7	67.4	88.1	87.4	85.2	87.4	80.7	79.3	83.0
Overall	61.3	48.1	51.2	55.3	52.5	59.3	64.0	56.9	62.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly