Phi-4-multimodal-instruct

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.

Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-mini as the backbone language model, and the advanced encoders and adapters of vision and speech. It has been trained on 5T text tokens, 2.3M speech hours, and 1.1T image-text tokens. This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data. The supported languages for each modalities are:

Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Image: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-4-multimodal-instruct

Model navigation navigation

About

Tags

Languages