Skip to content
Microsoft logo

Phi-4-multimodal-instruct

Playground
Can you explain the basics of machine learning?
What are some of the most famous works of Shakespeare?
What are some common features of Gothic architecture?

Model navigation navigation

Microsoft

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.

Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-mini as the backbone language model, and the advanced encoders and adapters of vision and speech. It has been trained on 5T text tokens, 2.3M speech hours, and 1.1T image-text tokens. This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data. The supported languages for each modalities are:

  • Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
  • Image: English
  • Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

About

First small multimodal model to have 3 modality inputs (text, audio, image), excelling in quality and efficiency
Context
128k input · 4k output
Training date
Jun 2024
Rate limit tier
Provider support

Languages

 (23)
Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian