Llama-3.2-90B-Vision-Instruct
Model navigation navigation

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Model Developer: Meta
Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
Training Data | Params | Input modalities | Output modalities | Context length | GQA | Data volume | Knowledge cutoff | |
---|---|---|---|---|---|---|---|---|
Llama 3.2-Vision | (Image, text) pairs | 11B (10.6) | Text + Image | Text | 128k* | Yes | 6B (image, text) pairs | December 2023 |
Llama 3.2-Vision | (Image, text) pairs | 90B (88.8) | Text + Image | Text | 128k* | Yes | 6B (image, text) pairs | December 2023 |
* Note: Serverless APIs on Azure AI currently only support 8K context length.
Supported Languages: For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.
Overview: Llama 3.2-Vision was pretrained on 6B image and text pairs. The instruction tuning data includes publicly available vision instruction datasets, as well as over 3M synthetically generated examples.
Data Freshness: The pretraining data has a cutoff of December 2023.