Llama-3.2-90B-Vision-Instruct

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Model Developer: Meta

Model Architecture

Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.

	Training Data	Params	Input modalities	Output modalities	Context length	GQA	Data volume	Knowledge cutoff
Llama 3.2-Vision	(Image, text) pairs	11B (10.6)	Text + Image	Text	128k*	Yes	6B (image, text) pairs	December 2023
Llama 3.2-Vision	(Image, text) pairs	90B (88.8)	Text + Image	Text	128k*	Yes	6B (image, text) pairs	December 2023

* Note: Serverless APIs on Azure AI currently only support 8K context length.

Supported Languages: For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.

Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.

Training Data

Overview: Llama 3.2-Vision was pretrained on 6B image and text pairs. The instruction tuning data includes publicly available vision instruction datasets, as well as over 3M synthetically generated examples.

Data Freshness: The pretraining data has a cutoff of December 2023.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3.2-90B-Vision-Instruct

Model navigation navigation

Model Architecture

Training Data

About

Tags

Languages