# Chapter 5: Multimodal Models

## What Are Multimodal Models?

Multimodal models are AI systems designed to process and generate data across multiple modalities, such as text, images, and audio. These models integrate different types of data to perform tasks that require understanding across modalities.

### Examples:
1. **DALL-E**: Generates images from textual descriptions.
2. **CLIP**: Links images and text for tasks like image captioning and zero-shot classification.

---

## Concept Sketch
Below is a sketch illustrating how multimodal models combine text and images:

![Multimodal Models](https://upload.wikimedia.org/wikipedia/commons/2/28/AI_multimodal_examples.svg)

This diagram shows how models like CLIP and DALL-E connect images and text.

---

## Code Examples

### Example 1: Image Captioning with CLIP
```python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load the CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load an image and prepare inputs
image = Image.open("example.jpg")
inputs = processor(text=["a cat", "a dog"], images=image, return_tensors="pt", padding=True)

# Get model outputs
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print("Probabilities:", probs)
```

### Example 2: Guided Image Generation with Stable Diffusion and CLIP
```python
from diffusers import StableDiffusionPipeline
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load Stable Diffusion and CLIP models
sd_pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Generate an image with Stable Diffusion
prompt = "A futuristic cityscape"
image = sd_pipeline(prompt).images[0]

# Evaluate the generated image with CLIP
inputs = clip_processor(text=["a futuristic cityscape", "a forest"], images=image, return_tensors="pt", padding=True)
outputs = clip_model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print("CLIP Evaluation Probabilities:", probs)
```

---

## Quiz

1. Which model is commonly used for generating images from text?
   - A. CLIP
   - B. DALL-E
   - C. GPT-3

2. What is the main function of CLIP in multimodal tasks?
   - A. Generate text from images.
   - B. Link images and text for classification and captioning.
   - C. Train generative models.

---

### Answers:
1. **B**: DALL-E
2. **B**: Link images and text for classification and captioning.

---

## Exercise

### Task:
1. Use CLIP to generate captions for a set of custom images.
2. Compare the captions generated for images of different categories (e.g., animals, landscapes).

---

### Example Solution:
```python
from PIL import Image

# Load custom images
images = ["cat.jpg", "forest.jpg", "car.jpg"]

# Generate captions for each image
for img_path in images:
    image = Image.open(img_path)
    inputs = clip_processor(text=["a cat", "a forest", "a car"], images=image, return_tensors="pt", padding=True)
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    print(f"Probabilities for {img_path}:", probs)
```
