# Advanced Multimodal Models — CLIP & DALL·E

This notebook explores **multimodal AI models**, focusing on **CLIP** and **DALL·E** — two foundational models that connect **language** and **vision**.

---

## What You’ll Learn
1. What are multimodal models and why they matter  
2. How CLIP connects text and images  
3. How DALL·E generates images from text  
4. How to use Hugging Face & OpenAI APIs for multimodal AI  

---

## Requirements
- Python 3.8+
- Install the following packages:


In [2]:
#!pip install torch torchvision transformers openai pillow matplotlib

## What is Multimodal AI?

**Multimodal AI** integrates multiple types of data — such as **text, image, audio, and video** — to enable richer understanding and generation.

Examples:
- **CLIP (Contrastive Language–Image Pretraining):** connects images and text.
- **DALL·E:** generates images from text prompts.
- **Whisper:** converts audio to text.
- **BLIP / Flamingo / Gemini:** combine vision and language for reasoning.

Multimodal systems are essential for:
- Image captioning  
- Visual question answering (VQA)  
- Text-to-image generation  
- AI-powered search and recommendation systems  

## Understanding CLIP (Contrastive Language–Image Pretraining)

CLIP, developed by OpenAI, learns to connect **text** and **images** by training on pairs of image–caption data.

It learns embeddings for both modalities:
- Image → visual embedding
- Text → textual embedding  

Then, it maximizes similarity for matching pairs and minimizes it for unrelated pairs.

This makes CLIP useful for:
- Zero-shot image classification  
- Text–image similarity  
- Image search engines  
- Image captioning foundations


In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
import torch

# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load an example image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
image = Image.open(requests.get(url, stream=True).raw)

# Define candidate texts
texts = ["a photo of a cat", "a photo of a dog", "a drawing of a cat"]

# Preprocess inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

# Compute similarity scores
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # image-text similarity
probs = logits_per_image.softmax(dim=1)      # normalize scores

print("Texts:", texts)
print("Similarity probabilities:", probs)


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

## Interpreting CLIP Results

The model computes **similarity scores** between the image and each text.

- The higher the score → the more the image matches the text.
- CLIP can act as a **zero-shot classifier** — no task-specific training required.

You can replace the texts with your own descriptions or labels (e.g., “a car”, “a person”, “a mountain”).

## DALL·E: Text-to-Image Generation

**DALL·E**, also by OpenAI, is a **generative model** that can create novel images from textual descriptions.

It extends the idea of multimodal representation by learning **to generate pixels conditioned on text**.

Example:  
> Prompt: “An astronaut riding a horse in a photorealistic style.”

DALL·E models include:
- **DALL·E Mini / Craiyon:** lightweight public models  
- **DALL·E 2 / DALL·E 3:** advanced OpenAI models with higher fidelity  

We’ll generate images using the OpenAI API.

In [None]:
import openai
from IPython.display import Image as IPyImage, display

# Set your OpenAI API key (store securely)
openai.api_key = "your_openai_api_key_here"

# Define your text prompt
prompt = "A futuristic cityscape with flying cars and neon lights"

# Generate an image using DALL·E 3
response = openai.images.generate(
    model="gpt-image-1",
    prompt=prompt,
    size="512x512"
)

# Display generated image
image_url = response.data[0].url
display(IPyImage(url=image_url))


## Tips for Better Prompts

1. **Be descriptive:**  
   “A sunset over the mountains” → “A vivid sunset over snowy mountains reflected in a lake.”

2. **Specify style:**  
   Add “in watercolor style,” “as a digital painting,” or “in Pixar 3D style.”

3. **Add composition details:**  
   “Wide-angle,” “portrait view,” “top-down perspective.”

4. **Use modifiers:**  
   “Realistic,” “abstract,” “fantasy,” “cinematic lighting,” “high resolution.”

Prompt quality directly affects image quality.


## CLIP + DALL·E: Text-Image Alignment

Together, CLIP and DALL·E can:
- Rank generated images by relevance using CLIP embeddings
- Improve prompt adherence
- Enable image retrieval, captioning, and generation in a unified pipeline

**Workflow Example:**
1. Generate 5 images using DALL·E  
2. Use CLIP to compute similarity with the prompt  
3. Select the top-ranked image  


In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

# Suppose you have multiple image URLs (generated or downloaded)
urls = [
    "https://example.com/image1.png",
    "https://example.com/image2.png",
    "https://example.com/image3.png"
]

# Load CLIP
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load and process images
images = [Image.open(requests.get(u, stream=True).raw) for u in urls]
prompt = "A futuristic cityscape with flying cars and neon lights"
inputs = clip_processor(text=[prompt], images=images, return_tensors="pt", padding=True)

# Compute similarity
outputs = clip_model(**inputs)
scores = outputs.logits_per_image.softmax(dim=1)
print("Image relevance scores:", scores)


## Summary

In this notebook, you learned:
- The fundamentals of **multimodal AI**
- How **CLIP** links vision and language via embeddings
- How **DALL·E** generates images from text
- How to combine both models for search and generation tasks

---

### Next Steps
- Try **BLIP** or **Flamingo** for captioning and visual question answering.  
- Experiment with **OpenCLIP** and **Stable Diffusion** for local generation.  
- Integrate CLIP with **vector search (FAISS, Pinecone)** for multimodal retrieval systems.  
