# Image Captioning using CLIP (Multimodal System)

## Aim
To implement a basic multimodal system for image captioning using CLIP.

## Objective
To understand how images and text can be processed together using a pre-trained multimodal model.

## Introduction 

**Image Captioning** is a multimodal task where a system generates a textual description for a given image.

- Uses both **Computer Vision** and **Natural Language Processing**
- CLIP is a popular multimodal model developed by OpenAI
- CLIP learns image–text relationships

## Model Used

- **CLIP (Contrastive Language–Image Pre-training)**
- Pre-trained on image–text pairs
- Maps images and text into a common embedding space

## Working Principle (Pipeline)

1. Input image is encoded using image encoder
2. Candidate text captions are encoded using text encoder
3. Similarity between image and text embeddings is computed
4. Caption with highest similarity is selected

## Step 1: Install Required Libraries

In [1]:
!pip install torch torchvision ftfy regex tqdm pillow clip-by-openai --quiet

ERROR: Cannot install torch, torchvision==0.17.1, torchvision==0.17.2, torchvision==0.18.0, torchvision==0.18.1, torchvision==0.19.0, torchvision==0.19.1, torchvision==0.20.0, torchvision==0.20.1, torchvision==0.21.0, torchvision==0.22.0, torchvision==0.22.1, torchvision==0.23.0, torchvision==0.24.0, torchvision==0.24.1 and torchvision==0.25.0 because these package versions have conflicting dependencies.

[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts


## Step 2: Import Libraries

In [2]:
import torch
import clip
from PIL import Image
import numpy as np

## Step 3: Load CLIP Model

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

print("CLIP model loaded successfully")

100%|███████████████████████████████████████| 338M/338M [00:50<00:00, 6.97MiB/s]


CLIP model loaded successfully


## Step 4: Load Input Image

In [5]:
# Use any sample image available on your system
image = preprocess(Image.open("download.jpeg")).unsqueeze(0).to(device)
image

tensor([[[[-0.6682, -0.8288, -0.7704,  ...,  1.9157,  1.9157,  1.9157],
          [-0.6682, -0.8288, -0.7704,  ...,  1.9157,  1.9157,  1.9157],
          [-0.6390, -0.8142, -0.7558,  ...,  1.9157,  1.9157,  1.9157],
          ...,
          [ 1.7844,  1.7844,  1.7844,  ...,  1.7552,  1.7552,  1.7552],
          [ 1.7844,  1.7844,  1.7844,  ...,  1.7552,  1.7552,  1.7552],
          [ 1.7844,  1.7844,  1.7844,  ...,  1.7698,  1.7552,  1.7552]],

         [[ 0.6041,  0.5741,  0.6642,  ...,  1.9698,  1.9698,  1.9698],
          [ 0.6191,  0.5741,  0.6642,  ...,  1.9698,  1.9698,  1.9698],
          [ 0.6341,  0.5891,  0.6792,  ...,  1.9698,  1.9698,  1.9698],
          ...,
          [ 0.5741,  0.5741,  0.5741,  ...,  0.6792,  0.6642,  0.6642],
          [ 0.5441,  0.5441,  0.5441,  ...,  0.6041,  0.5891,  0.5891],
          [ 0.5291,  0.5291,  0.5291,  ...,  0.5591,  0.5441,  0.5441]],

         [[ 0.5817,  0.6244,  0.7381,  ...,  1.6766,  1.6766,  1.6766],
          [ 0.5959,  0.6386,  

## Step 5: Define Candidate Captions

Since CLIP is not a generative model, we select the best caption from predefined text options.

In [6]:
captions = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a person",
    "a photo of a car"
]

text_tokens = clip.tokenize(captions).to(device)

## Step 6: Image Caption Prediction

In [7]:
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_tokens)

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    similarity = (image_features @ text_features.T).softmax(dim=-1)

best_caption = captions[similarity.argmax().item()]
print("Predicted Caption:", best_caption)

Predicted Caption: a photo of a person


## Observations (Exam Ready Points)

- CLIP successfully matches image with relevant text
- Caption selection is based on similarity score
- Demonstrates multimodal learning

## Applications

- Image captioning systems
- Image search and retrieval
- Accessibility tools for visually impaired
- Multimodal AI systems

## Advantages and Limitations

**Advantages:**
- No training required
- Works for both image and text

**Limitations:**
- Cannot generate new captions
- Depends on predefined text options

## Conclusion (One-Line Exam Answer)

CLIP enables image captioning by matching images with the most relevant text using a shared multimodal embedding space.