<a href="https://colab.research.google.com/github/subhashpolisetti/AutoGluon_ML_End-to-End_Implementations_Part-2/blob/main/7_Zero_Shot_Image_Classification_with_CLIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Zero-Shot Image Classification with CLIP**

This notebook demonstrates **zero-shot image classification** using **CLIP (Contrastive Language–Image Pretraining)** from OpenAI. It utilizes a pre-trained model to classify images into custom categories without additional training.

---

## **Key Objectives**
1. Perform **zero-shot classification** using a pre-trained CLIP model.
2. Classify an image into user-defined categories based on natural language prompts.
3. Compute probabilities for each category to determine the best match.

---

## **Steps Covered**

### **1. Installation and Setup**
- Install required libraries:
  - `torch` and `transformers` for handling the CLIP model.
  - `Pillow (PIL)` for image processing.

### **2. Load the CLIP Model**
- Use Hugging Face's `CLIPModel` and `CLIPProcessor` to load the pre-trained **CLIP ViT-B/32** model and processor.

### **3. Define Text Prompts**
- Specify custom categories as natural language prompts, e.g., `"This is a dog"`, `"This is a cat"`, etc.

### **4. Image Preprocessing**
- Load the input image using **Pillow** and preprocess it along with the text prompts using the `CLIPProcessor`.

### **5. Model Inference**
- Pass the preprocessed inputs to the CLIP model.
- Compute **logits** for image-text similarity.

### **6. Compute Probabilities**
- Apply the **softmax function** to the logits to calculate probabilities for each text prompt.

### **7. Display Results**
- Output probabilities for each text prompt, showing the likelihood of the image matching each category.

---

## **Key Features**
- **Zero-Shot Learning**: Classify images without training the model on the specified categories.
- **Custom Categories**: Define any text prompt for classification, enabling high flexibility.
- **Probabilistic Output**: Provides confidence scores for each category.

---

## **Example Applications**
- **Image Search**: Match images to descriptions without retraining.
- **Content Moderation**: Detect inappropriate content using descriptive prompts.
- **Image Annotation**: Automatically generate labels for images.

---

This notebook is ideal for exploring the power of **CLIP** in zero-shot image classification tasks and its ability to generalize across diverse categories.


In [None]:
!pip install torch torchvision transformers



In [None]:
pip install autogluon

Collecting autogluon
  Downloading autogluon-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.core==1.1.1 (from autogluon.core[all]==1.1.1->autogluon)
  Downloading autogluon.core-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.features==1.1.1 (from autogluon)
  Downloading autogluon.features-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.tabular==1.1.1 (from autogluon.tabular[all]==1.1.1->autogluon)
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting autogluon.multimodal==1.1.1 (from autogluon)
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.timeseries==1.1.1 (from autogluon.timeseries[all]==1.1.1->autogluon)
  Downloading autogluon.timeseries-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.core==1.1.1->autogluon.core[all]==1.1.1->autogluon)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [None]:
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

In [None]:
# Load the CLIP model and processor from Hugging Face
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]



In [None]:
# Load an image
image_path = '/content/image.webp'
image = Image.open(image_path)

In [None]:
# Define the labels you want to classify the image into
text_prompts = ["This is a dog", "This is a cat", "This is a car", "This is a house"]

In [None]:
# Preprocess image and text
inputs = processor(text=text_prompts, images=image, return_tensors="pt", padding=True)

In [None]:
# Get the model output (logits)
with torch.no_grad():
    outputs = model(**inputs)

# The logits for the image-text similarity
logits_per_image = outputs.logits_per_image  # Shape: [1, number of prompts]

# Compute probabilities by applying softmax to logits
probs = logits_per_image.softmax(dim=1)  # Shape: [1, number of prompts]

In [None]:
# Display probabilities for each text prompt
for i, prompt in enumerate(text_prompts):
    print(f"{prompt}: {probs[0][i].item():.4f}")

This is a dog: 0.9805
This is a cat: 0.0141
This is a car: 0.0010
This is a house: 0.0043
