<a href="https://colab.research.google.com/github/sdgroeve/Machine_Learning_course_UGent_D012554_2025/blob/main/notebooks/CLIP_image_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zero-Shot Image Classification with CLIP

This notebook demonstrates how to use OpenAI's CLIP model for zero-shot image classification. CLIP (Contrastive Language-Image Pre-training) is a neural network trained on a variety of image-text pairs, allowing it to perform zero-shot predictions on new images without specific training.

We'll use the Hugging Face Transformers library to load the model and perform predictions on a local image.

First, let's import the necessary libraries:

In [None]:
import torch
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

## Load Model and Processor

We'll use the `clip-vit-base-patch32` model from OpenAI, which is available through Hugging Face's model hub.

In [None]:
# Load model and processor
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = AutoModelForZeroShotImageClassification.from_pretrained("openai/clip-vit-base-patch32")

print("Model and processor loaded successfully!")

## Define Candidate Labels

In zero-shot classification, we need to provide the model with potential class descriptions. The model will then determine which description best matches the image.

In [None]:
# Define candidate labels (zero-shot prompts)
candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]

## Define Classification Function

Let's create a function that can classify local images using our model and candidate labels.

In [None]:
def classify_local_image(image_path, labels):
    """
    Classify a local image using the CLIP model with zero-shot learning.

    Args:
        image_path (str): Path to the local image file
        labels (list): List of text descriptions for zero-shot classification

    Returns:
        tuple: (predicted_label, probabilities) where predicted_label is the most likely
               label and probabilities is the softmax distribution over all labels
    """
    # Open the local image file
    image = Image.open(image_path).convert("RGB")

    # Process the image and the text inputs
    inputs = processor(images=image, text=labels, return_tensors="pt", padding=True)

    # Run inference
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)

    # Get the logits (unnormalized predictions)
    logits = outputs.logits_per_image

    # Convert to probabilities using softmax
    probs = logits.softmax(dim=1)

    # Find the label with the highest probability
    predicted_idx = torch.argmax(probs, dim=1).item()

    return labels[predicted_idx], probs

## Visualize the Image

Let's display the image before we classify it.

In [None]:
# Replace with the path to your local image
local_image_path = "dog.jpg"

# Display the image
image = Image.open(local_image_path)
plt.figure(figsize=(6, 6))
plt.imshow(image)
plt.axis('off')
plt.title('Image to classify')
plt.show()

## Classify the Image

Now, let's classify the image and see the results.

In [None]:
# Classify the local image
predicted_label_local, probabilities_local = classify_local_image(local_image_path, candidate_labels)

print("Predicted Label:", predicted_label_local)
print("Probabilities:", probabilities_local)

## Visualize the Results

Let's create a bar chart to visualize the probabilities for each label.

In [None]:
# Convert probabilities to a format suitable for plotting
probs_list = probabilities_local.squeeze().tolist()

# Create a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(candidate_labels, probs_list, color='skyblue')

# Highlight the predicted label
predicted_idx = candidate_labels.index(predicted_label_local)
bars[predicted_idx].set_color('navy')

# Add labels and title
plt.xlabel('Labels')
plt.ylabel('Probability')
plt.title('Zero-Shot Classification Results')
plt.xticks(rotation=15, ha='right')

# Add probability values on top of bars
for i, v in enumerate(probs_list):
    plt.text(i, v + 0.02, f'{v:.2f}', ha='center')

plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we demonstrated how to use the CLIP model for zero-shot image classification. This approach allows us to classify images into arbitrary categories without any specific training, by leveraging the model's understanding of both images and text descriptions.

Key benefits of this approach:
- No need to train or fine-tune models for specific classification tasks
- Flexibility to define custom categories on-the-fly
- Works reasonably well for common objects and scenes

Limitations:
- Performance may not match specialized models trained for specific tasks
- Results depend heavily on how well the text descriptions match the model's understanding
- Computationally more intensive than simple classification models