<a href="https://colab.research.google.com/github/zhiqiwang59/DL/blob/main/2_vlm/vlm_tutorial_practical_2_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# M2LS 2025: Vision-Language Models -- Practical 2: From Theory to Application 🚀
---
- Alexandre Galashov (agalashov@google.com)
- Petra Bevandic (Petra.Bevandic@fer.hr )
<br>

Now it's time to put our knowledge into action! With the core concepts of **ViT** from Practical 1 under our belts, we're ready to embed image and text into a shared space to solve real-world problems.

In this hands-on session, we will focus on understanding CLIP and using it in the following applications:

* **Zero-Shot Image Classification**

* **Anomaly Detection**

* **Image Search**

and look at failure cases on Sugar-Crepe benchmark.

Let's see what these models can really do!

**Disclaimer**: You will mainly be required to complete code blocks which we noted as **"Your code here"**. We took care of most of the boilerplate code for you. However, please also feel free to deviate from the code which we prepared and code things in the way you feel is right!

---

## Preliminary Setup (Hugging Face account)
---

1. Make a HuggingFace account if you already don't have one (Sign Up).

2. Create (if not done so) an access token in HuggingFace.

3. Either specify `HF_TOKEN` secret in colab secrets or specify `MANUALLY_ENTERED_HF_TOKEN`.

In [None]:
from google.colab import userdata
from huggingface_hub import login

MANUALLY_ENTERED_HF_TOKEN = '' # If not specified `HF_TOKEN`, enter your token.

try:
  HF_TOKEN = userdata.get('HF_TOKEN')
except userdata.SecretNotFoundError:
  HF_TOKEN = MANUALLY_ENTERED_HF_TOKEN

login(token=HF_TOKEN)

## Understanding CLIP and basics of text-image encoding

In this chapter, we'll explore the groundbreaking CLIP model and the art of aligning images with text. By the end of this section, you'll be able to answer:

* **🧠 How does CLIP actually learn?** We'll dive deep into the mechanics of its powerful contrastive loss function.

* **🚀 How can we use it in practice?** We'll go hands-on, applying pre-trained CLIP embeddings to solve a real-world problem.

* **🤔 What are its limitations?** We'll investigate the common pitfalls and failure modes of contrastive learning to build a more complete understanding.

[Contrastive Language-Image Pre-training (CLIP)](https://arxiv.org/pdf/2103.00020) is a method that uses large-scale image-text datasets to learn a shared embedding space. In this space, the representations of images and their corresponding text descriptions are close together, while unrelated pairs are pushed far apart.

The CLIP model architecture consists of **two encoders**: a **text encoder** and an **image encoder**. These encoders are used to generate representations for text and images, respectively. During training, the model's objective is to predict which image-text pairs within a batch are correctly matched. It achieves this by maximizing the similarity between the embeddings of positive (matching) pairs and minimizing the similarity of negative (mismatching) pairs.

CLIP's training approach enables it to perform remarkably well on various image-related tasks, particularly in zero-shot settings where it can classify new images without needing to be fine-tuned on a specific dataset.

<img src="https://drive.google.com/uc?export=view&id=1bxFxZX7Amwdn4JCyrgZBy87fQDI1yCWM" height="300" width="500">

## Exercise 1: Implement the CLIP Loss

In this exercise, you will implement the core loss function used in CLIP training. This loss function is crucial for teaching CLIP to align image and text representations effectively. The CLIP loss is defined as follows for a batch of $B$ examples:

$$L =   - \frac{1}{2B} \sum_{i=1}^{B} \left( \log \frac{  \exp(\phi(v_i, t_i) / \tau)}{\sum_{j=1}^B\exp(\phi(v_i, t_j) / \tau)} \right. +  \left. \log \frac{\exp(\phi( t_i, v_i) / \tau)}{\sum_{j=1}^B\exp(\phi(t_i, v_j) / \tau)} \right),$$

where $\phi(v_i, t_j) = \tfrac{v_i}{\| v_i \|_2} \cdot  \tfrac{t_j}{\| t_j \|_2}$ computes the cosine similarity between the image representation ($v_i$) and text representation ($t_i$)  and $\tau$ is the temperature parameter (that helps control the sharpness of the softmax distribution).

Your task is to write a function that calculates the CLIP loss given:
 - **Image embeddings** (a tensor of shape [batch_size, embedding_dim])
 - **Text embeddings** (a tensor of shape [batch_size, embedding_dim])

The loss function should:

1.  **L2 normalize the image embeddings and text embeddings** (this ensures that both image and text embeddings have a unit norm and enables the loss to focus on aligning the semantic directions of images and text, regardless of their initial scales).
2.  **Compute the cosine similarity** between each image embedding and each text embedding.
3. **Divide the similarity matrix by a temperature scaling factor** (this helps control the sharpness of the softmax distribution).
4. **Compute the cross-entropy loss** between the similarity matrix and the target matrix (where the target matrix has 1s along the diagonal, indicating correct image-text pairs).
4. Return the **mean loss across the batch**.

Once you have implemented the CLIP loss, you can use the following test to check your implementation.

In [None]:
import torch
import torch.nn.functional as F

In [None]:
# Example of CLIP loss implementation

def clip_loss(image_features, text_features, temperature=0.07):
    """
    Computes the CLIP loss given image and text features.

    Args:
        image_features (torch.Tensor): Image embeddings of shape (batch_size, embedding_dim).
        text_features (torch.Tensor): Text embeddings of shape (batch_size, embedding_dim).
        temperature (float, optional): Temperature scaling factor. Defaults to 1.0.

    Returns:
        torch.Tensor: The CLIP loss.
    """

    ############################################################################
    # Your code here
    # You will need to:
    # - Normalize embeddings to unit sphere (L2 normalization)
    # - Compute logits: cosine similarities (dot product)
    # - Create ground truth labels (positive pairs along the diagonal)
    # - Compute cross-entropy loss (symmetrically for both image and text)
    # - Average the two losses
    ############################################################################
    ...

    return total_loss

In [None]:
def test_clip_loss():
  # Test that positive pairs have low loss.
  image_features = torch.tensor([[1.0, 0.0], [0.0, 1.0]])
  text_features = torch.tensor([[1.0, 0.0], [0.0, 1.0]])
  loss = clip_loss(image_features, text_features, temperature=0.07)
  assert loss < 0.1  # Small loss for matching pairs

  # Test that negative pairs have high loss."""
  image_features = torch.tensor([[1.0, 0.0], [0.0, 1.0]])
  text_features = torch.tensor([[0.0, 1.0], [1.0, 0.0]])  # Mismatched
  loss = clip_loss(image_features, text_features, temperature=0.07)
  assert loss > 1.0  # High loss for mismatched pairs

  return "Congratulation! Your CLIP loss implementation passes the test."
test_clip_loss()

CLIP is typically pre-trained on a large dataset of image-text pairs sourced from the internet. This could include images with captions, descriptions, or surrounding text. The initial CLIP model (from OpenAI), for instance, was trained on 400 million image-text pairs.

After training, CLIP has learned to map images and text into a shared embedding space where similar concepts are close together and dissimilar concepts are far apart. This allows for zero-shot transfer learning, where the CLIP model can understand the relationship between images and text even for classes it hasn't explicitly seen during training. We will now explore this capability of CLIP pre-trained in the following exercises.

## Exercise 2 - Using CLIP for Semantic Image Search

You are now going to build a simple semantic image search tool using a pre-trained CLIP model. This will allow you to enter a text queries (e.g., "birds in the water") and retrieve the most relevant images from an image collection.

Hugging Face's Model Hub provides several pre-trained CLIP models that you can easily download and use.

In [None]:
!pip install ipyplot

In [None]:
from io import BytesIO
from PIL import Image
import numpy as np
import requests
import ipyplot
import torch

In [None]:
from transformers import CLIPProcessor, CLIPModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

The CLIP model consists of a text_model that computes the text_embeddings for the input texts and a vision model that computes the image_embeddings for the input images.

In [None]:
clip_model

The clip_processor does all of the necessary image pre-processing (normalization, rescaling, cropping, etc.) and text pre-processing (e.g. tokenization) needed for giving images and text as input to the CLIP model.

In [None]:
clip_processor

### 2. Dowload an image collection from pixabay

We'll now download a set of images that will represent our image collection for semantic image search. For this purpose, we'll use pixabay and download images of birds (feel free to play around with the pixabay_search_keyword if you want to download a different image collection).

In [None]:
original_api = "https://pixabay.com/api/?key="
pixabay_api_key = "22176616-358d1b190a298ff59f96b35a1"

pixabay_search_keyword = "Birds" #@param {type:"string"}

no_to_retrieve = 50
pixabay_api = original_api+pixabay_api_key+"&q="+pixabay_search_keyword.lower()+"&image_type=photo&safesearch=true&per_page="+str(no_to_retrieve)
response = requests.get(pixabay_api)
output = response.json()

image_collection =[]
for each in output["hits"]:
    imageurl = each["webformatURL"]
    response = requests.get(imageurl)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    image_collection.append(image)

print ("Total no of images retrived: ",len(image_collection))

ipyplot.plot_images(image_collection,max_images=50,img_width=150)

### **Task 1**: Perform Semantic Search

Implement a function that uses the CLIP pre-trained models to retrieve the top-k most semantically similar images with a text query.

In [None]:
def semantic_image_search(text_query, image_collection, top_k=3, clip_processor=clip_processor, clip_model=clip_model):
    """
    Performs semantic image search using CLIP.

    Args:
        text_query: Text query used to retrieve images.
        images_collection: A list of images used for the image search.
        top_k: The number of top results to return.
        clip_processor: Optional pre-traied CLIPProcessor for pre-processing the input image and text.
        clip_model: Optional pre-trained CLIPModel for obtaining image and text embeddings.

    Returns:
      top_k most similar images for the text_quert
    """

    with torch.no_grad():
      inputs = clip_processor(text=[text_query], images=image_collection, return_tensors="pt", padding=True)
      outputs = clip_model(**inputs)
    text_emb = outputs.text_embeds
    image_emb = outputs.image_embeds

    top_k_images = []
    ############################################################################
    # Your code here
    ############################################################################
    ...

    return top_k_images

In [None]:
text_query = 'bird in the water' #@param {type:"string"}
top_k = 2 # @param {type:"integer"}

top_k_images = semantic_image_search(text_query, image_collection, top_k=top_k)

ipyplot.plot_images(top_k_images, img_width=300)

Note that semantic-image search represents a zero-shot capability for CLIP as the model has not been directly trained for this task.

## Exercise 3: Performing Zero-Shot Classification with CLIP

In this exercise, we will implement a zero-shot classification pipeline using a pre-trained CLIP model.

Core Concept: **Zero-Shot Classification**

**Zero-shot classification** refers to the ability of a model to classify data into new categories that it was not explicitly trained on. This powerful technique allows us to bypass the need for task-specific fine-tuning, leveraging the rich, generalized knowledge of a large pre-trained model.

Essentially, we will use CLIP "as-is" to determine the best-matching category for an image from a list of text labels, without any further training.



### Zero-Shot Classification with CLIP

Pre-trained **CLIP (Contrastive Language-Image Pre-training)** can be used for zero-shot classification by leveraging its ability to understand the relationship between images and text descriptions. It was trained on a vast dataset of image-text pairs from the internet to learn a shared representation space where images and their corresponding text descriptions are close together.

***

#### How It Works

1.  **Prepare Candidate Labels**: Instead of training on labeled images, you provide a list of potential class names as text (e.g., "a photo of a cat", "a photo of a dog", "a photo of a bird").
2.  **Encode Text & Image**: The **text encoder** transforms each of these text labels into a vector. Simultaneously, the **image encoder** converts the input image into a vector in the same space.
3.  **Calculate Similarity**: The model then compares the image vector to each of the text vectors using a similarity metric, like cosine similarity.
4.  **Predict Class**: The class with the highest similarity score is chosen as the final prediction for the image.

***

#### Diagram

<img src="https://lh3.googleusercontent.com/d/1QQETNcX6bYL-CTOqzA4H9dTflA7xabOx">


#### Why It's Powerful

**Zero-shot classification with CLIP** is so effective because it doesn't require any retraining or fine-tuning. You can classify images into categories the model has never seen before, as long as you can describe them in text. This makes it incredibly flexible and adaptable for a wide range of tasks.

#### Let's do it!


In [None]:
#@title Necessary imports
import torch
import torchvision
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import matplotlib.pyplot as plt
import numpy as np
from torchvision import datasets, transforms
from datasets import load_dataset
import numpy as np

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

#### Loading CLIP model

Choose CLIP model size.

If you choose `SMOL_MODEL=False`, it will load [OpenAI CLIP model](https://huggingface.co/openai/clip-vit-base-patch16) with 150M parameters.

If you choose `SMOL_MODEL=True`, it will load [TinyCLIP model](https://arxiv.org/abs/2309.12314), with 25M parameters.

You can expect that the big model performs better, but for fast iteration, we recommend using a small model.

In [None]:
SMOL_MODEL = True #@param

In [None]:
print("Loading CLIP model and processor...")
if SMOL_MODEL:
  MODEL_ID = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"
else:
  MODEL_ID = "openai/clip-vit-base-patch16"
model = CLIPModel.from_pretrained(MODEL_ID).to(DEVICE)
processor = CLIPProcessor.from_pretrained(MODEL_ID)
print("Model loaded successfully.")

pytorch_total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in the model: {pytorch_total_params}")

EMBEDDING_DIM = 512

Inspect the model

In [None]:
model

**Question:** What is the final embedding dimension?

#### Loading the dataset

We will use [`Imagenette`](https://huggingface.co/datasets/frgfm/imagenette) dataset which is a small version of ImageNet.

In [None]:
loading_transform = transforms.Compose([
    processor.image_processor,
    lambda x: x['pixel_values'][0]
])

test_dataset = datasets.Imagenette(
    root='./data',
    split='val',
    size='160px',
    download=True,
    transform=loading_transform,
)

print('Imagenette has the following classes: ')
imagenette_classes = []
for c in test_dataset.classes:
  print(f"{c[0]}")
  imagenette_classes.append(f"{c[0]}")

NUM_CLASSES = len(imagenette_classes)

In [None]:
#@title Visualize an image and a class
sample_image, sample_class = test_dataset[0]
sample_image = torch.Tensor(sample_image).to(DEVICE)

image_std = torch.Tensor(processor.image_processor.image_std).reshape((3, 1, 1)).to(DEVICE)
image_mean = torch.Tensor(processor.image_processor.image_mean).reshape((3, 1, 1)).to(DEVICE)
image_to_plot = sample_image * image_std + image_mean

print(f'Class label = `{imagenette_classes[sample_class]}`')
plt.imshow(np.transpose(image_to_plot.cpu(), (1, 2, 0)))
plt.xticks([])
plt.yticks([])
plt.show()

In [None]:
#@title Extract fixed set of test images and labels
test_batch_size = 16 #@param
max_test_batches = 30 #@param
dataset_loader = torch.utils.data.DataLoader(test_dataset, batch_size=test_batch_size, shuffle=True)

torch.manual_seed(100)
test_images = []
test_labels = []
for idx, (images, labels) in enumerate(dataset_loader):
  test_images.append(images)
  test_labels.append(labels)
  if idx == max_test_batches - 1:
    break

### **Task 1**: Produce prompts for the classes

The CLIP model should receive text as input. We will assign a prompt for each class. Think about what should be a prompt for a class.

The following function should produce a list of promts with the same length as the number of classes in a dataset.

In [None]:
def create_class_prompts(class_labels):
  class_prompts = []
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  ...
  return class_prompts

In [None]:
class_prompts = create_class_prompts(test_dataset.classes)
assert len(class_prompts) == NUM_CLASSES
print('You have created the class prompts')

### **Task 2**: Produce prompt embeddings for all the class prompts for cosine similarity.

For every class prompt you have produced, you need to create a corresponding embedding from CLIP model to later use for the cosine similarity.

**Note**: You can use `model.get_text_features` function to produce text embeddings.

**Note**: Think about what needs to be done for the embeddings to be used for **cosine similarity**.

**Note**: You can use `processor` to put `class_prompts` into a correct format. We have already done it for you.

In [None]:
def embed_class_prompts_for_cosine_similarity(class_prompts, model, processor):
  with torch.no_grad():
    text_inputs = processor(text=class_prompts, padding=True, return_tensors="pt").to(DEVICE)
    ############################################################################
    # YOUR CODE HERE
    ############################################################################
    class_emb = ...
  return class_emb

In [None]:
# Auxiliary test function
def check_embedding_for_cosine_similarity(embedding):
  num_embeddings = embedding.shape[0]
  assert torch.sum(torch.abs(embedding.norm(dim=-1) - 1.0) < 1e-5) == num_embeddings

In [None]:
print("Encoding inlier text prompts...")
class_emb_test = embed_class_prompts_for_cosine_similarity(class_prompts, model, processor)
print("Text prompts encoded.")

assert class_emb_test.shape == (NUM_CLASSES, EMBEDDING_DIM)
check_embedding_for_cosine_similarity(class_emb_test)
print('You have correctly embedded the classes for cosine similarity.')

### **Task 3**: Implement image embedding function

Now you need to implement a function which embeds images to be later used for cosine similarity.

**Note**: You can use `model.get_image_features` for producing image embeddings, it receives images as `pixel_values`.

In [None]:
def embed_images_for_cosine_similarity(images, model):
  pixel_values = images.to(DEVICE)
  with torch.no_grad():
    ############################################################################
    # YOUR CODE HERE
    ############################################################################
    image_emb = ...
  return image_emb

In [None]:
batch_size = 17
fake_images = sample_image.reshape((1, 3, 224, 224)).expand((batch_size, -1, -1, -1))
image_emb = embed_images_for_cosine_similarity(fake_images, model)

assert image_emb.shape == (batch_size, EMBEDDING_DIM)
check_embedding_for_cosine_similarity(image_emb)
print('You have correctly embedded the images')

### **Task 4**: Implement function predicting labels

Your next task is to implement the classification function. This function will receive two tensors: `image_emb` (with shape `<NUM_IMAGES, EMBEDDING_DIM>`) and `class_emb` (with shape `<NUM_CLASSES, EMBEDDING_DIM>`).

Inside the function, you must first compute the similarity between each image and all of the possible class labels. The result will be a **similarity matrix**. After that, your logic should find the most likely class for every image by identifying which class embedding was the most similar to it. The function should then return the predicted class indices.

In [None]:
def predict_labels(image_emb, class_emb):
  ############################################################################
  # YOUR CODE HERE
  ############################################################################
  ...
  return predicted

### Putting things together

In [None]:
def compute_zero_shot_clip_accuracy(test_images, test_labels, model, processor, class_prompts):
  class_emb = embed_class_prompts_for_cosine_similarity(class_prompts, model, processor)
  correct = total = 0
  for images, labels in zip(test_images, test_labels):
    labels = labels.to(DEVICE)
    # Embed current images
    image_emb = embed_images_for_cosine_similarity(images, model)
    predicted = predict_labels(image_emb, class_emb)
    correct += (predicted == labels).sum().item()
    total += labels.size(0)
  accuracy = 100 * correct / total
  return accuracy

In [None]:
accuracy = compute_zero_shot_clip_accuracy(test_images, test_labels, model, processor, class_prompts)
print(f"The accuracy is {accuracy}%")

### **Bonus Task**: Improve Your Classifier

Great work so far! If you want to push your results further, here are two key strategies to try:

**Upgrade Your CLIP Model**: The performance of your classifier is directly tied to the power of the pre-trained checkpoint. Try loading a larger, more accurate CLIP model and see how it affects your results.

**Refine Your Class Prompts**: The default prompt of just the class name (e.g., "dog") can be improved. Since CLIP is used to seeing descriptive text, wrapping your labels in a simple template like "a photo of a dog" often leads to a significant accuracy boost. Experiment with different templates to find the best one for your dataset.

In [None]:
def create_class_prompts_strategy_a(class_labels):
  class_prompts = []
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  ...
  return class_prompts

def create_class_prompts_strategy_b(class_labels):
  class_prompts = []
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  ...
  return class_prompts

In [None]:
class_prompts_a = create_class_prompts_strategy_a(test_dataset.classes)
class_prompts_b = create_class_prompts_strategy_b(test_dataset.classes)
accuracy_a = compute_zero_shot_clip_accuracy(test_images, test_labels, model, processor, class_prompts_a)
accuracy_b = compute_zero_shot_clip_accuracy(test_images, test_labels, model, processor, class_prompts_b)

In [None]:
results = {
    'Strategy A': (class_prompts_a, accuracy_a),
    'Strategy B': (class_prompts_b, accuracy_b),
}
for key, (class_prompts, accuracy) in results.items():
  print(f'{key}: {accuracy}%')
  print(f'Class prompts: {[c for c in class_prompts]}')

Clean up the memory

In [None]:
del model
torch.cuda.empty_cache()

## Exercise 4 - Out-of-distribution detection with CLIP

**Out-of-distribution (OOD) detection** is a task of detecting samples that fall outside of training taxonomy. It is critical for safe deployment of machine learning models in the real world where a model might encounter events not predicted by the training data. For example, road driving datasets will focus on labeling classes that are usually found in traffic scenes (road, vehicles, traffic signs). We, would, however want these models to be able to react to unusual obstacles that could feasibly appear on the road (animals, toys, etc.).


Previously, varied negative data was used to improve the ability of of deep models to deal with unknown situations. Vision-language models and their large scale pretraining offer a way to form a good quality feature space that may be utilized to expand the ability of specialized models.

### Zero-shot anomaly detection in the CLIP embedding space

In this section, we will implement an anomaly detection system based on a simple and effective technique using CLIP. The goal is to assign a high anomaly score to images that are out-of-distribution (OOD).


One simple approach proposed in[ Ming et al, Delving into Out-of Ditribution  
Detection, NeuriPS 2022,](https://arxiv.org/pdf/2211.13445) embeds inlier prompts using CLIP. Anomaly score may be expressed as the distance of an image embedding to the closest inlier prompt. It works as follows:

* **Define Inlier Prompts**: We create a set of text prompts describing our "inlier" (normal) data categories.

* **Embed Prompts**: We use CLIP's text encoder to get an embedding for each inlier prompt.

* **Calculate Anomaly Score**: For any given test image, we compute its image embedding. The anomaly score is then defined as the distance of this image embedding to the nearest inlier prompt embedding.

For this exercise, we will use **MNIST digits as our inlier data** and **FashionMNIST clothing items** as our outliers.


In [None]:
#@title Necessary imports
import torch
import torch.nn.functional as F
from torchvision import datasets, transforms
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import average_precision_score, roc_auc_score

from torch.utils.data import Subset

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

#### Loading CLIP model

Choose CLIP model size.

If you choose `SMOL_MODEL=False`, it will load [OpenAI CLIP model](https://huggingface.co/openai/clip-vit-base-patch16) with 150M parameters.

If you choose `SMOL_MODEL=True`, it will load [TinyCLIP model](https://arxiv.org/abs/2309.12314), with 81M parameters. We use a larger TinyCLIP model here compared to Zero-shot classification because a smaller variant does not work.

You can expect that the big model performs better, but for fast iteration, we recommend using a small model.

In [None]:
SMOL_MODEL = True #@param

In [None]:
print("Loading CLIP model and processor...")
if SMOL_MODEL:
  MODEL_ID = "wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M"
else:
  MODEL_ID = "openai/clip-vit-base-patch16"
model = CLIPModel.from_pretrained(MODEL_ID).to(DEVICE)
processor = CLIPProcessor.from_pretrained(MODEL_ID)
print("Model loaded successfully.")

pytorch_total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in the model: {pytorch_total_params}")

EMBEDDING_DIM = 512

#### Loading the MNIST and FashionMNIST datasets

In [None]:
# Select a subset size.
subset_size = 1_000 #@param
batch_size = 16 #@param

loading_transform = transforms.Compose([
    processor.image_processor,
    lambda x: x['pixel_values'][0]
])

print("Loading MNIST and Fashion-MNIST test datasets...")
# MNIST is our 'normal' or 'inlier' dataset (label 0)

subset_indices = range(subset_size)
mnist_test = datasets.MNIST(root='./data', train=False, download=True, transform=loading_transform)
mnist_test = Subset(mnist_test, subset_indices)
mnist_loader = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=True)
# Fashion-MNIST is our 'anomalous' or 'outlier' dataset (label 1)
fashion_mnist_test = datasets.FashionMNIST(root='./data', train=False, download=True, transform=loading_transform)
fashion_mnist_test = Subset(fashion_mnist_test, subset_indices)
fashion_mnist_loader = torch.utils.data.DataLoader(fashion_mnist_test, batch_size=batch_size, shuffle=True)
print("Datasets loaded.")

### **Task 1:** Embed the MNIST class labels to the inlier prompts

Think about how to correctly embed the MNIST class labels into inlier prompts. Use your insights from CLIP Zero-shot classification.


In [None]:
################################################################################
# YOUR CODE HERE
################################################################################
inlier_prompts = ...

In [None]:
inlier_emb = embed_class_prompts_for_cosine_similarity(inlier_prompts, model, processor)

### **Task 2:** Calculate anomaly scores for a given set of image embeddings.

Your first task is to collect anomaly scores for a given dataset.


Similarly to the previous exercise, you will need to embed images into the CLIP feature space, where these embeddings can be compared to the textual descriptions of inlier classes.


Unlike the previous exercise, we are not interested in the classification result, but in the confidence/uncertainty of the model in its prediction. This is because lower confidence might indicate that a sample is out-of-distribution.


We may model prediction confidence by looking at the maximum cosine similarity between a sample and inlier class descriptions. Anomaly score can then be calculated as `1 - prediction_confidence`.

In [None]:
# --- Calculate Anomaly Scores ---
# Anomaly score = 1 - max_cosine_similarity(image_embedding, inlier_text_embeddings) [3]
def calculate_anomaly_scores(data_loader, model, text_embeddings):
  all_anomaly_scores = []
  # Setting up the manual seed.
  torch.manual_seed(100)
  for i, (images, _) in enumerate(data_loader):
    image_emb = embed_images_for_cosine_similarity(images, model)

    ############################################################################
    # YOUR CODE HERE
    # You will need to:
    # - Calculate cosine similarity between each image embedding and all text embeddings
    # - Find the maximum similarity for each image across all inlier text prompts
    # - Anomaly score is 1 - max_similarity (higher score means more anomalous) [3]
    ############################################################################
    ...

    assert anomaly_scores.shape == (images.shape[0],)
    all_anomaly_scores.append(anomaly_scores)
  all_anomaly_scores = torch.cat(all_anomaly_scores, dim=0)
  return all_anomaly_scores.cpu().numpy()

In [None]:
print("\nCalculating anomaly scores for MNIST(inliers)...")
mnist_anomaly_scores = calculate_anomaly_scores(mnist_loader, model, inlier_emb)
print("\nCalculating anomaly scores for FashionMNIST (outliers)...")
fashion_mnist_anomaly_scores = calculate_anomaly_scores(fashion_mnist_loader, model, inlier_emb)

### **Task 3:** Evaluate zero-shot anomaly detection performance.

Anomaly detection may be viewed as binary classification between anomalies and inlier samples. This makes binary classification metrics such as [Average Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) and [Area Under Receiver Operating Characteristic Curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) suitable to evaluate anomaly detection methods.

Your second task is to evaluate the zero shot performance of CLIP for anomaly detection in MNIST.

In [None]:
# --- Evaluation ---
# Calculate AUC-ROC and AUPRC [4, 5]
def compute_ood_eval_metrics(inlier_anomaly_score, outlier_anomaly_scores):
  ##############################################################################
  # YOUR CODE: START
  # Combine scores and create true labels
  # 0 for inliers (MNIST), 1 for outliers (FashionMNIST)
  ##############################################################################
  ...

  return auc_roc, auprc

In [None]:
auc_roc, auprc = compute_ood_eval_metrics(mnist_anomaly_scores, fashion_mnist_anomaly_scores)
print(f"\n--- Anomaly Detection Performance ---")
print(f"AUC-ROC: {auc_roc:.4f}")
print(f"AUPRC: {auprc:.4f}")

In [None]:
# --- Visualization (Optional) ---
plt.figure(figsize=(10, 6))
plt.hist(mnist_anomaly_scores, bins=50, alpha=0.7, label='MNIST (Inliers)', color='blue', density=True)
plt.hist(fashion_mnist_anomaly_scores, bins=50, alpha=0.7, label='Fashion MNIST (Outliers)', color='red', density=True)
plt.title('Distribution of Anomaly Scores')
plt.xlabel('Anomaly Score (1 - Max Cosine Similarity)')
plt.ylabel('Density')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

If "red" distribution is shifted to the right compared to the "blue" distribution -> You have done everything correctly!

### Bonus Task: Improve OOD detection

Great work so far! If you want to push your results further, here are two key strategies to try:

**Upgrade Your CLIP Model**: The performance of your classifier is directly tied to the power of the pre-trained checkpoint. Try loading a larger, more accurate CLIP model and see how it affects your results.

**Refine Your Inlinear Prompts**: Try different ways of incoding inlinear prompts.

**Different anonaly score**: Try different ways of computing anomaly score.

In [None]:
################################################################################
# YOUR CODE HERE
################################################################################
inlier_prompts_new = ... # TO FILL

inlier_emb_new = embed_class_prompts_for_cosine_similarity(inlier_prompts_new, model, processor)

In [None]:
def calculate_anomaly_scores_new(data_loader, model, text_embeddings):
  all_anomaly_scores = []
  # Setting up the manual seed.
  torch.manual_seed(100)
  for i, (images, _) in enumerate(data_loader):
    image_emb = embed_images_for_cosine_similarity(images, model)
    ############################################################################
    # YOUR CODE: START
    # You will need to:
    # - Calculate cosine similarity between each image embedding and all text embeddings
    # - Find the maximum similarity for each image across all inlier text prompts
    # - Anomaly score is 1 - max_similarity (higher score means more anomalous) [3]
    ############################################################################
    ...

    assert anomaly_scores.shape == (images.shape[0],)
    all_anomaly_scores.append(anomaly_scores)
  all_anomaly_scores = torch.cat(all_anomaly_scores, dim=0)
  return all_anomaly_scores.cpu().numpy()

In [None]:
print("\nCalculating anomaly scores for MNIST(inliers)...")
mnist_anomaly_scores_new = calculate_anomaly_scores_new(mnist_loader, model, inlier_emb_new)
print("\nCalculating anomaly scores for FashionMNIST (outliers)...")
fashion_mnist_anomaly_scores_new = calculate_anomaly_scores_new(fashion_mnist_loader, model, inlier_emb_new)

In [None]:
auc_roc_new, auprc_new = compute_ood_eval_metrics(mnist_anomaly_scores_new, fashion_mnist_anomaly_scores_new)
print(f"\n--- Anomaly Detection Performance (New) ---")
print(f"AUC-ROC: {auc_roc_new:.4f}")
print(f"AUPRC: {auprc_new:.4f}")

print(f"\n--- Anomaly Detection Performance (Old) ---")
print(f"AUC-ROC: {auc_roc:.4f}")
print(f"AUPRC: {auprc:.4f}")

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(mnist_anomaly_scores, bins=50, alpha=0.7, label='MNIST (Inliers) -- old', color='blue', density=True)
plt.hist(mnist_anomaly_scores_new, bins=50, alpha=0.9, label='MNIST (Inliers) -- new', color='green', density=True)
plt.hist(fashion_mnist_anomaly_scores, bins=50, alpha=0.7, label='Fashion MNIST (Outliers) -- old', color='red', density=True)
plt.hist(fashion_mnist_anomaly_scores_new, bins=50, alpha=0.9, label='Fashion MNIST (Outliers) -- new', color='yellow', density=True)
plt.title('Distribution of Anomaly Scores')
plt.xlabel('Anomaly Score (1 - Max Cosine Similarity)')
plt.ylabel('Density')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

Gret job!

## Exercise 5: CLIP failure cases, SugarCrepe benchmark

We will now use the [Sugar Crepe benchmark ](https://arxiv.org/pdf/2306.14610) to dive deeper into the CLIP model and better understand its failure cases.

In [None]:
%%capture
!pip install datasets --upgrade

In [None]:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceM4/SugarCrepe_swap_att")

Let's now inspect the images and text in this benchmark.

In [None]:
all_images = dataset['test']['image']
all_candidate_captions = dataset['test']['tested_labels'] # For each image we have 2 candidate captions.

In [None]:
ipyplot.plot_images(all_images,max_images=10,img_width=150)

In [None]:
all_candidate_captions[0:10]

### Task 1: Rank the candidate text captions for each image


Each image in this dataset has 2 potential caption candidates. Use a CLIP pre-trained model to rank the caption candidates. Return the ranked captions and their scores.

In [None]:
def rank_image_captions(image, candidate_captions, clip_processor, clip_model):
  """
    Rank candidate captions for an image using CLIP.

    Args:
        image: Input image.
        candidate_captions: A list of possible captions for the image.
        clip_processor: Pre-traied CLIPProcessor for pre-processing the input image and text.
        clip_model: Pre-trained CLIPModel for obtaining image and text embeddings.

    Returns:
        ranked_captions: Candidate captions ranked by the similarity with the image.
        ranked_scores: Ranked similarity scores.
    """
  with torch.no_grad():
    inputs = clip_processor(text=candidate_captions, images=[image], return_tensors="pt", padding=True)
    outputs = clip_model(**inputs)
  text_emb = outputs.text_embeds
  image_emb = outputs.image_embeds
  ##############################################################################
  # Your code here
  ##############################################################################
  ...
  return output

Using your implementation let's now visualize which captions have the highest score for each image.

In [None]:
from matplotlib import pyplot as plt

max_examples = 10

for image, candidate_captions in zip(all_images[:max_examples], all_candidate_captions[:max_examples]):
  ranked_captions, scores = rank_image_captions(image, candidate_captions, clip_processor, clip_model)

  plt.imshow(image)
  plt.show()

  print (f'{ranked_captions[0]} Score: {scores[0]}')
  print (f'{ranked_captions[1]} Score: {scores[1]}')

What do you notice here? Is the most similar caption (i.e. the highest scoring caption) the correct one?

### Discuss: CLIP failure case

By aligning a global image representation with a global text representation the CLIP model often fails to distinguish between different object attributes (e.g. color understanding) and other finegrained details in the image/text. This is due to the fact that these are often not needed to solve the contrastive loss. To address this shortcoming, additional losses (e.g. captioniong) need to be added to the vision language model.