# 2. Image Embeddings with CLIP

In this notebook, we'll generate image embeddings for the MNIST test set using a pre-trained CLIP model. Embeddings are powerful vector representations that capture the semantic content of images. We will then use dimensionality reduction techniques (PCA and UMAP) to visualize these high-dimensional vectors in 2D, allowing us to visually explore the structure of our dataset.

**Key concepts covered:**
*   Loading pre-trained models from the FiftyOne Model Zoo
*   Computing image embeddings with CLIP
*   Assigning embeddings to dataset samples
*   Dimensionality reduction: PCA and UMAP
*   Visualizing embedding plots in FiftyOne

## Setup

First, let's add our imports and load the test dataset we created in the previous step.

In [None]:
import os
import numpy as np
import torch
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

# Ensure the test dataset exists
if "mnist-test-set" in fo.list_datasets():
    test_dataset = fo.load_dataset("mnist-test-set")
else:
    print("Test dataset not found. Please run '1_explore_mnist.ipynb' first.")

session = fo.launch_app(test_dataset, auto=False)
print(session.url)

## Creating Image Embeddings with CLIP

Image embeddings are high-dimensional vectors that translate visual concepts into a format that machine learning models can compare. Similar images will have similar embedding vectors.

We will use a pre-trained CLIP model from the FiftyOne Model Zoo to generate these embeddings. FiftyOne makes this process simple with the `compute_embeddings()` method.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model = foz.load_zoo_model("clip-vit-base32-torch",
                                device=device)
print(f"The model is loaded on {clip_model._device}")

total_params = sum(p.numel() for p in clip_model._model.parameters())
print(f"The CLIP model has {total_params:,} parameters.")

In [None]:
# This will take about 3 min on a Google Colab instance with GPU enabled
clip_embeddings = test_dataset.compute_embeddings(model=clip_model,
                                        batch_size=512,
                                        num_workers=2)

The result is a NumPy array where each row is a 512-dimensional vector representing an image. Now, we'll attach each embedding to its corresponding sample in the FiftyOne dataset.

In [None]:
# Check the format and shape of our embeddings vector
print(type(clip_embeddings), clip_embeddings.shape)

test_dataset.add_sample_field("clip_embeddings", fo.VectorField)
test_dataset.set_values("clip_embeddings", clip_embeddings)
test_dataset.save()

## Creating a Similarity Index

A **similarity index** allows for efficient searching of similar samples based on their embeddings. Instead of a slow, brute-force search, the index organizes embeddings for fast retrieval. We'll build one on our new CLIP embeddings.

In [None]:
similarity_index = fob.compute_similarity(
    test_dataset,
    embeddings="clip_embeddings",
    brain_key="clip_cosine_similarity_index",
    backend="sklearn",
    metric="cosine"
)

print("Similarity index computed successfully!")

Now we can easily find the most similar images to any given sample.

In [None]:
query_sample = test_dataset.first()
print(f"Querying for images similar to sample: {query_sample.id} with label {query_sample.ground_truth.label}")

similar_view = test_dataset.sort_by_similarity(
    query_sample.id,
    brain_key="clip_cosine_similarity_index",
    k=10
)

session.view = similar_view
session.refresh()
print(f"Found {len(similar_view)} most similar samples. Check the App: {session.url}")

## Creating a 2D Projection of the Embeddings

Our 512-dimensional embeddings are impossible to visualize directly. We use dimensionality reduction techniques like **PCA** and **UMAP** to project them into 2D space. This is a lossy compression, but it helps us visually identify clusters, outliers, and patterns in the data.

- **PCA (Principal Component Analysis)**: A linear method that preserves global variance.
- **UMAP (Uniform Manifold Approximation and Projection)**: A non-linear method that excels at preserving local neighborhood structures and revealing clusters.

In [None]:
pca_visualization = fob.compute_visualization(test_dataset,
                                              method="pca",
                                              embeddings="clip_embeddings",
                                              num_dims=2,
                                              brain_key="pca_visualization_clip_embeds")

umap_visualization = fob.compute_visualization(test_dataset,
                                              method="umap",
                                              embeddings="clip_embeddings",
                                              num_dims=2,
                                              brain_key="umap_visualization_clip_embeds")
print("PCA and UMAP visualizations computed.")

### Visualizing Embeddings in the FiftyOne App

To see your new 2D projections:

1. In the App, click the **`+`** icon next to "Samples".
2. Select **Embeddings**.
3. Choose `pca_visualization_clip_embeds` or `umap_visualization_clip_embeds` from the dropdown.

You can now interact with the plot. Try coloring the points by `ground_truth.label` to see if CLIP's embeddings naturally separate the digits into clusters.

![](https://github.com/andandandand/practical-computer-vision/blob/main/images/image_embeddings_zero_cluster.gif?raw=true)

In [None]:
session.refresh()

## Next Steps

With our dataset enriched with CLIP embeddings and visualizations, we're ready to use these features for a classification task.

Proceed to `3_zero_shot_classification.ipynb` to perform zero-shot classification with CLIP.