# 2. Image Embeddings with CLIP

In this notebook, we'll generate image embeddings for the MNIST test set using a pre-trained CLIP model. Embeddings are powerful vector representations that capture the semantic content of images. We will then use dimensionality reduction techniques (PCA and UMAP) to visualize these high-dimensional vectors in 2D, allowing us to visually explore the structure of our dataset.

**Key concepts covered:**
*   Loading pre-trained models from the FiftyOne Model Zoo
*   Computing image embeddings with CLIP
*   Assigning embeddings to dataset samples
*   Dimensionality reduction: PCA and UMAP
*   Visualizing embedding plots in FiftyOne

## Setup

First, let's add our imports and load the test dataset we created in the previous step.

In [1]:
import os
import numpy as np
import torch
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

# Ensure the test dataset exists
if "mnist-test-set" in fo.list_datasets():
    test_dataset = fo.load_dataset("mnist-test-set")
else:
    print("Test dataset not found. Please run '1_explore_mnist.ipynb' first.")

session = fo.launch_app(test_dataset, auto=False)
print(session.url)

Connected to FiftyOne on port 5151 at 0.0.0.0.
If you are not connecting to a remote session, you may need to start a new session and specify a port
Session launched. Run `session.show()` to open the App in a cell output.
http://0.0.0.0:5151/


## Creating Image Embeddings with CLIP

Image embeddings are high-dimensional vectors that translate visual concepts into a format that machine learning models can compare. Similar images will have similar embedding vectors.

We will use a pre-trained CLIP model from the FiftyOne Model Zoo to generate these embeddings. FiftyOne makes this process simple with the `compute_embeddings()` method.

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model = foz.load_zoo_model("clip-vit-base32-torch",
                                device=device)
print(f"The model is loaded on {clip_model._device}")

total_params = sum(p.numel() for p in clip_model._model.parameters())
print(f"The CLIP model has {total_params:,} parameters.")

Downloading model from 'https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt'...
 100% |██████|    2.6Gb/2.6Gb [29.9s elapsed, 0s remaining, 100.9Mb/s]     
Downloading CLIP tokenizer...
 100% |█████|   10.4Mb/10.4Mb [207.0ms elapsed, 0s remaining, 50.0Mb/s]      
The model is loaded on cuda
The CLIP model has 151,277,313 parameters.


In [3]:
# This will take about 3 min on a Google Colab instance with GPU enabled
clip_embeddings = test_dataset.compute_embeddings(model=clip_model,
                                        batch_size=512,
                                        num_workers=2)

 100% |█████████████| 10000/10000 [25.3s elapsed, 0s remaining, 474.4 samples/s]      


The result is a NumPy array where each row is a 512-dimensional vector representing an image. Now, we'll attach each embedding to its corresponding sample in the FiftyOne dataset.

In [4]:
# Check the format and shape of our embeddings n-dimensional array.
# We should have 10000 embeddings, each with 512 dimensions.
print(type(clip_embeddings), clip_embeddings.shape)

test_dataset.add_sample_field("clip_embeddings", fo.VectorField)
test_dataset.set_values("clip_embeddings", clip_embeddings)
# We save the values to the dataset 
test_dataset.save()

<class 'numpy.ndarray'> (10000, 512)


## Creating a Similarity Index

A **similarity index** allows for efficient searching of similar samples based on their embeddings. Instead of a slow, brute-force search, the index organizes embeddings for fast retrieval. We'll build one on our new CLIP embeddings.

In [14]:
similarity_index = fob.compute_similarity(
    test_dataset,
    model="clip-vit-base32-torch",
    embeddings="clip_embeddings",
    brain_key="clip_cosine_similarity_index",
    backend="sklearn",
    metric="cosine"
)

print("Similarity index computed successfully!")

Similarity index computed successfully!


Now we can easily find the most similar images to any given sample.

In [6]:
query_sample = test_dataset.first()
print(f"Querying for images similar to sample: {query_sample.id} with label {query_sample.ground_truth.label}")

similar_view = test_dataset.sort_by_similarity(
    query_sample.id,
    brain_key="clip_cosine_similarity_index",
    k=10
)

session.view = similar_view
session.refresh()
print(f"Found {len(similar_view)} most similar samples. Check the App: {session.url}")

Querying for images similar to sample: 686921ec5484ada2e34aea0c with label 7 - seven
Found 10 most similar samples. Check the App: http://0.0.0.0:5151/


In [None]:
# We can save individual views with descriptive names and query them in the app
test_dataset.save_view(f"Images similar to {query_sample.id}", similar_view)
session.refresh()

![](https://github.com/andandandand/fiftyone/blob/develop/docs/source/getting_started_experiences/Classification/assets/similar_to_query_id.webp?raw=True)

The [CLIP model](https://docs.voxel51.com/model_zoo/models.html#clip-vit-base32-torch) has an important feature: it supports text prompts, meaning that we can search for images given a text query. We will go deeper into this in the next notebook. 

In [None]:
text_query = "the digit five"

similar_to_5_view = test_dataset.sort_by_similarity(text_query, k=5)
session.view = similar_to_5_view
print(session.url)

http://0.0.0.0:5151/


![](https://github.com/andandandand/fiftyone/blob/develop/docs/source/getting_started_experiences/Classification/assets/text_query_five.webp?raw=1)

## Creating a 2D Projection of the Embedding Space

Our 512-dimensional embeddings are impossible to visualize directly. We use dimensionality reduction techniques like **PCA** and **UMAP** to project them into 2D space. This is a lossy compression, but it helps us visually identify clusters, outliers, and patterns in the data.

- **PCA (Principal Component Analysis)**: A linear method that preserves global variance.
- **UMAP (Uniform Manifold Approximation and Projection)**: A non-linear method that excels at preserving local neighborhood structures and revealing clusters.

In [7]:
pca_visualization = fob.compute_visualization(test_dataset,
                                              method="pca",
                                              embeddings="clip_embeddings",
                                              num_dims=2,
                                              brain_key="pca_visualization_clip_embeds")

umap_visualization = fob.compute_visualization(test_dataset,
                                              method="umap",
                                              embeddings="clip_embeddings",
                                              num_dims=2,
                                              brain_key="umap_visualization_clip_embeds")

# Refresh the session to show the new visualizations
session.refresh()
print("PCA and UMAP visualizations computed.")
print(session.url)

Generating visualization...
Generating visualization...
UMAP( verbose=True)
Sat Jul  5 14:06:58 2025 Construct fuzzy simplicial set
Sat Jul  5 14:06:58 2025 Finding Nearest Neighbors
Sat Jul  5 14:06:58 2025 Building RP forest with 10 trees
Sat Jul  5 14:07:02 2025 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	 4  /  13
	Stopping threshold met -- exiting after 4 iterations
Sat Jul  5 14:07:14 2025 Finished Nearest Neighbor Search
Sat Jul  5 14:07:16 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Sat Jul  5 14:08:16 2025 Finished embedding
PCA and UMAP visualizations computed.


### Visualizing Embeddings in the FiftyOne App

To see your new 2D projections:

1. In the App, click the **`+`** icon next to "Samples".
2. Select **Embeddings**.
3. Choose `pca_visualization_clip_embeds` or `umap_visualization_clip_embeds` from the dropdown.

You can now interact with the plot. Try coloring the points by `ground_truth.label` to see if CLIP's embeddings naturally separate the digits into clusters.

![](https://github.com/andandandand/practical-computer-vision/blob/main/images/image_embeddings_zero_cluster.gif?raw=true)

## Exercise

Compute the T-SNE visualization of the embeddings with:

```python
tsne_visualization = fob.compute_visualization(test_dataset,
                                              method="tsne",
                                              embeddings="clip_embeddings",
                                              num_dims=2,
                                              brain_key="tsne_visualization_clip_embeds")
```
and compare the results in the FiftyOne app. Refer to the [documentation](https://docs.voxel51.com/api/fiftyone.brain.html#fiftyone.brain.compute_visualization) for more information. 

## Next Steps

With our dataset enriched with CLIP embeddings and visualizations, we're ready to use these features for a classification task.

Proceed to `3_zero_shot_classification.ipynb` to perform zero-shot classification with CLIP.