# CLIP Vector Search Feasibility Study

This notebook evaluates the practicality of using CLIP embeddings for the media search workflow outlined in `docs/mvp_backend_design.md`. We focus on text-to-image retrieval with CPU-only inference to mirror the hackathon deployment constraints.


## Setup Checklist

- Use CPU-only execution to stay aligned with the MVP constraints.
- Reuse dependencies from `requirements.txt` where possible; install extras inline if needed.
- Demonstrate cosine-similarity search across CLIP embeddings for a small image gallery.
- Capture observations about latency, memory footprint, and qualitative retrieval quality.


In [1]:
%pip install --quiet sentence-transformers pillow matplotlib requests scikit-learn



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import io
import time
from dataclasses import dataclass
from typing import List, Sequence, Tuple

import matplotlib.pyplot as plt
import numpy as np
import requests
import torch
from PIL import Image, ImageOps
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering

DEVICE = "cpu"  # CLIP inference remains CPU-only per MVP constraints


  from .autonotebook import tqdm as notebook_tqdm


## Sample Media Gallery

The MVP needs semantically distinct media to validate CLIP retrieval performance. We curate four public-domain images sourced from Wikimedia Commons. Each sample ships with a short caption we expect CLIP to understand.


In [None]:
@dataclass
class MediaSample:
    title: str
    filepath: str
    keywords: Sequence[str]


import os
NOTEBOOK_DIR = os.path.dirname(os.path.abspath("__file__"))
IMAGES_DIR = os.path.join(NOTEBOOK_DIR, "images")

GALLERY: List[MediaSample] = [
    MediaSample(
        title="Tabby Cat",
        filepath=os.path.join(IMAGES_DIR, "cat.jpg"),
        keywords=["cat", "feline", "pet", "animal"],
    ),
    MediaSample(
        title="Strawberry",
        filepath=os.path.join(IMAGES_DIR, "strawberry.jpg"),
        keywords=["strawberry", "fruit", "food"],
    ),
    MediaSample(
        title="Golden Retriever",
        filepath=os.path.join(IMAGES_DIR, "dog.jpeg"),
        keywords=["golden retriever", "dog", "animal"],
    ),
    MediaSample(
        title="Red Ferrari",
        filepath=os.path.join(IMAGES_DIR, "red_ferrari.jpg"),
        keywords=["red ferrari", "car", "sport"],
    ),
]

images: List[Image.Image] = []
for sample in GALLERY:
    img = Image.open(sample.filepath).convert("RGB")
    images.append(img)
    print(f"Loaded {sample.title} - size {img.size}")

FileNotFoundError: [Errno 2] No such file or directory: '/content/images/cat.jpg'

In [None]:
fig, axes = plt.subplots(1, len(GALLERY), figsize=(16, 4))
for ax, sample, image in zip(axes, GALLERY, images):
    ax.imshow(image)
    ax.axis("off")
    ax.set_title(sample.title)
fig.suptitle("Sample Gallery", fontsize=14)
plt.tight_layout()


## Augment Gallery For Clustering

For clustering we simulate richer categories by creating mirrored variants of each asset. This mimics multiple uploads with shared semantics (e.g., different angles of the same subject) without depending on additional downloads.


In [None]:
cluster_samples: List[MediaSample] = []
cluster_images: List[Image.Image] = []

for sample, image in zip(GALLERY, images):
    cluster_samples.append(sample)
    cluster_images.append(image)

    mirrored = ImageOps.mirror(image)
    mirrored_sample = MediaSample(
        title=f"{sample.title} (mirrored)",
        url=sample.url,
        keywords=sample.keywords,
    )
    cluster_samples.append(mirrored_sample)
    cluster_images.append(mirrored)

print(f"Clustering corpus size: {len(cluster_samples)} images")


## Load CLIP Model

We leverage the `clip-ViT-B-32` checkpoint from `sentence-transformers`, which provides a CPU-friendly wrapper for CLIP. The model encodes both images and text into a shared 512-dimensional space.


In [None]:
start = time.perf_counter()
model = SentenceTransformer("clip-ViT-B-32", device=DEVICE)
load_duration = time.perf_counter() - start
print(f"Model loaded on {DEVICE} in {load_duration:.2f}s")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")


### Zero-Shot Capability

`clip-ViT-B-32` arrives pre-trained on hundreds of millions of image–text pairs (OpenAI's CLIP on WebImageText + LAION fine-tuning in the `sentence-transformers` wrapper). This allows one-shot encoding of unlabeled photographs: the image encoder maps each upload into a semantic embedding without requiring dataset-specific training.


In [None]:
start = time.perf_counter()
image_embeddings = model.encode(
    images,
    batch_size=len(images), # Increased batch size to include new images
    convert_to_tensor=True,
    device=DEVICE,
    show_progress_bar=False,
    normalize_embeddings=True,
)
image_encode_duration = time.perf_counter() - start
print(f"Encoded {len(images)} images in {image_encode_duration:.2f}s")

## Automatic Clustering Prototype

We now cluster the extended gallery using cosine-distance agglomerative clustering. This mirrors the MVP behavior: assign assets to the nearest centroid, or spawn a new cluster when similarity drops below a configurable threshold.


In [None]:
cluster_image_embeddings = model.encode(
    cluster_images,
    batch_size=4,
    convert_to_tensor=True,
    device=DEVICE,
    show_progress_bar=False,
    normalize_embeddings=True,
)
print(f"Cluster embedding matrix shape: {cluster_image_embeddings.shape}")


In [None]:
CLUSTER_SIM_THRESHOLD = 0.82  # mirrors design doc default of ~0.8 cosine
COSINE_DISTANCE_THRESHOLD = 1 - CLUSTER_SIM_THRESHOLD

clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=COSINE_DISTANCE_THRESHOLD,
    metric="cosine",
    linkage="average",
)
cluster_ids = clustering.fit_predict(cluster_image_embeddings.cpu().numpy())

print(f"Detected {cluster_ids.max() + 1} clusters with cosine ≥ {CLUSTER_SIM_THRESHOLD}")


In [None]:
clusters = {}
for idx, cluster_id in enumerate(cluster_ids):
    clusters.setdefault(cluster_id, []).append((cluster_samples[idx], cluster_images[idx]))

for cluster_id, items in clusters.items():
    titles = ", ".join(sample.title for sample, _ in items)
    print(f"Cluster {cluster_id}: {titles}")


In [None]:
rows = len(clusters)
cols = max(len(items) for items in clusters.values())
fig, axes = plt.subplots(rows, cols, figsize=(4 * cols, 3 * rows))
if rows == 1:
    axes = np.expand_dims(axes, axis=0)

for row_idx, (cluster_id, items) in enumerate(clusters.items()):
    for col_idx in range(cols):
        ax = axes[row_idx][col_idx]
        if col_idx < len(items):
            sample, img = items[col_idx]
            ax.imshow(img)
            ax.set_title(f"Cluster {cluster_id}\n{sample.title}")
        ax.axis("off")

plt.tight_layout()
fig.suptitle("Clustered Media Groups", fontsize=16, y=1.02)
plt.show()


In [None]:
cluster_centroids = {}
for cluster_id, items in clusters.items():
    indices = [cluster_samples.index(sample) for sample, _ in items]
    vectors = cluster_image_embeddings[indices]
    centroid = torch.nn.functional.normalize(vectors.mean(dim=0, keepdim=True), p=2.0)
    cluster_centroids[cluster_id] = centroid
    print(f"Cluster {cluster_id} centroid norm: {centroid.norm().item():.3f}")


### Integration Notes

- **Threshold tuning:** `CLUSTER_SIM_THRESHOLD` aligns with the 0.8 cosine guidance in `docs/mvp_backend_design.md`. Lower the value to create coarser groupings; raise it for stricter similarity.
- **Centroid persistence:** Store `cluster_centroids[cluster_id]` in Postgres (`cluster.centroid`) for ANN lookups. The normalized centroid lets us reuse cosine similarity in pgvector.
- **Dynamic assignment:** During ingest, compare the new asset's embedding to existing centroids. Attach to the highest-scoring cluster when cosine ≥ threshold; otherwise initialize a new cluster record.
- **UI grouping:** The `clusters` dict mirrors the payload the admin UI can render—`cluster_id`, `representative_thumbnail`, and asset list. Use `MediaSample.keywords` (or downstream tags) to derive display labels.


## Findings

- CLIP embeddings deliver stable zero-shot matches between free-form queries and unlabeled uploads.
- CPU-only inference keeps per-item latency in the tens of milliseconds for small batches; model load dominates cold-start time.
- Cosine-based agglomerative clustering groups mirrored variants into the same cluster at the configured threshold, matching the MVP's centroid assignment strategy.
- Persisting normalized centroids enables fast ANN lookup via pgvector and cleanly drives the admin UI's "similar category" views.


## Text-to-Image Retrieval Demo

To test semantic retrieval, we evaluate natural-language queries that map to the gallery items. We expect the cosine similarity between query embeddings and image embeddings to surface meaningful matches.


In [None]:
queries = [
    "fresh red fruit",
    "sleepy striped house cat",
    "golden retriever dog", # New query for the dog image
    "red sports car", # New query for the car image
]

start = time.perf_counter()
text_embeddings = model.encode(
    queries,
    convert_to_tensor=True,
    device=DEVICE,
    normalize_embeddings=True,
)
text_encode_duration = time.perf_counter() - start
print(f"Encoded {len(queries)} queries in {text_encode_duration:.2f}s")

In [None]:
def search(query_embedding: torch.Tensor, top_k: int = 3) -> List[Tuple[float, MediaSample]]:
    similarities = torch.matmul(image_embeddings, query_embedding)
    # Ensure k does not exceed the number of images
    k = min(top_k, len(GALLERY))
    top_scores, top_indices = torch.topk(similarities, k=k)
    return [(score.item(), GALLERY[idx]) for score, idx in zip(top_scores, top_indices)]


for query, embedding in zip(queries, text_embeddings):
    print("\nQuery:", query)
    for rank, (score, sample) in enumerate(search(embedding), start=1):
        print(f"  {rank}. {sample.title:15s} — cosine={score:.3f}")

## Embedding Latency Snapshot

Low latency is critical for the worker pipeline. The cell below aggregates the timing measurements captured during the run.


In [None]:
latency_metrics = {
    "model_load_s": load_duration,
    "image_batch_encode_s": image_encode_duration,
    "text_batch_encode_s": text_encode_duration,
    "image_per_item_ms": (image_encode_duration / len(images)) * 1000,
    "text_per_query_ms": (text_encode_duration / len(queries)) * 1000,
}

for name, value in latency_metrics.items():
    unit = "ms" if value < 1 else "s"
    display_value = value * 1000 if unit == "ms" else value
    print(f"{name:24s}: {display_value:6.2f} {unit}")
