# Notebook: CheXpert + ChestXray14 Preprocessing and Image Embedding

This notebook performs all image-side preprocessing and indexing for the multimodal chest X-ray retrieval pipeline. It prepares embeddings for over 350,000 X-ray images across two datasets—**CheXpert** and **ChestXray14**—and saves them in a FAISS index for fast nearest-neighbor retrieval. The notebook runs entirely on the Colab SSD using an A100 GPU to avoid Google Drive I/O limitations.


---


## Background: What Are Embeddings and FAISS?

### What is an Image Embedding?
An **embedding** is a dense vector representation of an image (or text) that captures its semantic meaning in a high-dimensional space. In this notebook, we use a pretrained **BiomedCLIP** vision encoder to convert each chest X-ray image into a 512-dimensional vector that encodes its visual features in a way that is meaningful for similarity comparisons.

These vectors allow us to compare images not by raw pixel values, but by how semantically similar they are—e.g., images with similar medical findings are mapped closer together.

### What is FAISS?
**FAISS** (Facebook AI Similarity Search) is a library for fast, scalable nearest-neighbor search on dense vectors. It supports indexing millions of high-dimensional embeddings and querying them efficiently using various distance metrics like cosine similarity or Euclidean distance.

In this project, we:
- Normalize each image embedding to unit length
- Use **FAISS's IndexFlatIP** (inner product) to perform similarity search over the normalized vectors
- Store the index and aligned UUIDs for real-time retrieval in the Radiology Assistant

This enables us to later retrieve the most visually similar medical images to a user-uploaded query in real-time, without needing to store or access the full image dataset during inference.

---

## Workflow Overview

### **Step 1 – Environment Setup**
- Verified GPU (A100 or T4) and selected appropriate compute backend (`torch.device`).
- Installed required libraries (e.g., `open_clip_torch`, `faiss-cpu`).

### **Step 2 – Load Preprocessed CheXpert Data**
- Copied and unzipped `chexpert_flat.zip` into `/content/chexpert/` on Colab SSD.
- All images were converted to 224×224 grayscale `.png` in a prior preprocessing stage.

### **Step 3 – Define Paths**
- Defined fixed path constants for:
  - CheXpert directory: `/content/chexpert/`
  - Chest14 directory: `/content/images-224/images-224/`
  - Drive-relative paths for UUID mapping and FAISS save locations.

### **Step 4 – Define Transforms**
- Used OpenCLIP's BiomedCLIP transform: `Resize(224) → CenterCrop → ToTensor → Normalize(...)` from the pretrained preprocessing pipeline.

### **Step 5 – Launch Preprocessing (CheXpert Only)**
- Preprocessing was already complete; this step involved verifying and preparing standardized filenames.
- No new image files were created—just reused existing `.png`s on SSD.

### **Step 6 – Generate Image Metadata**
- Created `image_metadata` list of dictionaries with:
  - `uuid`: unique image ID
  - `path`: relative Drive path for later lookup
  - `dataset`: either `"chexpert"` or `"chest14"`

### **Step 7 – Save Manifest**
- Dumped `image_metadata.jsonl` (≈45 MB) using `json.dumps` line-by-line format.
- Manifest persisted to Drive: `./data/indexes/image_metadata.jsonl`

### **Step 8 – Load BiomedCLIP Vision Encoder**
- Loaded `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224` using `open_clip.create_model_from_pretrained(...)`.
- Only `encode_image()` was used; model weights remained frozen.

### **Step 9 – Embed Images in Batches**
- Iterated over all entries in `image_metadata.jsonl`.
- Loaded images from **SSD**, preprocessed, and passed through BiomedCLIP.
- Appended resulting 512-D float32 embeddings and UUIDs to memory.

### **Step 10 – Build FAISS Index**
- Normalized embeddings with `L2 norm` for cosine similarity.
- Used `faiss.IndexFlatIP(dim=512)` and wrote the binary index to:
  - `./data/indexes/image_faiss.bin`
  - `./data/indexes/image_uuids.json`

---

## Final Output Files

```
| Filename                  | Purpose                                   | Size     |
|---------------------------|--------------------------------------------|----------|
| `image_faiss.bin`         | FAISS index for ~350k images               | ~655 MB  |
| `image_uuids.json`        | One-to-one mapping from index to UUID      | ~12 MB   |
| `image_metadata.jsonl`    | Full manifest with path + dataset info     | ~45 MB   |

```

## Step 0: Mounting Google Drive and Importing Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/multimodal-xray-agent
!ls

In [None]:
!pip install open_clip_torch faiss-cpu -q

In [None]:
import torch
import faiss
import numpy as np
import glob, random
import itertools, pprintimport os, uuid, json, torch, shutil

from tqdm import tqdm
from PIL import Image
from pathlib import Path
from torchvision import transforms
from concurrent.futures import ThreadPoolExecutor
from open_clip import create_model_from_pretrained, get_tokenizer

from src.chexpert_preprocessing import process_one

## Step 1: Verifying GPU and Environment

In [None]:
# Device-agnostic setup
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device = torch.device("cuda")
    print(f"GPU detected: {device_name}")
else:
    device = torch.device("cpu")
    print("GPU not detected. Falling back to CPU.")

print(f"Running on device: {device}")

GPU detected: NVIDIA A100-SXM4-40GB
Running on device: cuda


## Step 2: Loading CheXpert Data to Local SSD

In [None]:
!cp "./data/images_sample/chexpert_flat.zip" /content/

In [None]:
!unzip -q /content/chexpert_flat.zip -d /content

In [None]:
!find /content/chexpert_flat -type f | head -n 3

/content/chexpert_flat/patient36133_study1_view1_frontal.jpg
/content/chexpert_flat/patient24375_study8_view2_lateral.jpg
/content/chexpert_flat/patient28388_study17_view2_frontal.jpg


## Step 3: Defining Paths

In [None]:
IN_DIR  = Path("/content/chexpert_flat").resolve()
OUT_DIR = Path("/content/chexpert").resolve()
OUT_DIR.mkdir(parents=True, exist_ok=True)

## Step 4: Define Transforms

In [None]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),          # [0,1]  float32
])

## Step 5: Launch Parallel Processing (Preprocessing)

This is essentially the same step that we used in our notebook to preprocess our dataset. The reason why we are repeating this step here again is that we want to load the files from Colab's SSD rather than Google Drive (which causes I/O errors due to the huge volume of the dataset). I am sure there are other, more optimal ways to solve this problem, but this is the best approach I could think of given my limited time.

In [None]:
# Find all .jpg files in the input directory and store their paths in a list
image_paths = list(IN_DIR.glob("*.jpg"))  
print(f"Discovered {len(image_paths):,} images") 

Discovered 223,416 images


In [None]:
# Use a thread pool to process images in parallel (4 threads)
with ThreadPoolExecutor(max_workers=4) as pool:
    # Apply process_one to each image path, showing progress with tqdm
    results = list(tqdm(pool.map(process_one, image_paths), total=len(image_paths)))

print(f"\n Preprocessed {sum(results):,} / {len(image_paths):,} images into {OUT_DIR}")

## Step 6: Generate Image Metadata

In [None]:
image_metadata = []

# Iterate over all files in the Chexpert output directory
for fname in tqdm(os.listdir(OUT_DIR), desc="Build Chexpert image metadata from SSD"):
    # Only process files that are PNG images
    if fname.endswith(".png"):
        image_metadata.append({
            "uuid": str(uuid.uuid4()),  # Generate a unique identifier for each image
            "path": f"data/images_sample/chexpert/{fname}",  # Path relative to Drive, not SSD
            "dataset": "chexpert"  # Indicate the dataset source
        })

Build Chexpert image metadata from SSD: 100%|██████████| 223414/223414 [00:00<00:00, 263144.35it/s]


In [None]:
# Now doing the same for the Chest14 dataset
!cp "./data/images_sample/chest14.zip" /content/

In [None]:
!unzip -q /content/chest14.zip -d /content

In [None]:
!find /content/images-224 -type f | head -n 3

/content/images-224/images-224/00000001_000.png
/content/images-224/images-224/00000001_001.png
/content/images-224/images-224/00000001_002.png


In [None]:
# Define the path to the Chest14 images on Colab SSD
CHEST14_DIR = Path("/content/images-224/images-224").resolve()

# Iterate over all files in the Chest14 directory
for fname in tqdm(os.listdir(CHEST14_DIR), desc="Build Chest14 image metadata"):
    # Only process PNG files (preprocessed images)
    if fname.endswith(".png"):
        # Append metadata for each image to the image_metadata list
        image_metadata.append({
            "uuid": str(uuid.uuid4()),  # Generate a unique identifier for the image
            "path": f"data/images_sample/chest14/{fname}",  # Path relative to Drive, not SSD
            "dataset": "chest14"  # Mark the dataset source
        })

Build Chest14 image metadata: 100%|██████████| 112120/112120 [00:00<00:00, 271400.47it/s]


## Step 7: Saving the Metadata File as jsonl

In [None]:
# Save locally and copy to Drive
jsonl_path = "/content/image_metadata.jsonl"
with open(jsonl_path, "w") as f:
    for entry in image_metadata:
        f.write(json.dumps(entry) + "\n")

In [None]:
!cp /content/image_metadata.jsonl /content/drive/MyDrive/multimodal-xray-agent/data/indexes/image_metadata.jsonl

## Step 8 – Load Vision Encoder: BiomedCLIP (OpenCLIP ViT-B/16)

We load `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224`, a domain-specific vision-language model pretrained on MIMIC-CXR and PMC figures. The vision backbone is a ViT-B/16 transformer, accessed via the `open_clip_torch` interface.

**Why this model:**
- **Radiology-tuned**: Unlike generic CLIP, BiomedCLIP has seen large volumes of chest X-rays and biomedical image-text pairs.
- **Pretrained ViT encoder**: Delivers strong performance with no additional fine-tuning required.
- **Frozen weights**: The model is used only for inference (`eval()` mode), ensuring stable and reproducible feature extraction.

**Purpose in pipeline:**
- Converts each 224×224 grayscale X-ray image into a 512-dimensional float32 embedding using `.encode_image(...)`.
- These embeddings are later indexed with FAISS for similarity-based image retrieval.

The `preprocess` transform returned by the model includes resizing, normalization, and tensor conversion, ensuring input compatibility with the pretrained ViT-B/16 backbone.

In [None]:
ROOT_DIR = "/content/drive/MyDrive/multimodal-xray-agent"
IMG_DIR_CHEXPERT = os.path.join(ROOT_DIR, "data/images_sample/chexpert")
IMG_DIR_CHEST14 = os.path.join(ROOT_DIR, "data/images_sample/chest14")
INDEX_OUT_DIR = os.path.join(ROOT_DIR, "data/indexes")
META_OUT_PATH = os.path.join(ROOT_DIR, "data/indexes/image_metadata.jsonl") # This is where the image metadata is stored

os.makedirs(INDEX_OUT_DIR, exist_ok=True)

In [None]:
# Load model and preprocessing from Hugging Face Hub
hf_repo = "hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
model, preprocess = create_model_from_pretrained(hf_repo)

In [None]:
model = model.to(device).eval()

In [None]:
print("Output shape (dummy):", model.encode_image(preprocess(Image.new("RGB", (224, 224))).unsqueeze(0).to(device)).shape)

Output shape (dummy): torch.Size([1, 512])


## Step 9 – Embed All Images into Vector Space

This step encodes each preprocessed chest X-ray into a fixed-length vector using the BiomedCLIP vision transformer.

**Process:**
- Loads `image_metadata.jsonl` from local SSD, which contains a UUID and path for each image.
- Depending on the dataset (`chexpert` or `chest14`), constructs the correct absolute file path on SSD.
- Each image is:
  1. Loaded via PIL and converted to RGB (as expected by ViT).
  2. Preprocessed using BiomedCLIP’s `preprocess(...)` transform (resize, normalize, tensorize).
  3. Passed to `model.encode_image(...)` under `torch.no_grad()` to generate a 512-dimensional float32 embedding.

**Output:**
- `all_embeddings`: a list of NumPy arrays (shape: 512-d per image).
- `all_uuids`: a parallel list of UUIDs for indexing and retrieval linkage.


In [None]:
# Load manifest from SSD (not Drive)
with open("/content/image_metadata.jsonl", "r") as f:
    image_metadata = [json.loads(line) for line in f]

In [None]:
all_embeddings = []
all_uuids = []

for entry in tqdm(image_metadata, desc="Embedding images from SSD"):
    fname = Path(entry["path"]).name

    if entry["dataset"] == "chexpert":
        actual_path = Path("/content/chexpert") / fname
    elif entry["dataset"] == "chest14":
        actual_path = Path("/content/images-224/images-224") / fname
    else:
        continue

    try:
        img = Image.open(actual_path).convert("RGB")
        img_tensor = preprocess(img).unsqueeze(0).to(device)
        with torch.no_grad():
            emb = model.encode_image(img_tensor).cpu().numpy()  # Generate 512-d embedding for image
        all_embeddings.append(emb)  # Store embedding
        all_uuids.append(entry["uuid"])  # Store UUID for indexing
    except Exception as e:
        print(f"[ERROR] {actual_path}: {e}")

Embedding images from SSD: 100%|██████████| 335534/335534 [44:39<00:00, 125.23it/s]


## Step 10 – Build FAISS Index & Persist Outputs

This step constructs a high-performance image similarity index using FAISS to enable cosine-based nearest neighbor search over the full 350,000-image corpus.

**Process:**
- **Embedding Flattening:** All 512-dimensional image vectors (`all_embeddings`) are vertically stacked into a single NumPy array of shape `(N, 512)`.
- **Normalization:** Each vector is L2-normalized so that cosine similarity reduces to inner product (dot product) in FAISS.
- **FAISS Index:** Uses `IndexFlatIP` to build an exact inner-product search index over the normalized vectors.

**Storage:**
- The FAISS binary index is written to `image_faiss.bin` (≈ 650 MB).
- The corresponding `image_uuids.json` file stores the aligned UUIDs for post-retrieval lookup and captioning.

**Why It Matters:**
- Enables efficient sub-50ms inference-time retrieval over hundreds of thousands of medical images.
- Index is self-contained: downstream modules (agents, FastAPI) use only this file + UUIDs for retrieval—raw images are not needed.

In [None]:
# Flatten and normalize embeddings
embeddings = np.vstack(all_embeddings).astype("float32")  # Stack all embedding arrays into a single (N, 512) float32 array
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)  # L2-normalize each embedding vector for cosine similarity

In [None]:
# Initialize FAISS index for Inner Product (cosine similarity via normalized vectors)
index = faiss.IndexFlatIP(embeddings.shape[1])  # Create a flat (exact) inner product index with dimension 512
index.add(embeddings)  # Add all normalized image embeddings to the FAISS index

In [None]:
INDEX_OUT_DIR = os.path.join(ROOT_DIR, "data/indexes")
INDEX_PATH = os.path.join(INDEX_OUT_DIR, "image_faiss.bin")
UUIDS_PATH = os.path.join(INDEX_OUT_DIR, "image_uuids.json")
os.makedirs(INDEX_OUT_DIR, exist_ok=True)

In [None]:
# Save index and UUIDs
faiss.write_index(index, INDEX_PATH)
with open(UUIDS_PATH, "w") as f:
    json.dump(all_uuids, f)

print(f"FAISS index saved to: {INDEX_PATH}")
print(f"UUID list saved to: {UUIDS_PATH}")

FAISS index saved to: /content/drive/MyDrive/multimodal-xray-agent/data/indexes/image_faiss.bin
UUID list saved to: /content/drive/MyDrive/multimodal-xray-agent/data/indexes/image_uuids.json
