# Notebook: Processing IU-XRay Dataset

This notebook prepares the **Indiana University Chest X-Ray Collection** as a high-quality stand-in for MIMIC-CXR, enabling us to generate a robust caption bank and fine-tuning corpus for our multimodal X-ray agent pipeline.

### Objectives
- Download, preprocess, and parse image–report pairs from the IU-Xray dataset.
- Generate two aligned metadata files:
  - `iu_impr.jsonl`: Maps each UUID to its radiologist-written impression.
  - `iu_uuids.jsonl`: Maps each UUID to the corresponding preprocessed image path.
- Generate 512-dimensional **BiomedCLIP** embeddings for all IU images and save as `iu_vecs.npy`.

---

### Workflow Overview

#### 1. Load Dataset from Hugging Face
- We downloaded all 4 Parquet shards from [`ayyuce/Indiana_University_Chest_X-ray_Collection`](https://huggingface.co/datasets/ayyuce/Indiana_University_Chest_X-ray_Collection).
- Merged into a single `pandas` dataframe with ~7,430 image-report entries.

#### 2. Save and Normalize All Images
- Extracted base64-encoded image bytes from each row.
- Converted to grayscale, resized to `224×224`, and saved to `data/iu_xray/images/iu_XXXX.png`.
- Ensured deterministic UUID-style filenames: `iu_0000.png` → `iu_7430.png`.

#### 3. Save Impressions to JSONL
- Parsed the `report` field and stored only the cleaned **impression** text.
- Saved as `iu_impr.jsonl`, aligned one-to-one with the images.

#### 4. Save UUID Metadata
- Verified and enumerated the saved PNGs.
- Created `iu_uuids.jsonl` to store `{uuid, path}` records.

#### 5. BiomedCLIP Embedding
- Loaded `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224` via `open_clip`.
- Embedded all 7,430 IU-Xray images in batches of 32 using GPU.
- Saved result as `iu_vecs.npy` with shape `(7430, 512)`.

#### 6. Sanity Check
- Confirmed:
  - UUID–impression–embedding alignment
  - Non-zero vector norms
  - Consistent indexing and vector dimensionality

---

### Final Output Directory
```bash
data/iu_xray/
├── raw/                         # Original parquet shards
├── images/                      # 224×224 grayscale PNGs (7430)
├── iu_impr.jsonl                # Textual impressions (UUID-mapped)
├── iu_uuids.jsonl               # Image UUID + path metadata
└── iu_vecs.npy                  # BiomedCLIP embeddings (7430 × 512)
```

## Step 0: Mounting Google Drive and Importing Libraries

In [2]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/multimodal-xray-agent
!ls

Mounted at /content/drive
/content/drive/MyDrive/multimodal-xray-agent
app	      data	  LICENSE  notebooks	   README.md	     scripts
chexpert.zip  deployment  models   PROJECT_LOG.md  requirements.txt  src


In [None]:
!pip install open_clip_torch -q

In [3]:
import json
import uuid
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
from io import BytesIO

from PIL import Image
from pathlib import Path
from datasets import load_dataset
from torchvision import transforms
from open_clip import create_model_from_pretrained, get_tokenizer

from src.image_utils import save_parquet_images

## Step 1: Verifying GPU and Environment

In [4]:
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device = torch.device("cuda")
    print(f"GPU detected: {device_name}")
else:
    device = torch.device("cpu")
    print("GPU not detected. Falling back to CPU.")

print(f"Running on device: {device}")

GPU detected: NVIDIA L4
Running on device: cuda


## Step 2: Extract Impressions from Uploaded .parquet Files


In [5]:
# Define paths
PROJECT_ROOT = Path("/content/drive/MyDrive/multimodal-xray-agent")

RAW_PARQUET_DIR = PROJECT_ROOT / "data" / "iu_xray" / "raw"
IMAGE_DIR = PROJECT_ROOT / "data" / "iu_xray" / "images"
OUTPUT_JSONL_PATH = PROJECT_ROOT / "data" / "iu_xray" / "iu_impr.jsonl"
UUID_JSONL_PATH = PROJECT_ROOT / "data" / "iu_xray" / "iu_uuids.jsonl"
EMBEDDING_SAVE_PATH = PROJECT_ROOT / "data" / "iu_xray" / "iu_vecs.npy"

IMAGE_DIR.mkdir(parents=True, exist_ok=True)
UUID_JSONL_PATH.parent.mkdir(parents=True, exist_ok=True)
OUTPUT_JSONL_PATH.parent.mkdir(parents=True, exist_ok=True)
EMBEDDING_SAVE_PATH.parent.mkdir(parents=True, exist_ok=True)

In [6]:
# Load all train + test shards
parquet_files = [
    RAW_PARQUET_DIR / "train-00000-of-00003.parquet",
    RAW_PARQUET_DIR / "train-00001-of-00003.parquet",
    RAW_PARQUET_DIR / "train-00002-of-00003.parquet",
    RAW_PARQUET_DIR / "test-00000-of-00001.parquet"
]

In [7]:
df_list = [pd.read_parquet(str(p)) for p in parquet_files]

In [8]:
df = pd.concat(df_list, ignore_index=True)

print(f"Loaded {len(df):,} total samples from train + test.")

Loaded 7,430 total samples from train + test.


This code block iterates through the rows of the DataFrame, extracts the report text, cleans it, and if the cleaned text is not empty, it creates a dictionary with a unique identifier and the impression text, adding it to the `records` list.

In [9]:
# Extract and format impression records
records = []

for i, row in df.iterrows():
    impression = row["report"].strip()
    if impression:
        records.append({
            "uuid": f"iu_{i:04d}",
            "impression": impression
        })

In [10]:
# Write to JSONL
with open(OUTPUT_JSONL_PATH, "w") as f:
    for r in records:
        f.write(json.dumps(r) + "\n")

print(f"Saved {len(records):,} impressions to {OUTPUT_JSONL_PATH}")

Saved 7,430 impressions to /content/drive/MyDrive/multimodal-xray-agent/data/iu_xray/iu_impr.jsonl


## Step 3: Extracting and Saving Images

In [20]:
saved_filenames = save_parquet_images(df, IMAGE_DIR)

Saving IU images: 100%|██████████| 7430/7430 [02:15<00:00, 54.72it/s]

Saved 7,430 images to: /content/drive/MyDrive/multimodal-xray-agent/data/iu_xray/images





## Step 3: Generate iu_uuids.jsonl

In [22]:
# Enumerate valid PNGs in sorted order
image_files = sorted(IMAGE_DIR.glob("*.png"))
records = []

for i, img_path in enumerate(image_files):
    try:
        # Optional: verify it's a readable image
        Image.open(img_path).verify()

        record = {
            "uuid": f"iu_{i:04d}",
            "path": str(img_path)
        }
        records.append(record)

    except Exception as e:
        print(f"Skipping unreadable image: {img_path.name} ({e})")

In [23]:
# Write to JSONL
with open(UUID_JSONL_PATH, "w") as f:
    for r in records:
        f.write(json.dumps(r) + "\n")

print(f"Saved {len(records):,} UUID entries to {UUID_JSONL_PATH}")

Saved 7,430 UUID entries to /content/drive/MyDrive/multimodal-xray-agent/data/iu_xray/iu_uuids.jsonl


## Step 4: Generate `iu_vecs.npy` using BiomedCLIP Image Encoder

In this step, we extract high-dimensional semantic embeddings for each IU-Xray image using the **BiomedCLIP** vision transformer. These embeddings will later serve as the backbone of our **caption bank retrieval** pipeline.

---

#### Embedding Overview

- **Model**: `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224`
- **Embedding Size**: 512 dimensions per image
- **Preprocessing**: Applied `open_clip`’s model-specific transform (resize, normalize, etc.)
- **Batching**: 32 images per batch using `torch.no_grad()` for inference
- **Hardware**: Optimized for GPU (`cuda`), with fallback to CPU

---

#### Code Summary

1. **Paths**:
   - `iu_uuids.jsonl`: Input file with UUID and image path.
   - `iu_vecs.npy`: Output file storing all image embeddings.

2. **Model Loading**:
   - Used `create_model_from_pretrained()` to load BiomedCLIP.
   - Moved to device and set to `eval()` mode.

3. **Batch Inference Loop**:
   - Iterated over all images from `iu_uuids.jsonl`.
   - Opened image → applied BiomedCLIP preprocessing → added to batch.
   - Every 32 images:
     - Ran `model.encode_image()` to get 512-d embeddings.
     - Stored outputs on CPU to conserve GPU memory.
   - Handled final leftover batch.

4. **Saving**:
   - Concatenated all batches and saved to:
     - `data/iu_xray/iu_vecs.npy`
   - Final shape: `(7430, 512)`

In [None]:
# Load BiomedCLIP model and processor
hf_repo = "hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"

model, preprocess = create_model_from_pretrained(hf_repo)

In [28]:
model = model.to(device).eval()

In [30]:
# Load UUIDs
with open(UUID_JSONL_PATH, "r") as f:
    uuid_records = [json.loads(line) for line in f]

In [None]:
# Batch-wise inference
all_embeddings = []
batch_size = 32
batch_images = []

# Iterate over all image records for embedding
for record in tqdm(uuid_records, desc="Embedding IU images"):
    img_path = Path(record["path"])
    try:
        # Open image and convert to RGB
        image = Image.open(img_path).convert("RGB")
        # Apply BiomedCLIP preprocessing
        tensor_img = preprocess(image)  # open_clip preprocessing
        batch_images.append(tensor_img)

        # If batch is full, run inference
        if len(batch_images) == batch_size:
            batch_tensor = torch.stack(batch_images).to(device)
            with torch.no_grad():
                features = model.encode_image(batch_tensor)  # Get embeddings
            all_embeddings.append(features.cpu())  # Move to CPU to save memory
            batch_images = []

    except Exception as e:
        print(f"Error with image {record['uuid']}: {e}")

# Handle any leftover images in the last batch
if batch_images:
    batch_tensor = torch.stack(batch_images).to(device)
    with torch.no_grad():
        features = model.encode_image(batch_tensor)
    all_embeddings.append(features.cpu())

Embedding IU images: 100%|██████████| 7430/7430 [00:59<00:00, 125.63it/s]


In [32]:
# Save
all_vecs = torch.cat(all_embeddings, dim=0).numpy()
np.save(EMBEDDING_SAVE_PATH, all_vecs)
print(f"Saved {all_vecs.shape[0]} BiomedCLIP embeddings to: {EMBEDDING_SAVE_PATH}")

Saved 7430 BiomedCLIP embeddings to: /content/drive/MyDrive/multimodal-xray-agent/data/iu_xray/iu_vecs.npy


## Step 5: Sanity Check

In [33]:
vecs = np.load(PROJECT_ROOT / "data/iu_xray/iu_vecs.npy")

In [34]:
with open(PROJECT_ROOT / "data/iu_xray/iu_uuids.jsonl") as f:
    uuid_records = [json.loads(l) for l in f]

In [35]:
with open(PROJECT_ROOT / "data/iu_xray/iu_impr.jsonl") as f:
    impr_records = [json.loads(l) for l in f]

In [36]:
# Check sizes
print("Vecs shape:", vecs.shape)
print("# UUIDs:", len(uuid_records))
print("# Impressions:", len(impr_records))

Vecs shape: (7430, 512)
# UUIDs: 7430
# Impressions: 7430


In [37]:
# Sample entries
for i in range(3):
    print(f"--- UUID: {uuid_records[i]['uuid']}")
    print(f"Impression: {impr_records[i]['impression'][:80]}...")
    print(f"Vec norm: {np.linalg.norm(vecs[i]):.4f}\n")

--- UUID: iu_0000
Impression: FINDINGS: Lungs are clear. No pleural effusions or pneumothoraces. Heart and med...
Vec norm: 83.8474

--- UUID: iu_0001
Impression: FINDINGS: Hyperinflated lungs with mildly flattened posterior diaphragm. No foca...
Vec norm: 87.9998

--- UUID: iu_0002
Impression: FINDINGS: Borderline heart size. The lungs are hyperexpanded and hyperlucent com...
Vec norm: 78.7909

