# Notebook: Building the PubMed Text Embedding and Retrieval Index

This notebook prepares the semantic text-retrieval module for our multimodal radiology agent. It builds the **PubMed abstract index** that powers evidence-grounded citation retrieval via FAISS and SPECTER2. This module enables the agent to return relevant biomedical literature when answering user queries or interpreting X-ray captions.

---
## Background

In order to support high-quality evidence generation, our multimodal radiology assistant includes a **text-retrieval module** that can return semantically relevant biomedical literature in response to user queries or image-derived captions. This module enhances factual grounding, citation transparency, and interpretability — all critical for clinical-facing AI systems.

We use the **`MedRAG/pubmed`** dataset hosted on Hugging Face, a cleaned subset of the PubMed Open Access collection. It contains over 2.2 million biomedical papers, each with:

- A stable `PMID` identifier
- A structured `title`
- An abstract stored in the `content` field

These papers span a wide range of clinical and research topics, making them ideal for embedding-based semantic search.

In this notebook, we sample a 30,000-record subset and build a **FAISS-based text retrieval index** using **SPECTER2**, a state-of-the-art transformer model trained to embed scientific documents. This index enables fast, scalable retrieval of relevant PubMed papers based on either a user query or an automatically generated caption from a medical image.

This forms the **text tower** of our dual-retrieval system and powers the citation-generating component of our multi-agent architecture.

---
## Workflow Overview

### **Step 1 – Environment Setup**
- Verified GPU availability (T4 or A100) and activated device-aware logic.
- Installed required libraries:
  - `sentence-transformers` for SPECTER2 embedding
  - `datasets` for Hugging Face dataset loading
  - `faiss-cpu` for building a fast similarity index

### **Step 2 – Sample and Export PubMed Subset**
- Loaded the `MedRAG/pubmed` dataset from Hugging Face (2.2M entries).
- Sampled the first **30,000 records** for prototype-scale text retrieval.
- Exported two files to `data/pubmed/`:
  - `raw_abstracts.jsonl` – Full records with `pmid`, `title`, `abstract`
  - `text_metadata.json` – Lightweight manifest for runtime lookup

### **Step 3 – Embed with SPECTER2**
- Loaded the `allenai/specter2_base` model (768-D output).
- Concatenated `title + abstract` → embedded in batches of 64 with normalization.
- Saved full embedding matrix as:
  - `text_vectors.npy` (float32, 30k × 768)

### **Step 4 – Build and Save FAISS Index**
- Created a cosine similarity FAISS index using `faiss.IndexFlatIP(dim=768)`.
- Added all SPECTER2 embeddings to the index.
- Persisted the index to disk as:
  - `text_faiss.bin` (≈88 MB)
  - Ensures O(1) semantic lookup during inference

### **Step 5 – Sanity Check: Query the Index**
- Queried the index with two test prompts:
  - `"chest x-ray findings in pulmonary embolism"`
  - `"case reports involving lung nodules in pediatric patients"`
- Verified that returned titles were semantically aligned and non-random.
- Confirmed full end-to-end retrieval stack is functional and reproducible.

---

### **Final Output Directory**
All artifacts are saved to the following directory for downstream use in the RAG pipeline:

```
data/pubmed/
├── raw_abstracts.jsonl       # Full 30k sample
├── text_metadata.json        # Lightweight manifest
├── text_vectors.npy          # 768-D SPECTER2 embeddings
└── text_faiss.bin            # FAISS cosine index (768-D)
```

This module completes the text-retrieval half of our citation-augmented agent. The next phase is MIMIC-QA curation and BioGPT-LoRA fine-tuning.

##Step 0: Mounting Google Drive and Importing Libraries

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/multimodal-xray-agent
!ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/multimodal-xray-agent
app	      data	  LICENSE  notebooks	   README.md	     scripts
chexpert.zip  deployment  models   PROJECT_LOG.md  requirements.txt  src


In [None]:
!pip install --upgrade datasets huggingface_hub fsspec -q

In [None]:
!pip install faiss-cpu -q

In [14]:
import os, json
import requests
import torch
from datasets import load_dataset
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from tqdm import tqdm

In [None]:
from src.text_search import query_text_faiss

## Step 1: Verifying GPU and Environment

In [3]:
# Device-agnostic setup
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device = torch.device("cuda")
    print(f"GPU detected: {device_name}")
else:
    device = torch.device("cpu")
    print("GPU not detected. Falling back to CPU.")

print(f"Running on device: {device}")

GPU detected: NVIDIA L4
Running on device: cuda


## Step 2: Sampling and Exporting PubMed Abstracts for Semantic Indexing

- Selected a representative subset of **30,000 abstracts** from the `MedRAG/pubmed` dataset on Hugging Face. This subset includes:
  - `title`: paper title
  - `abstract`: main summary text
  - `id`: PubMed ID (PMID)

- Defined output paths and directory structure:
  - `data/pubmed/raw_abstracts.jsonl`: line-delimited JSON containing full `title + abstract + pmid` entries.
  - `data/pubmed/text_metadata.json`: lightweight metadata file containing only `pmid` and `title` for retrieval post-indexing.

- Exported both files by iterating over the sampled records:
  - Trimmed whitespace and combined relevant fields.
  - Ensured all JSON was UTF-8 encoded and human-readable (indent=2 for metadata).

- These files serve as the **input corpus for downstream SPECTER2 embedding and FAISS indexing**.

In [7]:
# Define destination path
PUBMED_DIR = "data/pubmed"
os.makedirs(PUBMED_DIR, exist_ok=True)

In [None]:
# Load a manageable chunk, e.g., first 100k records
print("Downloading 100k records from MedRAG/pubmed...")
dataset = load_dataset("MedRAG/pubmed", split="train[:30000]")

In [6]:
len(dataset)

30000

In [8]:
# Define output paths
RAW_JSONL_PATH = os.path.join(PUBMED_DIR, "raw_abstracts.jsonl")
META_JSON_PATH = os.path.join(PUBMED_DIR, "text_metadata.json")
FAISS_INDEX_PATH = "data/pubmed/text_faiss.bin"
EMBEDDINGS_NPY_PATH = "data/pubmed/text_vectors.npy"

This code block performs two main tasks: writing the sampled PubMed records to a `.jsonl` file and extracting minimal metadata for quick lookup.

- Opens two output files:
  - `raw_abstracts.jsonl` (line-delimited JSON): stores full records (PMID, title, abstract)
  - `text_meta.json`: stores only the `pmid` and `title` for lightweight indexing

- Iterates through each entry in the dataset:
  - Strips whitespace from `title` and `abstract`
  - Writes a JSON line to `raw_abstracts.jsonl` with all key fields
  - Appends a compact metadata dict (`pmid`, `title`) to an in-memory list

- After the loop:
  - Dumps the metadata list to `text_meta.json` using `json.dump(...)`

In [11]:
# Write abstracts to JSONL and minimal metadata to JSON
with open(RAW_JSONL_PATH, "w", encoding="utf-8") as f_jsonl, \
     open(META_JSON_PATH, "w", encoding="utf-8") as f_meta:

    metadata = []
    for entry in dataset:
        title = entry["title"].strip()
        abstract = entry["content"].strip()
        pmid = entry["id"]

        # Save full abstract line
        json_line = json.dumps({
            "pmid": pmid,
            "title": title,
            "abstract": abstract
        })
        f_jsonl.write(json_line + "\n")

        # Save metadata for quick lookup
        metadata.append({"pmid": pmid, "title": title})

    json.dump(metadata, f_meta, indent=2)

print(f"Saved: {RAW_JSONL_PATH}")
print(f"Saved: {META_JSON_PATH}")

Saved: data/pubmed/raw_abstracts.jsonl
Saved: data/pubmed/text_metadata.json


## Step 3: Embedding Title + Abstracts using SPECTER2

- Loaded the `raw_abstracts.jsonl` file line-by-line and concatenated each paper's `title` and `abstract` into a single string.
- Used `allenai/specter2_base`, a state-of-the-art transformer model trained for scientific document similarity, via `SentenceTransformer`.
- Batched the input texts and encoded them into dense vector representations:
  - Batch size = 64
  - Output = normalized embeddings (`float32`, shape: `[N, 768]`)
- Persisted the resulting matrix to disk as `pubmed_embeddings.npy` for reproducibility and future use.

In [16]:
# Load abstracts
print("Reading abstracts for embedding...")
texts = []
with open(RAW_JSONL_PATH, "r", encoding="utf-8") as f:
    for line in f:
        record = json.loads(line)
        text = f"{record['title']} {record['abstract']}"
        texts.append(text.strip())

Reading abstracts for embedding...


In [None]:
# Load SPECTER2 encoder
model = SentenceTransformer("allenai/specter2_base")

In [None]:
# Generate embeddings
embeddings = model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True
)

In [19]:
# Save embeddings to disk (optional)
np.save(EMBEDDINGS_NPY_PATH, embeddings)

## Step 4 – Build and Save FAISS Index

- Constructed a dense vector index using **FAISS** with inner product (cosine similarity on normalized vectors).
- FAISS index type: `IndexFlatIP(dim=768)` – a flat (brute-force) index suitable for exact nearest neighbor retrieval.
- Added all `SPECTER2` embeddings to the index as `float32` numpy arrays.
- Saved the resulting binary index to disk as `text_faiss.bin` for fast semantic retrieval at inference time.

In [20]:
# Build FAISS index (cosine similarity)
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(np.asarray(embeddings, dtype=np.float32))
faiss.write_index(index, FAISS_INDEX_PATH)

print(f"FAISS index saved to: {FAISS_INDEX_PATH}")
print(f"Embeddings saved to: {EMBEDDINGS_NPY_PATH}")

FAISS index saved to: data/pubmed/text_faiss.bin
Embeddings saved to: data/pubmed/text_vectors.npy


## Step 5: Sanity Check (Query Text Index)

- Loaded the saved `FAISS` index (`text_faiss.bin`) and its corresponding metadata (`text_meta.json`) into memory.
- Defined a simple semantic search function `query_text_faiss()` that:
  - Encodes an input query using the same `SPECTER2` model.
  - Performs top-`k` nearest neighbor search using the FAISS index.
  - Returns the most semantically similar papers based on cosine similarity scores.
- Ran two test queries to verify semantic alignment:
  1. `"chest x-ray findings in pulmonary embolism"`
  2. `"case reports involving lung nodules in pediatric patients"`
- Validated that top-ranked papers had relevant titles, confirming embedding + indexing pipeline is working correctly.

In [21]:
# Load FAISS index and metadata
index = faiss.read_index(FAISS_INDEX_PATH)

with open(META_JSON_PATH, "r", encoding="utf-8") as f:
    metadata = json.load(f)

In [22]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [24]:
query = "chest x-ray findings in pulmonary embolism"

hits = query_text_faiss(query, model, index, metadata, top_k=5)

print(f"\n Query: {query}\n")
for i, hit in enumerate(hits, 1):
    print(f"{i}. [{hit['score']:.3f}] {hit['title']} (PMID: {hit['pmid']})")


 Query: chest x-ray findings in pulmonary embolism

1. [0.830] [Corticoids in respiratory pathology]. (PMID: pubmed23n0001_4982)
2. [0.825] [Thoracic injury and fat embolism]. (PMID: pubmed23n0001_9345)
3. [0.824] Chest roentgenography as a window to the diagnosis of Takayasu's arteritis. (PMID: pubmed23n0001_1038)
4. [0.821] Sulphasalazine lung. (PMID: pubmed23n0001_3473)
5. [0.806] [Pulmonary embolism in the elderly patient (author's transl)]. (PMID: pubmed23n0002_5963)


In [25]:
query = "case reports involving lung nodules in pediatric patients"

hits = query_text_faiss(query, model, index, metadata, top_k=5)

print(f"\n Query: {query}\n")
for i, hit in enumerate(hits, 1):
    print(f"{i}. [{hit['score']:.3f}] {hit['title']} (PMID: {hit['pmid']})")


 Query: case reports involving lung nodules in pediatric patients

1. [0.871] [Corticoids in respiratory pathology]. (PMID: pubmed23n0001_4982)
2. [0.845] Sulphasalazine lung. (PMID: pubmed23n0001_3473)
3. [0.841] [Palliative surgery in upper thoracic venous obstructions]. (PMID: pubmed23n0003_401)
4. [0.826] [Emphysema and "emphysembronchitis" in the age and their treatment (author's transl)]. (PMID: pubmed23n0002_5961)
5. [0.821] The differential diagnosis of bronchial asthma in children. (PMID: pubmed23n0001_4422)
