![i2b2 Logo](images/transmart-logo.png)

# Using ChromaDB + Embeddings to Search Patient Notes (RAG)

This notebook demonstrates how to implement a **Retrieval-Augmented Generation (RAG)** pipeline using **local embeddings** and **ChromaDB** to search and analyze clinical notes stored in an i2b2-like format. It walks through decoding, embedding, storing, retrieving, and generating responses with a local LLM.

You will work through a complete pipeline:
- Decode notes from BinHex format
- Embed clinical notes using the MiniLM model
- Store the notes and metadata in a persistent ChromaDB vector store
- Perform semantic similarity search and inspect cosine-based scores
- Apply **Maximal Marginal Relevance (MMR)** to promote diversity and reduce redundancy
- Retrieve only the **most recent note per patient** for clarity and accuracy
- Inject results into a reusable prompt template
- Generate structured, clinically meaningful responses using a local LLM (e.g., Qwen or LLaMA 3 via Ollama)

### 🔍 Key Concepts Covered

- Decoding i2b2-formatted BinHex clinical notes
- Embedding full notes for semantic similarity search
- Persistent storage and retrieval using ChromaDB
- Score-based filtering and MMR ranking
- Structured prompt engineering for LLMs
- Zero-shot summarization with a local model

Each cell builds on the last to demonstrate a complete RAG workflow, tailored to **clinical informatics and patient note analysis**.

> This notebook is part of the workshop: _Using LLMs to Search and Summarize Patient Notes_.


## 1. Prepare Data for Embedding

Before we can perform semantic search or generate responses using clinical notes, we must first prepare the data.

- **1.1**: Load a simulated i2b2 `visit_dimension` table that contains BinHex-encoded clinical notes and decode them into readable plain text.

This step ensures that each clinical note is in a usable format for embedding in the next section using a local MiniLM model.


In [None]:
# -----------------------------------------------------------
# 1.1. Load and Decode Clinical Notes from i2b2-Mimicking CSV
# -----------------------------------------------------------
# This cell loads visit-level data from a CSV file that mimics the i2b2 `visit_dimension` table
# and decodes the BinHex-encoded clinical notes into readable plain text.

# Each record includes:
# - encounter_num: Unique visit ID
# - patient_num: Patient identifier
# - start_date, end_date: Visit dates
# - location_cd, location_path: Clinic metadata
# - visit_blob: BinHex-encoded clinical note text
# - note_text: Decoded BinHex clinical note text
# -----------------------------------------------------------


import pandas as pd
import binascii
from IPython.display import display, Markdown

# Load the i2b2-style data
csv_path = "datafiles/i2b2_encounter_table.csv"
df = pd.read_csv(csv_path)

# Function to decode BinHex-encoded notes
def decode_note(hex_blob):
    hex_str = hex_blob.replace("0x", "")
    return binascii.unhexlify(hex_str).decode("utf-8", errors="ignore")

# Decode all notes into a new column
df["note_text"] = df["visit_blob"].apply(decode_note)

# Preview the first 10 records with decoded notes
display(Markdown("### Preview of Decoded Notes"))
display(df.head(10))


## 2. Embed and Store Entire Clinical Notes in ChromaDB

In this step, we process and store each clinical note as a **complete document** in a ChromaDB vector store. This approach preserves full patient context, making it ideal for semantic search and downstream clinical reasoning.

### Step 2.1: Embed and Store Notes
1. **Embed Full Notes**
   Each note is converted into a semantic vector using a lightweight transformer model (`MiniLM`).

2. **Store in ChromaDB**
   The embedded vector and its associated metadata (e.g., patient ID, encounter number, visit date) are stored together in a persistent ChromaDB directory.

### Why This Approach?

Storing full notes is especially useful when:
- You want to retrieve the complete narrative for clinical context
- Your downstream task (e.g., summarization or decision support) depends on comprehensive input
- Each note fits within the input limit of a single LLM call

This setup supports more faithful summarization and reasoning than chunk-based approaches when notes are relatively short and self-contained.

<img src="./images/rag_full.png" alt="RAG Full" width="900">


In [None]:
# -----------------------------------------------------------
# 2.1. Embed Clinical Notes Using Local MiniLM Embeddings and Store in ChromaDB
# -----------------------------------------------------------
# This cell encodes each clinical note into a semantic vector using a local MiniLM model
# and stores the results along with metadata in a ChromaDB vector store for retrieval.

# Embedding Model: sentence-transformers/all-MiniLM-L6-v2
# - Optimized for fast local execution
# - Produces 384-dimensional vectors for semantic similarity
# -----------------------------------------------------------


from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Create or connect to the persistent ChromaDB directory
vectorstore = Chroma(
    persist_directory="./datafiles/chroma_db_notes",
    embedding_function=embedding_model
)

# Prepare documents and metadata
documents = df["note_text"].tolist()
metadata = df[["patient_num", "encounter_num", "start_date"]].to_dict(orient="records")

# Add text and metadata to the ChromaDB vector store
vectorstore.add_texts(texts=documents, metadatas=metadata)

print(f"✅ Successfully embedded and stored {len(documents)} clinical notes using MiniLM and ChromaDB.")


## 3. Retrieving Clinical Notes with Similarity Score (RAG Retrieval with ChromaDB)

In this section, we retrieve relevant clinical notes from a ChromaDB vector store using vector similarity techniques. We explore both standard and advanced retrieval strategies to improve the relevance and diversity of retrieved context for downstream LLM generation.

### Key Retrieval Strategies

1. **Define a Clinical Query (Step 3.1)**
   - The user provides a natural language question (e.g., "Who has asthma and is taking Fluticasone and Albuterol?").

2. **Similarity Search with Scores (Step 3.2)**
   - Retrieves the top-k notes ranked by semantic similarity to the query.
   - Returns cosine-based relevance scores for each match.

3. **Score Threshold Filtering (Step 3.3)**
   - Filters out low-confidence matches using a minimum similarity score.
   - Retains only notes that meet a defined relevance threshold (e.g., ≥ 0.6).

4. **Maximal Marginal Relevance (MMR) Search (Step 3.4)**
   - Balances similarity and diversity.
   - Reduces redundancy by selecting a diverse subset of highly relevant notes.

### Why Use These Strategies?

High-quality retrieval is critical to the success of RAG workflows. These techniques help:
- Improve the contextual relevance of inputs to the LLM
- Filter out irrelevant or low-confidence documents
- Encourage diverse results to reduce bias and improve robustness

<img src="./images/rag_retrieval.png" alt="RAG Retrieval" width="900">


In [None]:
# -----------------------------------------------------------
# 3.1. Define the Query for Clinical Note Retrieval
# -----------------------------------------------------------
# This cell defines a natural language query to search the embedded clinical notes
# stored in ChromaDB using semantic similarity.

# Concept:
# The query will be *** automatically embedded *** by the vectorstore before performing
# a semantic comparison against stored note vectors.
# -----------------------------------------------------------

# Example Clinical Query:
query = "Who has asthma and is taking Fluticasone or Albuterol?"

# Display the query for reference
display(Markdown(f"### 🔍 Query: `{query}`"))


In [None]:
# -----------------------------------------------------------
# 3.2. Perform Similarity Search with Relevance Scores
# -----------------------------------------------------------
# This cell retrieves the top-k clinical notes most semantically similar to the input query.
# Each result includes a cosine similarity score returned by LangChain's Chroma vectorstore wrapper.

# Function used:
# - vectorstore.similarity_search_with_relevance_scores(query, k=10)
#   Returns a list of (Document, score) tuples.

# Relevance Score Interpretation (heuristic only):
# - 0.90 – 1.00 → Highly relevant
# - 0.70 – 0.90 → Strong relevance
# - 0.50 – 0.70 → Moderate relevance
# - 0.30 – 0.50 → Low relevance
# - 0.00 – 0.30 → Minimal or no relevance
# -----------------------------------------------------------

# Perform the similarity search
results = vectorstore.similarity_search_with_relevance_scores(query, k=10)

# Display the results
display(Markdown("### 🔍 Retrieved Clinical Notes with Relevance Scores"))

for idx, (doc, score) in enumerate(results, 1):
    patient = doc.metadata.get("patient_num", "N/A")
    encounter = doc.metadata.get("encounter_num", "N/A")
    date = doc.metadata.get("start_date", "N/A")
    doc_id = doc.id
    excerpt = doc.page_content[:1000].replace("\n", " ")

    display(Markdown(
        f"---\n**Result {idx}**  \n"
        f"- **Relevance Score:** `{score:.4f}`  \n"
        f"- **Patient Num:** `{patient}`  \n"
        f"- **Encounter Num:** `{encounter}`  \n"
        f"- **Start Date:** `{date}`  \n"
        f"- **Document ID:** `{doc_id}`  \n\n"
        f"**Excerpt:**\n```text\n{excerpt}...\n```"
    ))


In [None]:
# -----------------------------------------------------------
# 3.3. Use a Retriever with a Similarity Score Threshold
# -----------------------------------------------------------
# This cell configures a retriever that only returns documents whose
# cosine similarity scores exceed a predefined threshold.

# Parameters:
# - search_type="similarity_score_threshold": Enables threshold-based filtering.
# - search_kwargs={"k": 10, "score_threshold": 0.6}
#     - k: Maximum number of documents to *evaluate* (not necessarily return)
#     - score_threshold: Minimum similarity score (0–1) required for inclusion

# Purpose:
# Improves retrieval precision by excluding weak semantic matches—
# especially important for clinical reasoning and safety-critical applications.
# -----------------------------------------------------------

# Define threshold and top-k limit
score_threshold = 0.59
top_k = 10

# Create the retriever with filtering
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": top_k, "score_threshold": score_threshold}
)

# Execute the retrieval
results = retriever.invoke(query)

# Display results
display(Markdown(f"### 🔎 Retrieved Clinical Notes (Score ≥ {score_threshold})"))

for idx, doc in enumerate(results, 1):
    patient = doc.metadata.get("patient_num", "N/A")
    date = doc.metadata.get("start_date", "N/A")
    doc_id = doc.id
    excerpt = doc.page_content[:1000].replace("\n", " ")

    display(Markdown(
        f"**Document {idx}**  \n"
        f"- **Patient Num:** `{patient}`  \n"
        f"- **Start Date:** `{date}`  \n"
        f"- **Document ID:** `{doc_id}`  \n\n"
        f"**Excerpt:**\n```text\n{excerpt}...\n```"
    ))

display(Markdown(f"**✅ Total relevant results:** `{len(results)}`"))


In [None]:
# -----------------------------------------------------------
# 3.4. Perform Maximal Marginal Relevance (MMR) Search
# -----------------------------------------------------------
# This cell retrieves clinical notes using Maximal Marginal Relevance (MMR),
# an approach that balances query relevance with result diversity.

# Parameters:
# - fetch_k = 500: Number of top-ranked candidates to evaluate before reranking
# - k = 5: Final number of diverse, relevant documents to return
# - lambda_mult:
#     - 0.0 → prioritize diversity (minimal redundancy)
#     - 1.0 → prioritize similarity to the query
#     - 0.5 → balanced trade-off between diversity and relevance

# Purpose:
# MMR reduces redundancy by penalizing near-duplicate documents, while still
# prioritizing clinical notes that are highly relevant to the query.
# This is especially useful in settings where multiple perspectives or
# treatment variations are valuable (e.g., different asthma management plans).
# -----------------------------------------------------------

# Perform MMR search
results = vectorstore.max_marginal_relevance_search(
    query=query,
    k=5,
    fetch_k=500,
    lambda_mult=0.5
)

# Display results
display(Markdown("### 📘 Retrieved Clinical Notes Using Maximal Marginal Relevance (MMR)"))

for idx, doc in enumerate(results, 1):
    patient = doc.metadata.get("patient_num", "N/A")
    date = doc.metadata.get("start_date", "N/A")
    doc_id = getattr(doc, "id", "N/A")
    excerpt = doc.page_content[:1000].replace("\n", " ")

    display(Markdown(
        f"**Document {idx}**  \n"
        f"- **Patient Num:** `{patient}`  \n"
        f"- **Start Date:** `{date}`  \n"
        f"- **Document ID:** `{doc_id}`  \n\n"
        f"**Excerpt:**\n```text\n{excerpt}...\n```"
    ))

display(Markdown(f"**✅ Total MMR Results Returned:** `{len(results)}`"))


## 4. Generating Structured Responses with an LLM (RAG Retrieval)

In this final step, we use a Large Language Model (LLM) to analyze the clinical notes retrieved in the previous section. By injecting these relevant notes into a structured prompt, we enable the LLM to generate clinically useful, structured responses.

This completes the Retrieval-Augmented Generation (RAG) pipeline, connecting search results to intelligent language generation.

### Key Steps

1. **Create a Prompt Template (Step 4.1)**
   - Defines a reusable prompt structure that includes placeholders for both the query and the retrieved clinical context.
   - Specifies a structured output format, including metadata fields such as patient ID, encounter date, and a clinical summary.

2. **Invoke the LLM with Retrieved Context (Step 4.2)**
   - Fills the prompt with retrieved notes and the clinical query.
   - Sends the prompt to a local LLM (e.g., Qwen2 or LLaMA 3 via Ollama).
   - Returns a structured, context-aware response to the clinical question.

### Why It Matters

This generation step demonstrates the power of combining semantic search with generative AI:
- Produces context-rich answers grounded in patient data
- Supports clinical summarization and decision support
- Enables zero-shot reasoning over real-world clinical notes

<img src="./images/rag_generation.png" alt="RAG Generation" width="1250">


In [None]:
# -----------------------------------------------------------
# 4.1. Create a Prompt Template for LLM Querying
# -----------------------------------------------------------
# This chat prompt guides the LLM to generate structured clinical summaries
# from notes retrieved via similarity search.

# Placeholders:
# - {retrieved_docs}: Injects top-matching clinical notes
# - {query}: A user-defined clinical question

# Output Expectations:
# - One structured summary per patient
# - Metadata included for traceability
# - Response based only on the most recent note per patient
# -----------------------------------------------------------

from langchain.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_messages([
    ("system",
     "You are an advanced medical documentation assistant. Your role is to extract structured clinical details and summarize findings from retrieved patient notes."),

    ("human",
    """Based on the following retrieved clinical records:

{retrieved_docs}

Answer the clinical question: {query}

For each relevant patient, provide:

1. **Patient Demographics**: Patient Num, Gender, Age, Race
2. **Visit Date**: Use the most recent date only
3. **Chief Complaints**
4. **Current Medications**
5. **Has Asthma?** (Yes/No)
   - Only answer 'Yes' if asthma is explicitly stated in the diagnosis OR if the patient is currently prescribed a known asthma medication (e.g., albuterol, fluticasone).
   - If asthma is not clearly mentioned or supported by medication, answer 'No'. Do not infer.

6. **Summary**: One paragraph summarizing the patient's condition and addressing the query.

**Instructions:**
- Format your output using clear bullet points.
- Include only one entry per patient, using their most recent clinical note.
- Avoid including duplicate patients.
""")
])

print("✅ Prompt created and ready to use.")


In [None]:
# -----------------------------------------------------------
# 4.2. Use Retrieved Context to Invoke LLM and Generate Response
# -----------------------------------------------------------
# This step completes the RAG workflow by injecting the retrieved clinical notes
# into a structured chat prompt and using an LLM (via Ollama) to generate a response.
# -----------------------------------------------------------

from langchain_ollama import ChatOllama

# Initialize the local Ollama model (e.g., Qwen2, LLaMA 3, etc.)
model = ChatOllama(model="qwen2")

# Prepare the context text (combine page_content from results list)
retrieved_context = "\n\n---\n\n".join([doc.page_content for doc in results])

# Format chat messages using the ChatPromptTemplate
messages = prompt_template.format_messages(
    retrieved_docs=retrieved_context,
    query=query
)

# Generate a structured response using the LLM
response = model.invoke(messages)

# Display the LLM-generated output
display(Markdown("### 🧠 LLM-Generated Response"))
display(Markdown(response.content))
