![MGB Logo](./images/mgb-logo.png)


In [None]:
# -----------------------------------------------------------
# 1. Importing Required Libraries
# -----------------------------------------------------------
# Demonstrates the necessary imports for setting up Azure OpenAI authentication, embeddings, and model interaction.

# Key Components:
#   - os: Provides functions for interacting with the operating system, such as accessing environment variables.
#   - load_dotenv(): Loads environment variables from a .env file to securely store sensitive credentials.
#   - DefaultAzureCredential: Handles authentication with Azure services using various credential methods.
#   - get_bearer_token_provider(): Retrieves an authentication token for accessing Azure OpenAI services.
#   - AzureOpenAIEmbeddings: Enables text embedding generation using Azure OpenAI.
#   - AzureChatOpenAI: A LangChain wrapper for interacting with Azure-hosted OpenAI chat models.

# Purpose:
# These libraries facilitate secure authentication and seamless interaction with Azure OpenAI,
# enabling applications to leverage AI-powered text embeddings and chat models for various use cases.


import os
from dotenv import load_dotenv
from datetime import datetime
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from langchain_openai import AzureOpenAIEmbeddings,AzureChatOpenAI

print("\n=== Required Libraries Loaded ===")

In [None]:
# -----------------------------------------------------------
# 2. Setting Up Azure OpenAI Embeddings and Chat Model
# -----------------------------------------------------------
# Demonstrates how to configure and authenticate Azure OpenAI embeddings and chat models for AI-powered applications.

# Key Components:
#   - load_dotenv(): Loads environment variables from a .env file to securely store API credentials.
#   - DefaultAzureCredential(): Handles authentication with Azure services using the best available method.
#   - get_bearer_token_provider(): Retrieves an authentication token for accessing Azure OpenAI services.
#   - AzureOpenAIEmbeddings(): Initializes the Azure-hosted embeddings model for text vectorization.
#       - model: Specifies the embedding model to use.
#       - azure_deployment: Identifies the specific Azure deployment for embeddings.
#       - api_version: Defines the API version to be used.
#       - azure_endpoint: Specifies the Azure endpoint for API access.
#       - azure_ad_token_provider: Supplies authentication tokens.
#       - timeout: Ensures requests never time out (None).
#       - max_retries: Sets the number of retries in case of failure (2).
#   - AzureChatOpenAI(): Initializes the Azure-hosted OpenAI chat model.
#       - openai_api_version: Specifies the API version.
#       - azure_deployment: Identifies the specific chat model deployment.
#       - azure_endpoint: Defines the endpoint for API access.
#       - azure_ad_token_provider: Supplies authentication tokens.

# Purpose:
# This setup ensures secure and efficient access to Azure OpenAI services for both text embeddings and conversational AI.
# It enables scalable and robust AI-powered applications by integrating vector-based retrieval and LLM interactions.

load_dotenv()

# Set up Azure credentials and token provider
azure_credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    azure_credential, "https://cognitiveservices.azure.com/.default"
)

# Initialize the AzureOpenAIEmbeddings model using environment variables
embedding_model = AzureOpenAIEmbeddings(
    model=os.getenv("AZURE_EMBEDDING_MODEL"),
    azure_deployment=os.getenv("AZURE_EMBEDDING_DEPLOYMENT"),
    api_version=os.getenv("AZURE_EMBEDDING_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_EMBEDDING_ENDPOINT"),
    azure_ad_token_provider=token_provider,
    timeout=None,  # never timeout
    max_retries=2,  # try again twice
)

# Initialize the AzureChatOpenAI model using environment variables
model = AzureChatOpenAI(
    openai_api_version=os.getenv("AZURE_OPENAI_VERSION"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_ad_token_provider=token_provider
)

print("\n=== LLM and Embedding Models Loaded ===")


## 3. Option 1: Reading, Appending, Splitting Notes in Chunks

This approach processes clinical notes by reading them from files, storing them in memory, and then splitting them into smaller chunks for optimized retrieval.

**Key Steps:**
 1. **Reading and Appending Clinical Notes (Item 3.1)**:
    - Loads clinical notes from a directory and stores them in a list.
    - Extracts metadata, such as the patient identifier, for later reference.

 2. **Splitting Clinical Notes into Chunks (Item 3.2)**:
    - Uses a text splitter to divide documents into smaller sections.
    - Maintains context overlap between chunks to ensure coherence.

 3. **Storing Chunked Clinical Notes in ChromaDB (Item 3.3)**:
    - Converts text chunks into embeddings and stores them in ChromaDB.
    - Enables efficient vector search for fast and relevant retrieval.

**Purpose:**
This method enhances document retrieval accuracy by breaking down large patient notes into smaller, contextually rich segments that can be efficiently searched and analyzed.

![RAG Chunks](images/rag_chunks.png)

In [None]:
# -----------------------------------------------------------
# 3.1. Reading and Appending Clinical Notes (OPTION 1)
# -----------------------------------------------------------
# Demonstrates how to read and store clinical notes from a directory for further processing.

# Key Components:
#   - clinical_notes_dir: Specifies the directory where patient notes are stored.
#   - documents: A list that stores the text content of each clinical note.
#   - metadata: A list that stores metadata (patient identifiers) associated with each note.
#   - sorted(os.listdir(clinical_notes_dir)): Retrieves and sorts all filenames in the directory.
#   - filename.endswith(".txt"): Ensures only text files are processed.
#   - patient_num: Extracts the patient identifier from the filename.
#   - open(file_path, "r", encoding="utf-8"): Reads the file contents while preserving character encoding.
#   - documents.append(text): Stores the full text of each clinical note.
#   - metadata.append({"patient_num": patient_num}): Associates each note with its respective patient ID.

# Purpose:
# This step prepares clinical notes for further processing, such as embedding for retrieval-augmented generation (RAG).
# It ensures that patient records are correctly loaded and structured before being indexed in a vector store.

clinical_notes_dir = 'data_prep/patient_notes'

# Prepare documents list
documents = []
metadata = []

# Read and process each clinical note
for filename in sorted(os.listdir(clinical_notes_dir)):
    if filename.endswith(".txt"):
        # Extract patient identifier
        parts = filename.split("_")
        patient_num = parts[1]
        
        # Extract visit date
        latest_fact = parts[-1].replace(".txt", "")
        visit_date = datetime.strptime(latest_fact, '%Y%m%d').strftime('%m/%d/%Y')
        
        # Load text content
        file_path = os.path.join(clinical_notes_dir, filename)

        with open(file_path, "r", encoding="utf-8") as file:
            text = file.read()

        # Store document with metadata
        documents.append(text)
        metadata.append({"patient_num": patient_num, "visit_date": visit_date})

print(documents[:50])


In [None]:
# -----------------------------------------------------------
# 3.2. Splitting Clinical Notes into Chunks
# -----------------------------------------------------------
# Demonstrates how to split clinical notes into manageable chunks to optimize retrieval performance.

# Key Components:
#   - RecursiveCharacterTextSplitter: A text-splitting utility that ensures chunks are broken at logical points.
#   - chunk_size=1000: Specifies the maximum size (in characters) of each text chunk.
#   - chunk_overlap=50: Ensures a 50-character overlap between consecutive chunks to maintain context continuity.
#   - text_splitter.create_documents(documents, metadatas=metadata): Splits the documents while preserving metadata.
#   - print(split_docs[:500]): Displays the first 500 characters of the split documents for verification.

# Purpose:
# This step enhances document retrieval by ensuring that AI models process information in smaller, contextually rich segments.
# It improves search accuracy and relevance in retrieval-augmented generation (RAG) applications.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=350)
split_docs = text_splitter.create_documents(documents, metadatas=metadata)

# print(split_docs[:500])

# Print structured chunked documents with excerpts
print("\n=== Split Document Excerpts ===")
for idx, doc in enumerate(split_docs[:5], 1):  
    print(f"Chunk {idx}:")
    print(f"  Patient Num: {doc.metadata.get('patient_num', 'N/A')}")
    print(f"  Visit Date: {doc.metadata.get('visit_date', 'N/A')}")
    print(f"  Excerpt: {doc.page_content[:2000]}...")  
    print("-" * 100)  


In [None]:
# -----------------------------------------------------------
# 3.3. Storing Chunked Clinical Notes in ChromaDB
# -----------------------------------------------------------
# Demonstrates how to store and index chunked clinical notes in ChromaDB for efficient retrieval.

# Key Components:
#   - Chroma: A vector database optimized for storing and retrieving text embeddings.
#   - Chroma.from_documents(): Creates a vector store from the split clinical notes.
#       - split_docs: The chunked documents containing patient notes.
#       - embedding_model: The embedding model used to convert text into vector representations.
#       - persist_directory="./chroma_db_chunks": Specifies the storage directory for the vector database.
#   - print("Vector store created and loaded successfully!"): Confirms successful creation and storage of embeddings.

# Purpose:
# This step enables fast and accurate retrieval of clinical information by storing vector embeddings of chunked text.
# It allows AI-powered applications to efficiently search and retrieve relevant medical notes based on similarity.

from langchain_chroma import Chroma

vector_store_chunks = Chroma.from_documents(split_docs, embedding_model, persist_directory="./databases/chroma_db_chunks")

print("Vector store created and loaded successfully!")



## 4. Option 2: Storing Entire Clinical Notes in ChromaDB
This approach processes clinical notes by reading them from files and storing them **as whole documents**  in ChromaDB, preserving full patient context.

**Key Steps:(Item 4.1)**
1. **Reading and Extracting Metadata **:
   - Loads each clinical note from a directory.
   - Extracts the patient identifier and visit date for metadata tracking.

2. **Embedding and Storing Full Documents**:
   - Converts the entire text of each clinical note into vector embeddings.
   - Stores them in ChromaDB, maintaining patient-level document integrity.

**Purpose:**
This method ensures that full patient records remain intact, allowing retrieval of complete medical histories instead of fragmented sections. It is ideal for cases where full context is necessary for decision-making.

<img src="./images/rag_full.png" alt="RAG Full" width="900">



In [None]:
# -----------------------------------------------------------
# 4.1. Storing Entire Clinical Notes in ChromaDB
# -----------------------------------------------------------
# Demonstrates how to store full clinical notes in ChromaDB, preserving complete patient context for retrieval.

# Key Components:
#   - Chroma: A vector database for efficiently storing and retrieving text embeddings.
#   - Chroma(persist_directory="./chroma_db_full", embedding_function=embedding_model): 
#     Initializes the vector store for storing full clinical notes.
#   - sorted(os.listdir(clinical_notes_dir)): Retrieves and sorts all patient note files for processing.
#   - patient_num: Extracts the patient identifier from the filename.
#   - open(file_path, "r", encoding="utf-8"): Reads the full content of each clinical note.
#   - vector_store.add_texts(): Embeds and inserts the entire document into ChromaDB, associating it with metadata (patient_num).
#   - print("All clinical notes have been embedded and stored successfully!"): Confirms successful indexing of documents.

# Purpose:
# This method preserves the full clinical context of each note, allowing AI models to retrieve and analyze complete patient records.
# It enhances retrieval accuracy by avoiding text chunking, ensuring that related medical details remain together.


# Initialize ChromaDB vector store
vector_store_full = Chroma(persist_directory="./databases/chroma_db_full", embedding_function=embedding_model)

# Process and embed each clinical note individually
for filename in sorted(os.listdir(clinical_notes_dir)):
    if filename.endswith(".txt"):
        # Extract patient identifier
        parts = filename.split("_")
        patient_num = parts[1]
        
        # Extract visit date
        latest_fact = parts[-1].replace(".txt", "")
        visit_date = datetime.strptime(latest_fact, '%Y%m%d').strftime('%m/%d/%Y')
        
        file_path = os.path.join(clinical_notes_dir, filename)

        # Load text content
        with open(file_path, "r", encoding="utf-8") as file:
            text = file.read()
        
        # Embed and insert document into ChromaDB immediately
        vector_store_full.add_texts([text], metadatas=[{"patient_num": patient_num, "visit_date": visit_date}])

print("All clinical notes have been embedded and stored successfully!")


## 5. Defining the Query for Clinical Note Retrieval


In [None]:
# -----------------------------------------------------------
# 5. Defining the Query for Clinical Note Retrieval
# -----------------------------------------------------------
# Demonstrates how to define a natural language query for retrieving relevant clinical notes.

# Key Components:
#   - query: A user-defined question that will be used to search the vector store.
#   - "Who has asthma and using Fluticasone?": The query aims to retrieve clinical notes of patients diagnosed with asthma and prescribed Fluticasone.

# Purpose:
# This query enables the retrieval of relevant patient records from the vector store, 
# allowing AI-powered applications to find medical cases that match specific conditions and medications.


query = "Who has asthma and is taking Fluticasone?"




## 6. Retrieving Clinical Notes with Similarity and MMR Search
This approach focuses on retrieving relevant clinical notes using different search methods, including similarity-based retrieval and Maximal Marginal Relevance (MMR).

**Key Steps:**
1. **Similarity Search (Item 6.1)**:
   - Finds the most relevant clinical notes based on vector similarity.
   - Retrieves documents that closely match the given query.

2. **Similarity Search with Relevance Scores (Item 6.2)**:
   - Retrieves relevant documents along with their similarity scores.
   - Enables ranking and filtering based on the confidence of relevance.

3. **Using a Retriever with Score Threshold (Item 6.3)**:
   - Configures a retriever that automatically filters documents based on a minimum similarity score.
   - Returns only the most relevant clinical notes.

4. **Maximal Marginal Relevance (MMR) Search (Item 6.4)**:
   - Balances relevance and diversity in search results.
   - Ensures retrieval of a broad yet relevant set of documents to avoid redundancy.

**Purpose:**
This retrieval strategy ensures that clinical notes are not only highly relevant to the query but also diverse enough to provide a well-rounded perspective. It improves search accuracy and enhances AI-driven medical analysis.

<img src="./images/rag_retrieval.png" alt="RAG Retrieval" width="800">



In [None]:
# -----------------------------------------------------------
# 6.1. Performing Similarity Search
# -----------------------------------------------------------
# Demonstrates how to retrieve clinical notes based on vector similarity using cosine similarity.

# Key Components:
#   - similarity_search(query, k=10): 
#     Retrieves the top k (10 in this case) most similar documents to the query.
#   - vector_store_chunks.similarity_search(): 
#     Searches in the chunked vector store, retrieving smaller document segments.
#   - vector_store.similarity_search(): 
#     (Alternative) Searches in the full document vector store for broader context.
#   - print(results): Displays the retrieved documents.

# Purpose:
# This method enables retrieval of the most relevant clinical notes based on semantic similarity,
# allowing AI models to analyze and process medical cases that closely match the query.

results = vector_store_chunks.similarity_search(query, k=10)
# results = vector_store_full.similarity_search(query, k=5)
# print(results)


print("\n=== Retrieved Clinical Notes ===\n")

for idx, doc in enumerate(results, 1):
    print(f"Document {idx}:")
    print(f"  Patient Num: {doc.metadata.get('patient_num', 'N/A')}")
    print(f"  Visit Date: {doc.metadata.get('visit_date', 'N/A')}")
    print(f"  Document ID: {doc.id}")
    print(f"  Excerpt: {doc.page_content[:500]}...")  
    print("-" * 100) 


In [None]:
# -----------------------------------------------------------
# 6.2. Performing Similarity Search with Relevance Scores
# -----------------------------------------------------------
# Demonstrates how to retrieve clinical notes along with their similarity scores, 
# allowing for more precise filtering and ranking of results.

# Key Components:
#   - similarity_search_with_relevance_scores(query, k=10): 
#     Retrieves the top k (10 in this case) most similar documents along with their relevance scores.
#   - vector_store_chunks.similarity_search_with_relevance_scores(): 
#     Searches within the chunked vector store, returning segment-level matches.
#   - vector_store.similarity_search_with_relevance_scores(): 
#     (Alternative) Searches within the full document vector store for broader context.
#   - print(results): Displays the retrieved documents along with their similarity scores.

# Score Interpretation:
#   - 0.9 - 1.0: Highly relevant match
#   - 0.7 - 0.9: Strong relevance
#   - 0.5 - 0.7: Moderate relevance
#   - 0.3 - 0.5: Low relevance
#   - 0.0 - 0.3: Minimal or no relevance

# Purpose:
# This method provides greater transparency in retrieval by returning similarity scores,
# enabling fine-tuned filtering to ensure only highly relevant clinical notes are used for AI analysis.


results = vector_store_chunks.similarity_search_with_relevance_scores(query, k=10)
# results = vector_store_full.similarity_search_with_relevance_scores(query, k=5)

# Print retrieved results with relevance scores in a structured format
print("\n=== Retrieved Clinical Notes with Relevance Scores ===\n")

for idx, (doc, score) in enumerate(results, 1):
    print(f"Document {idx}:")
    print(f"  Relevance Score: {score:.6f}")  
    print(f"  Patient Num: {doc.metadata.get('patient_num', 'N/A')}")
    print(f"  Visit Date: {doc.metadata.get('visit_date', 'N/A')}")
    print(f"  Document ID: {doc.id}")
    print(f"  Excerpt: {doc.page_content[:500]}...")  
    print("-" * 100)  

    

In [None]:
# -----------------------------------------------------------
# 6.3. Using a Retriever with a Score Threshold
# -----------------------------------------------------------
# Demonstrates how to configure a retriever to only return documents that meet a minimum relevance score.

# Key Components:
#   - search_type="similarity_score_threshold": 
#     Specifies that the retriever should apply a similarity score filter when retrieving documents.
#   - search_kwargs={"k": 10, "score_threshold": score_threshold}: 
#     - k: Number of top results to return.
#     - score_threshold: Minimum relevance score required for a document to be included.
#   - retriever.invoke(query): Retrieves documents that meet the score threshold criteria.
#   - print(retrieved_docs): Displays the filtered, relevant documents.
#   - print(f"Total relevant results: {len(retrieved_docs)}"): Outputs the count of retrieved documents.

# Purpose:
# This method optimizes search precision by ensuring only documents with high relevance scores are retrieved,
# making it particularly useful for medical applications requiring accurate and relevant clinical information.


retriever = vector_store_chunks.as_retriever(
# retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold", 
    search_kwargs={"k": 10, 
                   "score_threshold": 0.45
                   }
)
results = retriever.invoke(query)

# Print retrieved results with relevance scores in a structured format
print("\n=== Retrieved Clinical Notes with Score Threshold ===\n")

for idx, (doc) in enumerate(results, 1):
    print(f"Document {idx}:")
    print(f"  Patient Num: {doc.metadata.get('patient_num', 'N/A')}")
    print(f"  Visit Date: {doc.metadata.get('visit_date', 'N/A')}")
    print(f"  Document ID: {doc.id}")
    print(f"  Excerpt: {doc.page_content[:500]}...")  
    print("-" * 100)  

print(f"Total relevant results: {len(results)}")


In [None]:
# -----------------------------------------------------------
# 6.4. Performing Maximal Marginal Relevance (MMR) Search
# -----------------------------------------------------------
# Demonstrates how to retrieve clinical notes using MMR, balancing relevance and diversity.

# Key Components:
#   - max_marginal_relevance_search(): Retrieves results that maximize relevance while reducing redundancy.
#   - fetch_k=50: Specifies the number of documents to fetch before applying MMR selection.
#   - k=10: Specifies the number of final documents to return.
#   - lambda_mult=0.5: Controls the balance between relevance (1) and diversity (0). 
#     - 0: Maximizes diversity in results.
#     - 1: Prioritizes relevance, potentially leading to similar documents.
#   - print(results): Displays the retrieved documents.

# Purpose:
# MMR ensures that the retrieved clinical notes are not only relevant but also diverse,
# reducing redundancy and covering a broader range of information. 
# This is especially useful in medical applications where multiple perspectives on a condition or treatment are needed.


results = vector_store_chunks.max_marginal_relevance_search(
# results = vector_store_full.max_marginal_relevance_search(
    query, 
    k=10, 
    fetch_k=100, 
    lambda_mult=0.5)

# Print retrieved results with relevance scores in a structured format
print("\n=== Retrieved Clinical Notes with MMR Search ===\n")

for idx, (doc) in enumerate(results, 1):
    print(f"Document {idx}:")
    print(f"  Patient Num: {doc.metadata.get('patient_num', 'N/A')}")
    print(f"  Visit Date: {doc.metadata.get('visit_date', 'N/A')}")
    print(f"  Document ID: {doc.id}")
    print(f"  Excerpt: {doc.page_content[:500]}...")  
    print("-" * 100)  

print(f"Total relevant results: {len(results)}")


## 7. Generation

**Key Steps:**
1. **Creating a Prompt Template for LLM Querying (Item 7.1)**:
   - Demonstrates how to structure a prompt for an AI model to analyze clinical notes.
2. **Invoking AzureChatOpenAI with Retrieved Context (Item 7.2)**:
   - Demonstrates how to pass retrieved clinical notes into the AI model for generating structured responses.

<img src="./images/rag_generation.png" alt="RAG Retrieval" width="1250">


In [None]:
# -----------------------------------------------------------
# 7.1. Creating a Prompt Template for LLM Querying
# -----------------------------------------------------------
# Demonstrates how to structure a prompt for an AI model to analyze clinical notes.

# Key Components:
#   - PromptTemplate.from_template(): Creates a dynamic prompt template for AI interaction.
#   - {context}: Placeholder for retrieved clinical notes that will provide context for the LLM.
#   - {query}: Placeholder for the user’s query, which the AI will answer based on the provided context.
#   - Structured Output:
#     - Patient Num, Gender, Age, and Race fields ensure that the response is structured and complete.
#     - Summary: Ensures that the AI-generated output provides a concise, yet informative response to the query.

# Purpose:
# This prompt template ensures that the AI model generates structured, clear, and relevant responses
# when analyzing clinical notes, making it suitable for automated medical documentation and decision support.

from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    
    "You are a medical assistant analyzing clinical notes. Based on the following records:\n\n"
    
    "{retrieved_docs}\n\n"
    
    "Answer the question: {query} using the following structure:\n"
    "   - Patient Num: patient_num, Gender: , Age: , Race: "
    "   - Visit Date: visit_date\n" 
    "   - Summary: One paragraph summarizing the patient note and one paragraph answering the question"
    
)


In [None]:
# -----------------------------------------------------------
# 7.2. Invoking AzureChatOpenAI with Retrieved Context
# -----------------------------------------------------------
# Demonstrates how to pass retrieved clinical notes into the AI model for generating structured responses.

# Key Components:
#   - final_prompt = prompt_template.format(context=results, query=query): 
#     - Populates the prompt template with the retrieved clinical notes (context) and the user query.
#   - model.invoke(final_prompt): 
#     - Sends the structured prompt to the Azure OpenAI model for processing.
#   - print(response.content): 
#     - Displays the AI-generated response.

# Purpose:
# This step completes the RAG (Retrieval-Augmented Generation) workflow by allowing the LLM to analyze relevant 
# clinical notes and generate structured, insightful answers to medical queries.

# Format the final prompt
final_prompt = prompt_template.format(retrieved_docs=results, query=query)

# Invoke Azure OpenAI model with the RAG-enhanced prompt
response = model.invoke(final_prompt)

print(response.content)

