# Building a Retrieval Augmented Generation (RAG) System with LangChain and ICD-11 Data

This Colab notebook demonstrates how to build a RAG system using LangChain to answer questions based on the International Classification of Diseases, 11th Revision (ICD-11) data.

**Objectives:**
* Understand the concept of Retrieval Augmented Generation (RAG).
* Learn how to acquire and preprocess external knowledge (ICD-11 data).
* Utilize LangChain to integrate various components: Document Loaders, Text Splitters, Embedding Models, Vector Stores, and Language Models.
* Build and query a simple RAG chain.


## 2.1 Prerequisites

To follow along, you will need:

**Google API Key:**
1. Go to [Google AI Studio](https://aistudio.google.com/app/apikey)
2. Click "Get API Key" or "Create API Key in new project"
3. Copy the generated API key

**ICD-11 Chapter 6 Dataset:**
1. Go to [Drive](https://drive.google.com/file/d/1ThIsNf1iuns9wlMZmBHOWRI9E6FiVgjQ/view?usp=drive_link)
2. Upload it here in colab, go to file section, just add it to the root folder

## CODE 1 - Setting up libraries ##

We use !pip to install LangChain and its related packages, including the integration for Google's AI models and the Chroma vector store

In [None]:
!pip install requests
!pip install langchain langchain-community
!pip install pandas
!pip install langchain-chroma
!pip install langchain-google-genai

import os
import getpass

os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API Key: ")

## CODE 2 - Loading external data

We will load the ICD-11 data directly from the provided `icdchapter6.csv` file. This file contains ICD-11 codes, titles, and definitions, separated by semicolons.

In [None]:
import pandas as pd
from langchain_core.documents import Document

def load_icd11_from_csv(file_path: str) -> list[Document]:
    """
    Loads ICD-11 data from a CSV file and converts it into LangChain Document objects.
    Assumes the CSV has columns like 'Code', 'Title', 'Definition'.

    Args:
        file_path (str): The path to the CSV file.

    Returns:
        list[Document]: A list of LangChain Document objects.
    """
    print(f"Loading ICD-11 data from CSV: '{file_path}'...")
    try:
        # Read the CSV file using pandas, specifying the delimiter
        # Assuming the CSV has headers like 'Code', 'Title', 'Definition'

        print(f"Successfully loaded {len(icd_documents)} ICD-11 entries from CSV.")
        return icd_documents
    except FileNotFoundError:
        print(f"Error: CSV file not found at '{file_path}'. Please ensure the file is uploaded.")
        return []
    except KeyError as e:
        print(f"Error: Missing expected column in CSV: {e}. Please check CSV headers.")
        print("Expected columns: 'Code', 'Title', 'Definition'")
        return []
    except Exception as e:
        print(f"An unexpected error occurred while reading the CSV: {e}")
        return []

# Load ICD-11 documents from the uploaded CSV file
# Ensure 'icdchapter6.csv' is uploaded to your Colab environment
icd11_documents = load_icd11_from_csv('icdchapter6.csv')

if icd11_documents:
    print("\nSample ICD-11 Document (first 100 characters):")
    print(icd11_documents[0].page_content[:100], "...")
    print("\nMetadata example:")
    print(icd11_documents[0].metadata)
else:
    print("Failed to load ICD-11 documents from CSV. Please check the file path and content.")

## CODE - Processing Documents: No Text Splitting

As each row in the CSV represents a complete disorder entry and should not be split, we will treat each loaded document as a single chunk. This means we will not apply any further text splitting.

In [None]:
# As requested, each document (representing a row/disorder) will be treated as a single chunk.
icd11_chunks = icd11_documents

print(f"\nOriginal documents count (now also chunk count): {len(icd11_documents)}")
print(f"Each document is now treated as a single chunk.")

if icd11_chunks:
    print("\nSample chunk (first 100 characters):")
    print(icd11_chunks[0].page_content[:100], "...")
    print("\nSample chunk metadata:")
    print(icd11_chunks[0].metadata)
else:
    print("No chunks available. Please ensure ICD-11 documents were loaded correctly.")

## CODE 3 - Generating Embeddings

Embeddings are numerical representations of text that capture semantic meaning. We'll use these embeddings to find semantically similar chunks when a user asks a question. `GoogleGenerativeAIEmbeddings` will convert our text chunks into vectors.

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# Initialize the embedding model
# The model name 'models/embedding-001' is suitable for generating text embeddings.
# It uses the API key set in the GOOGLE_API_KEY environment variable.
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

print("Embedding model initialized.")
# You can optionally test an embedding
# sample_embedding = embeddings.embed_query("What is hypertension?")
# print(f"Sample embedding dimension: {len(sample_embedding)}")


## CODE 4 - Creating or Loading a Persistent Vector Store

A vector store (or vector database) stores the embeddings and allows for efficient similarity search.
We'll use `Chroma` and configure it to persist data to disk. This means if you run this notebook again,
it will load the existing vector store instead of re-embedding all documents, saving time.

In [None]:
from langchain_chroma import Chroma
import os

# Define the directory where the vector store will be persisted
persist_directory = "./chroma_db"

# Check if the vector store already exists
if os.path.exists(persist_directory) and os.listdir(persist_directory):

else:


# To make the vectorstore a retriever that can be used in the RAG chain:
retriever = vectorstore.as_retriever()
print("Retriever created from the vector store.")

## CODE 5 - Building the RAG Chain

Now, we'll assemble the RAG chain using LangChain. The chain will perform the following steps:
1.  **Retrieve:** Given a user query, use the `retriever` to find the most relevant chunks from our ICD-11 vector store.
2.  **Stuff:** Combine these retrieved chunks with the original query into a single prompt for the Language Model.
3.  **Generate:** The Language Model generates a response based on the combined prompt.

We'll use LangChain Expression Language (LCEL) for a clear and flexible chain construction.

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize the Language Model (LLM)
# Changed model to 'gemma-3-1b-it' as requested.
llm = ChatGoogleGenerativeAI(model="gemma-3-1b-it", temperature=0.7)

# Define the prompt template for the LLM
# The prompt instructs the LLM to act as a helpful assistant and answer questions
# based on the provided context, stating if it doesn't know.
template = ""
prompt = ChatPromptTemplate.from_template(template)

# Construct the RAG chain using LCEL
# The chain flows from:
# 1. 'question' and 'context' inputs. 'context' is populated by the retriever.
# 2. These are passed to the prompt.
# 3. The prompt is passed to the LLM.
# 4. The LLM's output is parsed into a string.
rag_chain = ()

print("RAG chain constructed.")

## 7. Demonstration & Evaluation

Now, let's test our RAG system with some queries related to medical conditions.

In [None]:
def ask_rag_system(query: str):
    """
    Asks a question to the RAG system and prints the answer.
    """


# Example Queries
ask_rag_system("What is the ICD-11 code and definition for 'Dissociative identity disorder'?")
ask_rag_system("What is the capital of France?") # This query should ideally result in "I don't know" as it's outside the ICD-11 context.
ask_rag_system("What is the ICD-11 code for 'Anxiety'?") # Test a term that might be less prevalent if max_results was small.

## Conclusion and Next Steps

You have successfully built a basic RAG system using LangChain, drawing information from ICD-11 data.

**Key Learnings:**
* How to load and prepare unstructured data for RAG.
* The role of text splitting and embeddings.
* Creating and using a vector store (Chroma).
* Connecting an LLM and retriever to form a RAG chain.

**Potential Improvements and Further Exploration:**
* **Larger Dataset:** Integrate with the full WHO ICD-11 API (requires authentication and more robust data handling) or download a larger pre-processed dataset if available publicly.
* **Advanced Text Splitting:** Experiment with different `TextSplitter` strategies or add metadata to chunks for improved retrieval.
* **Hybrid Search:** Combine vector similarity search with keyword-based search for better retrieval performance.
* **Reranking:** Implement a reranking step after initial retrieval to ensure the most relevant documents are passed to the LLM.
* **Evaluation Metrics:** Set up evaluation metrics to assess the performance of your RAG system (e.g., faithfulness, relevance).
* **User Interface:** Build a simple web interface (e.g., using Streamlit or Gradio) to interact with the RAG system.
* **Chat History:** Extend the RAG chain to incorporate conversational history for more coherent multi-turn interactions.