<a href="https://colab.research.google.com/github/rushabh31/genai-tutorial/blob/main/GenAI_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**What this tutorial covers:**
- Installing dependencies.
- Loading and converting a document to text using `docling`.
- Splitting text into chunks.
- Creating embeddings and storing them in a vector store.
- Loading a LLM model from Hugging Face.
- Constructing a retrieval-augmented chain.
- Asking queries and seeing retrieved context.
- Finally, creating a Gradio UI to upload a file and interact with the RAG system.

**Prerequisites:**
- A Hugging Face Hub token. Get one from [HuggingFace.co](https://huggingface.co/settings/tokens).
- Acceptance of the Llama 3 license on Hugging Face.
- A GPU runtime in Colab (Runtime > Change Runtime Type > GPU).

---

## **Step 1: Install Dependencies**

- `langchain`: Provides the tools to build LLM apps with retrieval capabilities.
- `transformers`, `sentencepiece`: For loading and using models, tokenization.
- `chromadb`: A vector database for embeddings.
- `huggingface_hub`: Integration with Hugging Face Hub.
- `gradio`: For building a web-based UI.
- `docling`: To convert documents into structured text easily.

---

In [None]:
# requirements for this example:
%pip install -qq docling==2.10.0 docling-core==2.9.0 python-dotenv==1.0.1 langchain-text-splitters==0.3.2 langchain-huggingface==0.1.2 langchain-milvus==0.1.7 langchain==0.3.10  langchain-community==0.3.10 langchain-core==0.3.23 sentence-transformers==3.2.1 chromadb==0.5.23 gradio==5.8.0 bitsandbytes==0.45 gliner==0.2.13 gliner-spacy==0.0.10


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.0/96.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.1/87.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m628.3/628.3 kB[0m [31m44.1 MB/s[0m eta [36m

## **Step 2: Import necessary modules**

- We import classes and functions for embeddings, LLMs, retrieval chains, prompts, documents, and the docling converter.
- `Document` is the standard format LangChain expects for documents.
- `DocumentConverter` from docling will help us transform Docx/PDFs into markdown text.

---

In [None]:
import os
import textwrap
from typing import Iterator

import torch
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.schema import Document
from docling.document_converter import DocumentConverter
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline



## **Step 3: Set Hugging Face Credentials**

- We must set the HF credentials to authenticate and access the Llama3.2 model.

---

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…


## **Step 4: Define a Docling-based PDF Loader**

- `DoclingPDFLoader` takes a file path (or list of file paths), converts the document to docling’s internal structure, and then exports it as markdown text.
- We wrap the text in a `Document` object that LangChain understands.

 Github: https://github.com/DS4SD/docling
---

In [None]:
class DoclingLoader:
    """
    Loader that uses docling to convert PDFs to Markdown text.
    Returns LangChain Document objects with all text extracted.
    """
    def __init__(self, file_path: str | list[str]) -> None:
        self._file_paths = file_path if isinstance(file_path, list) else [file_path]
        self._converter = DocumentConverter()

    def lazy_load(self) -> Iterator[Document]:
        for source in self._file_paths:
            dl_doc = self._converter.convert(source).document
            text = dl_doc.export_to_markdown()
            yield Document(page_content=text)



## **Step 5: Upload and Load a Document with Docling**

**In a Colab environment, we can use the file upload widget.**  
After running the cell, select a DocX/PDF from your local machine.


**Explanation:**  
- We upload a Docx/PDF and store its path.
- Next, we’ll load it using our `DoclingLoader`.

---

In [None]:
from google.colab import files
uploaded = files.upload()
doc_path = list(uploaded.keys())[0]
print("Uploaded Document:", doc_path)


Saving sample_note.docx to sample_note.docx
Uploaded PDF: sample_note.docx


## **Step 6: Convert the Docx/PDF to Markdown Text**


**Explanation:**  
- We create a loader instance and convert the Docx/PDF.
- `raw_documents` should now contain one `Document` with the entire Docx/PDF content as markdown text.
- We print the first 500 characters to verify that docling worked correctly.

---

In [None]:
loader = DoclingLoader(doc_path)
raw_documents = list(loader.lazy_load())

if len(raw_documents) == 0:
    raise ValueError("No text could be extracted from the PDF.")

# Check the first 500 characters of the extracted text
print("Extracted Text (first 500 chars):")
print(raw_documents[0].page_content[:500])


Extracted Text (first 500 chars):
Your Diagnosis
Metastatic pancreatic cancer
Acute hyponatremia
Pleural effusion (right side)
Chronic pain due to malignancy
Syncope (likely metabolic or medication-related)

What to do next
You May Need to Schedule the Following Appointments
Follow up with Dr. John Doe, MD
When: Within 3 to 5 days
Where: 12345 ABC Center, Suite 999
Tempe, AZ 85284
Phone: (800) 123-456

Follow up with Dr. John A Doe, MD (Palliative Care)
When: Within 1 to 3 days
Where: 4321 Wellness Way, Suite 210
Mesa, AZ 85202



## **Step 7: Split the Text into Chunks**
### **What is Recursive Splitting?**
Recursive splitting divides large text into smaller, manageable chunks while maintaining coherence.

#### **Implementation**

- **`chunk_size=1000`**: Each chunk contains up to 1000 characters.
- **`chunk_overlap=100`**: Ensures that overlapping parts between chunks provide continuity.
- **Separators**: Defines breakpoints (`newline`, `space`, etc.) to avoid splitting in the middle of words or sentences.

#### **Why Recursive Splitting?**
- Large documents exceed the input limits of most models.
- Ensures text coherence by including overlapping regions.
- Makes retrieval more accurate by keeping related sentences within the same chunk.

#### **Connection to Workflow**:
Splitting prepares the document for embedding creation, ensuring each chunk is optimized for downstream retrieval.

---

In [None]:
text_content = raw_documents[0].page_content
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n", " ", ""]
)
documents = text_splitter.create_documents([text_content])

print(f"Number of chunks created: {len(documents)}")
print("Sample chunk:\n", documents[0].page_content[:500])


Number of chunks created: 2
Sample chunk:
 Your Diagnosis
Metastatic pancreatic cancer
Acute hyponatremia
Pleural effusion (right side)
Chronic pain due to malignancy
Syncope (likely metabolic or medication-related)

What to do next
You May Need to Schedule the Following Appointments
Follow up with Dr. John Doe, MD
When: Within 3 to 5 days
Where: 12345 ABC Center, Suite 999
Tempe, AZ 85284
Phone: (800) 123-456

Follow up with Dr. John A Doe, MD (Palliative Care)
When: Within 1 to 3 days
Where: 4321 Wellness Way, Suite 210
Mesa, AZ 85202



## **Step 8: Create Embeddings and Build a Vector Store**

### **What are Embeddings?**
- Numerical representations of text data that encode semantic meaning.
- Example: The phrase “heart attack” will have a similar embedding to “myocardial infarction.”

#### **Implementation**
```python
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
```
- **Model**: A pre-trained embedding model (`all-MiniLM-L6-v2`) from Sentence Transformer.
- **Purpose**: Converts text chunks into vectors for similarity search.

#### **Why Embeddings?**
- Enables **semantic search**, retrieving chunks based on meaning rather than exact matches.
- Helps the retriever identify the most relevant parts of the document.

#### **Connection**:
Embeddings are the foundation for building the **vector store** and performing retrieval.

---

In [None]:
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
print("Embeddings created with model:", embedding_model_name)


  embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## **Step 9: Building the Vectorstore**
### **What is a Vectorstore?**
A database that stores text chunks along with their embeddings, enabling efficient similarity search.

#### **Implementation**
```python
vectorstore = Chroma.from_documents(docs, embeddings, collection_name="doc_collection")
```
- **Chroma**: A lightweight vector database that indexes the embeddings.
- **Collection**: Groups embeddings under a specific name (e.g., `"doc_collection"`).

#### **Why a Vectorstore?**
- Quickly finds the most relevant chunks when answering questions.
- Scales well with large datasets.

#### **Connection**:
- The vector store enables the retriever to perform similarity-based chunk retrieval for any user question.

In [None]:
vectorstore = Chroma.from_documents(documents, embeddings, collection_name="doc_collection")

print("Vector store created with embedded documents.")
print("Number of documents stored:", len(documents))

## **Step 10: Load the LLM Model from Hugging Face**

### **What is a Language Model?**
A model that generates text by predicting the most likely words or sentences based on input.


This section of code demonstrates how to load and prepare a **large language model (LLM)**, configure it for text generation, and wrap it for integration with the **LangChain framework**. Below is a step-by-step explanation:

---

### **1. Define Model Name**
```python
model_name = "Shaleen123/llama3.2-3b-medical"
```
- **Purpose**: Specifies the pre-trained model to use for text generation.
- **Model Details**:
  - `Shaleen123/llama3.2-3b-medical` is a fine-tuned version of the Llama 3.2-3B model, specifically tailored for **medical use cases**.
  - This means it has been trained on medical datasets, making it proficient at generating medically relevant text.
- **Why Custom Model**: A custom fine-tuned model ensures domain-specific expertise, which is critical for applications like clinical question answering.

---

### **2. Load the Tokenizer**
```python
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
- **What is Tokenization?**
  - Tokenization is the process of breaking down text into smaller units (tokens), such as words or subwords, that the model can understand.
  - Example: The sentence "Patient has diabetes" might be tokenized as `[Patient, has, diabetes]`.
- **Purpose of Tokenizer**:
  - Maps text to numerical representations (input IDs) for the model.
  - Ensures compatibility between the input text and the model's vocabulary.
- **Why AutoTokenizer?**
  - Automatically downloads and configures a tokenizer compatible with the specified `model_name`.

---

### **3. Load the Model**
```python
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
)
```
- **What is AutoModelForCausalLM?**
  - A class for causal language models used in **generative tasks**, such as completing sentences or answering questions.
  - Causal models predict the next token in a sequence, making them ideal for text generation.

- **Key Parameters**:
  - `device_map="auto"`: Automatically allocates model components (layers) across available hardware (e.g., GPU, CPU).
  - `torch_dtype=torch.float16`: Uses 16-bit precision to reduce memory usage and improve inference speed.

- **Why Specify Hardware?**
  - Large models like Llama 3.2 can be computationally expensive.
  - Using `auto` ensures the model leverages available GPUs for faster performance.

---

### **4. Create a Text Generation Pipeline**
```python
llm_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=False
)
```
#### **Pipeline**:
A **pipeline** is a wrapper provided by Hugging Face for performing tasks like text generation in a simple interface.

#### **Parameters Explained**:
- **`"text-generation"`**:
  - Specifies the task type (text generation).
  - Ensures the pipeline uses the model in a causal language generation mode.
  
- **`max_new_tokens=512`**:
  - Limits the number of tokens generated in the output to prevent excessively long responses.

- **`temperature=0.1`**:
  - Controls randomness in the output:
    - **Low values (e.g., 0.1)**: Make the output more deterministic and focused.
    - **Higher values**: Increase variability, leading to creative but less precise outputs.

- **`do_sample=False`**:
  - Ensures the model generates text deterministically by sampling the highest-probability outputs (greedy decoding).

---

### **5. Wrap the Pipeline in LangChain-Compatible Wrapper**
```python
llm = HuggingFacePipeline(pipeline=llm_pipeline)
```
- **What is HuggingFacePipeline?**
  - A LangChain wrapper that allows Hugging Face pipelines to be used seamlessly within LangChain workflows.
  - Provides a standardized interface for text generation.

- **Purpose**:
  - Integrates the pipeline into a **Retrieval-Augmented Generation (RAG)** or other LangChain-supported frameworks.
  - Enables features like chaining, memory, and custom prompts.

---

#### **How Each Step Connects**
1. **Tokenizer and Model**:
   - The tokenizer prepares input text for the model.
   - The model generates predictions based on the tokenized input.

2. **Pipeline**:
   - Combines the tokenizer and model into a single interface for easy text generation.

3. **LangChain Wrapper**:
   - Prepares the pipeline for use in applications like question-answering, summarization, or document analysis.
---

In [None]:
model_name = "Shaleen123/llama3.2-3b-medical"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
)

# Create a text generation pipeline
llm_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=False
)

# Wrap pipeline in HuggingFacePipeline for LangChain
llm = HuggingFacePipeline(pipeline=llm_pipeline)


tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/325 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

  llm = HuggingFacePipeline(pipeline=llm_pipeline)



## **Step 11: Retrieval-Augmented Generation (RAG) Chain**
### **What is RAG?**
Combines retrieval (finding relevant document chunks) and generation (answering questions) to create a seamless Q&A system.

#### **Implementation**
```python
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":1})
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt}
)
```
- **Retriever**: Finds the top `k` (e.g., `1`) most relevant chunks for a query.
- **PromptTemplate**: Specifies how the retrieved context and user question are formatted for the LLM.
- **RetrievalQA Chain**: A pipeline that integrates retrieval and generation.

#### **Why RAG?**
- Provides accurate, grounded answers based on documents.
- Combines semantic search with the power of generative models.

#### **Connection**:
Links the vector store, retriever, and LLM into a unified pipeline for answering questions.

---

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":1})

# Define a prompt that instructs the model to use retrieved context
template = """You are a helpful Clinical AI assistant. Use the following context to answer the question. If you don't know the answer, just say "I don't know" and do not repeat yourself.

Context:
{context}

Question:
{question}

Answer:"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt}
)

print("RetrievalQA chain created. Ready for queries!")


NameError: name 'vectorstore' is not defined

## **Step 12: Test a Query**


- We run a test query against our retrieval-augmented system.
- The LLM will produce an answer based on the retrieved chunks.
- Print the answer to verify the system works.

---

In [None]:
test_query = "What is the patient's primary diagnosis?"
answer = chain.run(test_query)
print("Question:", test_query)
print("Answer:", answer)


In [None]:
test_query = "When should the patient follow up with Dr. John Doe, MD?"
answer = chain.run(test_query)
print("Question:", test_query)
print("Answer:", answer)



Question: When should the patient follow up with Dr. John Doe, MD?
Answer: You are a helpful Clinical AI assistant. Use the following context to answer the question. If you don't know the answer, just say "I don't know" and do not repeat yourself.

Context:
Your Diagnosis
Metastatic pancreatic cancer
Acute hyponatremia
Pleural effusion (right side)
Chronic pain due to malignancy
Syncope (likely metabolic or medication-related)

What to do next
You May Need to Schedule the Following Appointments
Follow up with Dr. John Doe, MD
When: Within 3 to 5 days
Where: 12345 ABC Center, Suite 999
Tempe, AZ 85284
Phone: (800) 123-456

Follow up with Dr. John A Doe, MD (Palliative Care)
When: Within 1 to 3 days
Where: 4321 Wellness Way, Suite 210
Mesa, AZ 85202
Phone: (800) 987-654

Medications
New hydromorphone (Dilaudid 2 mg oral tablet) 1 tab(s) By mouth Every 4 hours as needed for pain
New sodium chloride (0.9% saline oral solution) 100 mL By mouth Twice daily for 7 days
New ondansetron (Zofran 

In [None]:
test_query = "What are the precribed medications?"
answer = chain.run(test_query)
# print("Question:", test_query)
print("Answer:", answer)



Answer: You are a helpful Clinical AI assistant. Use the following context to answer the question. If you don't know the answer, just say "I don't know" and do not repeat yourself.

Context:
Your Diagnosis
Metastatic pancreatic cancer
Acute hyponatremia
Pleural effusion (right side)
Chronic pain due to malignancy
Syncope (likely metabolic or medication-related)

What to do next
You May Need to Schedule the Following Appointments
Follow up with Dr. John Doe, MD
When: Within 3 to 5 days
Where: 12345 ABC Center, Suite 999
Tempe, AZ 85284
Phone: (800) 123-456

Follow up with Dr. John A Doe, MD (Palliative Care)
When: Within 1 to 3 days
Where: 4321 Wellness Way, Suite 210
Mesa, AZ 85202
Phone: (800) 987-654

Medications
New hydromorphone (Dilaudid 2 mg oral tablet) 1 tab(s) By mouth Every 4 hours as needed for pain
New sodium chloride (0.9% saline oral solution) 100 mL By mouth Twice daily for 7 days
New ondansetron (Zofran 8 mg oral tablet) 1 tab(s) By mouth Every 8 hours as needed for nau


## **Step 13: Inspect the Retrieved Context**

- We manually retrieve the chunks that the RAG system considered for the previous query.
- This step is for transparency and debugging: you can see what text influenced the model’s answer.

---

In [None]:
retrieved_docs = retriever.get_relevant_documents(test_query)
for i, doc in enumerate(retrieved_docs, start=1):
    print(f"--- Retrieved Chunk {i} ---")
    print(textwrap.fill(doc.page_content, width=80))
    print()


--- Retrieved Chunk 1 ---
Your Diagnosis Metastatic pancreatic cancer Acute hyponatremia Pleural effusion
(right side) Chronic pain due to malignancy Syncope (likely metabolic or
medication-related)  What to do next You May Need to Schedule the Following
Appointments Follow up with Dr. John Doe, MD When: Within 3 to 5 days Where:
12345 ABC Center, Suite 999 Tempe, AZ 85284 Phone: (800) 123-456  Follow up with
Dr. John A Doe, MD (Palliative Care) When: Within 1 to 3 days Where: 4321
Wellness Way, Suite 210 Mesa, AZ 85202 Phone: (800) 987-654  Medications New
hydromorphone (Dilaudid 2 mg oral tablet) 1 tab(s) By mouth Every 4 hours as
needed for pain New sodium chloride (0.9% saline oral solution) 100 mL By mouth
Twice daily for 7 days New ondansetron (Zofran 8 mg oral tablet) 1 tab(s) By
mouth Every 8 hours as needed for nausea  Pharmacy Information Professional
Pharmacy: 6789 Health Top, Tempe, AZ Phone: (800) 555-12354  Discharge Orders
Discharge Today, Home with Home Health



  retrieved_docs = retriever.get_relevant_documents(test_query)


## **Step 14: Wrap Everything in a Gradio App**

### **What is Gradio?**
A Python library for building user-friendly interfaces for machine learning models.

#### **Implementation**
- **File Upload**: Users can upload documents for processing.
- **Process Button**: Initiates document processing.
- **Question Input**: Accepts user questions for the system to answer.
- **Answer Display**: Shows generated answers and retrieved context.

#### **Why Gradio?**
- Simplifies interaction with the RAG system.
- Provides real-time results without extensive coding knowledge.

#### **Connection**:
Links the backend RAG logic to a user-facing interface for seamless interaction.

---

## **9. End-to-End Workflow**
1. **Upload Document**: A user uploads a document (PDF/DOCX).
2. **Process Document**: The system extracts, splits, embeds, and stores the document.
3. **Ask Questions**: The user enters a question.
4. **Retrieve Context**: The retriever finds the most relevant chunks.
5. **Generate Answer**: The LLM generates a concise, accurate answer based on the retrieved context.

---

---


In [None]:
import gradio as gr

# Globals for chain/retriever after processing
chain = None
retriever = None
llm = None
model_name = "Shaleen123/llama3.2-3b-medical"
def process_document(file):
    if file is None:
        return "Please upload a Document first."

    # Load the Document with docling
    loader = DoclingLoader(file.name)
    raw_docs = list(loader.lazy_load())
    if len(raw_docs) == 0:
        return "No text extracted."

    text_content = raw_docs[0].page_content

    # Split text
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        separators=["\n", " ", ""]
    )
    docs = text_splitter.create_documents([text_content])

    # Create embeddings and vectorstore
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vectorstore = Chroma.from_documents(docs, embeddings, collection_name="pdf_collection")

    # Load model locally (You can reuse the model loaded before)
    global llm
    if llm is None:
        # If we haven't loaded the model yet in the session, load it now
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model_local = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.float16,
        )
        llm_pipeline = pipeline(
            "text-generation",
            model=model_local,
            tokenizer=tokenizer,
            max_new_tokens=128,
            temperature=0.01,
            do_sample=False
        )
        llm = HuggingFacePipeline(pipeline=llm_pipeline)

    # Create retriever and chain
    global chain, retriever
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":1})
    template = """You are a helpful Clinical AI assistant. Use the following context to answer the question. If you don't know the answer, just say "I don't know" and do not repeat yourself.
    Your response should be in one line.
    Context:
    {context}

    Question:
    {question}

    Answer:"""
    prompt = PromptTemplate(template=template, input_variables=["context", "question"])
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt}
    )

    return "Document processed successfully! Now you can ask questions."

def answer_question(user_question):
    if chain is None or retriever is None:
        return "Please upload and process a document first.", ""

    ans = chain.run(user_question)
    retrieved_docs = retriever.get_relevant_documents(user_question)
    context_str = ""
    for i, d in enumerate(retrieved_docs, start=1):
        context_str += f"\n--- Retrieved Chunk {i} ---\n"
        context_str += textwrap.fill(d.page_content, width=80)
        context_str += "\n"
    return ans, context_str

with gr.Blocks() as demo:
    gr.Markdown("# RAG System with Llama 3.2 (Loaded Locally) and Docling")
    gr.Markdown("Upload a PDF, process it, then ask questions.")

    with gr.Row():
        pdf_input = gr.File(label="Upload Document", file_types=[".docx",".pdf"])
        process_btn = gr.Button("Process Document")

    status = gr.Textbox(label="Status", interactive=False)

    with gr.Row():
        question = gr.Textbox(label="Your Question")
        ask_btn = gr.Button("Ask")

    answer_output = gr.Textbox(label="Answer", lines=4)
    context_output = gr.Textbox(label="Retrieved Context (Source Chunks)", lines=10)

    process_btn.click(process_document, inputs=pdf_input, outputs=status)
    ask_btn.click(answer_question, inputs=question, outputs=[answer_output, context_output])

demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://1c9879b0ae2de0eb1b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




### **What is Named Entity Recognition (NER)?**

Named Entity Recognition (NER) is a subfield of Natural Language Processing (NLP) that focuses on identifying and categorizing specific entities in a text into predefined categories. Examples of these entities include:
- **Diagnoses**: e.g., "Metastatic pancreatic cancer"
- **Medications**: e.g., "hydromorphone"
- **Persons**: e.g., "Dr. John Doe"
- **Locations**: e.g., "Tempe, AZ"

### **How Does NER Work?**

NER systems analyze text and assign "labels" to certain spans of words based on their context and meaning. The process typically involves several steps:

---

#### **1. Tokenization**
The text is split into smaller units (tokens), such as words or subwords, to allow the model to analyze each part individually.
- Example:
  Text: `"Metastatic pancreatic cancer"`
  Tokens: `["Metastatic", "pancreatic", "cancer"]`

---

#### **2. Text Analysis and Feature Extraction**
NER models evaluate the context and semantics of tokens using language models. Pre-trained models (like `gliner_small-v2.5` in our case) are trained on large datasets to understand language patterns.
- Features used:
  - Word embeddings (numerical representations of words).
  - Surrounding context of words.
  - Part-of-speech tags.

---

#### **3. Entity Detection**
The model assigns a label to sequences of tokens based on its training:
- **BIO Tagging** (Begin-Inside-Outside):
  - `B-diagnoses`: Beginning of a diagnosis entity.
  - `I-diagnoses`: Inside a diagnosis entity.
  - `O`: Outside any entity.

Example:
```
Text: "Metastatic pancreatic cancer"
Labels: [B-diagnoses, I-diagnoses, I-diagnoses]
```

---

#### **4. Categorization**
The system maps detected entities to one of the predefined categories:
- **diagnoses**: Medical conditions.
- **medication**: Drugs or prescriptions.
- **phone number**: Contact numbers.

---

### **How NER Fits Into Our Code?**

Our specific code performs NER in a medical context, with additional support for contact and address-related entities. Here's how each component works:

---

#### **1. Loading the Model**
```python
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)
```
- **Custom Model**: We use `gliner_small-v2.5`, a model fine-tuned for recognizing medical entities, phone numbers, and street addresses.
- **Pipeline**: The spaCy pipeline is configured to include this custom NER model.

---

#### **2. Defining Predefined Entity Categories**
```python
"labels": ["diagnoses", "medication", "symptoms", "person", "labs", "city", "state", "phone number", "street address"]
```
These categories specify the types of entities the model will detect:
- **Medical categories**: `diagnoses`, `medication`, `symptoms`.
- **Contact categories**: `phone number`, `street address`.
- **Location categories**: `city`, `state`.

---

#### **3. Processing the Text**
```python
doc = nlp(text)
```
- **Input**: Text containing medical information, appointments, medications, and contact details.
- **Output**: A spaCy `doc` object that stores:
  - **Entities**: Recognized spans of text (e.g., "Dr. John Doe").
  - **Labels**: The categories assigned to those spans (e.g., `person`).

---

#### **4. Visualizing the Results**
```python
displacy.render(doc, style="ent", options=options, jupyter=True)
```
- **Goal**: Highlight detected entities in the text for easy interpretation.
- **Customization**: We assign light matte colors to different entity types for clarity.

---

### **Why Are We Using NER in This Context?**

1. **Extracting Relevant Medical Information**
   - Automatically identify key medical terms like diagnoses, medications, and symptoms.

2. **Improving Readability**
   - Highlighted entities make it easier to understand the text.

3. **Facilitating Structured Data Analysis**
   - Extracted entities can be converted into structured data formats (e.g., a database of diagnoses and medications).

4. **Domain-Specific Adaptation**
   - Using a fine-tuned medical NER model ensures accurate and contextually relevant results.

---

### **How NER Improves Usability in Our Code**
- **Simplifies Information Retrieval**: Users don’t need to manually parse long medical reports; the system highlights key details.
- **Customizable for Any Domain**: This approach is tailored to healthcare but can adapt to other fields like finance or law.
- **Supports Automated Workflows**: The extracted entities can integrate with downstream tasks, such as generating summaries or reports.

### **Conclusion**
Named Entity Recognition is a foundational NLP task that allows us to structure unstructured data effectively. In this project, we apply NER to highlight medical and contact-related entities, aiding both readability and further data processing. Let me know if you'd like to dive deeper into any part!

In [None]:
import spacy
from spacy import displacy

# Custom spaCy configuration
custom_spacy_config = {
    "gliner_model": "gliner-community/gliner_small-v2.5",
    "chunk_size": 250,
    "labels": ["diagnoses", "medication", "symptoms", "person", "labs", "city", "state", "phone number", "street address"],
    "style": "ent",
    "map_location": "cuda"
}

# Load spaCy pipeline
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Input text
text = """Your Diagnosis
Metastatic pancreatic cancer
Acute hyponatremia
Pleural effusion (right side)
Chronic pain due to malignancy
Syncope (likely metabolic or medication-related)
What to do next
You May Need to Schedule the Following Appointments
Follow up with Dr. John Doe, MD
When: Within 3 to 5 days
Where: 12345 ABC Center, Suite 999
Tempe, AZ 85284
Phone: (800) 123-456
Follow up with Dr. John A Doe, MD (Palliative Care)
When: Within 1 to 3 days
Where: 4321 Wellness Way, Suite 210
Mesa, AZ 85202
Phone: (800) 987-654
Medications
New hydromorphone (Dilaudid 2 mg oral tablet) 1 tab(s) By mouth Every 4 hours as needed for pain
New sodium chloride (0.9% saline oral solution) 100 mL By mouth Twice daily for 7 days
New ondansetron (Zofran 8 mg oral tablet) 1 tab(s) By mouth Every 8 hours as needed for nausea
Pharmacy Information
Professional Pharmacy: 6789 Health Top, Tempe, AZ
Phone: (800) 555-12354
Discharge Orders
Discharge Today, Home with Home Health
DME Supply DME Ordered: Other: See Special Instructions, Special Instructions: Transport wheelchair provided, Orthopedic/Neuro Needs: N/A, Laterality: Not Applicable, Diagnosis: Metastatic pancreatic cancer
Case Management Instructions
Final Transition Plan
Additional Discharge Details: HealthCare Partners Network (800) 555-0000 will arrange home health services for skilled nursing and palliative support. If they have not contacted you within 24-48 hours, please call them for updates.
Your medical equipment provider is HomeCare Medical Supply (800) 555-1111, who has issued a transport wheelchair for mobility assistance. For any technical issues or repairs, contact them directly.
"""

# Process the text
doc = nlp(text)

# Define custom colors for each entity with subtle matte light colors
colors = {
    "diagnoses": "#f9d5d3",        # Soft pink
    "medication": "#d3e5f9",      # Light blue
    "symptoms": "#d9f9d3",        # Light green
    "person": "#f9f4d3",          # Pale yellow
    "labs": "#e2d3f9",            # Lavender
    "city": "#f9e4d3",            # Peach
    "state": "#d3f9e8",           # Mint green
    "phone number": "#f9d3e6",    # Light rose
    "street address": "#d3f3f9"   # Aqua
}

# Create options for displaCy visualization
options = {
    "colors": colors
}

# Visualize entities with custom colors
displacy.render(doc, style="ent", options=options, jupyter=True)


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
# Print extracted entities
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.score)