In [1]:
!git clone https://github.com/pyyas-star/colab.git
%cd colab

Cloning into 'colab'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 36 (delta 4), reused 25 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 46.00 KiB | 11.50 MiB/s, done.
Resolving deltas: 100% (4/4), done.
/content/colab


# Domain-Specific Medical LLM Q&A Agent (RAG + LLM)

**A Retrieval-Augmented Generation System for Safe, Evidence-Based Medical Question Answering**

---

## üìã Project Overview

This notebook implements a **Domain-Specific Medical Q&A Agent** using **Retrieval-Augmented Generation (RAG)** combined with a powerful **LLM**. The system is designed to answer medical questions using **only trusted, publicly available medical guidelines** (e.g., WHO, CDC), making it suitable for health-tech applications.

### Key Features:
- ‚úÖ **RAG Architecture**: Combines retrieval with LLM generation for accurate, evidence-based answers
- ‚úÖ **Medical Domain Focus**: Uses only trusted medical guidelines (WHO, CDC)
- ‚úÖ **Safety First**: Built-in safety checks and medical disclaimers
- ‚úÖ **Citation Support**: All answers include source citations
- ‚úÖ **Production-Ready**: Modular, well-documented, and deployment-ready

---

## üèóÔ∏è Architecture Overview

### What is RAG?

**Retrieval-Augmented Generation (RAG)** is a technique that enhances LLM responses by:

1. **Retrieving** relevant documents from a knowledge base
2. **Augmenting** the LLM prompt with retrieved context
3. **Generating** answers based on the provided context

### Why RAG for Medical Applications?

- **Reduces Hallucinations**: LLMs alone can generate plausible but incorrect medical information
- **Evidence-Based**: Answers are grounded in actual medical guidelines
- **Up-to-Date**: Knowledge base can be updated without retraining the model
- **Transparency**: Citations show where information comes from
- **Safety**: Built-in validation prevents harmful responses

### Pipeline Flow:

```
Medical Documents ‚Üí Chunking ‚Üí Embeddings ‚Üí Vector Store
                                                     ‚Üì
User Query ‚Üí Embedding ‚Üí Vector Search ‚Üí Retrieve Top-K Docs
                                                     ‚Üì
Retrieved Context + Query ‚Üí LLM Prompt ‚Üí Generated Answer + Citations
```

---

## ‚ö†Ô∏è Medical Disclaimer

**IMPORTANT**: This system provides general medical information for educational purposes only. It is **NOT** a substitute for professional medical advice, diagnosis, or treatment. Always consult qualified healthcare providers for personal medical concerns.


## üîß Section 1: Setup & Installation

### Google Colab Setup

If running in Google Colab, enable GPU for faster processing:
- Runtime ‚Üí Change runtime type ‚Üí GPU (T4 or better)

### Install Dependencies


In [2]:
# Install required packages
%pip install -q torch transformers sentence-transformers faiss-cpu accelerate numpy pandas pyyaml requests tqdm


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/23.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.5/23.6 MB[0m [31m16.8 MB/s[0m eta [36m0:00:02[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m5.8/23.6 MB[0m [31m81.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m11.7/23.6 MB[0m [31m163.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m18.4/23

In [3]:
# Check GPU availability
import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("Using CPU (slower but works)")


CUDA Available: True
GPU Device: Tesla T4
GPU Memory: 15.83 GB


In [4]:
# Import all required libraries
import sys
from pathlib import Path
import json
import yaml
from typing import List, Dict, Optional
import numpy as np
import pandas as pd
from tqdm import tqdm

# Add utils to path
sys.path.append(str(Path.cwd()))

# Import utility modules
from utils import (
    load_data, clean_text, structure_documents,
    MedicalChunker, extract_metadata,
    EmbeddingGenerator,
    VectorStore, build_vector_store,
    RetrievalEngine,
    RAGPipeline,
    RAGEvaluator,
    MedicalSafetyChecker
)

print("‚úÖ All imports successful!")


‚úÖ All imports successful!


## üì• Section 2: Data Ingestion

### Why Preprocessing Matters for Medical Accuracy

Medical documents often contain:
- Formatting inconsistencies
- Special characters and symbols
- Multiple languages or translations
- Structured sections (symptoms, treatment, prevention)

Proper preprocessing ensures:
- **Better Retrieval**: Clean text improves embedding quality
- **Reduced Noise**: Removes irrelevant formatting
- **Consistent Structure**: Makes chunking more effective
- **Metadata Extraction**: Identifies diseases, sections, sources


In [5]:
# Load configuration
config_path = Path("config/config.yaml")
if config_path.exists():
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
else:
    # Default configuration
    config = {
        'chunking': {'chunk_size': 1000, 'chunk_overlap': 200},
        'embedding': {'model_name': 'sentence-transformers/all-MiniLM-L6-v2', 'batch_size': 32},
        'retrieval': {'top_k': 5},
        'llm': {'model_name': 'google/flan-t5-base', 'max_length': 512}
    }

print("Configuration loaded:")
print(json.dumps(config, indent=2))


Configuration loaded:
{
  "embedding": {
    "model_name": "sentence-transformers/all-MiniLM-L6-v2",
    "batch_size": 32,
    "cache_dir": "./cache/embeddings"
  },
  "chunking": {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "min_chunk_size": 100
  },
  "vector_store": {
    "index_type": "flat",
    "save_path": "./data/vector_store"
  },
  "retrieval": {
    "top_k": 5,
    "score_threshold": 0.0,
    "rerank": false,
    "rerank_method": "similarity"
  },
  "llm": {
    "model_name": "google/flan-t5-base",
    "use_api": false,
    "api_key": null,
    "max_length": 512,
    "temperature": 0.3
  },
  "safety": {
    "enable_validation": true,
    "add_disclaimer": true,
    "emergency_detection": true
  },
  "data": {
    "sample_file": "./data/sample_medical_guidelines.txt",
    "test_questions_file": "./data/test_questions.json"
  },
  "evaluation": {
    "benchmark_output": "./results/benchmark_results.json",
    "compare_with_baseline": true
  }
}


In [6]:
# Load medical documents
data_file = Path("data/sample_medical_guidelines.txt")

if data_file.exists():
    raw_documents = load_data(data_file, source_type="file")
    print(f"‚úÖ Loaded {len(raw_documents)} document(s)")
    print(f"Total characters: {sum(len(doc) for doc in raw_documents):,}")
else:
    print("‚ö†Ô∏è Sample data file not found. Using empty list.")
    raw_documents = []


‚úÖ Loaded 1 document(s)
Total characters: 6,867


In [7]:
# Clean and preprocess documents
cleaned_documents = []
for doc in raw_documents:
    cleaned = clean_text(doc)
    cleaned_documents.append(cleaned)

print(f"‚úÖ Cleaned {len(cleaned_documents)} document(s)")
print(f"\nSample cleaned text (first 500 chars):")
if cleaned_documents:
    print(cleaned_documents[0][:500] + "...")


‚úÖ Cleaned 1 document(s)

Sample cleaned text (first 500 chars):
Title: Malaria Prevention and Treatment Guidelines Source: World Health Organization (WHO) Date: 2024 --- OVERVIEW Malaria is a life-threatening disease caused by parasites that are transmitted to people through the bites of infected female Anopheles mosquitoes. It is preventable and curable. In 2022, there were an estimated 249 million malaria cases worldwide. SYMPTOMS The first symptoms of malaria are usually very similar to flu  they include: - A high temperature (fever) - Headaches - Sweats ...


In [8]:
# Structure documents with metadata
structured_docs = structure_documents(cleaned_documents)

print(f"‚úÖ Structured {len(structured_docs)} document(s)")
print(f"\nSample document structure:")
if structured_docs:
    sample = structured_docs[0]
    print(f"ID: {sample['id']}")
    print(f"Title: {sample.get('title', 'N/A')}")
    print(f"Source: {sample.get('source', 'N/A')}")
    print(f"Text length: {len(sample['text'])} characters")


‚úÖ Structured 1 document(s)

Sample document structure:
ID: doc_0
Title: 
Source: unknown
Text length: 6728 characters


## ‚úÇÔ∏è Section 3: Text Chunking & Preprocessing

### Why Chunk Size Affects Retrieval Accuracy

**Chunk Size Trade-offs:**
- **Too Small**: Loses context, fragments information
- **Too Large**: Includes irrelevant information, reduces precision
- **Optimal**: 500-1000 characters balances context and precision

**Overlap Importance:**
- Prevents information loss at chunk boundaries
- Ensures continuity for multi-chunk answers
- Typical overlap: 10-20% of chunk size


In [9]:
# Initialize chunker
chunk_size = config.get('chunking', {}).get('chunk_size', 1000)
chunk_overlap = config.get('chunking', {}).get('chunk_overlap', 200)

chunker = MedicalChunker(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    min_chunk_size=100
)

print(f"‚úÖ Chunker initialized:")
print(f"  - Chunk size: {chunk_size} characters")
print(f"  - Overlap: {chunk_overlap} characters")


‚úÖ Chunker initialized:
  - Chunk size: 1000 characters
  - Overlap: 200 characters


In [10]:
# Chunk all documents
all_chunks = []

for doc in structured_docs:
    # Extract metadata
    enhanced_metadata = extract_metadata(doc['text'], doc.get('metadata', {}))

    # Chunk document
    chunks = chunker.chunk_text(
        doc['text'],
        document_id=doc['id'],
        metadata=enhanced_metadata
    )

    # Convert to dict format
    for chunk in chunks:
        all_chunks.append({
            'text': chunk.text,
            'chunk_id': chunk.chunk_id,
            'document_id': chunk.document_id,
            'metadata': chunk.metadata
        })

print(f"‚úÖ Created {len(all_chunks)} chunks from {len(structured_docs)} documents")
print(f"\nChunk statistics:")
chunk_lengths = [len(c['text']) for c in all_chunks]
print(f"  - Average length: {np.mean(chunk_lengths):.0f} characters")
print(f"  - Min length: {min(chunk_lengths)} characters")
print(f"  - Max length: {max(chunk_lengths)} characters")


‚úÖ Created 1 chunks from 1 documents

Chunk statistics:
  - Average length: 6728 characters
  - Min length: 6728 characters
  - Max length: 6728 characters


## üî¢ Section 4: Embedding & Vector Store

### How Embeddings Represent Medical Concepts

**Embeddings** convert text into numerical vectors that capture semantic meaning:
- Similar medical concepts have similar vectors
- Enables mathematical similarity search
- Preserves relationships between terms

**Why Vector Search is Required:**
- **Speed**: Much faster than keyword search
- **Semantic Understanding**: Finds conceptually similar content
- **Scalability**: Handles large knowledge bases efficiently


In [11]:
# Initialize embedding generator
model_name = config.get('embedding', {}).get('model_name', 'sentence-transformers/all-MiniLM-L6-v2')
batch_size = config.get('embedding', {}).get('batch_size', 32)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

embedding_generator = EmbeddingGenerator(
    model_name=model_name,
    device=device
)

print(f"‚úÖ Embedding generator initialized")
print(f"  - Model: {model_name}")
print(f"  - Embedding dimension: {embedding_generator.embedding_dim}")


Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Embedding generator initialized
  - Model: sentence-transformers/all-MiniLM-L6-v2
  - Embedding dimension: 384


In [12]:
# Generate embeddings for all chunks
chunk_texts = [chunk['text'] for chunk in all_chunks]

print(f"Generating embeddings for {len(chunk_texts)} chunks...")
embeddings = embedding_generator.generate_embeddings(
    chunk_texts,
    batch_size=batch_size,
    show_progress=True
)

print(f"\n‚úÖ Generated embeddings")
print(f"  - Shape: {embeddings.shape}")
print(f"  - Data type: {embeddings.dtype}")


Generating embeddings for 1 chunks...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


‚úÖ Generated embeddings
  - Shape: (1, 384)
  - Data type: float32


In [13]:
# Build vector store
vector_store_path = Path("data/vector_store")

vector_store = build_vector_store(
    embeddings=embeddings,
    chunks=all_chunks,
    save_path=vector_store_path
)

print(f"‚úÖ Vector store built and saved")
print(f"  - Total vectors: {vector_store.get_stats()['total_vectors']}")
print(f"  - Saved to: {vector_store_path}")


‚úÖ Vector store built and saved
  - Total vectors: 1
  - Saved to: data/vector_store


## üîç Section 5: Retrieval Pipeline

### Why Retrieval Reduces Hallucinations

**Without Retrieval (LLM-only):**
- LLM relies on training data (may be outdated)
- Can generate plausible but incorrect information
- No way to verify sources

**With Retrieval (RAG):**
- Answers grounded in actual documents
- Can cite specific sources
- Knowledge base can be updated independently
- Reduces fabrication of medical facts


In [14]:
# Initialize retrieval engine
top_k = config.get('retrieval', {}).get('top_k', 5)

retrieval_engine = RetrievalEngine(
    vector_store=vector_store,
    embedding_generator=embedding_generator,
    top_k=top_k,
    score_threshold=0.0
)

print(f"‚úÖ Retrieval engine initialized")
print(f"  - Top-K: {top_k}")


‚úÖ Retrieval engine initialized
  - Top-K: 5


## ü§ñ Section 6: LLM Answer Generation (RAG)

### How the LLM Uses Context from Guidelines

**Prompt Construction:**
1. Retrieved documents are formatted as context
2. User query is added
3. Instructions guide the LLM to use only the context
4. Citation requirements are specified

**Why Prompting is Critical:**
- Prevents LLM from using outdated training data
- Ensures answers are evidence-based
- Enforces citation requirements
- Maintains medical safety standards


In [15]:
# Initialize RAG pipeline
llm_model_name = config.get('llm', {}).get('model_name', 'google/flan-t5-base')
use_api = config.get('llm', {}).get('use_api', False)
api_key = config.get('llm', {}).get('api_key')

print(f"Initializing RAG Pipeline...")
print(f"  - LLM Model: {llm_model_name}")
print(f"  - Use API: {use_api}")

rag_pipeline = RAGPipeline(
    retrieval_engine=retrieval_engine,
    llm_model_name=llm_model_name,
    use_api=use_api,
    api_key=api_key
)

print(f"\n‚úÖ RAG Pipeline initialized")


Initializing RAG Pipeline...
  - LLM Model: google/flan-t5-base
  - Use API: False


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
ERROR:utils.rag_pipeline:Error loading LLM: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of ApertusConfig, ArceeConfig, AriaTextConfig, BambaConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BitNetConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, BltConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, Cohere2Config, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, DeepseekV2Config, DeepseekV3Config, DiffLlamaConfig, DogeConfig, Dots1Config, ElectraConfig, Emu3Config, ErnieConfig, Ernie4_5Config, Ernie4_5_MoeConfig, Exaone4Config, FalconConfig, FalconH1Config, FalconMambaConfig, FlexOlmoConfig, FuyuConfig, GemmaConfig, Gemma2Config, Gemma3Config, Gemma3TextConfig, Gemma3nConfig, Gemma3nTextConfig, GitConfig, GlmConfig, Glm4Config

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0



‚úÖ RAG Pipeline initialized


In [22]:
# Test the complete RAG pipeline
test_query = "What are the symptoms of malaria?"

print(f"Query: {test_query}\n")
print("Generating answer...\n")

result = rag_pipeline.generate_answer(
    query=test_query,
    top_k=5,
    include_citations=True
)

print("=" * 80)
print("ANSWER:")
print(result['answer'])
print("=" * 80)
# print("\n" + "=" * 80)
# print(f"\nSources: {len(result['sources'])}")
print(f"Valid: {result['is_valid']}")


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Query: What are the symptoms of malaria?

Generating answer...

ANSWER:
A high temperature (fever) Headaches Sweats and chills Muscle aches or pains Vomiting and/or diarrhoea Tiredness and loss of energy Symptoms usually appear between 7 and 18 days after being bitten, but in some cases can take up to a year or occasionally even longer. If not treated promptly, malaria can cause severe complications, including:

**Sources:**
[1] Medical Guidelines


---
**Medical Disclaimer**: This information is for educational purposes only and is not a substitute for professional medical advice, diagnosis, or treatment. Always seek the advice of your physician or other qualified health provider with any questions you may have regarding a medical condition. Never disregard professional medical advice or delay in seeking it because of something you have read here.
Valid: True


## üß™ Section 7: Testing & Evaluation

### Real-World Performance Demonstration

This section demonstrates:
- Answer quality on medical questions
- Citation accuracy
- Comparison with baseline (LLM-only)
- Evaluation metrics


In [17]:
# Load test questions
test_questions_file = Path("data/test_questions.json")

if test_questions_file.exists():
    with open(test_questions_file, 'r') as f:
        test_questions = json.load(f)
    print(f"‚úÖ Loaded {len(test_questions)} test questions")
else:
    print("‚ö†Ô∏è Test questions file not found. Using sample questions.")
    test_questions = [
        {"query": "What are the symptoms of malaria?"},
        {"query": "How is tuberculosis treated?"}
    ]


‚úÖ Loaded 8 test questions


In [20]:
# Test on multiple questions
print("Testing RAG System on Medical Questions:\n")
print("=" * 80)

for i, test_case in enumerate(test_questions[:3], 1):  # Test first 3
    query = test_case['query']
    print(f"\n[{i}] Query: {query}")
    print("-" * 80)

    result = rag_pipeline.generate_answer(query, top_k=3)

    print(f"Answer: {result['answer'][:300]}...")
    print(f"\nSources: {len(result['sources'])}")
    if result.get('safety_warning'):
        print(f"Safety Warning: {result['safety_warning']}")
    print("=" * 80)


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Testing RAG System on Medical Questions:


[1] Query: What are the symptoms of malaria?
--------------------------------------------------------------------------------


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: A high temperature (fever) Headaches Sweats and chills Muscle aches or pains Vomiting and/or diarrhoea Tiredness and loss of energy Symptoms usually appear between 7 and 18 days after being bitten, but in some cases can take up to a year or occasionally even longer. If not treated promptly, malaria ...

Sources: 1

[2] Query: What is the first-line treatment for uncomplicated malaria?
--------------------------------------------------------------------------------


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: Artemisinin-based combination therapies (ACTs) are recommended --- Common ACTs include artemether-lumefantrine, artesunate-amodiaquine, and dihydroartemisinin-piperaquine --- Title: Malaria Prevention and Treatment Guidelines Source: World Health Organization (WHO) Date: 2024 --- OVERVIEW Malaria is...

Sources: 1

[3] Query: How can malaria be prevented?
--------------------------------------------------------------------------------
Answer: Vector control: Use of insecticide-treated mosquito nets (ITNs) and indoor residual spraying (IRS) 2. Chemoprevention: Antimalarial drugs for high-risk groups (pregnant women, infants, travelers) 3. Environmental management: Eliminating mosquito breeding sites 4. Personal protection: Wear long-sleev...

Sources: 1


## üöÄ Section 8: Making it App-Ready (Bonus)

### Deployment Possibilities

This notebook structure can easily be converted into:

1. **FastAPI Endpoint**: REST API for medical Q&A
2. **Streamlit App**: Interactive web interface
3. **Chatbot Interface**: Conversational medical assistant
4. **Mobile App Backend**: API for mobile health apps

### Next Steps for Production

1. **Add Authentication**: Secure API access
2. **Rate Limiting**: Prevent abuse
3. **Logging**: Track usage and errors
4. **Monitoring**: Health checks and metrics
5. **Caching**: Cache common queries
6. **Database**: Store query history
7. **CI/CD**: Automated testing and deployment


## üìä Summary & Next Steps

### What We Built

‚úÖ **Complete RAG Pipeline**: From data ingestion to answer generation
‚úÖ **Medical Domain Focus**: Safety checks and evidence-based answers
‚úÖ **Production-Ready Code**: Modular, documented, and scalable
‚úÖ **Evaluation Framework**: Metrics and benchmarking tools

### Key Learnings

1. **RAG Architecture**: How retrieval enhances LLM responses
2. **Medical Safety**: Importance of validation and disclaimers
3. **Vector Search**: Efficient semantic similarity search
4. **Prompt Engineering**: Critical for medical accuracy

### Future Enhancements

- **Larger Knowledge Base**: Add more medical guidelines
- **Better Embeddings**: Fine-tune on medical text
- **Reranking**: Improve retrieval quality
- **Multi-turn Conversations**: Context-aware follow-ups
- **Multilingual Support**: Answer in multiple languages

---

**Thank you for using the Medical RAG Q&A Agent!**

Remember: This system is for educational purposes only. Always consult healthcare professionals for medical advice.
