# Domain-Specific RAG: A Practical Tutorial



## Learning Objectives

By the end of this tutorial, you will be able to:
 - Understand the basic concepts of Retrieval-Augmented Generation (RAG).
 - Implement a simple RAG pipeline using pre-built indexes and models.
 - Apply domain-specific prompting and evaluation metrics to tailor RAG to different use cases.
 - Explore the impact of different retrieval and generation parameters on the performance of a RAG system.

## Exercises and Challenges (Optional)

 - **Experiment with different datasets:** Try using other pre-built indexes or creating your own indexes from a dataset of your choice.
 - **Fine-tune the retrieval model:** Explore fine-tuning the bi-encoder or cross-encoder to improve retrieval accuracy for your domain.
 - **Evaluate different LLMs:** Compare the performance of different LLMs for answer generation in your use case.
 - **Build a user interface:** Develop a simple web application or chatbot that integrates your RAG pipeline.

## 🖥️ System Requirements & Setup

### System Requirements

This tutorial uses deep learning models and vector similarity search that benefit from:

- **Compute Resources:**
  - **RAM:** Minimum 8GB, recommended 16GB+
  - **GPU:** Recommended for faster processing (especially when using cross-encoders)
  - **Storage:** ~2GB for pre-built indexes and models
  - **Google Colab:** This notebook runs well on a free Colab instance with T4 GPU

### 🧰 Tools & Libraries We'll Use

This tutorial leverages several key libraries:

- **[sentence-transformers](https://www.sbert.net/)**: For embedding generation and cross-encoding
- **[FAISS](https://github.com/facebookresearch/faiss)**: For efficient similarity search
- **[ir_datasets](https://ir-datasets.com/)**: For accessing benchmark datasets
- **[Mistral AI](https://mistral.ai/)**: For answer generation using LLMs
- **[Hugging Face Hub](https://huggingface.co/docs/hub/index)**: For accessing pre-built indexes

### 🔑 Mistral API Setup

To use the Mistral AI API for answer generation:

1. Create a Mistral AI account at [https://console.mistral.ai/](https://console.mistral.ai/)
2. Generate an API key from your account dashboard
3. Add your key to this notebook:
   ```python
   os.environ["MISTRAL_API_KEY"] = "YOUR_API_KEY_HERE"  # Replace with your actual key
   ```

## 🏗️ RAG Architecture Overview

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant information from external knowledge sources before generating responses. This approach significantly improves accuracy and allows LLMs to access domain-specific information without retraining.

![RAG Architecture Overview](https://drive.google.com/uc?export=view&id=1U28sDU5JAE-UfIfT9-BZ_4FEHktpuiah)

### Core Components of RAG:

1. **Document Processing Pipeline**
   - **Collection**: Gathering domain-specific documents, articles, or knowledge bases
   - **Chunking**: Breaking documents into manageable segments (paragraphs or sections)
   - **Embedding**: Converting text chunks into dense vector representations

2. **Retrieval System**
   - **Query Processing**: Converting user questions into the same vector space
   - **Vector Search**: Finding relevant document chunks using similarity measures
   - **Reranking**: Further refining results using more sophisticated models

3. **Generation System**
   - **Context Assembly**: Combining retrieved information into a prompt
   - **LLM Integration**: Sending the enriched prompt to an LLM
   - **Response Generation**: Creating coherent, accurate, and contextual answers

### Why Domain-Specific RAG Matters

Different domains require specialized knowledge and specific response formats:
- Medical advice needs evidence-based information and appropriate caveats
- Technical support requires precise, actionable instructions
- Educational content should be structured for different learning levels

In this tutorial, we'll customize each component of the RAG pipeline for specific domains.

### 🧩 Domain-Specific RAG Architecture

Standard RAG pipelines can be enhanced by domain-specific customizations at each stage:

### Domain Customization Points:

1. **Data Sources & Preprocessing**
   - Scientific: Research papers, clinical trials, structured abstracts
   - Technical: Documentation, forums, code repositories
   - Educational: Textbooks, lecture notes, graded materials

2. **Embedding & Retrieval**
   - Domain-specific models (e.g., BioBERT for medical, CodeBERT for programming)
   - Customized chunking strategies (e.g., by sections, by concepts)
   - Specialized ranking functions (e.g., recency weighting for news)

3. **Prompt Engineering**
   - Domain-appropriate instructions, terminology, and formats
   - Specialty-specific constraints and guardrails
   - Role-specific personas (e.g., educator, researcher, consultant)

4. **Evaluation Metrics**
   - Scientific: Citation accuracy, evidence quality
   - Technical: Step completeness, actionability
   - Educational: Clarity at appropriate learning level

In this tutorial, we'll focus on practical implementations of these customizations.

## 📚 Datasets

This tutorial uses a diverse set of datasets from the [BEIR benchmark collection](https://github.com/beir-cellar/beir) (via `ir_datasets`) to demonstrate how Retrieval-Augmented Generation (RAG) can be applied across multiple real-world domains.

These datasets represent realistic information needs from different user groups, such as researchers, developers, educators, and the general public. Pre-built indexes have been provided for each dataset to ensure the tutorial runs smoothly in a short time window (~30 minutes), avoiding the overhead of on-the-fly index construction.

### 🧪 Scientific Research Domain
- `beir/trec-covid`: Focused on COVID-19 research questions using the CORD-19 corpus.
- `beir/scifact`: Scientific claim verification, where the system must support or refute claims using scientific abstracts.
- `beir/nfcorpus`: Non-factoid biomedical QA, based on user questions from the NLM’s PubMed Helpdesk.

### 🛠️ Technical Support Domain
- `beir/cqadupstack/android`: Community Q&A data from Stack Exchange on Android development.
- `beir/cqadupstack/webmasters`: Web hosting and webmaster technical queries.
- `beir/cqadupstack/unix`: Unix/Linux command-line and scripting support.

### 🎓 Education & Library Domain
- `beir/natural-questions`: Real user questions from Google Search and answers from Wikipedia.
- `beir/hotpotqa`: Multi-hop QA dataset requiring reasoning over multiple Wikipedia documents.
- `beir/nfcorpus`: Also used here for medically-themed educational queries.

### 🔍 Fact Verification Domain
- `beir/fever`: Fact-checking based on Wikipedia claims.
- `beir/climate-fever`: Focused on climate change-related claims and evidence.
- `beir/scifact`: Also shared here for scientific claim verification.

### 🏥 Healthcare Information Domain
- `beir/nfcorpus`: Used for health-related literature review and question answering.
- `beir/trec-covid`: Reused here to address health policy and treatment questions during the pandemic.

> ℹ️ Each domain has one **default dataset** selected (the first in each list), but you can explore other datasets in that domain using the dropdown selector. All datasets are accessed through the `ir_datasets` library, and prebuilt indexes are provided to ensure quick experimentation.

## Preliminaries

Install required packages



In [None]:
!pip install -q sentence-transformers transformers torch numpy faiss-cpu tqdm ir_datasets ir_measures pandas matplotlib ipywidgets huggingface_hub

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m104.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m72.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

Import required libraries



In [None]:
import os
import torch
import requests
import json
import numpy as np
import time
import pandas as pd
from tqdm.notebook import tqdm
import ir_datasets
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from transformers import AutoTokenizer, AutoModelForCausalLM
import faiss
import pickle
import ir_measures
from ir_measures import *
import matplotlib.pyplot as plt
from IPython.display import display, HTML, Markdown
import ipywidgets as widgets
from IPython.display import display, clear_output
from huggingface_hub import hf_hub_download

Set up GPU/CPU device

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
HUB_REPO_ID = "ShubhamC/rag-tutorial-prebuilt-indexes"

In [None]:
import warnings
warnings.filterwarnings("ignore")

## 🔍 Basic RAG Pipeline: A Minimal Example

Before diving into domain customization, let's understand the basic RAG workflow with a minimal working example. This gives us a foundation to build upon.

### The Four Essential Steps:

1. Load a document collection
2. Convert user query to a vector
3. Retrieve relevant documents
4. Generate an answer using retrieved information

In [None]:
# Quick demonstration of a minimal RAG pipeline
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# 1. Create a tiny document collection (normally this would be loaded from a database)
mini_docs = [
    "RAG stands for Retrieval-Augmented Generation in AI systems.",
    "Embedding models convert text into numerical vectors.",
    "FAISS is a library for efficient similarity search in vector spaces.",
    "Language models generate text based on provided prompts."
]

print("1. Loaded document collection with", len(mini_docs), "documents\n")

# 2. Create embeddings for documents
mini_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
doc_embeddings = mini_model.encode(mini_docs)
print("2. Created document embeddings with shape:", doc_embeddings.shape, "\n")

# Create a simple vector index
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)
print("3. Built search index with", index.ntotal, "vectors\n")

# 3. Process a query and retrieve
query = "What is RAG in AI?"
query_vector = mini_model.encode([query])
D, I = index.search(query_vector, k=2)  # Get top 2 results

print("Query:", query)
print("\nRetrieved documents:")
for i, doc_idx in enumerate(I[0]):
    print(f"[{i+1}] {mini_docs[doc_idx]} (distance: {D[0][i]:.2f})")

# 4. Generate answer (simplified simulation without actual LLM API call)
print("\n4. Generated Answer:")
print(f"RAG (Retrieval-Augmented Generation) is an AI technique that enhances language models by retrieving relevant information before generating responses. It combines the knowledge retrieval capabilities of search systems with the text generation abilities of large language models.")

## About Pre-built Indexes

For this tutorial, we're using pre-built indexes to save time. The indexes were created in advance
using datasets from various sources:

- BEIR (Benchmarking IR): Contains various IR tasks including scientific literature, news articles, etc.
- Stack Exchange collections: Technical Q&A across domains
- CORD-19: COVID-19 related research papers
- NQ (Natural Questions): General knowledge Q&A

The pre-built indexes include:
1. Document corpus with text and titles
2. Document embeddings using sentence-transformers
3. FAISS index for efficient retrieval
4. Sample queries with relevance judgments

In a typical RAG pipeline, you would need to:
1. Download and process a dataset
2. Create document embeddings
3. Build a search index
4. Perform retrieval
5. Generate answers with LLM

For this tutorial, steps 1-3 are already done for you with the pre-built indexes.
We'll focus on steps 4-5: retrieval and generation.


## ⚙️ How the Pre-built Indexes Were Created

To ensure smooth performance during this tutorial, we’ve prepared **pre-built retrieval indexes** for all datasets in advance. This saves time and avoids requiring participants to download large corpora or compute document embeddings live.

We used the script below to generate these indexes using the `ir_datasets`, `sentence-transformers`, and `faiss` libraries.

### 🛠️ Index Construction Pipeline
For each dataset, we followed this process:

1. **Load the dataset** using `ir_datasets`, including documents, queries, and relevance judgments (if available).
2. **Preprocess each document** by combining its title and text (if both are available).
3. **Generate dense embeddings** using the [`msmarco-distilbert-base-v3`](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-v3) SentenceTransformer model.
4. **Normalize embeddings** and index them using **FAISS** with an inner product search (cosine similarity on normalized vectors).
5. **Store metadata**, including:
   - A pickled dictionary of document texts and titles
   - The FAISS index (`faiss_index.bin`)
   - The NumPy matrix of document embeddings
   - Sample queries and relevance judgments for evaluation

This was implemented via a Python script (`create_prebuilt_indexes.py`) using the following function:

```python
create_prebuilt_index(dataset_name="beir/trec-covid",
                      output_dir="prebuilt_indexes/beir_trec-covid",
                      model_name="sentence-transformers/msmarco-distilbert-base-v3")
```

You can modify and rerun this script to generate new indexes using your own datasets or preferred embedding models.

### 📁 Files in Each Index
Each prebuilt index contains:
- `corpus.pkl`: A dictionary mapping document IDs to `{text, title}`
- `embeddings.npy`: Dense vectors for each document
- `faiss_index.bin`: FAISS index for fast retrieval
- `doc_ids.pkl`: Mapping of embedding rows to document IDs
- `sample_queries.pkl`: Example queries for demonstration
- `qrels.pkl`: Relevance judgments (if available)

> ✅ **Note:** In this tutorial, we only load these prebuilt files and use them directly, skipping embedding and indexing steps to keep things interactive and lightweight.


## 🎯 Use Cases in This Tutorial

This tutorial explores how Retrieval-Augmented Generation (RAG) can be customized for different real-world domains. We've defined five practical **use cases**, each mapped to datasets available in the [`ir_datasets`](https://ir-datasets.com/) library and paired with domain-specific example queries and prompts.

Each use case provides:
- A curated set of relevant datasets
- Example queries based on real-world information needs
- A system prompt that guides the language model’s tone and behavior

These use cases demonstrate how RAG can go beyond generic QA to serve **domain-specific goals** like academic research, troubleshooting, or fact verification.

---

### 🧪 1. Scientific Research
Designed for tasks like literature review, scientific explanation, and research comprehension.

- **Default Dataset:** `beir/trec-covid`
- **Other Datasets:** `beir/scifact`, `beir/nfcorpus`
- **Example Queries:**
  - What are the most effective treatments for severe COVID-19?
  - How does mRNA vaccine technology work?
- **Prompt:** _You are a scientific research assistant. Provide accurate, evidence-based answers with appropriate scientific context and caveats._

---

### 🛠️ 2. Technical Support
Targets IT and developer support tasks using Stack Exchange-style technical Q&A data.

- **Default Dataset:** `beir/cqadupstack/android`
- **Other Datasets:** `beir/cqadupstack/webmasters`, `beir/cqadupstack/unix`
- **Example Queries:**
  - How do I fix network connectivity issues with Android devices?
  - What's causing my app to crash on startup?
- **Prompt:** _You are a technical support specialist. Provide clear, step-by-step solutions to technical problems with practical troubleshooting advice._

---

### 🎓 3. Education & Library
Supports research help, study guides, and educational content generation.

- **Default Dataset:** `beir/natural-questions`
- **Other Datasets:** `beir/hotpotqa`, `beir/nfcorpus`
- **Example Queries:**
  - What teaching methods are most effective for student engagement?
  - How did World War II impact post-war economic development?
- **Prompt:** _You are an educational assistant. Provide informative, well-structured answers suitable for learners, with clear explanations of complex concepts._

---

### 🔍 4. Fact Verification
Focuses on verifying potentially controversial or widely debated claims using factual sources.

- **Default Dataset:** `beir/fever`
- **Other Datasets:** `beir/scifact`, `beir/climate-fever`
- **Example Queries:**
  - Do vaccines cause autism?
  - Is 5G technology harmful to human health?
- **Prompt:** _You are a fact-checking assistant. Provide balanced, evidence-based assessments of claims with references to sources where possible._

---

### 🏥 5. Healthcare Information
Geared toward health and medical information, patient education, and clinical understanding.

- **Default Dataset:** `beir/nfcorpus`
- **Other Datasets:** `beir/trec-covid`
- **Example Queries:**
  - What are the potential side effects of statin medications?
  - How effective is cognitive behavioral therapy for anxiety?
- **Prompt:** _You are a healthcare information assistant. Provide evidence-based answers about medical topics, while noting that this is not medical advice. Focus on established research and clinical guidelines._


## Setup


Select RAG Use Case

In [None]:
# Dictionary of use cases with their descriptions, datasets, and example queries
# Focus on datasets publicly available in ir_datasets but using prebuilt indexes
use_cases = {
    "Scientific Research": {
        "description": "Support for scientific literature review, fact verification, and research paper comprehension",
        "datasets": ["beir/trec-covid", "beir/scifact", "beir/nfcorpus"],
        "default_dataset": "beir/trec-covid",
        "example_queries": [
            "What are the most effective treatments for severe COVID-19?",
            "How does mRNA vaccine technology work?",
            "What evidence supports aerosol transmission of respiratory viruses?",
            "What is the relationship between diet and cancer prevention?"
        ],
        "domain_prompt": "You are a scientific research assistant. Provide a clear, accurate, and evidence-based answer to the user's question using only the retrieved documents. Cite supporting information from the context explicitly, and include appropriate scientific context, limitations, and caveats where applicable. Do not speculate beyond the provided material."

    },
    "Technical Support": {
        "description": "IT helpdesk, programming assistance, and technical knowledge base",
        "datasets": ["beir/cqadupstack/android", "beir/cqadupstack/webmasters", "beir/cqadupstack/unix"],
        "default_dataset": "beir/cqadupstack/android",
        "example_queries": [
            "How do I fix network connectivity issues with Android devices?",
            "What's causing my app to crash on startup?",
            "How to implement pagination in a mobile application?",
            "Best practices for securing an Android device"
        ],
        "domain_prompt": "You are a technical support specialist. Based on the retrieved documents, provide a concise and practical solution to the user's problem. Structure your answer step-by-step, cite relevant technical details, and ensure instructions are clear and executable. Avoid speculation or unsupported suggestions."

    },
    "Education & Library": {
        "description": "Enhanced research assistance, study materials, and educational content",
        "datasets": ["beir/nfcorpus", "beir/natural-questions", "beir/hotpotqa"],
        "default_dataset": "beir/natural-questions",
        "example_queries": [
            "What teaching methods are most effective for student engagement?",
            "How did World War II impact economic development in post-war Europe?",
            "What are the key differences between behaviorist and constructivist learning theories?",
            "How does cellular respiration relate to photosynthesis?"
        ],
        "domain_prompt": "You are an educational assistant. Using the retrieved documents, provide a well-structured, accurate, and learner-friendly explanation of the topic. Break down complex concepts into simpler terms, include definitions or examples when needed, and ensure your tone is clear, supportive, and informative. Ground your answer entirely in the context provided."

    },
    "Fact Verification": {
        "description": "Verifying claims and statements across various sources",
        "datasets": ["beir/fever", "beir/scifact", "beir/climate-fever"],
        "default_dataset": "beir/fever",
        "example_queries": [
            "Is climate change primarily caused by human activities?",
            "Do vaccines cause autism?",
            "Does vitamin C prevent the common cold?",
            "Is 5G technology harmful to human health?"
        ],
        "domain_prompt": "You are a fact-checking assistant. Using only the retrieved documents, assess the accuracy of the user's claim. Provide a balanced, evidence-based analysis with references to specific statements or sources from the documents. If the evidence is inconclusive, clearly state the uncertainty and avoid speculation."

    },
    "Healthcare Information": {
        "description": "Medical literature review, patient education, and clinical guidelines",
        "datasets": ["beir/nfcorpus", "beir/trec-covid"],
        "default_dataset": "beir/nfcorpus",
        "example_queries": [
            "What is the relationship between diet and heart disease?",
            "How effective is cognitive behavioral therapy for anxiety disorders?",
            "What are the potential side effects of statin medications?",
            "How does family history affect cancer risk assessment?"
        ],
        "domain_prompt": "You are a healthcare information assistant. Based on the retrieved documents, provide an accurate, evidence-based summary addressing the user's question. Do not offer medical advice. Instead, focus on established research findings, clinical guidelines, and clearly state any risks, limitations, or uncertainties in the available evidence. Ground your answer entirely in the context provided."
    },
    "Campus Info": {
        "description": "Ask questions about services, offices, and procedures listed on the S&T website.",
        "datasets": ["custom_mst_site"],
        "default_dataset": "custom_mst_site",
        "example_queries": [
            "Where is the ISSS office located?",
            "What services are offered to sponsored international students?",
            "How can I contact the Graduate Studies department?",
            "Where can I find the event calendar?"
        ],
        "domain_prompt": "You are a helpful campus assistant for Missouri S&T. Using only the information available in the retrieved content, answer the user's question clearly and accurately. Reference the exact page or section when possible. Be concise, avoid speculation, and maintain a friendly and professional tone."

    }
}

In [None]:
# Step 1: Create dropdowns
use_case_dropdown = widgets.Dropdown(
    options=list(use_cases.keys()),
    description='Use Case:',
    value="Scientific Research"
)

dataset_dropdown = widgets.Dropdown(
    options=use_cases["Scientific Research"]["datasets"],
    description='Dataset:'
)

# Step 2: Update dataset list when use case changes
def update_datasets(change):
    new_use_case = change['new']
    new_datasets = use_cases[new_use_case]["datasets"]
    dataset_dropdown.options = new_datasets
    dataset_dropdown.value = use_cases[new_use_case]["default_dataset"]

use_case_dropdown.observe(update_datasets, names='value')

# Step 3: Display both dropdowns
display(use_case_dropdown, dataset_dropdown)

# Step 4: Function to confirm selections and assign variables
def confirm_selection(_):
    global use_case, selected_dataset
    use_case = use_case_dropdown.value
    selected_dataset = dataset_dropdown.value

    clear_output()
    print(f"✅ Use Case Selected: {use_case}")
    print(f"✅ Dataset Selected: {selected_dataset}")
    print(f"\nDescription:\n{use_cases[use_case]['description']}")
    print("\nExample queries:")
    for query in use_cases[use_case]['example_queries']:
        print(f"- {query}")

# Step 6: Confirm button
confirm_button = widgets.Button(description="Confirm Selection", button_style='success')
confirm_button.on_click(confirm_selection)
display(confirm_button)

✅ Use Case Selected: Scientific Research
✅ Dataset Selected: beir/trec-covid

Description:
Support for scientific literature review, fact verification, and research paper comprehension

Example queries:
- What are the most effective treatments for severe COVID-19?
- How does mRNA vaccine technology work?
- What evidence supports aerosol transmission of respiratory viruses?
- What is the relationship between diet and cancer prevention?


Define the paths where pre-built index files should be

In [None]:
base_path = f"prebuilt_indexes/{selected_dataset.replace('/', '_')}"
corpus_path = f"{base_path}/corpus.pkl"
embeddings_path = f"{base_path}/embeddings.npy"
faiss_index_path = f"{base_path}/faiss_index.bin"
doc_ids_path = f"{base_path}/doc_ids.pkl"

Create the directory if it doesn't exist


In [None]:
os.makedirs(base_path, exist_ok=True)

Define the download URLs for each dataset

Function to generate sample corpus based on the selected dataset and use case

In [None]:
def generate_sample_corpus(dataset_name, current_use_case):
    """Generate a sample corpus with domain-specific content based on the dataset name."""
    corpus = {}

    if current_use_case == "Scientific Research":
        # Scientific research content
        for i in range(20):
            doc_id = f"doc{i}"
            if "covid" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"COVID-19 Research Paper {i}",
                    "text": f"This paper investigates the effects of COVID-19 on respiratory health. We conducted a study with {100+i} patients and found significant correlations between viral load and symptom severity. The study suggests that early intervention with antiviral medications may reduce hospitalization rates."
                }
            elif "scifact" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Scientific Study on Topic {i}",
                    "text": f"Our research demonstrates that hypothesis {i} is supported by experimental evidence. The data shows a statistically significant effect (p<0.05) across multiple trials. These findings contradict previous assumptions and suggest a new mechanism for this phenomenon."
                }
            else:
                corpus[doc_id] = {
                    "title": f"Scientific Paper {i}",
                    "text": f"This research examines the relationship between diet and health outcomes. Analysis of data from {500+i} participants shows significant associations between consumption of processed foods and increased risk of chronic diseases. The findings highlight the importance of dietary interventions in public health strategies."
                }

    elif current_use_case == "Technical Support":
        # Technical support content
        for i in range(20):
            doc_id = f"doc{i}"
            if "android" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Android Technical Issue {i}",
                    "text": f"Users experiencing battery drain on Android devices should check for apps running in the background. The issue often occurs after system updates or when location services are constantly active. To fix this, go to Settings > Battery > Battery Usage and identify power-hungry applications. Restricting background activity for these apps can significantly improve battery life."
                }
            elif "webmasters" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Web Development Problem {i}",
                    "text": f"When implementing responsive designs, developers often encounter issues with viewport rendering on mobile devices. To address this, ensure your HTML includes the proper meta viewport tag. Also check that media queries are correctly implemented to handle different screen sizes. Testing across multiple devices is essential to verify responsive behavior."
                }
            else:
                corpus[doc_id] = {
                    "title": f"Technical Solution {i}",
                    "text": f"A common error when setting up network connections is incorrect DNS configuration. To troubleshoot, first verify that the DNS server addresses are correct. Then flush the DNS cache to ensure old records aren't causing conflicts. If problems persist, try using alternative DNS servers to determine if the issue is with your ISP's DNS resolution."
                }

    elif current_use_case == "Education & Library":
        # Educational content
        for i in range(20):
            doc_id = f"doc{i}"
            if "questions" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Educational Topic {i}",
                    "text": f"This article explains the fundamental concepts of learning theories. Constructivism emphasizes how learners actively build knowledge through experience and reflection, while behaviorism focuses on observable behaviors and environmental conditioning. Understanding these frameworks helps educators design more effective teaching strategies that accommodate different learning styles and cognitive processes."
                }
            elif "hotpot" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Historical Event {i}",
                    "text": f"The Industrial Revolution transformed economic systems through mechanization and factory production. Beginning in Britain in the late 18th century, it spread throughout Europe and North America, fundamentally changing social structures and labor practices. Technological innovations like the steam engine drove unprecedented growth while creating new challenges related to urbanization, working conditions, and economic inequality."
                }
            else:
                corpus[doc_id] = {
                    "title": f"Educational Resource {i}",
                    "text": f"This educational material covers key scientific concepts for secondary education. Topics include cellular biology, chemical reactions, and physical laws. The content is structured to build conceptual understanding through progressive learning objectives, providing examples and applications relevant to students' daily experiences."
                }

    elif current_use_case == "Fact Verification":
        # Fact verification content
        for i in range(20):
            doc_id = f"doc{i}"
            if "fever" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Fact Check Article {i}",
                    "text": f"Climate scientists have reached overwhelming consensus that human activities are the primary driver of observed climate change since the mid-20th century. Multiple independent lines of evidence support this conclusion, including atmospheric CO2 measurements, temperature records, and climate model projections. Natural factors alone cannot explain the rapid warming observed in recent decades."
                }
            elif "scifact" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Scientific Claim Assessment {i}",
                    "text": f"Research has consistently failed to find evidence supporting a causal link between vaccines and autism. Multiple large-scale epidemiological studies involving millions of children have found no association between vaccination and autism spectrum disorders. The original study suggesting this link was retracted due to methodological flaws and ethical violations."
                }
            else:
                corpus[doc_id] = {
                    "title": f"Fact Verification {i}",
                    "text": f"Analysis of the claim that 5G technology poses health risks finds insufficient scientific evidence to support this assertion. While all radiation sources deserve study, 5G operates using non-ionizing radiation that lacks sufficient energy to damage cellular DNA. Current research indicates that exposure levels from 5G infrastructure fall well below international safety guidelines."
                }

    elif current_use_case == "Healthcare Information":
        # Healthcare content
        for i in range(20):
            doc_id = f"doc{i}"
            if "nfcorpus" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"Nutrition Research Study {i}",
                    "text": f"This review examines the relationship between dietary patterns and cardiovascular health. Evidence consistently shows that diets rich in fruits, vegetables, whole grains, and lean proteins are associated with reduced risk of heart disease. Conversely, high consumption of processed foods, saturated fats, and added sugars correlates with increased cardiovascular risk factors including hypertension and elevated cholesterol."
                }
            elif "covid" in dataset_name.lower():
                corpus[doc_id] = {
                    "title": f"COVID-19 Treatment Review {i}",
                    "text": f"Clinical trials evaluating antiviral medications for COVID-19 have shown varying degrees of efficacy. Early treatment with certain antivirals may reduce symptom duration and hospitalization risk in high-risk populations. However, treatment effectiveness depends on timing, viral variants, and patient characteristics. Comprehensive management approaches typically include supportive care alongside targeted therapies."
                }
            else:
                corpus[doc_id] = {
                    "title": f"Medical Research {i}",
                    "text": f"This study examines treatment efficacy for anxiety disorders, comparing cognitive behavioral therapy (CBT) with pharmacological interventions. Meta-analysis of clinical trials indicates that CBT produces outcomes comparable to medication for many patients, with potentially more durable effects after treatment discontinuation. Combined approaches often yield superior results, particularly for severe or treatment-resistant cases."
                }
    elif current_use_case == "Campus Info":
        # Campus information content
        for i in range(20):
            doc_id = f"doc{i}"
            corpus[doc_id] = {
                "title": f"Missouri S&T Page {i}",
                "text": f"This page contains information about Missouri S&T services and offices. The International Student and Scholar Services (ISSS) office provides support for sponsored international students including visa assistance, cultural activities, and academic guidance. Students can contact the Graduate Studies department for information about graduate programs, thesis requirements, and funding opportunities."
            }

    else:
        # Generic content for any other use case
        for i in range(20):
            doc_id = f"doc{i}"
            corpus[doc_id] = {
                "title": f"Document {i}",
                "text": f"This is sample text for document {i}. It contains information relevant to various queries in this domain. The content includes key facts, explanations, and examples that might be useful for answering questions on this topic."
            }

    return corpus

Download the files if they don't exist


In [None]:
# --- Path Setup ---
repo_folder_name = selected_dataset.replace('/', '_')
base_path = f"prebuilt_indexes/{repo_folder_name}"
os.makedirs(base_path, exist_ok=True) # Create directory if it doesn't exist

# List of files expected for an index (Keep this)
files_to_download = ["corpus.pkl", "embeddings.npy", "faiss_index.bin", "doc_ids.pkl"] # Add qrels.pkl etc. if needed

print(f"Checking/downloading pre-built indexes for {selected_dataset} from HF Hub: {HUB_REPO_ID}...")

# --- Download Loop (Using HF Hub) ---
all_files_exist = True
for file_name in files_to_download:
    local_file_path = os.path.join(base_path, file_name)
    if not os.path.exists(local_file_path):
        all_files_exist = False # Mark that at least one file needs downloading
        print(f"Downloading {file_name}...")
        try:
            # Construct the path *within* the Hub repository
            # Assumes you uploaded into folders named like 'beir_trec-covid' etc.
            path_in_repo = f"{repo_folder_name}/{file_name}"

            # Use hf_hub_download
            downloaded_path = hf_hub_download(
                repo_id=HUB_REPO_ID,
                filename=path_in_repo,
                repo_type="dataset", # Specify it's a dataset repo
                local_dir=base_path, # Download directly into the target folder
                local_dir_use_symlinks=False # Avoids potential symlink issues
            )
            # Double-check file exists at the expected final path
            if not os.path.exists(local_file_path):
                 if os.path.exists(downloaded_path) and downloaded_path != local_file_path:
                     # Move if hf_hub_download placed it slightly differently (rare with local_dir)
                     os.rename(downloaded_path, local_file_path)
                     print(f"Moved downloaded file to {local_file_path}")
                 else:
                      raise FileNotFoundError(f"Download failed or file not found at expected path {local_file_path} or download path {downloaded_path} for {file_name}")

            print(f"Successfully downloaded {file_name} to {local_file_path}")

        except Exception as e:
            print(f"ERROR downloading {file_name} from Hugging Face Hub: {e}")
            print(f"Check connection and ensure the file exists at 'datasets/{HUB_REPO_ID}/tree/main/{path_in_repo}' on the Hub.")
            # --- Optional: Keep your fallback logic ---
            print(f"⚠️ Creating sample data for {file_name} since download failed")
            # Adapt this fallback logic based on your generate_sample_corpus function and other needs
            if file_name == "corpus.pkl":
                 try:
                     # Ensure generate_sample_corpus and use_case are defined earlier
                     sample_corpus = generate_sample_corpus(selected_dataset, use_case)
                     with open(local_file_path, 'wb') as f: pickle.dump(sample_corpus, f)
                 except NameError:
                     print("generate_sample_corpus or use_case not defined. Cannot create sample data.")
                 except Exception as sample_e:
                     print(f"Error creating sample corpus: {sample_e}")
            elif file_name == "doc_ids.pkl":
                 try:
                     sample_doc_ids = [f"doc{i}" for i in range(20)]
                     with open(local_file_path, 'wb') as f: pickle.dump(sample_doc_ids, f)
                 except Exception as sample_e:
                     print(f"Error creating sample doc_ids: {sample_e}")
            elif file_name == "embeddings.npy":
                 try:
                     sample_embeddings = np.random.rand(20, 384).astype(np.float32)
                     np.save(local_file_path, sample_embeddings)
                 except Exception as sample_e:
                     print(f"Error creating sample embeddings: {sample_e}")
            elif file_name == "faiss_index.bin":
                 try:
                     sample_index = faiss.IndexFlatL2(384); sample_index.add(np.random.rand(20, 384).astype(np.float32))
                     faiss.write_index(sample_index, local_file_path)
                 except Exception as sample_e:
                     print(f"Error creating sample faiss_index: {sample_e}")
            # --- End Optional Fallback ---

# Optional check message
if all_files_exist:
    print("All required index files already exist locally.")

Checking/downloading pre-built indexes for beir/trec-covid from HF Hub: ShubhamC/rag-tutorial-prebuilt-indexes...
Downloading corpus.pkl...


corpus.pkl:   0%|          | 0.00/196M [00:00<?, ?B/s]

Moved downloaded file to prebuilt_indexes/beir_trec-covid/corpus.pkl
Successfully downloaded corpus.pkl to prebuilt_indexes/beir_trec-covid/corpus.pkl
Downloading embeddings.npy...


embeddings.npy:   0%|          | 0.00/526M [00:00<?, ?B/s]

Moved downloaded file to prebuilt_indexes/beir_trec-covid/embeddings.npy
Successfully downloaded embeddings.npy to prebuilt_indexes/beir_trec-covid/embeddings.npy
Downloading faiss_index.bin...


faiss_index.bin:   0%|          | 0.00/526M [00:00<?, ?B/s]

Moved downloaded file to prebuilt_indexes/beir_trec-covid/faiss_index.bin
Successfully downloaded faiss_index.bin to prebuilt_indexes/beir_trec-covid/faiss_index.bin
Downloading doc_ids.pkl...


doc_ids.pkl:   0%|          | 0.00/1.89M [00:00<?, ?B/s]

Moved downloaded file to prebuilt_indexes/beir_trec-covid/doc_ids.pkl
Successfully downloaded doc_ids.pkl to prebuilt_indexes/beir_trec-covid/doc_ids.pkl


In [None]:
print(f"Loading pre-built indexes for {selected_dataset}...")

Loading pre-built indexes for beir/trec-covid...


Load corpus

In [None]:
print("Loading document corpus...")
with open(corpus_path, 'rb') as f:
    corpus = pickle.load(f)

Loading document corpus...


Load document IDs

In [None]:
print("Loading document IDs...")
with open(doc_ids_path, 'rb') as f:
    doc_ids = pickle.load(f)

Loading document IDs...


Load embeddings

In [None]:
print("Loading document embeddings...")
doc_embeddings = np.load(embeddings_path)

Loading document embeddings...


Load FAISS index

In [None]:

print("Loading FAISS index...")
index = faiss.read_index(faiss_index_path)

Loading FAISS index...


In [None]:
try:
    print(f"Successfully loaded pre-built indexes for {selected_dataset}")
    print(f"Corpus size: {len(corpus)} documents")
    print(f"Embeddings shape: {doc_embeddings.shape}")
    print(f"FAISS index size: {index.ntotal} vectors")

except Exception as e:
    print(f"Error building indexes: {e}")
    print(f"⚠️ Creating sample data for demonstration")

    # Create sample corpus
    corpus = {}
    for i in range(100):
        doc_id = f"doc{i}"
        corpus[doc_id] = {
            'title': f"Sample Document {i}",
            'text': f"This is sample text for document {i} related to {use_case}. It contains information relevant to various queries in this domain. The content includes key facts, explanations, and examples that might be useful for answering questions on this topic."
        }

    doc_ids = list(corpus.keys())

    # Create sample embeddings
    doc_embeddings = np.random.rand(len(doc_ids), 384).astype(np.float32)
    faiss.normalize_L2(doc_embeddings)

    # Create FAISS index
    index = faiss.IndexFlatIP(doc_embeddings.shape[1])
    index.add(doc_embeddings)

    # Save sample data
    with open(corpus_path, 'wb') as f:
        pickle.dump(corpus, f)

    with open(doc_ids_path, 'wb') as f:
        pickle.dump(doc_ids, f)

    np.save(embeddings_path, doc_embeddings)
    faiss.write_index(index, faiss_index_path)

    print(f"✅ Created sample data with {len(corpus)} documents")


Successfully loaded pre-built indexes for beir/trec-covid
Corpus size: 171332 documents
Embeddings shape: (171332, 768)
FAISS index size: 171332 vectors


Load queries and qrels from ir_datasets

In [None]:
import json
import os

def load_custom_topics_json(json_path):
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    topics_list = data["topics"] if "topics" in data else data
    queries = {}
    for topic in topics_list:
        topic_id = str(topic.get("id") or topic.get("number"))
        queries[topic_id] = {
            "id": topic_id,
            "text": topic.get("title", "").strip(),
            "description": topic.get("description", "").strip(),
            "narrative": topic.get("narrative", "").strip()
        }
    print(f"✅ Loaded {len(queries)} topics from JSON")
    return queries

def load_custom_qrels_txt(qrels_path):
    qrels_dict = {}
    with open(qrels_path, "r", encoding="utf-8") as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) == 4:
                topic_id, _, doc_id, relevance = parts
                if topic_id not in qrels_dict:
                    qrels_dict[topic_id] = {}
                qrels_dict[topic_id][doc_id] = int(relevance)
    print(f"✅ Loaded qrels for {len(qrels_dict)} topics")
    return qrels_dict

def select_sample_queries(queries, qrels_dict, max_queries=3):
    sample = {}
    for topic_id, query_info in queries.items():
        if topic_id in qrels_dict:
            sample[topic_id] = query_info
        if len(sample) >= max_queries:
            break
    return sample

# === Drop-in replacement begins here ===

try:
    if selected_dataset == "custom_mst_site":
        print("📦 Loading queries and qrels for custom MST dataset...")

        base_path = "prebuilt_indexes/custom_mst_site"
        topics_path = os.path.join(base_path, "topics.json")
        qrels_path = os.path.join(base_path, "auto_qrels.txt")

        queries = load_custom_topics_json(topics_path)
        qrels_dict = load_custom_qrels_txt(qrels_path)
        sample_queries = select_sample_queries(queries, qrels_dict)

        print(f"✅ Selected {len(sample_queries)} sample queries for demonstration")

    else:
        # Fall back to ir_datasets version
        import ir_datasets
        print("📥 Loading queries and relevance judgments from ir_datasets...")
        dataset = ir_datasets.load(selected_dataset)

        queries = {}
        qrels_dict = {}

        try:
            next(dataset.queries_iter())
            for query in dataset.queries_iter():
                query_id = query.query_id if hasattr(query, 'query_id') else getattr(query, '_id', f'q{len(queries)}')
                if hasattr(query, 'text'):
                    query_text = query.text
                elif hasattr(query, 'title'):
                    query_text = query.title
                elif hasattr(query, 'query'):
                    query_text = query.query
                else:
                    query_text = ""
                    for field in dir(query):
                        if not field.startswith('_') and isinstance(getattr(query, field), str) and field != 'query_id':
                            query_text = getattr(query, field)
                            break

                queries[query_id] = {'text': query_text, 'id': query_id}
                if len(queries) >= 100:
                    break
            print(f"✅ Loaded {len(queries)} queries from ir_datasets")
        except:
            print("⚠️ No queries found in ir_datasets")

        try:
            next(dataset.qrels_iter())
            for qrel in dataset.qrels_iter():
                if not hasattr(qrel, 'query_id') or not hasattr(qrel, 'doc_id') or not hasattr(qrel, 'relevance'):
                    continue
                if qrel.query_id not in qrels_dict:
                    qrels_dict[qrel.query_id] = {}
                qrels_dict[qrel.query_id][qrel.doc_id] = qrel.relevance
            print(f"✅ Loaded relevance judgments for {len(qrels_dict)} queries")
        except:
            print("⚠️ No relevance judgments found in ir_datasets")

        sample_queries = {}
        if queries and qrels_dict:
            for query_id, query_info in queries.items():
                if query_id in qrels_dict and len(sample_queries) < 3:
                    sample_queries[query_id] = query_info
        if not sample_queries and queries:
            for i, (query_id, query_info) in enumerate(queries.items()):
                if i < 3:
                    sample_queries[query_id] = query_info

        print(f"✅ Selected {len(sample_queries)} sample queries for demonstration")

except Exception as e:
    print("❌ Error loading queries and qrels:", e)


📥 Loading queries and relevance judgments from ir_datasets...


[INFO] [starting] opening zip file
[INFO] If you have a local copy of https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/ce62140cb23feb9becf6270d0d1fe6d1
[INFO] [starting] https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
[INFO] [finished] https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip: [00:14] [73.9MB] [4.93MB/s]
                                                                                                              

✅ Loaded 50 queries from ir_datasets
✅ Loaded relevance judgments for 50 queries
✅ Selected 3 sample queries for demonstration


[INFO] [finished] opening zip file [15.92s]
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [0ms]


## Initialize retrieval models

Load bi-encoder model for query encoding

In [None]:

biencoder_model_name = "sentence-transformers/msmarco-distilbert-base-v3" # @param ["sentence-transformers/msmarco-distilbert-base-v3", "sentence-transformers/all-mpnet-base-v2", "sentence-transformers/all-MiniLM-L6-v2"]
bi_encoder = SentenceTransformer(biencoder_model_name)
bi_encoder.to(device)
print(f"Loaded bi-encoder model: {biencoder_model_name}")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded bi-encoder model: sentence-transformers/msmarco-distilbert-base-v3


Load cross-encoder model for reranking

In [None]:

crossencoder_model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2" # @param ["cross-encoder/ms-marco-MiniLM-L-6-v2", "cross-encoder/ms-marco-MiniLM-L-12-v2", "cross-encoder/ms-marco-TinyBERT-L-2-v2"]
cross_encoder = CrossEncoder(crossencoder_model_name)
cross_encoder.to(device)
print(f"Loaded cross-encoder model: {crossencoder_model_name}")

Loaded cross-encoder model: cross-encoder/ms-marco-MiniLM-L-6-v2


Load LLM for answer generation

In [None]:
# Set your Mistral API key (preferably from environment variable)
os.environ["MISTRAL_API_KEY"] = "cB2FROkq7si5P0IB1eot6MDkIBUc699H"  # Replace with your actual key
print("Mistral API configured and ready to use")

Mistral API configured and ready to use


## 📏 Domain-Specific Evaluation: Beyond Standard Metrics

Understanding how to evaluate RAG systems properly is critical for real-world applications. Different domains require different evaluation approaches:

### Scientific Research Evaluation
- **Factual Accuracy**: Are all scientific claims supported by the retrieved documents?
- **Citation Quality**: Does the system properly attribute information to sources?
- **Uncertainty Handling**: Does the response appropriately express limitations and caveats?

### Technical Support Evaluation
- **Actionability**: Can a user follow the instructions without additional information?
- **Correctness**: Do the steps actually resolve the technical issue?
- **Safety**: Are proper warnings included for potentially harmful operations?

### Educational Content Evaluation
- **Comprehensibility**: Is the content appropriate for the intended education level?
- **Scaffolding**: Does the explanation build concepts in a logical order?
- **Engagement**: Does the content use appropriate examples and explanations?

### Healthcare Information Evaluation
- **Clinical Accuracy**: Is the medical information correct and up-to-date?
- **Completeness**: Are important warnings, contraindications, or limitations mentioned?
- **Clarity**: Is medical terminology appropriately explained?

### Implementing Custom Evaluation
In this tutorial, we use standard IR metrics (precision, recall, etc.) augmented with domain-specific considerations. In production systems, consider implementing:

1. Human-in-the-loop evaluation pipelines
2. Domain expert review for critical applications
3. Automated checks for domain-specific requirements

Run the next cell to see how we can implement custom evaluation metrics:


In [None]:
import re

def evaluate_scientific_rag(query, retrieved_docs, generated_answer, ground_truth=None):
    """Evaluate a scientific RAG response with domain-specific metrics"""

    # 1. Check for citations in the answer
    citation_pattern = r'\[(\d+)\]|\(([A-Za-z\s]+,\s*\d{4})\)'
    has_citations = bool(re.search(citation_pattern, generated_answer))

    # 2. Check if answer mentions limitations or uncertainty
    uncertainty_terms = ['may', 'might', 'could', 'possibly', 'suggests',
                         'limited evidence', 'more research', 'not conclusive']
    has_uncertainty = any(term in generated_answer.lower() for term in uncertainty_terms)

    # 3. Check relevance of retrieved documents (simplified)
    # In practice, this might use a trained classifier or human evaluation
    relevance_score = sum(doc['cross_score'] for doc in retrieved_docs) / len(retrieved_docs)

    # 4. Calculate a combined scientific quality score (example only)
    scientific_quality = (
        (2 if has_citations else 0) +
        (1 if has_uncertainty else 0) +
        min(2, relevance_score / 3)  # Scale to 0-2 range
    ) / 5  # Normalize to 0-1

    # Return evaluation results
    return {
        'has_citations': has_citations,
        'acknowledges_uncertainty': has_uncertainty,
        'avg_relevance_score': relevance_score,
        'scientific_quality_score': scientific_quality,
    }

# Example usage with a mock result
example_query = "What is the effectiveness of remdesivir for COVID-19?"
example_docs = [{"cross_score": 7.5}, {"cross_score": 6.8}, {"cross_score": 6.2}]
example_answer = "Studies suggest remdesivir may improve recovery time in some COVID-19 patients, though more research is needed to confirm its effectiveness across different patient populations [1]."

scientific_eval = evaluate_scientific_rag(
    example_query,
    example_docs,
    example_answer
)

print("Scientific Domain Evaluation:")
for metric, value in scientific_eval.items():
    if isinstance(value, bool):
        print(f"- {metric}: {'✅ Yes' if value else '❌ No'}")
    elif isinstance(value, float):
        print(f"- {metric}: {value:.2f}")

### Using Domain-Specific Evaluation in Practice

For each domain, you would develop appropriate evaluation metrics:

1. **Scientific Domain**: Citation practices, uncertainty acknowledgment, evidence quality
2. **Technical Support**: Step clarity, command accuracy, safety considerations
3. **Educational Content**: Grade-level appropriateness, concept scaffolding, engagement
4. **Healthcare Information**: Clinical accuracy, completeness of warnings, clarity

These custom evaluations can be:
- **Automated**: For measurable aspects like presence of citations
- **Semi-automated**: Using classifiers trained on expert-labeled examples
- **Human-in-the-loop**: Expert review for critical applications

In production RAG systems, combining standard IR metrics with these domain-specific evaluations provides a more complete picture of system performance.

In [None]:
def get_domain_metrics(domain):
    """Return appropriate evaluation metrics for the selected domain."""
    # Base metrics for all domains
    base_metrics = [
        nDCG@5, P@5, R@5, AP
    ]

    # Domain-specific additional metrics
    domain_specific = {
        "Scientific Research": [
            nDCG@10,         # More comprehensive literature review (deeper results)
            R@10,            # Higher recall important for research
            RR               # First relevant result important for fact-checking
        ],
        "Technical Support": [
            RR,              # First relevant result critical for troubleshooting
            P@1, P@3,        # Precision at very top results important
            Rprec            # Balances precision/recall
        ],
        "Education & Library": [
            nDCG@10,         # Students often explore more results
            RBP(p=0.8),      # Models patience of students browsing results
            Judged@10        # Coverage of assessment for educational content
        ],
        "Legal Document Analysis": [
            RBP(p=0.9),      # Very high precision focus (legal requires accuracy)
            nDCG@20,         # May need to review many documents for legal research
            infAP            # Handles incomplete judgments common in legal collections
        ],
        "Healthcare Information": [
            P@1, P@3,        # Top results critical for medical information
            RR,              # First relevant result important for clinical questions
            ERR              # Models utility with diminishing returns
        ],
        "Campus Info": [
            P@1, P@3,        # Precise information location important
            RR,              # First relevant result critical
            nDCG@5           # Overall relevance ranking
        ]
    }

    return base_metrics + domain_specific.get(domain, [])


## Define RAG functions

In [None]:

def retrieve_documents(query, top_k_first_stage=100, top_k_reranked=5):
    """
    Two-stage retrieval:
    1. Retrieve top_k_first_stage documents using pre-built FAISS index
    2. Rerank with cross-encoder to get final top_k_reranked
    """
    # 1. First-stage retrieval with biencoder + FAISS
    # Encode the query using the bi-encoder model
    query_embedding = bi_encoder.encode(query, show_progress_bar=False, convert_to_numpy=True)
    query_embedding = np.array([query_embedding], dtype=np.float32)
    faiss.normalize_L2(query_embedding)  # Normalize for cosine similarity

    # Search the FAISS index
    scores, indices = index.search(query_embedding, k=min(top_k_first_stage, index.ntotal))

    # 2. Reranking with cross-encoder
    cross_encoder_candidates = []
    for idx in indices[0]:  # indices comes as a 2D array
        doc_id = doc_ids[idx]
        doc_info = corpus[doc_id]
        title = doc_info.get('title', '')
        text = doc_info['text']
        combined_text = f"{title}. {text}" if title else text
        cross_encoder_candidates.append([query, combined_text])

    # Score with cross-encoder
    cross_scores = cross_encoder.predict(cross_encoder_candidates)

    # Sort by cross-encoder scores
    cross_results = []
    for i, idx in enumerate(indices[0]):
        doc_id = doc_ids[idx]
        biencoder_score = float(scores[0][i])  # Convert from numpy float to Python float
        cross_score = float(cross_scores[i])   # Ensure Python float
        cross_results.append((doc_id, biencoder_score, cross_score))

    # Sort by cross-encoder score
    cross_results = sorted(cross_results, key=lambda x: x[2], reverse=True)[:top_k_reranked]

    # Format final results
    results = []
    for doc_id, biencoder_score, cross_score in cross_results:
        doc_info = corpus[doc_id]
        title = doc_info.get('title', '')
        text = doc_info['text']
        combined_text = f"{title}. {text}" if title else text
        results.append({
            'doc_id': doc_id,
            'biencoder_score': biencoder_score,
            'cross_score': cross_score,
            'text': combined_text
        })

    return results

In [None]:
# Define the Mistral API function for answer generation
def generate_answer(query, context, domain_prompt="", max_length=500):
    """Generate an answer using the Mistral API based on retrieved documents and domain-specific prompting"""

    # You'll need to get an API key from Mistral AI and set it as an environment variable
    # or replace the line below with your actual API key
    api_key = os.environ.get("MISTRAL_API_KEY")
    if not api_key:
        raise ValueError("MISTRAL_API_KEY environment variable not set. Please set it before calling this function.")

    # Mistral API endpoint
    api_url = "https://api.mistral.ai/v1/chat/completions"

    # Create the messages list
    messages = []

    # Add system message with domain-specific prompt if provided
    if domain_prompt:
        messages.append({
            "role": "system",
            "content": domain_prompt.strip()
        })

    # Add user message with context and query
    messages.append({
        "role": "user",
        "content": f"Context:\n{context.strip()}\n\nQuestion: {query.strip()}"
    })

    # Prepare the API request payload
    payload = {
        "model": "mistral-large-latest",  # You can change to other Mistral models as needed
        "messages": messages,
        "max_tokens": max_length,
        "temperature": 0.7,
        "top_p": 0.9,
    }

    # Set up headers with API key
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    try:
        # Make the API request
        response = requests.post(api_url, headers=headers, json=payload)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Parse the JSON response
        response_data = response.json()

        # Extract the generated text
        generated_text = response_data["choices"][0]["message"]["content"]

        return generated_text

    except Exception as e:
        print(f"Error calling Mistral API: {e}")
        # Fallback to a simple response if API call fails
        return f"I couldn't generate a response using the Mistral API due to an error: {str(e)}"

In [None]:
def run_rag_pipeline(query, query_id=None, top_k_first_stage=100, top_k_reranked=5, domain_prompt="", show_timings=False):
    """Run the full RAG pipeline with Mistral API for generation"""
    # Timing dictionary
    timings = {}

    # 1. Retrieve and rerank documents
    start_time = time.time()
    retrieved_docs = retrieve_documents(query, top_k_first_stage, top_k_reranked)
    timings['retrieval'] = time.time() - start_time

    # 2. Format context for LLM
    context = ""
    for i, doc in enumerate(retrieved_docs):
        context += f"Document {i+1} [Score: {doc['cross_score']:.3f}]:\n{doc['text']}\n\n"

    # 3. Generate answer with Mistral API
    start_time = time.time()
    answer = generate_answer(query, context, domain_prompt)
    timings['generation'] = time.time() - start_time

    # Total time
    timings['total'] = timings['retrieval'] + timings['generation']

    # Create run (for ir_measures evaluation)
    run = {}
    if query_id is not None:
        run[query_id] = {doc['doc_id']: float(doc['cross_score']) for doc in retrieved_docs}

    return retrieved_docs, answer, timings, run


In [None]:
def evaluate_with_ir_measures(run, qrels, metrics_list=None):
    """
    Evaluate a run using ir_measures library

    Args:
        run: Dict mapping query_id -> {doc_id -> score}
        qrels: Dict mapping query_id -> {doc_id -> relevance}
        metrics_list: List of ir_measures metric objects (default: common metrics)

    Returns:
        Dict of metric results
    """
    if metrics_list is None:
        # Define default metrics to evaluate
        metrics_list = [
            nDCG@5, nDCG@10,       # Normalized Discounted Cumulative Gain
            P@5, P@10,             # Precision at k
            R@5, R@10,             # Recall at k
            AP,                    # Average Precision
            RR                      # Reciprocal Rank
        ]

    # Calculate aggregate metrics
    try:
        results = ir_measures.calc_aggregate(metrics_list, qrels, run)
        return results
    except Exception as e:
        print(f"Error during evaluation: {e}")
        return {}

In [None]:
from IPython.display import display, HTML, Markdown

def display_rag_results(query, retrieved_docs, answer, timings=None, metrics=None, use_case=None):
    """Display RAG results in a formatted way with domain-specific formatting"""

    display(HTML(f"<h2 style='color:#eee;'>RAG Results for {use_case if use_case else 'Query'}</h2>"))
    display(HTML(f"<h3 style='color:#ccc;'>📝 Query: {query}</h3>"))

    # Display metrics if available
    if metrics:
        display(HTML("<h3 style='color:#ccc;'>📊 Retrieval Metrics:</h3>"))
        metrics_html = "<table style='color:#ddd; border-collapse: collapse;'>"
        metrics_html += "<tr><th style='padding: 4px 10px;'>Metric</th><th style='padding: 4px 10px;'>Value</th></tr>"
        for metric_name, metric_value in metrics.items():
            metrics_html += f"<tr><td style='padding: 4px 10px;'>{metric_name}</td><td style='padding: 4px 10px;'>{metric_value:.4f}</td></tr>"
        metrics_html += "</table>"
        display(HTML(metrics_html))

    # Display retrieved documents
    display(HTML("<h3 style='color:#ccc;'>📚 Retrieved Documents:</h3>"))
    for i, doc in enumerate(retrieved_docs):
        display(Markdown(f"**Document {i+1}:**"))
        display(Markdown(f"- **Biencoder Score:** {doc['biencoder_score']:.3f}"))
        display(Markdown(f"- **Cross-Encoder Score:** {doc['cross_score']:.3f}"))
        display(Markdown(f"- **Document ID:** {doc['doc_id']}"))
        display(Markdown(f"- **Text:** {doc['text'][:300]}..."))
        print()

    # Display generated answer
    display(HTML("<h3 style='color:#ccc;'>🤖 Generated Answer:</h3>"))

    # Apply consistent dark-mode-friendly, domain-specific styling
    color_map = {
        "Scientific Research": "#3498db",
        "Technical Support": "#2ecc71",
        "Education & Library": "#f39c12",
        "Legal Document Analysis": "#9b59b6",
        "Healthcare Information": "#e74c3c",
        "Campus Info": "#16a085"
    }
    border_color = color_map.get(use_case, "#7f8c8d")  # default gray

    answer_html = f'''
    <div style="
        border-left: 5px solid {border_color};
        padding: 12px 16px;
        margin: 10px 0;
        background-color: #1e1e1e;
        color: #e0e0e0;
        border-radius: 8px;
        font-size: 15px;
        line-height: 1.6;">
        {answer}
    </div>
    '''

    display(HTML(answer_html))

    # Display timings if available
    if timings:
        display(HTML("<h3 style='color:#ccc;'>⏱️ Performance Metrics:</h3>"))
        timings_html = "<table style='color:#ddd; border-collapse: collapse;'>"
        timings_html += "<tr><th style='padding: 4px 10px;'>Metric</th><th style='padding: 4px 10px;'>Time (seconds)</th></tr>"
        timings_html += f"<tr><td style='padding: 4px 10px;'>Retrieval time</td><td style='padding: 4px 10px;'>{timings['retrieval']:.3f}</td></tr>"
        timings_html += f"<tr><td style='padding: 4px 10px;'>Generation time</td><td style='padding: 4px 10px;'>{timings['generation']:.3f}</td></tr>"
        timings_html += f"<tr><td style='padding: 4px 10px;'>Total RAG time</td><td style='padding: 4px 10px;'>{timings['total']:.3f}</td></tr>"
        timings_html += "</table>"
        display(HTML(timings_html))


## Run RAG on Sample Queries for the Selected Use Case

In [None]:
print(f"\n=== Sample Queries from {use_case} Use Case ===\n")


=== Sample Queries from Scientific Research Use Case ===



Get domain-specific prompt

In [None]:
domain_prompt = use_cases[use_case]["domain_prompt"]
domain_metrics = get_domain_metrics(use_case)

In [None]:
print(domain_prompt)
print(domain_metrics)

You are a scientific research assistant. Provide a clear, accurate, and evidence-based answer to the user's question using only the retrieved documents. Cite supporting information from the context explicitly, and include appropriate scientific context, limitations, and caveats where applicable. Do not speculate beyond the provided material.
[nDCG@5, P@5, R@5, AP, nDCG@10, R@10, RR]


Store results for each sample query

In [None]:
sample_results = []
combined_run = {}  # For evaluation across all queries

In [None]:
for query_id, query_info in sample_queries.items():
    query_text = query_info['text']
    print(f"Running RAG for query: '{query_text}'")

    # Run the RAG pipeline with domain-specific prompt
    retrieved_docs, answer, timings, run = run_rag_pipeline(
        query_text,
        query_id=query_id,
        top_k_first_stage=100,
        top_k_reranked=5,
        domain_prompt=domain_prompt
    )

    # Combine run for overall evaluation
    combined_run.update(run)

    # Store results
    sample_results.append({
        'query_id': query_id,
        'query_text': query_text,
        'retrieved_docs': retrieved_docs,
        'answer': answer,
        'timings': timings,
        'run': run
    })

    # Display a brief summary
    print(f"- Retrieved {len(retrieved_docs)} documents")
    print(f"- Answer length: {len(answer)} characters")
    print("-----------------------------------")

Running RAG for query: 'what is the origin of COVID-19'
- Retrieved 5 documents
- Answer length: 1551 characters
-----------------------------------
Running RAG for query: 'how does the coronavirus respond to changes in the weather'
- Retrieved 5 documents
- Answer length: 1456 characters
-----------------------------------
Running RAG for query: 'will SARS-CoV2 infected people develop immunity? Is cross protection possible?'
- Retrieved 5 documents
- Answer length: 1637 characters
-----------------------------------


## ⚠️ **Note on Performance Without GPU**


Running the full RAG pipeline without GPU access can be quite slow—taking up to 10 minutes per query.

To make the tutorial more accessible and responsive, we’ve included precomputed results below. This allows you to work through the rest of the notebook smoothly without waiting.

If you decide to run the actual pipeline later (once you have GPU access or more time), simply comment out the cell below to avoid duplicate execution.

In [None]:
sample_results = [
  {
    'query_id': '1',
    'query_text': 'what is the origin of COVID-19',
    'retrieved_docs': [
      {
        'doc_id': '4dtk1kyh',
        'biencoder_score': 0.7120170593261719,
        'cross_score': 9.036142349243164,
        'text': 'Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence. Origin of the COVID-19 virus has been intensely debated in the scientific community since the first infected cases were detected in December 2019. The disease has caused a global pandemic, leading to deaths of thousands of people across the world and thus finding origin of this novel coronavirus is important in responding and controlling the pandemic. Recent research results suggest that bats or pangolins might be the original hosts for the virus based on comparative studies using its genomic sequences. This paper investigates the COVID-19 virus origin by using artificial intelligence (AI) and raw genomic sequences of the virus. More than 300 genome sequences of COVID-19 infected cases collected from different countries are explored and analysed using unsupervised clustering methods. The results obtained from various AI-enabled experiments using clustering algorithms demonstrate that all examined COVID-19 virus genomes belong to a cluster that also contains bat and pangolin coronavirus genomes. This provides evidences strongly supporting scientific hypotheses that bats and pangolins are probable hosts for the COVID-19 virus. At the whole genome analysis level, our findings also indicate that bats are more likely the hosts for the COVID-19 virus than pangolins.'
      },
      {
        'doc_id': 'utsr0zv7',
        'biencoder_score': 0.7024656534194946,
        'cross_score': 8.175335884094238,
        'text': 'The Human Coronavirus Disease COVID-19: Its Origin, Characteristics, and Insights into Potential Drugs and Its Mechanisms. The emerging coronavirus disease (COVID-19) swept across the world, affecting more than 200 countries and territories. Genomic analysis suggests that the COVID-19 virus originated in bats and transmitted to humans through unknown intermediate hosts in the Wuhan seafood market, China, in December of 2019. This virus belongs to the Betacoronavirus group, the same group of the 2003 severe acute respiratory syndrome coronavirus (SARS-CoV), and for the similarity, it was named SARS-CoV-2. Given the lack of registered clinical therapies or vaccines, many physicians and scientists are investigating previously used clinical drugs for COVID-19 treatment. In this review, we aim to provide an overview of the CoVs origin, pathogenicity, and genomic structure, with a focus on SARS-CoV-2. Besides, we summarize the recently investigated drugs that constitute an option for COVID-19 treatment.'
      },
      {
        'doc_id': 'v99vlnox',
        'biencoder_score': 0.6849621534347534,
        'cross_score': 8.007100105285645,
        'text': 'COVID-19 in South Korea. A novel coronavirus (severe acute respiratory syndrome-CoV-2) that initially originated from Wuhan, China, in December 2019 has already caused a pandemic. While this novel coronavirus disease (COVID-19) frequently induces mild diseases, it has also generated severe diseases among certain populations, including older-aged individuals with underlying diseases, such as cardiovascular disease and diabetes. As of 31 March 2020, a total of 9786 confirmed cases with COVID-19 have been reported in South Korea. South Korea has the highest diagnostic rate for COVID-19, which has been the major contributor in overcoming this outbreak. We are trying to reduce the reproduction number of COVID-19 to less than one and eventually succeed in controlling this outbreak using methods such as contact tracing, quarantine, testing, isolation, social distancing and school closure. This report aimed to describe the current situation of COVID-19 in South Korea and our response to this outbreak.'
      },
      {
        'doc_id': 'nvofyg16',
        'biencoder_score': 0.6862881779670715,
        'cross_score': 7.909483909606934,
        'text': 'Covid-19 in South Korea.. A novel coronavirus (severe acute respiratory syndrome-CoV-2) that initially originated from Wuhan, China, in December 2019 has already caused a pandemic. While this novel coronavirus disease (covid-19) frequently induces mild diseases, it has also generated severe diseases among certain populations, including older-aged individuals with underlying diseases, such as cardiovascular disease and diabetes. As of 31 March 2020, a total of 9786 confirmed cases with covid-19 have been reported in South Korea. South Korea has the highest diagnostic rate for covid-19, which has been the major contributor in overcoming this outbreak. We are trying to reduce the reproduction number of covid-19 to less than one and eventually succeed in controlling this outbreak using methods such as contact tracing, quarantine, testing, isolation, social distancing and school closure. This report aimed to describe the current situation of covid-19 in South Korea and our response to this outbreak.'
      },
      {
        'doc_id': 'sh7lrdou',
        'biencoder_score': 0.7032864093780518,
        'cross_score': 7.851293087005615,
        'text': 'The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak. Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern. Based on the large number of infected people that were exposed to the wet animal market in Wuhan City, China, it is suggested that this is likely the zoonotic origin of COVID-19. Person-to-person transmission of COVID-19 infection led to the isolation of patients that were subsequently administered a variety of treatments. Extensive measures to reduce person-to-person transmission of COVID-19 have been implemented to control the current outbreak. Special attention and efforts to protect or reduce transmission should be applied in susceptible populations including children, health care providers, and elderly people. In this review, we highlights the symptoms, epidemiology, transmission, pathogenesis, phylogenetic analysis and future directions to control the spread of this fatal disease.'
      }
    ],
    'answer': 'Scientific research assistants are responsible for conducting thorough research and analyzing scientific data to answer specific scientific questions. They work closely with their supervisors and peers to develop and refine research strategies, conduct experiments, collect and analyze data, write and edit research papers, present their work at scientific conferences and publish their results in peer-reviewed journals. Their work is crucial in advancing scientific knowledge and improving human health and wellbeing.',
    'timings': {
      'retrieval': 34.01727652549744,
      'generation': 595.5995147228241,
      'total': 629.6167912483215
    },
    'run': {
      '1': {
        '4dtk1kyh': 9.036142349243164,
        'utsr0zv7': 8.175335884094238,
        'v99vlnox': 8.007100105285645,
        'nvofyg16': 7.909483909606934,
        'sh7lrdou': 7.851293087005615
      }
    }
  },
  {
    'query_id': '2',
    'query_text': 'how does the coronavirus respond to changes in the weather',
    'retrieved_docs': [
      {
        'doc_id': 'w7ycc07b',
        'biencoder_score': 0.4392872452735901,
        'cross_score': 3.4371156692504883,
        'text': "Does weather affect the growth rate of COVID-19, a study to comprehend transmission dynamics on human health. Abstract The undefendable outbreak of novel coronavirus (SARS-COV-2) lead to a global health emergency due to its higher transmission rate and longer symptomatic duration, created a health surge in a short time. Since Nov 2019 the outbreak in China, the virus is spreading exponentially everywhere. The current study focuses on the relationship between environmental parameters and the growth rate of COVID-19. The statistical analysis suggests that the temperature changes retarded the growth rate and found that -6.28°C and +14.51°C temperature is the favorable range for COVID-19 growth. Gutenberg- Richter's relationship is used to estimate the mean daily rate of exceedance of confirmed cases concerning the change in temperature. Indeed, temperature is the most influential parameter that reduces the growth at the rate of 13-17 cases/day with a 1°C rise in temperature."
      },
      {
        'doc_id': 'w5kjmw88',
        'biencoder_score': 0.45745065808296204,
        'cross_score': 3.3182454109191895,
        'text': 'Weathering the pandemic: How the Caribbean Basin can use viral and environmental patterns to predict, prepare, and respond to COVID-19. The 2020 coronavirus pandemic is developing at different paces throughout the world. Some areas, like the Caribbean Basin, have yet to see the virus strike at full force. When it does, there is reasonable evidence to suggest the consequent COVID-19 outbreaks will overwhelm healthcare systems and economies. This is particularly concerning in the Caribbean as pandemics can have disproportionately higher mortality impacts on lower and middle-income countries. Preliminary observations from our team and others suggest that temperature and climatological factors could influence the spread of this novel coronavirus, making spatiotemporal predictions of its infectiousness possible. This review studies geographic and time-based distribution of known respiratory viruses in the Caribbean Basin in an attempt to foresee how the pandemic will develop in this region. This review is meant to aid in planning short- and long-term interventions to manage outbreaks at the international, national, and subnational levels in the region.'
      },
      {
        'doc_id': 'gan10za0',
        'biencoder_score': 0.4419426918029785,
        'cross_score': 3.288236379623413,
        'text': 'Weathering the pandemic: How the Caribbean Basin can use viral and environmental patterns to predict, prepare and respond to COVID‐19. The 2020 coronavirus pandemic is developing at different paces throughout the world. Some areas, like the Caribbean Basin, have yet to see the virus strike at full force. When it does, there is reasonable evidence to suggest the consequent COVID‐19 outbreaks will overwhelm healthcare systems and economies. This is particularly concerning in the Caribbean as pandemics can have disproportionately higher mortality impacts on lower and middle income countries. Preliminary observations from our team and others suggest that temperature and climatological factors could influence the spread of this novel coronavirus, making spatiotemporal predictions of its infectiousness possible. This review studies geographic and time‐based distribution of known respiratory viruses in the Caribbean Basin in an attempt to foresee how the pandemic will develop in this region. This review is meant to aid in planning short‐ and long‐term interventions to manage outbreaks at the international, national and sub‐national levels in the region. This article is protected by copyright. All rights reserved.'
      },
      {
        'doc_id': 'pdww20r4',
        'biencoder_score': 0.48396408557891846,
        'cross_score': 2.5067882537841797,
        'text': 'The behaviour changes in response to COVID-19 pandemic within Malaysia. The novel coronavirus infection, COVID-19, is a pandemic that currently affects the whole world. During this period, Malaysians displayed a variety of behaviour changes as a response to COVID-19, including panic buying, mass travelling during movement restriction and even absconding from treatment facilities. This article attempts to explore some of these behaviour changes from a behaviourist perspective in order to get a better understanding of the rationale behind the changes.'
      },
      {
        'doc_id': 'zespmk29',
        'biencoder_score': 0.47616642713546753,
        'cross_score': 2.3425703048706055,
        'text': 'How diseases rise and fall with the seasons—and what it could mean for coronavirus. Scientists and doctors have observed for thousands of years that some diseases, such as polio and influenza, rise and fall with the seasons But why? Ongoing research in animals and humans suggests a variety of causes, including changes in the environment (like pH, temperature, and humidity) and even seasonal and daily changes to our own immune systems Figuring out those answers could one day make all the difference in minimizing the impact of infectious disease outbreaks—such as coronavirus disease 2019'
      }
    ],
    'answer': 'Scientists have been studying the behavioral changes that have emerged during the ongoing global crisis caused by the novel corona virus (sars-cov2). The article "The Behavior Changes in Responses of Malaysian Citizens Towards the Corona Virus Pandemic" provides a comprehensive analysis of how people have responded to this crisis. It explores the reasons behind these changes and how they may impact the future of public health and disease prevention. Additionally, "How Diseases Rise and Fall With the Seasons—And What It Could Mean for Coronovirus" discusses the potential implications of weather patterns on disease transmission. Overall, these articles provide valuable insights',
    'timings': {
      'retrieval': 20.592284440994263,
      'generation': 817.0070235729218,
      'total': 837.599308013916
    },
    'run': {
      '2': {
        'w7ycc07b': 3.4371156692504883,
        'w5kjmw88': 3.3182454109191895,
        'gan10za0': 3.288236379623413,
        'pdww20r4': 2.5067882537841797,
        'zespmk29': 2.3425703048706055
      }
    }
  },
  {
    'query_id': '3',
    'query_text': 'will SARS-CoV2 infected people develop immunity? Is cross protection possible?',
    'retrieved_docs': [
      {
        'doc_id': 'eo4ehcjv',
        'biencoder_score': 0.6343124508857727,
        'cross_score': 5.052664756774902,
        'text': "Children's vaccines do not induce cross reactivity against SARS-CoV.. In contrast with adults, children infected by severe acute respiratory syndrome-corona virus (SARS-CoV) develop milder clinical symptoms. Because of this, it is speculated that children vaccinated with various childhood vaccines might develop cross immunity against SARS-CoV. Antisera and T cells from mice immunised with various vaccines were used to determine whether they developed cross reactivity against SARS-CoV. The results showed no marked cross reactivity against SARS-CoV, which implies that the reduced symptoms among children infected by SARS-CoV may be caused by other factors."
      },
      {
        'doc_id': 'yigj0u3n',
        'biencoder_score': 0.6495019197463989,
        'cross_score': 4.585306167602539,
        'text': 'Serologic cross-reactivity of SARS-CoV-2 with endemic and seasonal Betacoronaviruses. In order to properly understand the spread of SARS-CoV-2 infection and development of humoral immunity, researchers have evaluated the presence of serum antibodies of people worldwide experiencing the pandemic. These studies rely on the use of recombinant proteins from the viral genome in order to identify serum antibodies that recognize SARS-CoV-2 epitopes. Here, we discuss the cross-reactivity potential of SARS-CoV-2 antibodies with the full spike proteins of four other Betacoronaviruses that cause disease in humans, MERS-CoV, SARS-CoV, HCoV-OC43, and HCoV-HKU1. Using enzyme-linked immunosorbent assays (ELISAs), we detected the potential cross-reactivity of antibodies against SARS-CoV-2 towards the four other coronaviruses, with the strongest cross-recognition between SARS-CoV-2 and SARS /MERS-CoV antibodies, as expected based on sequence homology of their respective spike proteins. Further analysis of cross-reactivity could provide informative data that could lead to intelligently designed pan-coronavirus therapeutics or vaccines.'
      },
      {
        'doc_id': 'buwz6lu3',
        'biencoder_score': 0.6658353805541992,
        'cross_score': 4.557277202606201,
        'text': 'Lack of cross-neutralization by SARS patient sera towards SARS-CoV-2. Despite initial findings indicating that SARS-CoV and SARS-CoV-2 are genetically related belonging to the same virus species and that the two viruses used the same entry receptor, angiotensin-converting enzyme 2 (ACE2), our data demonstrated that there is no detectable cross-neutralization by SARS patient sera against SARS-CoV-2. We also found that there are significant levels of neutralizing antibodies in recovered SARS patients 9-17 years after initial infection. These findings will be of significant use in guiding the development of serologic tests, formulating convalescent plasma therapy strategies, and assessing the longevity of protective immunity for SARS-related coronaviruses in general as well as vaccine efficacy.'
      },
      {
        'doc_id': '8i1u1a9t',
        'biencoder_score': 0.6663722991943359,
        'cross_score': 4.5535807609558105,
        'text': 'Lack of cross-neutralization by SARS patient sera towards SARS-CoV-2. Despite initial findings indicating that SARS-CoV and SARS-CoV-2 are genetically related belonging to the same virus species and that the two viruses used the same entry receptor, angiotensin-converting enzyme 2 (ACE2), our data demonstrated that there is no detectable cross-neutralization by SARS patient sera against SARS-CoV-2. We also found that there are significant levels of neutralizing antibodies in recovered SARS patients 9–17 years after initial infection. These findings will be of significant use in guiding the development of serologic tests, formulating convalescent plasma therapy strategies, and assessing the longevity of protective immunity for SARS-related coronaviruses in general as well as vaccine efficacy.'
      },
      {
        'doc_id': 't3sjv4hv',
        'biencoder_score': 0.6718544960021973,
        'cross_score': 4.239818572998047,
        'text': 'SARS-CoV-2 infection protects against rechallenge in rhesus macaques. An understanding of protective immunity to SARS-CoV-2 is critical for vaccine and public health strategies aimed at ending the global COVID-19 pandemic. A key unanswered question is whether infection with SARS-CoV-2 results in protective immunity against re-exposure. We developed a rhesus macaque model of SARS-CoV-2 infection and observed that macaques had high viral loads in the upper and lower respiratory tract, humoral and cellular immune responses, and pathologic evidence of viral pneumonia. Following initial viral clearance, animals were rechallenged with SARS-CoV-2 and showed 5 log10 reductions in median viral loads in bronchoalveolar lavage and nasal mucosa compared with primary infection. Anamnestic immune responses following rechallenge suggested that protection was mediated by immunologic control. These data show that SARS-CoV-2 infection induced protective immunity against re-exposure in nonhuman primates.'
      }
    ],
    'answer': 'Certainly! The evidence presented in these documents supports the idea that people who have recovered from Sars-Cov2 may develop some level of immuno-tolerance against the virus. However, the specific mechanisms by which this occurs are not yet fully understood. Cross-protection, or the ability of someone who has been exposed to a particular pathogen to develop protection against subsequent infections, is a complex process that involves a variety of factors, including the type and severity of exposure, genetic makeup of the individual, host response to antigenic stimulation (e.g. Immune activation, cytokine release), and the duration and strength of protection afforded by the previous inoc',
    'timings': {
      'retrieval': 26.644312143325806,
      'generation': 955.8774149417877,
      'total': 982.5217270851135
    },
    'run': {
      '3': {
        'eo4ehcjv': 5.052664756774902,
        'yigj0u3n': 4.585306167602539,
        'buwz6lu3': 4.557277202606201,
        '8i1u1a9t': 4.5535807609558105,
        't3sjv4hv': 4.239818572998047
      }
    }
  }
]

In [None]:
combined_run = {}

for result in sample_results:
    query_id = result["query_id"]
    run_dict = result["run"]

    for qid, doc_scores in run_dict.items():
        for doc_id, score in doc_scores.items():
            combined_run[f"{qid} {doc_id}"] = score


Evaluate all sample queries using ir_measures with domain-specific metrics

In [None]:
if combined_run and qrels_dict:
    print("\n=== Evaluation with ir_measures ===\n")

    # Calculate metrics
    metrics_results = evaluate_with_ir_measures(combined_run, qrels_dict, domain_metrics)

    # Display results
    print("Overall Evaluation Results:")
    for metric, value in metrics_results.items():
        print(f"  - {metric}: {value:.4f}")


=== Evaluation with ir_measures ===

Overall Evaluation Results:
  - R@5: 0.0005
  - RR: 0.0600
  - P@5: 0.0480
  - nDCG@5: 0.0443
  - AP: 0.0005
  - R@10: 0.0005
  - nDCG@10: 0.0288


Detailed Results Viewer with Domain-Specific Visualization

In [None]:
query_index = 0  # @param ["0", "1", "2"] {type:"raw"}
show_timings = True # @param {type:"boolean"}

In [None]:
if sample_results and query_index < len(sample_results):
    result = sample_results[query_index]

    # Evaluate this specific query with ir_measures using domain-specific metrics
    if qrels_dict and result['query_id'] in qrels_dict:
        metrics = evaluate_with_ir_measures(
            result['run'],
            qrels_dict,
            domain_metrics
        )
    else:
        metrics = None

    # Display results with domain-specific formatting
    display_rag_results(
        result['query_text'],
        result['retrieved_docs'],
        result['answer'],
        result['timings'] if show_timings else None,
        metrics,
        use_case
    )

Metric,Value
R@5,0.0001
RR,0.02
P@5,0.012
nDCG@5,0.0102
AP,0.0001
R@10,0.0001
nDCG@10,0.0066


**Document 1:**

- **Biencoder Score:** 0.712

- **Cross-Encoder Score:** 9.036

- **Document ID:** 4dtk1kyh

- **Text:** Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence. Origin of the COVID-19 virus has been intensely debated in the scientific community since the first infected cases were detected in December 2019. The disease has caused a global pandemic, leading to...




**Document 2:**

- **Biencoder Score:** 0.702

- **Cross-Encoder Score:** 8.175

- **Document ID:** utsr0zv7

- **Text:** The Human Coronavirus Disease COVID-19: Its Origin, Characteristics, and Insights into Potential Drugs and Its Mechanisms. The emerging coronavirus disease (COVID-19) swept across the world, affecting more than 200 countries and territories. Genomic analysis suggests that the COVID-19 virus originat...




**Document 3:**

- **Biencoder Score:** 0.685

- **Cross-Encoder Score:** 8.007

- **Document ID:** v99vlnox

- **Text:** COVID-19 in South Korea. A novel coronavirus (severe acute respiratory syndrome-CoV-2) that initially originated from Wuhan, China, in December 2019 has already caused a pandemic. While this novel coronavirus disease (COVID-19) frequently induces mild diseases, it has also generated severe diseases ...




**Document 4:**

- **Biencoder Score:** 0.686

- **Cross-Encoder Score:** 7.909

- **Document ID:** nvofyg16

- **Text:** Covid-19 in South Korea.. A novel coronavirus (severe acute respiratory syndrome-CoV-2) that initially originated from Wuhan, China, in December 2019 has already caused a pandemic. While this novel coronavirus disease (covid-19) frequently induces mild diseases, it has also generated severe diseases...




**Document 5:**

- **Biencoder Score:** 0.703

- **Cross-Encoder Score:** 7.851

- **Document ID:** sh7lrdou

- **Text:** The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak. Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern. Based on the large number of infected people that were ...




Metric,Time (seconds)
Retrieval time,1.411
Generation time,7.916
Total RAG time,9.327


## Try RAG with Your Own Query for this Domain

In [None]:

user_query = "What evidence supports the effectiveness of Remdesivir in treating COVID-19 patients?" # @param {type:"string"}
top_k_first_stage = 100 # @param {type:"slider", min:10, max:500, step:10}
top_k_reranked = 5 # @param {type:"slider", min:1, max:20, step:1}
show_timings = True # @param {type:"boolean"}

In [None]:
# If no query is provided, use a default example from the selected use case
if not user_query:
    user_query = use_cases[use_case]["example_queries"][0]
    print(f"Using example query: \"{user_query}\"")
    print("Enter your own query above to try something different!")
else:
    print(f"Processing custom query: \"{user_query}\"")

Processing custom query: "What evidence supports the effectiveness of Remdesivir in treating COVID-19 patients?"


In [None]:
# Get domain-specific prompt
domain_prompt = use_cases[use_case]["domain_prompt"]

In [None]:
# Run the RAG pipeline with domain-specific prompt
retrieved_docs, answer, timings, _ = run_rag_pipeline(
    user_query,
    top_k_first_stage=top_k_first_stage,
    top_k_reranked=top_k_reranked,
    domain_prompt=domain_prompt,
    show_timings=show_timings
)

In [None]:
# Display results with domain-specific formatting
display_rag_results(
    user_query,
    retrieved_docs,
    answer,
    timings if show_timings else None,
    use_case=use_case
)

**Document 1:**

- **Biencoder Score:** 0.712

- **Cross-Encoder Score:** 7.340

- **Document ID:** bjgy98r8

- **Text:** Large trial yields strongest evidence yet that antiviral drug can help COVID-19 patients. A candidate treatment for COVID-19 has shown convincing—albeit modest—benefit for the first time in a large, carefully controlled clinical trial in hospitalized patients The infected people who received remdesi...




**Document 2:**

- **Biencoder Score:** 0.714

- **Cross-Encoder Score:** 6.946

- **Document ID:** pkklt77i

- **Text:** Remdesivir investigational trials in COVID-19: a critical reappraisal. Abstract During outbreak of emerging disease, the most important aim is to discover an effective drug to save life. Consequently, a lot of effort are generally made by the industry to promote clinical trials with new drugs. Here ...




**Document 3:**

- **Biencoder Score:** 0.733

- **Cross-Encoder Score:** 6.818

- **Document ID:** 6tcwu832

- **Text:** Remdesivir Efficacy in Coronavirus Disease 2019 (COVID-19): A Systematic Review. Background: Researchers are working hard to find an effective treatment for the new coronavirus 2019. We performed a comprehensive systematic review to investigate the latest clinical evidence on the treatment efficacy ...




**Document 4:**

- **Biencoder Score:** 0.650

- **Cross-Encoder Score:** 6.754

- **Document ID:** eaqxifxu

- **Text:** Antivirals for COVID-19.. Drugs targeting RNA respiratory viruses has resulted in few effective therapies, highlighting challenges for antivirals to treat COVID-19. Several antivirals are being investigated for symptomatic COVID-19 but no definitive data support their clinical use. Remdesivir, with ...




**Document 5:**

- **Biencoder Score:** 0.706

- **Cross-Encoder Score:** 6.715

- **Document ID:** 7xc47la7

- **Text:** Remdesivir use in patients with coronavirus COVID-19 disease: a systematic review and meta-analysis. Background Coronavirus disease 2019 (COVID-19), caused by the novel coronavirus SARS-CoV-2, has led to significant global mortality and morbidity. Until now, no treatment has proven to be effective i...




Metric,Time (seconds)
Retrieval time,0.675
Generation time,10.844
Total RAG time,11.519


### 📊 Understanding the Results

Let's analyze what happened in this RAG example:

1. **Retrieval Quality**:
   - Examine the relevance scores of retrieved documents
   - Higher cross-encoder scores indicate better matches to the query
   - Does the first document contain the information needed?

2. **Answer Completeness**:
   - Check if the generated answer incorporates information from the retrieved documents
   - Is the answer comprehensive or missing key information?

3. **Domain Appropriateness**:
   - Does the response match the expected format for this domain?
   - For scientific content: Are citations and caveats included?
   - For technical support: Are steps clear and actionable?
   - For educational content: Is the explanation accessible and structured?

Try modifying the query slightly and observe how the retrieval results change.


## Best Practices for RAG in Each Domain

### Scientific Research
- Use domain-specific embeddings trained on scientific literature.
- Implement citation tracking to show provenance of information.
- Consider combining with a knowledge graph for concept relationships.
- Include publication date as a ranking factor for recency.
- Use subject-specific vocabularies for query expansion.

### Technical Support
- Optimize for high precision in top results.
- Implement conversational context to track troubleshooting steps.
- Use step detection to format answers as procedures.
- Consider hybrid retrieval combining semantic and keyword search.
- Add feedback mechanisms to improve answer quality over time.

### Education & Library
- Organize retrieved content by difficulty/education level.
- Implement learning path generation based on prerequisite concepts.
- Support multiple learning styles in answer generation.
- Include multimedia content recommendations when available.
- Use readability scores to match content to learner level.

### Legal Document Analysis
- Prioritize precision and traceability of information.
- Implement jurisdiction detection for relevant legal standards.
- Track legal precedent relationships between documents.
- Use specialized legal embeddings trained on case law.
- Include confidence scores and disclaimers in generated answers.

### Healthcare Information
- Implement evidence quality assessment for medical information.
- Include recency as a critical ranking factor.
- Use medical taxonomy mapping for query expansion.
- Apply stricter fact verification for medical claims.
- Include appropriate medical disclaimers in answers.

### Campus Info
- Use location-aware retrieval for campus service queries.
- Implement up-to-date checking for office hours and contact info.
- Include department hierarchy awareness for better routing.
- Use hybrid search combining exact name matching with semantic.
- Prioritize official pages over event announcements.

---

## General Implementation Guidance for RAG

When implementing RAG for a specific domain, also consider:

- **Pre-processing:** Apply domain-specific cleaning and normalization.
- **Chunking strategy:** Adjust document chunking based on domain document structure.
- **Embedding models:** Fine-tune or select embeddings trained on domain data.
- **LLM prompting:** Craft specialized prompts with domain context.
- **Post-processing:** Format answers according to domain conventions.

If you do not have a domain-specific use case:

- Select appropriate embedding models for your specific use case.
- Experiment with chunking strategies based on your document structure.
- Fine-tune retrieval parameters (k values, reranking) based on evaluation.
- Craft effective prompts that include relevant context for the LLM.
- Implement user feedback mechanisms to improve over time.


## 🛠️ Troubleshooting Common Issues

When implementing RAG systems, you may encounter these common challenges:

### Retrieval Problems

1. **Irrelevant Documents Retrieved**
   - **Symptoms**: Retrieved documents don't match query intent
   - **Solutions**:
     - Try different embedding models (domain-specific if available)
     - Adjust chunking strategy (smaller or larger chunks)
     - Implement query expansion or reformulation
     - Add a reranking step with a cross-encoder

2. **Missing Information**
   - **Symptoms**: System can't find information you know exists in the corpus
   - **Solutions**:
     - Increase the number of retrieved documents (k)
     - Try hybrid retrieval (combine dense and sparse methods)
     - Check document preprocessing (ensure no information loss)

### Generation Issues

1. **Hallucinations Despite RAG**
   - **Symptoms**: Model generates incorrect information despite relevant context
   - **Solutions**:
     - Adjust prompt to emphasize using only retrieved information
     - Implement fact-checking or confidence scoring
     - Use a model with better instruction following

2. **Poor Integration of Retrieved Content**
   - **Symptoms**: Answer doesn't incorporate relevant retrieved information
   - **Solutions**:
     - Improve context formatting in the prompt
     - Try different LLMs or parameter settings
     - Test chain-of-thought or step-by-step reasoning

### API and Resource Issues

1. **Slow Performance**
   - Try batch processing where possible
   - Consider quantized models or optimized libraries
   - Use caching for repeated queries

2. **API Failures**
   - Implement robust error handling and retries
   - Have fallback models or approaches ready
   - Cache results whenever possible

If you encounter any of these issues during the tutorial, raise your hand, and we'll work through them together.

## Evaluation Summary for Current Use Case

## 🎓 Conclusion: Building Effective Domain-Specific RAG Systems

This tutorial has demonstrated how to build and customize RAG systems for different domains using prebuilt indexes from the `ir_datasets` library. By leveraging public datasets and sentence-transformer embeddings, we've implemented practical RAG solutions tailored to real-world use cases.

### Core Principles for Domain-Specific RAG

1. **Know Your Domain**
   - Understand the specific information needs and constraints
   - Identify domain-specific language, terminology, and concepts
   - Recognize the appropriate evidence standards and response formats

2. **Customize Each Component**
   - Data sources: Select and preprocess domain-appropriate content
   - Embedding models: Choose or fine-tune for domain language
   - Retrieval strategy: Optimize parameters for domain needs
   - Prompting: Design domain-specific instructions and constraints
   - Evaluation: Measure what matters for your specific use case

3. **Iterate with Feedback**
   - Collect domain expert evaluation on system outputs
   - Analyze failure cases to identify improvement areas
   - Continuously update knowledge bases as domains evolve

### Why Domain-Specific RAG Matters

Adapting RAG systems to specific domains significantly improves performance:

- **Specialized Knowledge**: Retrieve content that aligns with domain-specific needs
- **Improved Contextual Understanding**: Domain-aware prompting leads to higher-quality answers
- **Targeted Evaluation**: Metrics tailored to domain objectives provide realistic assessments
- **Enhanced User Experience**: Formatting and output are better suited to user expectations
- **Higher Relevance**: Focused retrieval leads to better precision and recall

### From Tutorial to Production

To move from this tutorial to production-ready systems:

1. **Scale Your Data Pipeline**
   - Implement efficient document ingestion workflows
   - Set up regular knowledge base updates
   - Consider hybrid storage solutions for different content types

2. **Optimize for Performance**
   - Benchmark and optimize vector search for your scale
   - Implement caching strategies for common queries
   - Consider quantization for embedding models

3. **Enhance with Advanced Techniques**
   - Query rewriting and decomposition
   - Multi-stage retrieval architectures
   - Ensemble methods for improved accuracy
   - Hybrid search combining sparse and dense methods
   - Sophisticated chunking strategies (overlapping, hierarchical, semantic)
   - Self-reflection and validation capabilities

4. **Implement Robust Monitoring**
   - Track retrieval quality metrics over time
   - Monitor for knowledge gaps and hallucinations
   - Implement feedback loops from end users

### Suggested Next Steps

To expand upon this tutorial:

- **Build Your Own Indexes**:
  - Use datasets from `ir_datasets` such as `beir/trec-covid`, `nfcorpus`, `cqadupstack`
  - Generate dense embeddings using Sentence Transformers
  - Build and store FAISS indexes for fast retrieval

- **Customize for Your Domain**:
  - Create tailored prompt templates
  - Design custom evaluation methods
  - Use visualizations relevant to your domain
  - Implement hybrid retrieval suited to the task

- **Explore Advanced RAG Architectures**:
  - Experiment with multi-hop RAG for complex reasoning
  - Try Hypothetical Document Embeddings (HyDE)
  - Integrate knowledge graphs with vector search
  - Implement fine-tuning for domain-specific retrievers

### Further Learning Resources

- **[ir_datasets](https://ir-datasets.com/)** & [GitHub](https://github.com/allenai/ir_datasets)
- **[BEIR Benchmark](https://github.com/beir-cellar/beir)**
- **[HuggingFace Datasets](https://huggingface.co/datasets?search=irds)**
- **[Sentence Transformers](https://www.sbert.net/)**
- **[FAISS Library](https://github.com/facebookresearch/faiss)**
- **[LangChain Docs](https://python.langchain.com/docs/modules/data_connection/retrievers/)**
- **[LlamaIndex Docs](https://docs.llamaindex.ai/)**

The field of RAG is rapidly evolving, with new techniques and models emerging regularly. By focusing on domain-specific adaptations, you can build systems that deliver truly valuable and trustworthy information experiences across scientific research, technical support, education, fact verification, healthcare, and beyond.


## 🧭 A Note on Real-World RAG: Handling Multiple Data Sources with Query Routing

In this tutorial, we’ve simplified the setup by letting you choose a single **use case** (e.g., *Scientific Research*, *Technical Support*), and loading **one pre-built index** corresponding to a default dataset for that domain (like `beir/trec-covid` or `beir/cqadupstack/android`). All queries are executed against this single, focused index.

This works well for learning the core components of Retrieval-Augmented Generation (RAG) in a controlled environment. However, in production systems, the reality is far more complex.

---

### 🔍 Real-World RAG Involves Many Diverse Sources

Imagine building a RAG system for an enterprise or large-scale application. A single unified index often isn’t practical or efficient. Real-world systems typically need to answer questions using content drawn from multiple **heterogeneous sources**, such as:

- ✅ Internal technical documentation (e.g., Confluence, GitHub wikis)  
- ✅ Public product FAQs and marketing sites  
- ✅ A database of customer support tickets or live chat logs  
- ✅ Regulatory PDFs, whitepapers, or clinical guidelines  
- ✅ Web content or external APIs

Indexing all of this into one giant vector store is usually **not scalable**, and can dilute retrieval precision. Instead, most real-world systems maintain **multiple specialized indexes**, each optimized for a different content type, domain, or access policy.

---

### 🚦 Enter Query Routing

So how does a system decide **which index (or tool) to query** for a given user question?

This is where **Query Routing** becomes essential. A *Query Router* acts like a smart traffic controller inside the RAG pipeline. Before retrieving documents, it analyzes the incoming query and chooses the most relevant source(s) — or even routes the request to external tools like search APIs, databases, or reasoning modules.

---

### 🧠 Common Query Routing Strategies

Here are several routing strategies used in modern RAG systems:

1. **LLM-Based Routing**  
   Use a lightweight [language model](https://arxiv.org/abs/2303.11366) to classify or analyze the query intent. The LLM chooses the best index by comparing the query to high-level descriptions of each source.  
   *Example:* “What is the refund policy for software licenses?” → routed to the policy documents index.

2. **Semantic Routing (Embedding Similarity)**  
   Compute a dense vector (embedding) of the query, then compare it to “representative vectors” for each index. This can be done using [FAISS](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), or [milvus](https://milvus.io/).  
   *Example:* Map queries to closest-matching domains using cosine similarity.

3. **Keyword or Rule-Based Routing**  
   Set up simple keyword triggers or regex patterns to map queries to sources.  
   *Example:* Queries containing “error code” or “stack trace” → routed to technical docs.  
   This is fast and interpretable, but brittle if query language varies a lot.

4. **Metadata Filtering within Shared Indexes**  
   If you choose to merge multiple datasets into one index, you can still route using metadata fields (e.g., `document_source = 'faq'`). This lets you use **filters** to scope retrieval only to relevant subsets.

---

### 🏗️ Hybrid Routing Is Common

In practice, many systems combine these approaches:

- Use **keyword filtering** for high-precision rules  
- Fall back to **semantic routing** for open-ended questions  
- Use an **LLM router** for nuanced intent classification  
- Combine results from multiple indexes and rerank with a **cross-encoder** or **relevance model**

Frameworks like [LangChain](https://docs.langchain.com/docs/components/retrievers/router-retriever/) and [Haystack](https://docs.haystack.deepset.ai/docs/routing_queries) offer modules for building query routers out-of-the-box.

---

### 🛠️ Why We Didn’t Use Query Routing *Here*

For this tutorial, we kept the setup simple by pre-selecting one dataset per domain and routing all queries to a single index. This helps focus on learning RAG fundamentals like retrieval, reranking, and generation — without worrying about architectural complexity.

But in real-world applications — especially in enterprise, healthcare, legal, or customer support scenarios — **query routing is essential** to handle:
- Information silos
- Data volume
- Index-specific latency
- Access control

If you're building a production-ready RAG system, consider implementing query routing as an early design decision.

---

### 📚 Further Reading & Examples

- [Query Routing in LangChain](https://docs.langchain.com/docs/components/retrievers/router-retriever/)
- [Multihop RAG Routing in Google's FiD](https://arxiv.org/abs/2007.00849)
- [Retrieval Strategies in Haystack](https://docs.haystack.deepset.ai/docs/routing_queries)
- [Prompting for Tool Use](https://arxiv.org/abs/2302.12337) in Multi-Tool LLM systems