<div align="center">
<a href="https://rapidfire.ai/"><img src="https://raw.githubusercontent.com/RapidFireAI/rapidfireai/main/docs/images/RapidFire - Blue bug -white text.svg" width="115"></a>
<a href="https://discord.gg/6vSTtncKNN"><img src="https://raw.githubusercontent.com/RapidFireAI/rapidfireai/main/docs/images/discord-button.svg" width="145"></a>
<a href="https://oss-docs.rapidfire.ai/"><img src="https://raw.githubusercontent.com/RapidFireAI/rapidfireai/main/docs/images/documentation-button.svg" width="125"></a>
<br/>
Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/RapidFireAI/rapidfireai">GitHub</a></i> ‚≠ê
<br/>
To install RapidFire AI on your own machine, see the <a href="https://oss-docs.rapidfire.ai/en/latest/walkthrough.html">Install and Get Started</a> guide in our docs.
</div>

# RapidFire AI Tutorial: Optimizing RAG Pipelines with Trackio Experiment Tracking 

Retrieval-Augmented Generation (RAG) is a practical way to make an AI assistant **answer using your documents**:

- **Retrieve**: find the most relevant passages for a question.
- **Generate**: give those passages to a language model so it can answer *grounded in evidence*.

In this tutorial, we'll build and evaluate a RAG pipeline for a **financial opinion Q&A** assistant using the [FiQA dataset](https://huggingface.co/datasets/explodinggradients/fiqa).

Examples of the kind of questions we're targeting:

- "Should I invest in index funds or individual stocks?"
- "What's a good way to save for retirement in my 30s?"
- "Is it worth refinancing my mortgage right now?"

## What We're Building

A concrete RAG pipeline that looks like this:

1. **Load a financial corpus** (documents + posts).
2. **Split documents into chunks** (so we can search smaller, more relevant pieces).
3. **Embed the chunks** (turn text into vectors) and store them in a vector index (FAISS).
4. **Retrieve top‚ÄëK chunks** for each question using similarity search.
5. *(Optional)* **Rerank** the retrieved chunks with a stronger model to keep only the best evidence.
6. **Build a prompt** that includes the question + retrieved context.
7. **Generate an answer** with a vLLM model.
8. **Evaluate retrieval quality** (Precision, Recall, NDCG@5, MRR) so we can tell which settings find better evidence.

## Our Approach

RAG has a lot of "knobs", and it's easy to lose track of what helped. In this notebook we'll systematically vary both **retrieval settings** and **generator models** to find the best combination.

We'll use [RapidFireAI](https://github.com/RapidFireAI/rapidfireai) to:

- **Define a full experiment grid**: 2 chunking strategies √ó 2 reranking `top_n` values √ó 2 generator models = **8 total configs**.
- **Run all configs the same way** on the same dataset.
- **Compare retrieval metrics side-by-side** as they update (Precision/Recall/NDCG/MRR) to pick the best evidence-finding setup.

We use **Trackio** for experiment tracking and visualization.

### Configure Trackio and Import RapidFire Components

First, we enable Trackio as the experiment tracking backend and disable MLflow and TensorBoard. These environment variables must be set **before** importing RapidFire components.

Then we import RapidFire's core classes for defining the RAG pipeline and running a grid search.

In [None]:
import os

# Enable Trackio as the tracking backend
os.environ["RF_TRACKIO_ENABLED"] = "true"

# Disable other tracking backends for standalone Trackio usage
os.environ["RF_MLFLOW_ENABLED"] = "false"
os.environ["RF_TENSORBOARD_ENABLED"] = "false"

from rapidfireai import Experiment
from rapidfireai.automl import List, RFLangChainRagSpec, RFvLLMModelConfig, RFPromptManager, RFGridSearch
import re, json
from typing import List as listtype, Dict, Any

### Loading the Data

We load the FiQA **queries** and **relevance labels (qrels)**. The qrels file contains ground truth information about which documents are relevant to which queries, which we'll use to evaluate retrieval quality.

In [None]:
from datasets import load_dataset
import pandas as pd
from pathlib import Path

# Dataset directory (relative to this notebook's location)
dataset_dir = Path("../datasets")

fiqa_dataset = load_dataset("json", data_files=str(dataset_dir / "fiqa" / "queries.jsonl"), split="train")
fiqa_dataset = fiqa_dataset.rename_columns({"text": "query", "_id": "query_id"})
qrels = pd.read_csv(str(dataset_dir / "fiqa" / "qrels.tsv"), sep="\t")
qrels = qrels.rename(
    columns={"query-id": "query_id", "corpus-id": "corpus_id", "score": "relevance"}
)

### Create Experiment

An `Experiment` is RapidFire's top-level container for this notebook run: it groups configs/runs, saves artifacts, and tracks metrics under a unique name. We set `mode="evals"` because we're running evaluation (not training). See the [Experiment API docs](https://oss-docs.rapidfire.ai/en/latest/experiment.html#api-experiment) for more details.

In [None]:
experiment = Experiment(experiment_name="exp1-fiqa-rag-trackio", mode="evals")

### Defining the RAG Search Space

This is where RapidFireAI shines. Instead of hardcoding a single RAG configuration, we define a search space using `RFLangChainRagSpec`.

We will test:

* **2 Chunking Strategies**: Different chunk sizes (256 vs 128 tokens).
* **2 Reranking Strategies**: Different `top_n` values (2 vs 5).

This gives us 4 retrieval combinations to evaluate. Combined with 2 generator models, we get **8 total configurations**.

In [None]:
from langchain_community.document_loaders import DirectoryLoader, JSONLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Per-Actor batch size for hardware efficiency
batch_size = 128

# 2 chunk sizes x 2 reranking top-n = 4 combinations in total
rag_gpu = RFLangChainRagSpec(
    document_loader=DirectoryLoader(
        path=str(dataset_dir / "fiqa"),
        glob="corpus.jsonl",
        loader_cls=JSONLoader,
        loader_kwargs={
            "jq_schema": ".",
            "content_key": "text",
            "metadata_func": lambda record, metadata: {
                "corpus_id": int(record.get("_id"))
            },  # store the document id
            "json_lines": True,
            "text_content": False,
        },
        sample_seed=42,
    ),
    # 2 chunking strategies with different chunk sizes
    text_splitter=List([
            RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                encoding_name="gpt2", chunk_size=256, chunk_overlap=32
            ),
            RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                encoding_name="gpt2", chunk_size=128, chunk_overlap=32
            ),
        ],
    ),
    embedding_cls=HuggingFaceEmbeddings,
    embedding_kwargs={
        "model_name": "sentence-transformers/all-MiniLM-L6-v2",
        "model_kwargs": {"device": "cuda:0"},
        "encode_kwargs": {"normalize_embeddings": True, "batch_size": batch_size},
    },
    vector_store=None,  # uses FAISS by default
    search_type="similarity",
    search_kwargs={"k": 15},
    # 2 reranking strategies with different top-n values
    reranker_cls=CrossEncoderReranker,
    reranker_kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L6-v2",
        "model_kwargs": {"device": "cuda:0"},
        "top_n": List([2, 5]),
    },
    enable_gpu_search=True,  # GPU-based exact search instead of ANN index
)

### Define Data Processing and Postprocessing Functions

We retrieve context for each question and turn it into LLM-ready prompts. The preprocessing function:
1. Performs batched retrieval over all queries
2. Extracts retrieved document IDs for evaluation
3. Serializes context into prompts for the generator

The postprocessing function attaches "ground truth" relevant documents from FiQA (`qrels`) so we can score retrieval quality later.

In [None]:
def sample_preprocess_fn(
    batch: Dict[str, listtype], rag: RFLangChainRagSpec, prompt_manager: RFPromptManager
) -> Dict[str, listtype]:
    """Function to prepare the final inputs given to the generator model"""

    INSTRUCTIONS = "Utilize your financial knowledge, give your answer or opinion to the input question or subject matter."

    # Perform batched retrieval over all queries; returns a list of lists of k documents per query
    all_context = rag.get_context(batch_queries=batch["query"], serialize=False)

    # Extract the retrieved document ids from the context
    retrieved_documents = [
        [doc.metadata["corpus_id"] for doc in docs] for docs in all_context
    ]

    # Serialize the retrieved documents into a single string per query using the default template
    serialized_context = rag.serialize_documents(all_context)
    batch["query_id"] = [int(query_id) for query_id in batch["query_id"]]

    # Each batch to contain conversational prompt, retrieved documents, and original 'query_id', 'query', 'metadata'
    return {
        "prompts": [
            [
                {"role": "system", "content": INSTRUCTIONS},
                {
                    "role": "user",
                    "content": f"Here is some relevant context:\n{context}. \nNow answer the following question using the context provided earlier:\n{question}",
                },
            ]
            for question, context in zip(batch["query"], serialized_context)
        ],
        "retrieved_documents": retrieved_documents,
        **batch,
    }


def sample_postprocess_fn(batch: Dict[str, listtype]) -> Dict[str, listtype]:
    """Function to postprocess outputs produced by generator model"""
    # Get ground truth documents for each query; can be done in preprocess_fn too but done here for clarity
    batch["ground_truth_documents"] = [
        qrels[qrels["query_id"] == query_id]["corpus_id"].tolist()
        for query_id in batch["query_id"]
    ]
    return batch

### Define Custom Eval Metrics Functions

The following helper methods compute standard retrieval metrics from the retrieved vs. ground-truth document IDs:

- **Precision**: What fraction of retrieved documents are relevant?
- **Recall**: What fraction of relevant documents were retrieved?
- **F1 Score**: Harmonic mean of Precision and Recall.
- **NDCG@5**: Normalized Discounted Cumulative Gain at rank 5 (measures ranking quality).
- **MRR**: Mean Reciprocal Rank (how early does the first relevant doc appear?).

We compute metrics per batch and then combine them across batches so each config gets one consistent score.

In [None]:
import math


def compute_ndcg_at_k(retrieved_docs: set, expected_docs: set, k=5):
    """Utility function to compute NDCG@k"""
    relevance = [1 if doc in expected_docs else 0 for doc in list(retrieved_docs)[:k]]
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

    # IDCG: perfect ranking limited by min(k, len(expected_docs))
    ideal_length = min(k, len(expected_docs))
    ideal_relevance = [3] * ideal_length + [0] * (k - ideal_length)
    idcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(ideal_relevance))

    return dcg / idcg if idcg > 0 else 0.0


def compute_rr(retrieved_docs: set, expected_docs: set):
    """Utility function to compute Reciprocal Rank (RR) for a single query"""
    rr = 0
    for i, retrieved_doc in enumerate(retrieved_docs):
        if retrieved_doc in expected_docs:
            rr = 1 / (i + 1)
            break
    return rr


def sample_compute_metrics_fn(batch: Dict[str, listtype]) -> Dict[str, Dict[str, Any]]:
    """Function to compute all eval metrics based on retrievals and/or generations"""

    true_positives, precisions, recalls, f1_scores, ndcgs, rrs = 0, [], [], [], [], []
    total_queries = len(batch["query"])

    for pred, gt in zip(batch["retrieved_documents"], batch["ground_truth_documents"]):
        expected_set = set(gt)
        retrieved_set = set(pred)

        true_positives = len(expected_set.intersection(retrieved_set))
        precision = true_positives / len(retrieved_set) if len(retrieved_set) > 0 else 0
        recall = true_positives / len(expected_set) if len(expected_set) > 0 else 0
        f1 = (
            2 * precision * recall / (precision + recall)
            if (precision + recall) > 0
            else 0
        )

        precisions.append(precision)
        recalls.append(recall)
        f1_scores.append(f1)
        ndcgs.append(compute_ndcg_at_k(retrieved_set, expected_set, k=5))
        rrs.append(compute_rr(retrieved_set, expected_set))

    return {
        "Total": {"value": total_queries},
        "Precision": {"value": sum(precisions) / total_queries},
        "Recall": {"value": sum(recalls) / total_queries},
        "F1 Score": {"value": sum(f1_scores) / total_queries},
        "NDCG@5": {"value": sum(ndcgs) / total_queries},
        "MRR": {"value": sum(rrs) / total_queries},
    }


def sample_accumulate_metrics_fn(
    aggregated_metrics: Dict[str, listtype],
) -> Dict[str, Dict[str, Any]]:
    """Function to accumulate eval metrics across all batches"""

    num_queries_per_batch = [m["value"] for m in aggregated_metrics["Total"]]
    total_queries = sum(num_queries_per_batch)
    algebraic_metrics = ["Precision", "Recall", "F1 Score", "NDCG@5", "MRR"]

    return {
        "Total": {"value": total_queries},
        **{
            metric: {
                "value": sum(
                    m["value"] * queries
                    for m, queries in zip(
                        aggregated_metrics[metric], num_queries_per_batch
                    )
                )
                / total_queries,
                "is_algebraic": True,
                "value_range": (0, 1),
            }
            for metric in algebraic_metrics
        },
    }

### Define vLLM Generator Configurations

We define two vLLM generator configurations with different model sizes:

1. **Qwen2.5-0.5B-Instruct**: A lightweight model for faster inference.
2. **Qwen2.5-3B-Instruct**: A larger model for potentially better generation quality.

Each generator config will be combined with the 4 retrieval configs (2 chunking √ó 2 reranking), giving us 8 total configurations to evaluate.

We bundle the generators with our preprocessing/metrics functions into `config_set`, which RapidFire will run across all configurations.

In [None]:
# 2 vLLM generator configs with different sizes of generator models

vllm_config1 = RFvLLMModelConfig(
    model_config={
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "dtype": "half",
        "gpu_memory_utilization": 0.7,
        "tensor_parallel_size": 1,
        "distributed_executor_backend": "mp",
        "enable_chunked_prefill": False,
        "enable_prefix_caching": True,
        "max_model_len": 4096,
        "disable_log_stats": True,  # Disable vLLM progress logging
    },
    sampling_params={
        "temperature": 0.8,
        "top_p": 0.95,
        "max_tokens": 512,
    },
    rag=rag_gpu,
    prompt_manager=None,
)

vllm_config2 = RFvLLMModelConfig(
    model_config={
        "model": "Qwen/Qwen2.5-3B-Instruct",
        "dtype": "half",
        "gpu_memory_utilization": 0.7,
        "tensor_parallel_size": 1,
        "distributed_executor_backend": "mp",
        "enable_chunked_prefill": False,
        "enable_prefix_caching": True,
        "max_model_len": 4096,
        "disable_log_stats": True,  # Disable vLLM progress logging
    },
    sampling_params={
        "temperature": 0.8,
        "top_p": 0.95,
        "max_tokens": 512,
    },
    rag=rag_gpu,
    prompt_manager=None,
)

config_set = {
    "vllm_config": List([vllm_config1, vllm_config2]),  # Each represents 4 configs
    "batch_size": batch_size,
    "preprocess_fn": sample_preprocess_fn,
    "postprocess_fn": sample_postprocess_fn,
    "compute_metrics_fn": sample_compute_metrics_fn,
    "accumulate_metrics_fn": sample_accumulate_metrics_fn,
    "online_strategy_kwargs": {
        "strategy_name": "normal",
        "confidence_level": 0.95,
        "use_fpc": True,
    },
}

### Create Config Group

We create an `RFGridSearch` over `config_set`, producing **8 total configs** (2 generators √ó 2 chunkers √ó 2 rerankers) to run and compare.

In [None]:
# Simple grid search across all sets of config knob values = 8 combinations in total
config_group = RFGridSearch(config_set)

### Run Multi-Config Evals

Now we run the main evaluation function. Two tables will appear below:

1. **First table**: Lists all preprocessing/RAG sources (appears immediately).
2. **Second table**: Lists all individual runs with their knobs and metrics, updated in real-time via online aggregation showing both estimates and confidence intervals.

RapidFire AI provides an Interactive Controller that lets you manage executing runs dynamically:

- ‚èπÔ∏è **Stop**: Gracefully stop a running config
- ‚ñ∂Ô∏è **Resume**: Resume a stopped run
- üóëÔ∏è **Delete**: Remove a run from this experiment
- üìã **Clone**: Create a new run by editing the config dictionary of a parent run
- üîÑ **Refresh**: Update run status and metrics

In [None]:
# Launch evals of all RAG configs in the config_group with swap granularity of 4 chunks
# NB: If your machine has only 1 GPU, set num_actors=1
results = experiment.run_evals(
    config_group=config_group,
    dataset=fiqa_dataset,
    num_actors=2,
    num_shards=4,
    seed=42,
)

### View Trackio Dashboard

To visualize your experiment metrics in the Trackio dashboard, open a terminal and run:

```bash
trackio show --project "exp1-fiqa-rag-trackio"
```

This will launch the Trackio dashboard in your browser where you can:
- View real-time training curves
- Compare metrics across different configurations
- Analyze hyperparameter impacts on performance

### View Results

In [None]:
# Convert results dict to DataFrame
results_df = pd.DataFrame([
    {k: v['value'] if isinstance(v, dict) and 'value' in v else v for k, v in {**metrics_dict, 'run_id': run_id}.items()}
    for run_id, (_, metrics_dict) in results.items()
])

results_df

### End Experiment

In [None]:
experiment.end()

### View RapidFire AI Log Files

In [None]:
# Get the experiment-specific log file
log_file = experiment.get_log_file_path()

print(f"üìÑ Log File: {log_file}")
print()

if log_file.exists():
    print("=" * 80)
    print(f"Last 30 lines of {log_file.name}:")
    print("=" * 80)
    with open(log_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines[-30:]:
            print(line.rstrip())
else:
    print(f"‚ùå Log file not found: {log_file}")

### Conclusion

We built a Financial Q&A RAG pipeline and compared **8 configurations** (2 generators √ó 2 chunking strategies √ó 2 reranking settings) using standard retrieval metrics, all tracked with Trackio.

**What we covered:**
- Enabling Trackio as the sole experiment tracking backend
- Defining a search space with `RFLangChainRagSpec` and `RFGridSearch`
- Computing retrieval metrics (Precision, Recall, F1, NDCG@5, MRR)
- Visualizing results in the Trackio dashboard

**Ideas to explore next:**
- Try additional retrieval knobs (e.g., different embedding models, varying `k`, chunk overlap settings)
- Add generation quality metrics alongside retrieval metrics
- Scale to larger datasets or more generator model comparisons