# 04. Retrieval-Augmented Generation


Large Language Models (LLMs) are brilliant generalists ‚Äî they‚Äôve read the internet and can reason across domains ‚Äî but they **don‚Äôt know what they haven‚Äôt seen**. Their parameters store general knowledge, not private, up-to-date, or domain-specific facts. **Retrieval-Augmented Generation (RAG)** bridges that gap.
It combines:

1. **Retrieval** ‚Äì find relevant information from an **external knowledge base** (e.g., docs, databases, websites).
2. **Generation** ‚Äì pass that retrieved context into an LLM to ground its answer.

This simple loop ‚Äî *retrieve ‚Üí augment ‚Üí generate* ‚Äî makes the model:

* **More accurate** (uses real facts, not hallucinations)
* **More current** (retrieval can include recent or proprietary data)
* **Cheaper & smaller** (you don‚Äôt need to fine-tune large models for every dataset)
* **Explainable** (you can trace answers back to the retrieved sources)

RAG is now the **foundation of modern enterprise AI systems**, powering products like search-chat hybrids, coding copilots, knowledge assistants, and customer-support bots.
In short: *RAG makes LLMs grounded, trustworthy, and useful in the real world.*

I highly recommend watching explanations of RAG from [IBM](https://www.youtube.com/watch?v=T-D1OfcDW1M) and [Cole Medin](https://www.youtube.com/watch?v=tLMViADvSNE).


## Scenario: Why Pok√©mon Queries Are Hard for Pure LLMs

Let‚Äôs take something seemingly simple ‚Äî asking questions about Pok√©mon species like *Pikachu*, *Charizard*, or *Mewtwo*.
At first glance, LLMs might seem to know this, but there are hidden challenges:

| Problem                   | Why it‚Äôs hard for an LLM                                                                                                                  |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| **Data freshness**        | Game mechanics, move sets, and forms change with every generation ‚Äî LLMs trained on older data may be outdated.                           |
| **Structured facts**      | Evolution trees, base stats, and type matchups are stored in *tables*, not prose ‚Äî hard for models to memorize precisely.                 |
| **Ambiguity**             | Words like ‚Äúform‚Äù, ‚ÄúMega Evolution‚Äù, ‚ÄúTM‚Äù, or ‚Äúbase stats‚Äù require domain-specific interpretation.                                        |
| **Compositional queries** | ‚ÄúWhich Pok√©mon evolves into Pikachu?‚Äù or ‚ÄúList Charizard‚Äôs Mega forms and their base stats‚Äù require multiple lookups and reasoning steps. |

When we ask these **zero-shot**, even the best LLMs often **hallucinate**:

* inventing fake evolution lines,
* mixing up stats across generations,
* or returning vague, generic answers.

That‚Äôs where **RAG** shines:

* We **retrieve** the real Pok√©mon data (from pokemondb.net in this tutorial).
* We **chunk and embed** those markdown pages in a **vector database (LanceDB)**.
* Then, for each query, we **retrieve the most relevant chunks** and let the LLM reason *grounded in evidence*.

So instead of guessing, our agent *reads* and *reasons*.
This setup scales naturally to enterprise settings ‚Äî from Pok√©mon encyclopedias to product catalogs, regulatory documents, or customer knowledge bases.

For our data, we use the [PokemonDB](https://pokemondb.net/pokedex/). We'll fetch: **pichu, pikachu, raichu, charizard, mewtwo, slowpoke** and save as `.md`. These pages are HTML; we'll convert to Markdown for easier chunking.

In [2]:
import os, time, json, math, re, uuid, statistics, textwrap

from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass

from rich import print
from rich.table import Table
import numpy as np
import pandas as pd

# OpenRouter via OpenAI-compatible client
from openai import OpenAI

# Text splitters
from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# PydanticAI
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic import BaseModel, Field

In [3]:
import requests, pathlib
from markdownify import markdownify as mdify

# Saving data for common pokemons
POKEMON = [
    ("pichu",     "https://pokemondb.net/pokedex/pichu"),
    ("pikachu",   "https://pokemondb.net/pokedex/pikachu"),
    ("raichu",    "https://pokemondb.net/pokedex/raichu"),
    ("charizard", "https://pokemondb.net/pokedex/charizard"),
    ("mewtwo",    "https://pokemondb.net/pokedex/mewtwo"),
    ("slowpoke",  "https://pokemondb.net/pokedex/slowpoke"),
]

def fetch_markdown(url: str) -> str:
    html = requests.get(url, timeout=30).text
    md = mdify(html, heading_style="ATX")
    return md

DATA_DIR = pathlib.Path("./data/pokemon_md")

downloaded = []
for name, url in POKEMON:
    md_text = fetch_markdown(url)
    path = DATA_DIR / f"{name}.md"
    path.write_text(md_text, encoding="utf-8")
    downloaded.append((name, str(path), url))

print(f"Saved {len(downloaded)} markdown files ‚Üí {DATA_DIR}")

Let's see what a sample of this data page looks like.

In [4]:
from IPython.display import Markdown, display
import pathlib

md_path = pathlib.Path("./data/pokemon_md/pikachu.md")
display(Markdown(md_path.read_text(encoding="utf-8")[3000:4000]))  # first 2000 chars

1)

[![Pikachu artwork by Ken Sugimori](https://img.pokemondb.net/artwork/pikachu.jpg)](https://img.pokemondb.net/artwork/large/pikachu.jpg)

[Additional artwork](/artwork/pikachu)

## Pok√©dex data

|  |  |
| --- | --- |
| National ‚Ññ | **0025** |
| Type | [Electric](/type/electric) |
| Species | Mouse Pok√©mon |
| Height | 0.4¬†m (1‚Ä≤04‚Ä≥) |
| Weight | 6.0¬†kg (13.2¬†lbs) |
| Abilities | 1. [Static](/ability/static "Contact with the Pok√©mon may cause paralysis.") [Lightning Rod](/ability/lightning-rod "Draws in all Electric-type moves to up Sp. Attack.") (hidden ability) |
| Local ‚Ññ | 0025 (Yellow/Red/Blue) 0022 (Gold/Silver/Crystal) 0156 (Ruby/Sapphire/Emerald) 0025 (FireRed/LeafGreen) 0104 (Diamond/Pearl) 0104 (Platinum) 0022 (HeartGold/SoulSilver) 0036 (X/Y ‚Äî Central Kalos) 0163 (Omega Ruby/Alpha Sapphire) 0025 (Sun/Moon ‚Äî Alola dex) 0032 (U.Sun/U.Moon ‚Äî Alola dex) 0025 (Let's Go Pikachu/Let's Go Eevee) 0194 (Sword/Shield) 0104 (Brilliant Diamond/Shining Pearl) 0056 (Legends: Arceus) 0074

## 4Ô∏è‚É£ Preparing our Knowledge Base ‚Äî Chunking the Pok√©mon Markdown Files

Now that we‚Äôve downloaded Pok√©mon data as `.md` files (for Pikachu, Charizard, Mewtwo, etc.),  
we need to **split the text into smaller chunks** before embedding it into a vector database.

Why?

- **LLMs and embeddings have context limits** ‚Äî we can‚Äôt feed the entire document at once.  
- **Smaller, semantically coherent chunks** help retrieval systems match relevant sections precisely.  
- Chunking also improves **Recall@k**, **latency**, and **embedding reuse** during updates.

We‚Äôll try two common splitting strategies:

| Splitter | Description | When to use |
|-----------|--------------|-------------|
| üß© **RecursiveCharacterTextSplitter** | Splits text purely by length, preserving overlap. | Generic text without structure. |
| üß± **MarkdownHeaderTextSplitter** | Splits along Markdown headers (`#`, `##`, `###`), then limits size. | Structured content (docs, wikis, pages like Pok√©mon DB). |

After chunking, we‚Äôll have two parallel sets of documents:
- `docs_rec`: recursively chunked plain text  
- `docs_md`: structure-aware markdown chunks  

These will later be embedded into LanceDB and compared for retrieval quality.


In [None]:
from typing import List, Dict, Any, Optional, Tuple
import os

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# --- Chunking params ---
CHUNK_SIZE = 700
CHUNK_OVERLAP = 120

# --- Eval/profiling ---
EVAL_K_LIST = [1, 3, 5]
EMBEDDING_COST_PER_1K = float(os.getenv("EMBED_COST_PER_1K", "0.00013"))  # USD
PRINT_TOP_N = 5

def read_files_as_object_array(directory_path: str) -> List[Dict[str, str]]:
    out = []
    for fname in os.listdir(directory_path):
        fpath = os.path.join(directory_path, fname)
        if os.path.isfile(fpath):
            with open(fpath, "r", encoding="utf-8") as f:
                out.append({"filename": fname, "content": f.read()})
    return out

def recursive_text_splitter(data, chunk_size, overlap_size):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    texts = splitter.create_documents(
        [f"{d['filename']}\n{d['content']}" for d in data],
        metadatas=[{"filename": d["filename"]} for d in data],
    )
    return texts

def markdown_splitter(data, chunk_size, overlap_size):
    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")], strip_headers=True
    )
    md_splits = []
    for d in data:
        splits = md_splitter.split_text(d["content"])
        for s in splits:
            s.metadata["filename"] = d["filename"]
        md_splits.extend(splits)

    size_limiter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    return size_limiter.split_documents(md_splits)

docs_raw = read_files_as_object_array(str(DATA_DIR))
docs_rec = recursive_text_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)
docs_md  = markdown_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)

print(f"Recursive chunks: {len(docs_rec)} | Markdown+size chunks: {len(docs_md)}")

## 5Ô∏è‚É£ Building our Vector Database ‚Äî Introduction to LanceDB

Before our agent can ‚Äúretrieve‚Äù knowledge, we need a **database that understands vectors** ‚Äî numerical representations of text meaning (embeddings).  
That‚Äôs where **[LanceDB](https://lancedb.com/)** comes in.

### üîç What is LanceDB?
LanceDB is a **lightweight, local-first vector database** built on the **Lance columnar format**.  
It‚Äôs designed for:
- **Storing and searching** high-dimensional embeddings (like text or image vectors).  
- Performing **semantic similarity queries** (e.g., ‚Äúfind texts most similar to this query‚Äù).  
- **Hybrid retrieval**: combining full-text search (BM25 / Tantivy) and vector search.  
- **Speed and simplicity** ‚Äî it runs locally (no separate server needed).

### üß† What we‚Äôll do here
1. **Embed** all Pok√©mon chunks using OpenRouter‚Äôs embedding model (`text-embedding-3-large`).  
2. **Create / connect** to a LanceDB table named `"pokemon_pages"`.  
3. **Insert** each chunk‚Äôs text, vector, and metadata (like filename & splitter type).  
4. **Build** a full-text search (FTS) index for keyword lookups alongside vector search.

After this step, we‚Äôll have a ready-to-query LanceDB store ‚Äî the foundation for our Retrieval-Augmented Generation (RAG) pipeline.

In [None]:
from dotenv import load_dotenv
from openai import OpenAI

import lancedb
import uuid

load_dotenv()

OPENAI_BASE_URL = "https://openrouter.ai/api/v1"

EMBED_MODEL = os.getenv("EMBEDDINGS_MODEL", "qwen/qwen3-embedding-8b")

client = OpenAI(base_url=OPENAI_BASE_URL, api_key=os.getenv('OPENROUTER_API_KEY'))

DB_URI = "./db/sample-lancedb"
TABLE_NAME_TMP = "pokemon_pages_tmp"
TABLE_NAME = "pokemon_pages"

def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 64) -> List[List[float]]:
    r"""
    Returns a list of embedding vectors. Uses OpenAI-compatible client pointed at OpenRouter.
    r"""
    out = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        resp = client.embeddings.create(model=model, input=batch)
        out.extend([e.embedding for e in resp.data])
    return out

db = lancedb.connect(DB_URI)
if TABLE_NAME_TMP in db.table_names():
    tbl = db.open_table(TABLE_NAME_TMP)
    print(f"Loaded LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
else:
    all_chunks = []
    for d in docs_rec:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "recursive"}})
    for d in docs_md:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "markdown"}})

    print("Embedding chunks...")
    vectors = embed_texts([c["content"] for c in all_chunks])
    for c, v in zip(all_chunks, vectors):
        c["vector"] = v
    tbl = db.create_table(TABLE_NAME_TMP, data=all_chunks)
    tbl.create_fts_index("content")
    print(f"Indexed {len(all_chunks)} chunks into LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")

## 6Ô∏è‚É£ Searching the Knowledge Base ‚Äî Semantic vs Keyword Search

Now that our Pok√©mon chunks are stored in **LanceDB**, let‚Äôs learn how to **search** through them.

### üß≠ What is Semantic Search?
Traditional search engines (like keyword or BM25 search) match **exact words** or **phrases** in your query.  
But LLMs and embeddings represent meaning as **vectors in high-dimensional space** ‚Äî a *semantic* space.  

In **semantic search**, we:
1. **Embed** the query into a vector (using the same embedding model as our database).  
2. Measure its **closeness** to all stored vectors (chunks) ‚Äî using **cosine similarity** or **dot product**.  
3. Retrieve the most **semantically similar** chunks, even if they don‚Äôt share exact words.

For example:  
> Query ‚Üí ‚ÄúWho evolves into Pikachu?‚Äù  
> Closest text ‚Üí ‚ÄúPichu evolves into Pikachu when leveled up with high friendship.‚Äù

Even if the word ‚Äúwho‚Äù or ‚Äúfriendship‚Äù doesn‚Äôt appear in both, their embeddings are **close** in the semantic space, allowing **meaning-based retrieval**. I recommend watching video on vector search by [IBM](https://www.youtube.com/watch?v=gl1r1XV0SLw).

### üß© Three search modes we‚Äôll explore
| Method | Description | Strength |
|---------|--------------|-----------|
| üî° **FTS (Full Text Search)** | Matches literal terms using BM25 (like keyword search). | Great for rare names, exact filters, or numeric queries. |
| üß† **Vector Search** | Uses embedding similarity in high-dimensional space. | Captures meaning, paraphrases, and context. |
| ‚ö° **Hybrid Search** | Fuses both (via Reciprocal Rank Fusion). | Balances precision (FTS) and recall (semantic). |

The next cell defines functions for each search type and prints their **top results** side by side ‚Äî  
so you can see how **semantic closeness** changes the quality of retrieval.

In [62]:
from rich import print as rprint

def perform_vector_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    emb = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    qb = tbl.search(emb).metric('cosine').limit(top_k).select(["content", "metadata", "_distance", "vector"])
    if pokemon:
        qb = qb.where(f"metadata.filename = '{pokemon}.md'")
    return qb.to_list()

def perform_fts_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    qb = tbl.search(query, query_type="fts").limit(top_k).select(["content", "metadata", "_score", "vector"])
    if pokemon:
        qb = qb.where(f"metadata.filename = '{pokemon}.md'", prefilter=True)
    return qb.to_list()

def reciprocal_rank_fusion(results_a, results_b, k: int = 60):
    def rid(x): return hash(x["content"])
    scores = {}
    for i, r in enumerate(results_a):
        scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
    for i, r in enumerate(results_b):
        scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
    uniq = {}
    for r in results_a + results_b:
        uniq[rid(r)] = r
    ranked = sorted(uniq.values(), key=lambda r: scores[hash(r['content'])], reverse=True)
    return ranked

def perform_hybrid_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    vres = perform_vector_search(query, pokemon, top_k=top_k)
    fres = perform_fts_search(query, pokemon, top_k=top_k)
    fused = reciprocal_rank_fusion(vres, fres)[:top_k]
    return fused

queries = [
    "Which Pok√©mon evolves into Pikachu?",
    "Show Mega evolutions for Charizard",
]

for q in queries:
    rprint(f"\n[bold green]Query:[/] {q}")
    v = perform_vector_search(q, top_k=3)
    f = perform_fts_search(q, top_k=3)
    h = perform_hybrid_search(q, top_k=3)
    rprint("[magenta]Vector top1:[/]", v[0]["metadata"]["filename"], "‚Üí", v[0]["content"][:120].replace("\n"," "))
    rprint("[magenta]FTS    top1:[/]", f[0]["metadata"]["filename"], "‚Üí", f[0]["content"][:120].replace("\n"," "))
    rprint("[magenta]Hybrid top1:[/]", h[0]["metadata"]["filename"], "‚Üí", h[0]["content"][:120].replace("\n"," "))

## 7Ô∏è‚É£ Evaluating Retrieval Quality ‚Äî Coverage, Recall, and Ranking Metrics

Once our Pok√©mon chunks are embedded and searchable, we need to **measure how well** the retrieval step is working.  
Even the best LLM can only answer correctly if the **right information** was fetched first.

### üß© Why Evaluation Matters
RAG systems rely on two main components:
1. **Retrieval** ‚Äì finding the most relevant chunks from the knowledge base.  
2. **Generation** ‚Äì the LLM reasoning over those chunks to answer questions.

If retrieval fails (missing or irrelevant chunks), generation will inevitably fail too ‚Äî no matter how smart the model is.  
That‚Äôs why **retrieval metrics** are critical for diagnosing performance.

### üìä Metrics we‚Äôll compute
| Metric | What it measures | Why it matters |
|---------|------------------|----------------|
| **Coverage Ratio** | How much of the original document text is preserved in the chunked dataset. | Ensures chunking didn‚Äôt lose too much information. |
| **Recall@k** | Whether at least one relevant chunk appears in the top-k retrieved results. | Tests if the search finds what we need (completeness). |
| **MRR (Mean Reciprocal Rank)** | How early in the ranking the first relevant chunk appears. | Rewards search methods that bring correct answers to the top. |
| **Latency** *(later)* | Time taken for each search query. | Balances quality vs speed for production systems. |

In the next cell, we‚Äôll start with **coverage statistics** ‚Äî verifying that our chunking step retains most of the source content for both splitters (recursive and markdown).  
This acts as a sanity check before moving on to deeper retrieval evaluation.

In [63]:
import pandas as pd 

GROUND_TRUTH = {
    "Which Pok√©mon evolves into Pikachu?": ["pichu.md"],
    "Which Pok√©mon learns Volt Tackle via breeding/light ball mechanics?": ["pikachu.md", "pichu.md"],
    "Show Mega evolutions for Charizard": ["charizard.md"],
    "Base stats of Mewtwo": ["mewtwo.md"],
    "What is Mewtwo‚Äôs base stat total (BST)?": ["mewtwo.md"],
    "What is Slowpoke's type?": ["slowpoke.md"],
    "What moves can Raichu learn by TM?": ["raichu.md"],
}

def coverage_stats(docs_raw, chunks) -> Dict[str, float]:
    total_chars = sum(len(d["content"]) for d in docs_raw)
    chunk_chars = sum(len(c.page_content) for c in chunks)
    return {
        "total_chars": total_chars,
        "chunk_chars": chunk_chars,
        "coverage_ratio": chunk_chars / total_chars if total_chars else 0.0
    }

cov_rec = coverage_stats(docs_raw, docs_rec)
cov_md  = coverage_stats(docs_raw, docs_md)

pd.DataFrame([
    {"splitter": "recursive", **cov_rec},
    {"splitter": "markdown",  **cov_md},
])

Unnamed: 0,splitter,total_chars,chunk_chars,coverage_ratio
0,recursive,249998,264501,1.058012
1,markdown,249998,259329,1.037324


In [64]:
import time 

def eval_search(queries: List[str], search_fn, ks=(1,3,5)) -> pd.DataFrame:
    rows = []
    for q in queries:
        t0 = time.time()
        results = search_fn(q, top_k=max(ks))
        elapsed = time.time() - t0
        filenames = [r["metadata"]["filename"] for r in results]
        gt = set(GROUND_TRUTH[q])
        recs = {}
        for k in ks:
            recs[f"Recall@{k}"] = 1.0 if any(f in gt for f in filenames[:k]) else 0.0
        rr = 0.0
        for i, f in enumerate(filenames, start=1):
            if f in gt:
                rr = 1.0 / i
                break
        rows.append({"query": q, "latency_ms": round(1000*elapsed,2), "MRR": rr, **recs})
    return pd.DataFrame(rows)

df_vec = eval_search(list(GROUND_TRUTH.keys()), perform_vector_search, ks=tuple(EVAL_K_LIST))
df_fts = eval_search(list(GROUND_TRUTH.keys()), perform_fts_search,    ks=tuple(EVAL_K_LIST))
df_hyb = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_search, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Vector):[/]"); display(df_vec)
rprint("[bold]Per-query (FTS):[/]"); display(df_fts)
rprint("[bold]Per-query (Hybrid):[/]"); display(df_hyb)
rprint("[bold green]Summary:[/]"); display(summary)

Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,713.24,0.0,0.0,0.0,0.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,1230.84,0.5,0.0,1.0,1.0
2,Show Mega evolutions for Charizard,787.39,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,748.83,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,668.39,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,712.11,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,652.49,1.0,1.0,1.0,1.0


Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,15.75,1.0,1.0,1.0,1.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,15.71,1.0,1.0,1.0,1.0
2,Show Mega evolutions for Charizard,16.11,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,1.28,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,14.36,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,17.23,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,4.59,0.0,0.0,0.0,0.0


Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,751.2,0.333333,0.0,1.0,1.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,679.05,1.0,1.0,1.0,1.0
2,Show Mega evolutions for Charizard,737.3,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,933.25,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,750.39,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,1000.14,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,987.52,1.0,1.0,1.0,1.0


Unnamed: 0,Method,MRR(mean),Recall@1(mean),Recall@3(mean),Recall@5(mean),"Latency(ms, mean)"
0,Vector,0.786,0.714,0.857,0.857,787.613
1,FTS,0.857,0.857,0.857,0.857,12.147
2,Hybrid,0.905,0.857,1.0,1.0,834.121


### üîé Interpreting the Results

**TL;DR:** *Hybrid wins on quality; FTS wins on speed.*

- **Hybrid (MRR=0.90, Recall@3/5=1.0):** Best overall retrieval quality. Reciprocal Rank Fusion (RRF) captures **semantic matches** that FTS misses while still surfacing **exact-term hits**. Ideal default for general-purpose RAG.
- **Vector (MRR=0.78, Recall@5=0.85, ~1500 ms):** Strong semantic coverage‚Äîgreat when users paraphrase. Slightly slower due to embedding + nearest-neighbor search.
- **FTS (MRR=0.85, Recall@k ‚â§ 0.86, ~11 ms):** **Blazing fast** and excels for **exact names, forms, numbers** (e.g., ‚ÄúTM‚Äù, ‚ÄúMega‚Äù). But it can miss paraphrases or semantic matches.

What to deploy
- **Default:** Hybrid.  
- **Query routing:** Use **FTS** for quoted phrases/IDs/numerics; otherwise **Hybrid**.  
- **Latency-sensitive paths:** FTS with a **semantic fallback** on low-confidence.

## 9Ô∏è‚É£ Improving Precision ‚Äî What is Reranking and Why It Helps

Even after combining vector and keyword search, our top results may still include **partially relevant** or **redundant** chunks.  
That‚Äôs where **reranking** comes in ‚Äî a crucial final step in the retrieval pipeline.

üéØ What is Reranking?
Reranking means taking the **initial set of retrieved results** (e.g., top 20) and reordering them using a **more accurate relevance model**.  
This model computes a finer-grained similarity between the **query** and each retrieved chunk.

Common reranking approaches:
- **Embedding-based cosine similarity** *(lightweight)* ‚Äî compares the query vector with each chunk‚Äôs vector (as we‚Äôll do here).  
- **Cross-encoder models** *(heavier)* ‚Äî feed `[query, passage]` pairs into an LLM or BERT-like model for deeper contextual matching.

üí° Why Reranking Helps
- **First-stage retrieval** (vector/FTS/hybrid) is optimized for speed, not precision.  
- **Reranking** refines the order to push **the most semantically aligned chunks** to the top, improving **MRR** and **answer faithfulness**.  
- It‚Äôs especially useful when:
  - Many chunks share overlapping content.  
  - The query is nuanced or multi-faceted (e.g., ‚ÄúMega evolutions and base stats of Charizard‚Äù).  
  - You plan to feed only a few chunks into the LLM for generation.

In the next cell, we‚Äôll apply a simple **cosine-similarity-based reranker** that reorders hybrid search results using the query‚Äôs embedding ‚Äî  
a fast and effective upgrade for small to mid-sized RAG systems.

In [37]:
import numpy as np 

def cosine(a, b):
    a = np.array(a); b = np.array(b)
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

def rerank_by_query_vector(query: str, results: List[Dict[str, Any]], top_k: int = 5):
    """
    Rerank retrieved results based on cosine similarity 
    between the query embedding and each result‚Äôs embedding vector.
    """
    qv = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    rescored = []
    for r in results:
        rescored.append((cosine(qv, r['vector']), r))
    rescored = sorted(rescored, key=lambda x: x[0], reverse=True)
    rescored = [r for _, r in rescored[:top_k]]
    results, mds = [], set()
    for r in rescored:
        if r['metadata']['filename'] in mds: continue
        mds.add(r['metadata']['filename']); results.append(r)
    return results

def perform_hybrid_rerank(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    fused = perform_hybrid_search(query, pokemon, top_k=top_k*10)
    return rerank_by_query_vector(query, fused, top_k=top_k)

df_hyr = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_rerank, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid","Reranking"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean(), df_hyr["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean(), df_hyr[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean(), df_hyr["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Hybrid + Rerank):[/]"); display(df_hyr)
rprint("[bold green]Summary:[/]"); display(summary)

Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,2031.45,0.5,0.0,1.0,1.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,2542.5,1.0,1.0,1.0,1.0
2,Show Mega evolutions for Charizard,4854.6,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,4485.11,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,9059.44,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,1868.7,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,1684.89,1.0,1.0,1.0,1.0


Unnamed: 0,Method,MRR(mean),Recall@1(mean),Recall@3(mean),Recall@5(mean),"Latency(ms, mean)"
0,Vector,0.786,0.714,0.857,0.857,1573.404
1,FTS,0.857,0.857,0.857,0.857,11.397
2,Hybrid,0.905,0.857,1.0,1.0,1451.989
3,Reranking,0.929,0.857,1.0,1.0,3789.527


**Takeaway:**  
Reranking yields the **highest retrieval precision** (MRR‚Üë) with nearly perfect recall, though at a higher latency cost.  
In practice, it‚Äôs often used as an **optional second stage** ‚Äî applied only when the agent is uncertain or when quality matters more than speed.

## üîß Packaging Retrieval as ‚ÄúTools‚Äù for Agents

Now that we have multiple retrieval strategies ‚Äî vector, FTS, hybrid ‚Äî  
we‚Äôll wrap them into **simple, reusable tools** that return formatted text contexts.

These tools will later be used by our **PydanticAI agent** to decide:
- Which search mode to use (routing),
- How much context to retrieve, and  
- When to combine multiple sources (reflection and fusion).

Let‚Äôs define these tool functions next.

In [39]:
def build_context_from_results(results: List[Dict[str,Any]]):
    return "\n---\n".join([
        f"Title: {r['metadata']['filename']}\nContent:\n{r['content']}"
        for r in results
    ])

def tool_vector(query: str, k: int = 5) -> str:
    res = perform_vector_search(query, top_k=k)
    return build_context_from_results(res)

def tool_fts(query: str, k: int = 5) -> str:
    res = perform_fts_search(query, top_k=k)
    return build_context_from_results(res)

def tool_hybrid(query: str, k: int = 5) -> str:
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results(res)

In [46]:
res = tool_hybrid("normal type attach charizard")
print(res)

## 12) Agentic RAG with PydanticAI ‚Äî Routing + Reflection

In [43]:
import logfire
import nest_asyncio

nest_asyncio.apply()

logfire.configure()
logfire.instrument_pydantic_ai()

CHAT_MODEL  = os.getenv("CHAT_MODEL", "openrouter:openai/gpt-4o-mini")

class VanillaAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")

class RAGAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")
    used_tool: str = Field(description="Which tool was used: vector | fts | hybrid")
    citation: str = Field(description="Filename used to generate response.")

vanilla_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You area pokemon expert. Answer given questions"
    ),
    output_type=VanillaAnswer
)

rag_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT. If unknown, say 'I don't know from the corpus'. "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid]
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
vanilla_response = vanilla_agent.run_sync(q)

print(vanilla_response.output)

rag_response = rag_agent.run_sync(q)

print(rag_response)

14:20:20.390 vanilla_agent run
14:20:20.391   chat openai/gpt-4o-mini


14:20:23.598 rag_agent run
14:20:23.600   chat openai/gpt-4o-mini
14:20:28.389   running 2 tools
14:20:28.400     running tool: tool_hybrid
14:20:28.400     running tool: tool_hybrid
14:20:29.948   chat openai/gpt-4o-mini


## 13) Empirical comparison: reflection vs no-reflection

In [None]:
EVAL = [
    {
        "q": "Which Pok√©mon evolves into Pikachu?",
        "must_files": ["pichu.md"],
        "must_keywords": ["Pichu", "evolv", "Pikachu"]
    },
    {
        "q": "Base stats of Mewtwo",
        "must_files": ["mewtwo.md"],
        "must_keywords": ["Base", "HP", "Attack", "Defense"]
    },
    {
        "q": "What is Slowpoke's type?",
        "must_files": ["slowpoke.md"],
        "must_keywords": ["Water", "Psychic"]
    },
]

def score_answer(ans_text: str, files: List[str], kws: List[str]) -> int:
    text = ans_text.lower()
    ok_files = all(f.lower() in text for f in files)
    ok_kw = all(kw.lower() in text for kw in kws)
    return int(ok_files and ok_kw)

rows = []
for ex in EVAL:
    t0 = time.time(); a1 = answer_with_reflection(ex["q"], enable_reflection=False); t1 = time.time()
    t2 = time.time(); a2 = answer_with_reflection(ex["q"], enable_reflection=True);  t3 = time.time()
    rows.append({
        "query": ex["q"],
        "no_reflection_score": score_answer(a1.answer, ex["must_files"], ex["must_keywords"]),
        "with_reflection_score": score_answer(a2.answer, ex["must_files"], ex["must_keywords"]),
        "no_reflection_latency_ms": round(1000*(t1-t0),2),
        "with_reflection_latency_ms": round(1000*(t3-t2),2),
        "tools": f"{a1.used_tool} vs {a2.used_tool}"
    })

df_reflect = pd.DataFrame(rows)
df_reflect, df_reflect.mean(numeric_only=True).to_frame("mean").T

## 14) Routing ablation (FTS-only vs Vector-only vs Hybrid)

In [None]:
def run_agent_fixed_tool(query: str, tool: str) -> Answer:
    ctx = retrieve(query, tool, k=5)
    res: Answer = rag_agent.run_sync(user_message=f"CONTEXT:\n{ctx}\n\nQuestion: {query}")
    res.used_tool = tool
    return res

ABLATE_QUERIES = list(GROUND_TRUTH.keys())
rows = []
for q in ABLATE_QUERIES:
    for tool in ["fts","vector","hybrid"]:
        a = run_agent_fixed_tool(q, tool)
        rows.append({"query": q, "tool": tool, "answer_len": len(a.answer), "used_tool": a.used_tool})
pd.DataFrame(rows)

## 15) Convenience helpers (API-like)

In [None]:
def build_context_from_results_simple(results):
    return "---\n".join([
        f"Title: {r['metadata']['filename']}\nContent:\n{r['content']}" for r in results
    ])

def perform_vector_search_simple(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    emb = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    qb = tbl.search(emb).limit(top_k).select(["content","metadata"])
    if pokemon is not None:
        qb = qb.where(f"metadata.filename = '{pokemon.lower()}.md'")
    return qb.to_list()

def perform_fts_search_simple(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    qb = tbl.search(query, query_type="fts").limit(top_k).select(["content","metadata"])
    if pokemon is not None:
        qb = qb.where(f"metadata.filename = '{pokemon.lower()}.md'", prefilter=True)
    return qb.to_list()

## 16) Wrap-up

- **Semantic (vector) vs keyword (FTS)**‚Äîboth matter; hybrid often wins.  
- **RAG quality** hinges on chunking, metadata, embeddings, retrieval, and reranking.  
- **Agentic RAG** (routing + reflection) increases robustness on small corpora.  
- Track **coverage**, **Recall@k**, **MRR**, **latency**, **cost**; scale eval sets for reliability.