# 04. Retrieval-Augmented Generation


Large Language Models (LLMs) are brilliant generalists ‚Äî they‚Äôve read the internet and can reason across domains ‚Äî but they **don‚Äôt know what they haven‚Äôt seen**. Their parameters store general knowledge, not private, up-to-date, or domain-specific facts. **Retrieval-Augmented Generation (RAG)** bridges that gap.
It combines:

1. **Retrieval** ‚Äì find relevant information from an **external knowledge base** (e.g., docs, databases, websites).
2. **Generation** ‚Äì pass that retrieved context into an LLM to ground its answer.

This simple loop ‚Äî *retrieve ‚Üí augment ‚Üí generate* ‚Äî makes the model:

* **More accurate** (uses real facts, not hallucinations)
* **More current** (retrieval can include recent or proprietary data)
* **Cheaper & smaller** (you don‚Äôt need to fine-tune large models for every dataset)
* **Explainable** (you can trace answers back to the retrieved sources)

RAG is now the **foundation of modern enterprise AI systems**, powering products like search-chat hybrids, coding copilots, knowledge assistants, and customer-support bots.
In short: *RAG makes LLMs grounded, trustworthy, and useful in the real world.*

I highly recommend watching explanations of RAG from [IBM](https://www.youtube.com/watch?v=T-D1OfcDW1M) and [Cole Medin](https://www.youtube.com/watch?v=tLMViADvSNE).


## Scenario: Why Pok√©mon Queries Are Hard for Pure LLMs

Let‚Äôs take something seemingly simple ‚Äî asking questions about Pok√©mon species like *Pikachu*, *Charizard*, or *Mewtwo*.
At first glance, LLMs might seem to know this, but there are hidden challenges:

| Problem                   | Why it‚Äôs hard for an LLM                                                                                                                  |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| **Data freshness**        | Game mechanics, move sets, and forms change with every generation ‚Äî LLMs trained on older data may be outdated.                           |
| **Structured facts**      | Evolution trees, base stats, and type matchups are stored in *tables*, not prose ‚Äî hard for models to memorize precisely.                 |
| **Ambiguity**             | Words like ‚Äúform‚Äù, ‚ÄúMega Evolution‚Äù, ‚ÄúTM‚Äù, or ‚Äúbase stats‚Äù require domain-specific interpretation.                                        |
| **Compositional queries** | ‚ÄúWhich Pok√©mon evolves into Pikachu?‚Äù or ‚ÄúList Charizard‚Äôs Mega forms and their base stats‚Äù require multiple lookups and reasoning steps. |

When we ask these **zero-shot**, even the best LLMs often **hallucinate**:

* inventing fake evolution lines,
* mixing up stats across generations,
* or returning vague, generic answers.

That‚Äôs where **RAG** shines:

* We **retrieve** the real Pok√©mon data (from pokemondb.net in this tutorial).
* We **chunk and embed** those markdown pages in a **vector database (LanceDB)**.
* Then, for each query, we **retrieve the most relevant chunks** and let the LLM reason *grounded in evidence*.

So instead of guessing, our agent *reads* and *reasons*.
This setup scales naturally to enterprise settings ‚Äî from Pok√©mon encyclopedias to product catalogs, regulatory documents, or customer knowledge bases.

For our data, we use the [PokemonDB](https://pokemondb.net/pokedex/). We'll fetch: **pichu, pikachu, raichu, charizard, mewtwo, slowpoke** and save as `.md`. These pages are HTML; we'll convert to Markdown for easier chunking.

In [1]:
import requests, pathlib
from markdownify import markdownify as mdify

# Saving data for common pokemons
POKEMON = [
    ("pichu",     "https://pokemondb.net/pokedex/pichu"),
    ("pikachu",   "https://pokemondb.net/pokedex/pikachu"),
    ("raichu",    "https://pokemondb.net/pokedex/raichu"),
    ("charizard", "https://pokemondb.net/pokedex/charizard"),
    ("mewtwo",    "https://pokemondb.net/pokedex/mewtwo"),
    ("slowpoke",  "https://pokemondb.net/pokedex/slowpoke"),
]

def fetch_markdown(url: str) -> str:
    html = requests.get(url, timeout=30).text
    md = mdify(html, heading_style="ATX")
    return md

DATA_DIR = pathlib.Path("./data/pokemon_md")

downloaded = []
for name, url in POKEMON:
    md_text = fetch_markdown(url)
    path = DATA_DIR / f"{name}.md"
    path.write_text(md_text, encoding="utf-8")
    downloaded.append((name, str(path), url))

print(f"Saved {len(downloaded)} markdown files ‚Üí {DATA_DIR}")

Saved 6 markdown files ‚Üí data\pokemon_md


Let's see what a sample of this data page looks like.

In [2]:
from IPython.display import Markdown, display
import pathlib

md_path = pathlib.Path("./data/pokemon_md/pikachu.md")
display(Markdown(md_path.read_text(encoding="utf-8")[3000:4000]))  # first 2000 chars

1)

[![Pikachu artwork by Ken Sugimori](https://img.pokemondb.net/artwork/pikachu.jpg)](https://img.pokemondb.net/artwork/large/pikachu.jpg)

[Additional artwork](/artwork/pikachu)

## Pok√©dex data

|  |  |
| --- | --- |
| National ‚Ññ | **0025** |
| Type | [Electric](/type/electric) |
| Species | Mouse Pok√©mon |
| Height | 0.4¬†m (1‚Ä≤04‚Ä≥) |
| Weight | 6.0¬†kg (13.2¬†lbs) |
| Abilities | 1. [Static](/ability/static "Contact with the Pok√©mon may cause paralysis.") [Lightning Rod](/ability/lightning-rod "Draws in all Electric-type moves to up Sp. Attack.") (hidden ability) |
| Local ‚Ññ | 0025 (Yellow/Red/Blue) 0022 (Gold/Silver/Crystal) 0156 (Ruby/Sapphire/Emerald) 0025 (FireRed/LeafGreen) 0104 (Diamond/Pearl) 0104 (Platinum) 0022 (HeartGold/SoulSilver) 0036 (X/Y ‚Äî Central Kalos) 0163 (Omega Ruby/Alpha Sapphire) 0025 (Sun/Moon ‚Äî Alola dex) 0032 (U.Sun/U.Moon ‚Äî Alola dex) 0025 (Let's Go Pikachu/Let's Go Eevee) 0194 (Sword/Shield) 0104 (Brilliant Diamond/Shining Pearl) 0056 (Legends: Arceus) 0074

## 4Ô∏è‚É£ Preparing our Knowledge Base ‚Äî Chunking the Pok√©mon Markdown Files

Now that we‚Äôve downloaded Pok√©mon data as `.md` files (for Pikachu, Charizard, Mewtwo, etc.),  
we need to **split the text into smaller chunks** before embedding it into a vector database.

Why?

- **LLMs and embeddings have context limits** ‚Äî we can‚Äôt feed the entire document at once.  
- **Smaller, semantically coherent chunks** help retrieval systems match relevant sections precisely.  
- Chunking also improves **Recall@k**, **latency**, and **embedding reuse** during updates.

We‚Äôll try two common splitting strategies:

| Splitter | Description | When to use |
|-----------|--------------|-------------|
| üß© **RecursiveCharacterTextSplitter** | Splits text purely by length, preserving overlap. | Generic text without structure. |
| üß± **MarkdownHeaderTextSplitter** | Splits along Markdown headers (`#`, `##`, `###`), then limits size. | Structured content (docs, wikis, pages like Pok√©mon DB). |

After chunking, we‚Äôll have two parallel sets of documents:
- `docs_rec`: recursively chunked plain text  
- `docs_md`: structure-aware markdown chunks  

These will later be embedded into LanceDB and compared for retrieval quality.


In [3]:
from typing import List, Dict, Any, Optional, Tuple
import os

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# --- Chunking params ---
CHUNK_SIZE = 700
CHUNK_OVERLAP = 120

# --- Eval/profiling ---
EVAL_K_LIST = [1, 3, 5]
EMBEDDING_COST_PER_1K = float(os.getenv("EMBED_COST_PER_1K", "0.00013"))  # USD
PRINT_TOP_N = 5

def read_files_as_object_array(directory_path: str) -> List[Dict[str, str]]:
    out = []
    for fname in os.listdir(directory_path):
        fpath = os.path.join(directory_path, fname)
        if os.path.isfile(fpath):
            with open(fpath, "r", encoding="utf-8") as f:
                out.append({"filename": fname, "content": f.read()})
    return out

def recursive_text_splitter(data, chunk_size, overlap_size):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    texts = splitter.create_documents(
        [f"{d['filename']}\n{d['content']}" for d in data],
        metadatas=[{"filename": d["filename"]} for d in data],
    )
    return texts

def markdown_splitter(data, chunk_size, overlap_size):
    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")], strip_headers=True
    )
    md_splits = []
    for d in data:
        splits = md_splitter.split_text(d["content"])
        for s in splits:
            s.metadata["filename"] = d["filename"]
        md_splits.extend(splits)

    size_limiter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap_size, length_function=len, is_separator_regex=False
    )
    return size_limiter.split_documents(md_splits)

docs_raw = read_files_as_object_array(str(DATA_DIR))
docs_rec = recursive_text_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)
docs_md  = markdown_splitter(docs_raw, CHUNK_SIZE, CHUNK_OVERLAP)

print(f"Recursive chunks: {len(docs_rec)} | Markdown+size chunks: {len(docs_md)}")

Recursive chunks: 526 | Markdown+size chunks: 509


## 5Ô∏è‚É£ Building our Vector Database ‚Äî Introduction to LanceDB

Before our agent can ‚Äúretrieve‚Äù knowledge, we need a **database that understands vectors** ‚Äî numerical representations of text meaning (embeddings).  
That‚Äôs where **[LanceDB](https://lancedb.com/)** comes in.

### üîç What is LanceDB?
LanceDB is a **lightweight, local-first vector database** built on the **Lance columnar format**.  
It‚Äôs designed for:
- **Storing and searching** high-dimensional embeddings (like text or image vectors).  
- Performing **semantic similarity queries** (e.g., ‚Äúfind texts most similar to this query‚Äù).  
- **Hybrid retrieval**: combining full-text search (BM25 / Tantivy) and vector search.  
- **Speed and simplicity** ‚Äî it runs locally (no separate server needed).

### üß† What we‚Äôll do here
1. **Embed** all Pok√©mon chunks using OpenRouter‚Äôs embedding model (`text-embedding-3-large`).  
2. **Create / connect** to a LanceDB table named `"pokemon_pages"`.  
3. **Insert** each chunk‚Äôs text, vector, and metadata (like filename & splitter type).  
4. **Build** a full-text search (FTS) index for keyword lookups alongside vector search.

After this step, we‚Äôll have a ready-to-query LanceDB store ‚Äî the foundation for our Retrieval-Augmented Generation (RAG) pipeline.

In [4]:
from dotenv import load_dotenv
from openai import OpenAI

import lancedb
import uuid

load_dotenv()

OPENAI_BASE_URL = "https://openrouter.ai/api/v1"

EMBED_MODEL = os.getenv("EMBEDDINGS_MODEL", "qwen/qwen3-embedding-8b")

client = OpenAI(base_url=OPENAI_BASE_URL, api_key=os.getenv('OPENROUTER_API_KEY'))

DB_URI = "./db/sample-lancedb"
TABLE_NAME_TMP = "pokemon_pages_tmp"
TABLE_NAME = "pokemon_pages"

def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 64) -> List[List[float]]:
    r"""
    Returns a list of embedding vectors. Uses OpenAI-compatible client pointed at OpenRouter.
    r"""
    out = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        resp = client.embeddings.create(model=model, input=batch)
        out.extend([e.embedding for e in resp.data])
    return out

db = lancedb.connect(DB_URI)
if TABLE_NAME_TMP in db.table_names():
    tbl = db.open_table(TABLE_NAME_TMP)
    print(f"Loaded LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")
else:
    all_chunks = []
    for d in docs_rec:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "recursive"}})
    for d in docs_md:
        all_chunks.append({"id": str(uuid.uuid4()), "content": d.page_content,
                        "metadata": {"filename": d.metadata.get("filename",""), "splitter": "markdown"}})

    print("Embedding chunks...")
    vectors = embed_texts([c["content"] for c in all_chunks])
    for c, v in zip(all_chunks, vectors):
        c["vector"] = v
    tbl = db.create_table(TABLE_NAME_TMP, data=all_chunks)
    tbl.create_fts_index("content")
    print(f"Indexed {len(all_chunks)} chunks into LanceDB at {DB_URI} (table={TABLE_NAME_TMP})")

Loaded LanceDB at ./db/sample-lancedb (table=pokemon_pages_tmp)


## 6Ô∏è‚É£ Searching the Knowledge Base ‚Äî Semantic vs Keyword Search

Now that our Pok√©mon chunks are stored in **LanceDB**, let‚Äôs learn how to **search** through them.

### üß≠ What is Semantic Search?
Traditional search engines (like keyword or BM25 search) match **exact words** or **phrases** in your query.  
But LLMs and embeddings represent meaning as **vectors in high-dimensional space** ‚Äî a *semantic* space.  

In **semantic search**, we:
1. **Embed** the query into a vector (using the same embedding model as our database).  
2. Measure its **closeness** to all stored vectors (chunks) ‚Äî using **cosine similarity** or **dot product**.  
3. Retrieve the most **semantically similar** chunks, even if they don‚Äôt share exact words.

For example:  
> Query ‚Üí ‚ÄúWho evolves into Pikachu?‚Äù  
> Closest text ‚Üí ‚ÄúPichu evolves into Pikachu when leveled up with high friendship.‚Äù

Even if the word ‚Äúwho‚Äù or ‚Äúfriendship‚Äù doesn‚Äôt appear in both, their embeddings are **close** in the semantic space, allowing **meaning-based retrieval**. I recommend watching video on vector search by [IBM](https://www.youtube.com/watch?v=gl1r1XV0SLw).

### üß© Three search modes we‚Äôll explore
| Method | Description | Strength |
|---------|--------------|-----------|
| üî° **FTS (Full Text Search)** | Matches literal terms using BM25 (like keyword search). | Great for rare names, exact filters, or numeric queries. |
| üß† **Vector Search** | Uses embedding similarity in high-dimensional space. | Captures meaning, paraphrases, and context. |
| ‚ö° **Hybrid Search** | Fuses both (via Reciprocal Rank Fusion). | Balances precision (FTS) and recall (semantic). |

The next cell defines functions for each search type and prints their **top results** side by side ‚Äî  
so you can see how **semantic closeness** changes the quality of retrieval.

In [5]:
from rich import print as rprint

def perform_vector_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    emb = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    qb = tbl.search(emb).metric('cosine').limit(top_k).select(["content", "metadata", "_distance", "vector"])
    if pokemon:
        qb = qb.where(f"metadata.filename = '{pokemon}.md'")
    return qb.to_list()

def perform_fts_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    qb = tbl.search(query, query_type="fts").limit(top_k).select(["content", "metadata", "_score", "vector"])
    if pokemon:
        qb = qb.where(f"metadata.filename = '{pokemon}.md'", prefilter=True)
    return qb.to_list()

def reciprocal_rank_fusion(results_a, results_b, k: int = 60):
    def rid(x): return hash(x["content"])
    scores = {}
    for i, r in enumerate(results_a):
        scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
    for i, r in enumerate(results_b):
        scores[rid(r)] = scores.get(rid(r), 0) + 1.0/(k+i+1)
    uniq = {}
    for r in results_a + results_b:
        uniq[rid(r)] = r
    ranked = sorted(uniq.values(), key=lambda r: scores[hash(r['content'])], reverse=True)
    return ranked

def perform_hybrid_search(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    vres = perform_vector_search(query, pokemon, top_k=top_k)
    fres = perform_fts_search(query, pokemon, top_k=top_k)
    fused = reciprocal_rank_fusion(vres, fres)[:top_k]
    return fused

queries = [
    "Which Pok√©mon evolves into Pikachu?",
    "Show Mega evolutions for Charizard",
]

for q in queries:
    rprint(f"\n[bold green]Query:[/] {q}")
    v = perform_vector_search(q, top_k=3)
    f = perform_fts_search(q, top_k=3)
    h = perform_hybrid_search(q, top_k=3)
    rprint("[magenta]Vector top1:[/]", v[0]["metadata"]["filename"], "‚Üí", v[0]["content"][:120].replace("\n"," "))
    rprint("[magenta]FTS    top1:[/]", f[0]["metadata"]["filename"], "‚Üí", f[0]["content"][:120].replace("\n"," "))
    rprint("[magenta]Hybrid top1:[/]", h[0]["metadata"]["filename"], "‚Üí", h[0]["content"][:120].replace("\n"," "))

## 7Ô∏è‚É£ Evaluating Retrieval Quality ‚Äî Coverage, Recall, and Ranking Metrics

Once our Pok√©mon chunks are embedded and searchable, we need to **measure how well** the retrieval step is working.  
Even the best LLM can only answer correctly if the **right information** was fetched first.

### üß© Why Evaluation Matters
RAG systems rely on two main components:
1. **Retrieval** ‚Äì finding the most relevant chunks from the knowledge base.  
2. **Generation** ‚Äì the LLM reasoning over those chunks to answer questions.

If retrieval fails (missing or irrelevant chunks), generation will inevitably fail too ‚Äî no matter how smart the model is.  
That‚Äôs why **retrieval metrics** are critical for diagnosing performance.

### üìä Metrics we‚Äôll compute
| Metric | What it measures | Why it matters |
|---------|------------------|----------------|
| **Coverage Ratio** | How much of the original document text is preserved in the chunked dataset. | Ensures chunking didn‚Äôt lose too much information. |
| **Recall@k** | Whether at least one relevant chunk appears in the top-k retrieved results. | Tests if the search finds what we need (completeness). |
| **MRR (Mean Reciprocal Rank)** | How early in the ranking the first relevant chunk appears. | Rewards search methods that bring correct answers to the top. |
| **Latency** *(later)* | Time taken for each search query. | Balances quality vs speed for production systems. |

In the next cell, we‚Äôll start with **coverage statistics** ‚Äî verifying that our chunking step retains most of the source content for both splitters (recursive and markdown).  
This acts as a sanity check before moving on to deeper retrieval evaluation.

In [6]:
import pandas as pd 

GROUND_TRUTH = {
    "Which Pok√©mon evolves into Pikachu?": ["pichu.md"],
    "Which Pok√©mon learns Volt Tackle via breeding/light ball mechanics?": ["pikachu.md", "pichu.md"],
    "Show Mega evolutions for Charizard": ["charizard.md"],
    "Base stats of Mewtwo": ["mewtwo.md"],
    "What is Mewtwo‚Äôs base stat total (BST)?": ["mewtwo.md"],
    "What is Slowpoke's type?": ["slowpoke.md"],
    "What moves can Raichu learn by TM?": ["raichu.md"],
}

def coverage_stats(docs_raw, chunks) -> Dict[str, float]:
    total_chars = sum(len(d["content"]) for d in docs_raw)
    chunk_chars = sum(len(c.page_content) for c in chunks)
    return {
        "total_chars": total_chars,
        "chunk_chars": chunk_chars,
        "coverage_ratio": chunk_chars / total_chars if total_chars else 0.0
    }

cov_rec = coverage_stats(docs_raw, docs_rec)
cov_md  = coverage_stats(docs_raw, docs_md)

pd.DataFrame([
    {"splitter": "recursive", **cov_rec},
    {"splitter": "markdown",  **cov_md},
])

Unnamed: 0,splitter,total_chars,chunk_chars,coverage_ratio
0,recursive,249998,264501,1.058012
1,markdown,249998,259329,1.037324


In [7]:
import time 

def eval_search(queries: List[str], search_fn, ks=(1,3,5)) -> pd.DataFrame:
    rows = []
    for q in queries:
        t0 = time.time()
        results = search_fn(q, top_k=max(ks))
        elapsed = time.time() - t0
        filenames = [r["metadata"]["filename"] for r in results]
        gt = set(GROUND_TRUTH[q])
        recs = {}
        for k in ks:
            recs[f"Recall@{k}"] = 1.0 if any(f in gt for f in filenames[:k]) else 0.0
        rr = 0.0
        for i, f in enumerate(filenames, start=1):
            if f in gt:
                rr = 1.0 / i
                break
        rows.append({"query": q, "latency_ms": round(1000*elapsed,2), "MRR": rr, **recs})
    return pd.DataFrame(rows)

df_vec = eval_search(list(GROUND_TRUTH.keys()), perform_vector_search, ks=tuple(EVAL_K_LIST))
df_fts = eval_search(list(GROUND_TRUTH.keys()), perform_fts_search,    ks=tuple(EVAL_K_LIST))
df_hyb = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_search, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Vector):[/]"); display(df_vec)
rprint("[bold]Per-query (FTS):[/]"); display(df_fts)
rprint("[bold]Per-query (Hybrid):[/]"); display(df_hyb)
rprint("[bold green]Summary:[/]"); display(summary)

Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,586.07,0.0,0.0,0.0,0.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,501.79,0.5,0.0,1.0,1.0
2,Show Mega evolutions for Charizard,4004.79,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,464.45,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,533.87,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,886.77,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,2282.72,1.0,1.0,1.0,1.0


Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,33.01,1.0,1.0,1.0,1.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,17.15,1.0,1.0,1.0,1.0
2,Show Mega evolutions for Charizard,0.0,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,19.67,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,13.68,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,18.08,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,15.24,0.0,0.0,0.0,0.0


Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,533.85,0.333333,0.0,1.0,1.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,500.46,1.0,1.0,1.0,1.0
2,Show Mega evolutions for Charizard,667.26,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,515.35,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,487.06,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,4685.0,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,491.01,1.0,1.0,1.0,1.0


Unnamed: 0,Method,MRR(mean),Recall@1(mean),Recall@3(mean),Recall@5(mean),"Latency(ms, mean)"
0,Vector,0.786,0.714,0.857,0.857,1322.923
1,FTS,0.857,0.857,0.857,0.857,16.69
2,Hybrid,0.905,0.857,1.0,1.0,1125.713


### üîé Interpreting the Results

**TL;DR:** *Hybrid wins on quality; FTS wins on speed.*

- **Hybrid (MRR=0.90, Recall@3/5=1.0):** Best overall retrieval quality. Reciprocal Rank Fusion (RRF) captures **semantic matches** that FTS misses while still surfacing **exact-term hits**. Ideal default for general-purpose RAG.
- **Vector (MRR=0.78, Recall@5=0.85, ~1300 ms):** Strong semantic coverage‚Äîgreat when users paraphrase. Slightly slower due to embedding + nearest-neighbor search.
- **FTS (MRR=0.85, Recall@k ‚â§ 0.86, ~17 ms):** **Blazing fast** and excels for **exact names, forms, numbers** (e.g., ‚ÄúTM‚Äù, ‚ÄúMega‚Äù). But it can miss paraphrases or semantic matches.

What to deploy
- **Default:** Hybrid.  
- **Query routing:** Use **FTS** for quoted phrases/IDs/numerics; otherwise **Hybrid**.  
- **Latency-sensitive paths:** FTS with a **semantic fallback** on low-confidence.

## 9Ô∏è‚É£ Improving Precision ‚Äî What is Reranking and Why It Helps

Even after combining vector and keyword search, our top results may still include **partially relevant** or **redundant** chunks.  
That‚Äôs where **reranking** comes in ‚Äî a crucial final step in the retrieval pipeline.

üéØ What is Reranking?
Reranking means taking the **initial set of retrieved results** (e.g., top 20) and reordering them using a **more accurate relevance model**.  
This model computes a finer-grained similarity between the **query** and each retrieved chunk.

Common reranking approaches:
- **Embedding-based cosine similarity** *(lightweight)* ‚Äî compares the query vector with each chunk‚Äôs vector (as we‚Äôll do here).  
- **Cross-encoder models** *(heavier)* ‚Äî feed `[query, passage]` pairs into an LLM or BERT-like model for deeper contextual matching.

üí° Why Reranking Helps
- **First-stage retrieval** (vector/FTS/hybrid) is optimized for speed, not precision.  
- **Reranking** refines the order to push **the most semantically aligned chunks** to the top, improving **MRR** and **answer faithfulness**.  
- It‚Äôs especially useful when:
  - Many chunks share overlapping content.  
  - The query is nuanced or multi-faceted (e.g., ‚ÄúMega evolutions and base stats of Charizard‚Äù).  
  - You plan to feed only a few chunks into the LLM for generation.

In the next cell, we‚Äôll apply a simple **cosine-similarity-based reranker** that reorders hybrid search results using the query‚Äôs embedding ‚Äî  
a fast and effective upgrade for small to mid-sized RAG systems.

In [8]:
import numpy as np 

def cosine(a, b):
    a = np.array(a); b = np.array(b)
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

def rerank_by_query_vector(query: str, results: List[Dict[str, Any]], top_k: int = 5):
    """
    Rerank retrieved results based on cosine similarity 
    between the query embedding and each result‚Äôs embedding vector.
    """
    qv = client.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding
    rescored = []
    for r in results:
        rescored.append((cosine(qv, r['vector']), r))
    rescored = sorted(rescored, key=lambda x: x[0], reverse=True)
    rescored = [r for _, r in rescored[:top_k]]
    results, mds = [], set()
    for r in rescored:
        if r['metadata']['filename'] in mds: continue
        mds.add(r['metadata']['filename']); results.append(r)
    return results

def perform_hybrid_rerank(query: str, pokemon: Optional[str] = None, top_k: int = 5):
    fused = perform_hybrid_search(query, pokemon, top_k=top_k*10)
    return rerank_by_query_vector(query, fused, top_k=top_k)

df_hyr = eval_search(list(GROUND_TRUTH.keys()), perform_hybrid_rerank, ks=tuple(EVAL_K_LIST))

summary = pd.DataFrame({
    "Method": ["Vector","FTS","Hybrid","Reranking"],
    "MRR(mean)": [df_vec["MRR"].mean(), df_fts["MRR"].mean(), df_hyb["MRR"].mean(), df_hyr["MRR"].mean()],
    **{f"Recall@{k}(mean)": [df_vec[f"Recall@{k}"].mean(), df_fts[f"Recall@{k}"].mean(), df_hyb[f"Recall@{k}"].mean(), df_hyr[f"Recall@{k}"].mean()] for k in EVAL_K_LIST},
    "Latency(ms, mean)": [df_vec["latency_ms"].mean(), df_fts["latency_ms"].mean(), df_hyb["latency_ms"].mean(), df_hyr["latency_ms"].mean()],
}).round(3)

rprint("[bold]Per-query (Hybrid + Rerank):[/]"); display(df_hyr)
rprint("[bold green]Summary:[/]"); display(summary)

Unnamed: 0,query,latency_ms,MRR,Recall@1,Recall@3,Recall@5
0,Which Pok√©mon evolves into Pikachu?,2488.6,0.333333,0.0,1.0,1.0
1,Which Pok√©mon learns Volt Tackle via breeding/...,2057.19,1.0,1.0,1.0,1.0
2,Show Mega evolutions for Charizard,1164.91,1.0,1.0,1.0,1.0
3,Base stats of Mewtwo,1143.49,1.0,1.0,1.0,1.0
4,What is Mewtwo‚Äôs base stat total (BST)?,1174.59,1.0,1.0,1.0,1.0
5,What is Slowpoke's type?,1132.06,1.0,1.0,1.0,1.0
6,What moves can Raichu learn by TM?,7774.14,1.0,1.0,1.0,1.0


Unnamed: 0,Method,MRR(mean),Recall@1(mean),Recall@3(mean),Recall@5(mean),"Latency(ms, mean)"
0,Vector,0.786,0.714,0.857,0.857,1322.923
1,FTS,0.857,0.857,0.857,0.857,16.69
2,Hybrid,0.905,0.857,1.0,1.0,1125.713
3,Reranking,0.905,0.857,1.0,1.0,2419.283


**Takeaway:**  
Reranking yields the **highest retrieval precision** (MRR‚Üë) with nearly perfect recall, though at a higher latency cost.  
In practice, it‚Äôs often used as an **optional second stage** ‚Äî applied only when the agent is uncertain or when quality matters more than speed.

## üîß Packaging Retrieval as ‚ÄúTools‚Äù for Agents

Now that we have multiple retrieval strategies ‚Äî vector, FTS, hybrid ‚Äî  
we‚Äôll wrap them into **simple, reusable tools** that return formatted text contexts.

These tools will later be used by our **PydanticAI agent** to decide:
- Which search mode to use (routing),
- How much context to retrieve, and  
- When to combine multiple sources (reflection and fusion).

Let‚Äôs define these tool functions next.

In [9]:
import logfire
import nest_asyncio

nest_asyncio.apply()

logfire.configure()
logfire.instrument_pydantic_ai()

tbl = db.open_table(TABLE_NAME)

def build_context_from_results(results: List[Dict[str,Any]]):
    return "\n---\n".join([
        f"Title: {r['metadata']['filename']}\nContent:\n{r['content']}"
        for r in results
    ])

def tool_vector(query: str, k: int = 5) -> str:
    """Vector search"""
    logfire.info(f"Vector search called with query: {query}")
    res = perform_vector_search(query, top_k=k)
    return build_context_from_results(res)

def tool_fts(query: str, k: int = 5) -> str:
    """Full Text Search"""
    logfire.info(f"FTS search called with query: {query}")
    res = perform_fts_search(query, top_k=k)
    return build_context_from_results(res)

def tool_hybrid(query: str, k: int = 5) -> str:
    """Hybrid Search"""
    logfire.info(f"Hybrid search called with query: {query}")
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results(res)

def tool_rerank(query: str, k: int = 5) -> str:
    "Reranking Search"
    logfire.info(f"Reranking search called with query: {query}")
    res = perform_hybrid_rerank(query, top_k=k)
    return build_context_from_results(res)

## üîÆ From Plain LLM to RAG-Enhanced Agent ‚Äî Comparing Knowledge Access

Now that our retrieval tools are ready, let‚Äôs test how much they actually help the model think.

### üß† Two Agents, Two Worlds
We‚Äôll create two simple agents using **PydanticAI**:

| Agent | Description | Data Access |
|--------|--------------|-------------|
| üß© **Vanilla Agent** | A plain LLM (e.g., Grok-4 or GPT-4) answering directly from its internal training data. | ‚ùå No external context |
| üìö **RAG Agent** | Same model, but grounded with retrieved Pok√©mon chunks from LanceDB. It must answer *only* from the provided context. | ‚úÖ Uses hybrid search tool |

### ‚öîÔ∏è The Test
We‚Äôll ask both agents the same question:

> *‚ÄúWho has more powerful normal type attack ‚Äî Charizard or Pikachu?‚Äù*

The **Vanilla Agent** relies purely on what it ‚Äúremembers.‚Äù  
The **RAG Agent**, on the other hand, performs:
1. **Retrieval** ‚Äî pulls relevant chunks from our local Pok√©mon corpus using `tool_hybrid`.  
2. **Grounded generation** ‚Äî answers based strictly on retrieved evidence and cites sources (e.g., `[charizard.md]`).

This comparison highlights how RAG agents can **reduce hallucinations** and provide **traceable, verifiable answers** even with small, domain-specific knowledge bases.


In [13]:
from pydantic_ai import Agent
from pydantic import BaseModel, Field

CHAT_MODEL  = os.getenv("CHAT_MODEL", "openrouter:x-ai/grok-4-fast")

class VanillaAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")

class RAGAnswer(BaseModel):
    answer: str = Field(description="Concise, factual answer for the given query.")
    used_tool: str = Field(description="Which tool was used: vector | fts | hybrid")
    citation: str = Field(description="Filename used to generate response.")

vanilla_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You area pokemon expert. Answer given questions"
    ),
    output_type=VanillaAnswer,
    retries=3
)

rag_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT. If unknown, say 'I don't know from the corpus'. "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
vanilla_response = vanilla_agent.run_sync(q)

rprint(vanilla_response.output)

rag_response = rag_agent.run_sync(q)

rprint(rag_response)

11:20:50.065 vanilla_agent run
11:20:50.072   chat x-ai/grok-4-fast


11:20:52.825 rag_agent run
11:20:52.825   chat x-ai/grok-4-fast
11:20:55.145   running 1 tool
11:20:55.145     running tool: tool_hybrid
11:20:55.145       Hybrid search called with query: Charizard vs Pikachu normal type attacks power comparison
11:20:56.037   chat x-ai/grok-4-fast
11:20:59.007   running 1 tool
11:20:59.007     running tool: tool_hybrid
11:20:59.007       Hybrid search called with query: Pikachu normal type moves power
11:21:06.900   chat x-ai/grok-4-fast
11:21:09.728   running 1 tool
11:21:09.730     running tool: tool_hybrid
11:21:09.730       Hybrid search called with query: Pikachu moves list normal type power
11:21:10.272   chat x-ai/grok-4-fast


In the logs above, you can see a clear difference:

- üß© **Vanilla Agent:**  
  Likely gave a vague or partially correct answer ‚Äî it relies on its pretrained world knowledge, which may be outdated or incomplete.  
  It has no access to our curated Pok√©mon corpus, so its response can drift or even hallucinate.

- üìö **RAG Agent (Correct Answer):**  
  Retrieved the **Charizard** and **Pikachu** entries from our LanceDB knowledge base, analyzed their base attack stats,  
  and correctly identified that **Charizard has the stronger Normal-type attack** ‚Äî **with a source citation** (e.g., `[charizard.md]`).  

This demonstrates the core benefit of **Retrieval-Augmented Generation**:
- It grounds responses in **real, verifiable data**.  
- It produces **contextually correct** and **source-traceable** answers.  
- It **reduces hallucinations** and improves trustworthiness ‚Äî especially in factual, domain-specific tasks.

In short, the RAG agent doesn‚Äôt *guess* ‚Äî it *knows where to look*.

In the next cell, we‚Äôll run the same question again and compare how the **Reranking Agent** responds.  
Look for stronger alignment with retrieved facts and clearer source citations.

In [None]:
reranking_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_rerank],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = reranking_agent.run_sync(q)

rprint(rag_response)

15:29:43.530 reranking_agent run
15:29:43.532   chat x-ai/grok-4-fast
15:29:45.807   running 1 tool
15:29:45.808     running tool: tool_rerank
15:29:45.809       Reranking search called with query: Charizard Pikachu normal type moves power comparison
15:29:48.467   chat x-ai/grok-4-fast
15:29:52.070   running 1 tool
15:29:52.080     running tool: tool_rerank
15:29:52.080       Reranking search called with query: Charizard learnable normal type moves power
15:29:54.960   chat x-ai/grok-4-fast
15:29:57.605   running 1 tool
15:29:57.605     running tool: tool_rerank
15:29:57.606       Reranking search called with query: Pikachu learnable normal type moves power
15:29:59.481   chat x-ai/grok-4-fast


## üß† Building a Smarter Agent ‚Äî Multi-Tool Retrieval and Dynamic Reasoning

So far, we‚Äôve seen each retrieval method in isolation ‚Äî vector, keyword, hybrid, and reranking.  
But real-world questions vary in structure: some are **factual**, some **numeric**, some **semantic**.  
No single search method fits them all.

### üõ†Ô∏è Enter the Multi-Tool Agent
In this step, we give our RAG agent **access to all retrieval tools**:
- üî° `tool_fts` ‚Üí for exact terms (e.g., ‚ÄúTM45‚Äù or ‚ÄúBase stats‚Äù).  
- üß† `tool_vector` ‚Üí for meaning-based matches and paraphrases.  
- ‚ö° `tool_hybrid` ‚Üí for balanced performance.  
- üéØ `tool_rerank` ‚Üí for highest-precision reranked retrieval.

The agent can now **choose the best tool dynamically** based on query type and context ‚Äî an early example of **tool orchestration** or **self-routing**.

This brings us closer to a true **agentic RAG** system ‚Äî one that reasons *about how to reason*.

In [None]:
multitool_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_fts, tool_vector, tool_hybrid, tool_rerank],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multitool_agent.run_sync(q)

rprint(rag_response)

15:30:02.783 multitool_agent run
15:30:02.785   chat x-ai/grok-4-fast
15:30:08.556   running 1 tool
15:30:08.556     running tool: tool_hybrid
15:30:08.556       Hybrid search called with query: Charizard normal type moves base power
15:30:09.484   chat x-ai/grok-4-fast
15:30:11.066   running 1 tool
15:30:11.067     running tool: tool_hybrid
15:30:11.069       Hybrid search called with query: Pikachu normal type moves base power
15:30:22.920   chat x-ai/grok-4-fast
15:30:25.104   running 1 tool
15:30:25.104     running tool: tool_hybrid
15:30:25.104       Hybrid search called with query: Charizard learnable Normal type moves base power
15:30:25.858   chat x-ai/grok-4-fast
15:30:29.492   running 1 tool
15:30:29.492     running tool: tool_hybrid
15:30:29.492       Hybrid search called with query: Pikachu learnable Normal type moves base power
15:30:30.251   chat x-ai/grok-4-fast


## üß© Contextualised Retrieval ‚Äî Using an LLM to Summarize Retrieved Evidence

So far, our agents have pulled relevant chunks from LanceDB and fed them *as-is* into the answering model.  
However, as context grows, simply concatenating text leads to **redundancy**, **token waste**, and sometimes **distracting noise**.

To address this, we introduce **Contextualised Retrieval** ‚Äî a smarter approach where a small LLM acts as a **retrieval summarizer**.

### üß† How this works
1. **Retrieve**: The agent first collects top-k chunks via hybrid search.  
2. **Summarize**: A lightweight *retrieval assistant* LLM processes these chunks and condenses them into a **focused summary**.  
3. **Augment**: The final answering agent then uses this *context summary* plus the original chunks for grounded reasoning.

I highly recommend the reader to go through [Anthropic's Guide on Contextual Retrieval](https://www.anthropic.com/engineering/contextual-retrieval), and article by [Wang et al. (2025)](https://arxiv.org/html/2510.09106v1).

### üéØ Why this matters
- Reduces **token and latency overhead** by summarizing only key attributes (types, evolutions, base stats).  
- Improves **signal-to-noise ratio**, especially when multiple retrieved chunks overlap.  
- Enables a more **scalable agentic retrieval loop**, where the model *reflects on retrieved context* before reasoning.

However, this also comes at the cost of one LLM call per search. 

In the next cell, we‚Äôll define:
- `build_context_from_results_via_llm()` ‚Üí uses an LLM to synthesize a compact, focused context summary.  
- `tool_hybrid_contextualised()` ‚Üí wraps hybrid retrieval + summarization as a single callable tool.

We‚Äôll then run our **Contextualised Agent** to answer the same question ‚Äî  
expect shorter, sharper answers with clear citations and improved factual consistency.

In [None]:
def build_context_from_results_via_llm(query: str, results: List[Dict[str,Any]]):
    combined = build_context_from_results(results)

    retrieval_agent = Agent(
        model=CHAT_MODEL,
        system_prompt=(
            "You are a retrieval assistant helping an LLM ground its reasoning. "
            "Given the retrieved Pok√©mon entries below, summarize only the most relevant "
            "details and context in 3‚Äì5 concise sentences. Focus on types, evolutions, "
            "base stats, and notable traits that help answer factual questions.\n\n"
            f"Input query:\n {query}\n\n"
            f"Retrieved context:\n {combined}"
        ),
        retries=3
    )

    summary = retrieval_agent.run_sync("").output
    logfire.info(f"Summary returned: {summary}")
    return f"### Context Summary\n{summary}\n\n---\n### Full Retrieved Chunks\n{combined}"


def tool_hybrid_contextualised(query: str, k: int = 5) -> str:
    logfire.info(f"Contextual Retrieval search called with query: {query}")
    res = perform_hybrid_search(query, top_k=k)
    return build_context_from_results_via_llm(query, res)

contextual_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_hybrid_contextualised],
    retries=3
)


q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = contextual_agent.run_sync(q)

rprint(rag_response)

15:52:31.843 contextual_agent run
15:52:31.845   chat x-ai/grok-4-fast
15:52:34.406   running 1 tool
15:52:34.406     running tool: tool_hybrid_contextualised
15:52:34.406       Contextual Retrieval search called with query: Charizard Pikachu normal type attack power comparison
15:52:35.609       retrieval_agent run
15:52:35.609         chat x-ai/grok-4-fast
15:52:41.747       Summary returned: Pikachu is an Electric-type Pok√©mon with neutral effectiveness... compared to Pikachu's unevolved state without provided stats.
15:52:41.769   chat x-ai/grok-4-fast
15:52:45.984   running 1 tool
15:52:45.984     running tool: tool_hybrid_contextualised
15:52:45.984       Contextual Retrieval search called with query: Pikachu base stats attack
15:52:48.177       retrieval_agent run
15:52:48.177         chat x-ai/grok-4-fast
15:52:54.938       Summary returned: Pikachu is an Electric-type Pok√©mon that evolves from Pichu an...tack stat (standardly known as 55, though not confirmed here).
15:52:54

## üîÑ Multi-Query Retrieval ‚Äî Expanding Recall Through Paraphrased Queries

Even the best retrievers can miss information if the query wording doesn‚Äôt match the phrasing in the knowledge base.  
For example, ‚ÄúWho has stronger normal attacks?‚Äù and ‚ÄúWhich Pok√©mon hits harder with normal moves?‚Äù express the same intent ‚Äî  
but may retrieve **different** chunks due to surface-level differences in tokens and structure.

To make our system more robust, we introduce **Multi-Query Retrieval**, also known as **Query Augmentation** or **Multi-Vector RAG**.

### üß© How it works
1. **Generate paraphrases** ‚Äî An auxiliary *query-rewriting agent* produces multiple semantically equivalent versions of the input question.  
2. **Retrieve per variant** ‚Äî Each variation runs its own hybrid search in LanceDB.  
3. **Merge and deduplicate** ‚Äî Retrieved results are combined and deduplicated to form a richer, more complete context.

This strategy helps the system:
- Capture **lexical and syntactic diversity** in stored text.  
- Improve **Recall@k** and **coverage**, especially for sparse or under-represented phrasing.  
- Provide **redundant grounding**, which stabilizes the final generation step.


I highly recommend looking at the seminal work by [Kostric and Balog (2024)](https://arxiv.org/pdf/2406.18960) on this.


### ‚öôÔ∏è What this code does
- `tool_multiquery()`  
  ‚Üí Generates paraphrased queries via a *query-rewriting agent*, retrieves hybrid results for each, and merges them.  
- `multiquery_agent`  
  ‚Üí Uses this tool to answer the same comparison question while grounding on a **broader semantic context**.

This approach trades a small latency increase for **higher recall and resilience**,  
bringing our RAG pipeline closer to modern *multi-query ensemble* systems used in production LLM retrieval frameworks.


In [None]:
from itertools import chain

class QueryVariations(BaseModel):
    variations: List[str]

def tool_multiquery(query: str, num_variations: int = 3, k: int = 5):
    """Run RAG with multiple paraphrased query variants to improve robustness."""
    # Step 1: Generate paraphrases of the input query
    variation_agent = Agent(
        model=CHAT_MODEL,
        system_prompt=(
            "You are a query rewriting assistant. Given a question, produce "
            f"{num_variations} short paraphrases that preserve meaning but vary wording."
        ),
        output_type=QueryVariations,
        retries=2
    )
    variations = variation_agent.run_sync(query).output.variations
    logfire.info(f"Variations: {variations}")
    queries = [query] + variations

    # Step 2: Retrieve results for all query variants
    all_results = list(chain.from_iterable(perform_hybrid_search(q, top_k=k) for q in queries))

    # Step 3: Deduplicate by content hash and merge
    unique_results = {hash(r["content"]): r for r in all_results}.values()

    return build_context_from_results(list(unique_results))

multiquery_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[tool_multiquery],
    retries=3
)

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = multiquery_agent.run_sync(q)

rprint(rag_response)

15:54:49.332 multiquery_agent run
15:54:49.333   chat x-ai/grok-4-fast
15:54:52.268   running 1 tool
15:54:52.268     running tool: tool_multiquery
15:54:52.283       variation_agent run
15:54:52.284         chat x-ai/grok-4-fast
15:54:55.020       Variations: ['Between Charizard and Pikachu in Pok√©mon, which has the stro... Pikachu, which Pok√©mon has the mightier normal type attack?']
15:55:02.756   chat x-ai/grok-4-fast
15:55:07.015   running 1 tool
15:55:07.015     running tool: tool_multiquery
15:55:07.015       variation_agent run
15:55:07.015         chat x-ai/grok-4-fast
15:55:09.834       Variations: ["What are Pikachu's base stats in Pok√©mon?", "Pikachu's basic stats in the Pok√©mon games", 'Base stats of Pikachu in Pok√©mon']
15:55:12.920   chat x-ai/grok-4-fast
15:55:15.711   running 1 tool
15:55:15.711     running tool: tool_multiquery
15:55:15.714       variation_agent run
15:55:15.715         chat x-ai/grok-4-fast
15:55:18.375       Variations: ["What are Pikachu's bas

## üîÅ Iterative Retrieval with FLARE ‚Äî Adaptive Query Expansion for RAG

So far, all our retrieval methods (hybrid, reranking, multi-query) assumed that a **single retrieval pass** is enough.  
But what if the question is **underspecified** or requires **multi-hop reasoning** ‚Äî e.g., connecting multiple facts across Pok√©mon pages?

In such cases, a model needs to:
1. Recognize **when information is missing**, and  
2. Formulate **follow-up retrievals** to fill those gaps.

This idea leads us to **FLARE** ‚Äî *Feedback Loop for Adaptive Retrieval Enhancement.*

### ‚öôÔ∏è How FLARE works
Instead of doing one retrieval step, the agent operates in a **loop**:
1. The model analyzes the current context and identifies **information needs** (`needs`).  
2. If evidence is missing, it generates **new sub-queries** (like ‚ÄúCharizard base attack stat‚Äù or ‚ÄúPikachu move power‚Äù).  
3. The system performs hybrid retrieval for each need and **expands the context**.  
4. Once enough evidence is gathered, it produces a **final grounded answer** (`final_answer`).

This structured, multi-step reasoning loop makes retrieval **adaptive and self-aware**, reducing hallucination risk. See the paper by [Jiang et al. (2023)](https://arxiv.org/pdf/2305.06983).

### üß† Key concepts demonstrated
| Concept | Description |
|----------|--------------|
| ü™û *Self-reflective retrieval* | The model inspects its own context and identifies missing information. |
| üîÑ *Iterative retrieval loop* | It autonomously issues and resolves follow-up queries. |
| üìë *Structured reasoning schema* | Outputs are typed (`needs`, `final_answer`), ensuring interpretability. |

### üî¨ What this code does
- Defines `FLAREAnswer` ‚Üí a structured schema with two fields: `needs` (follow-up queries) and `final_answer` (final grounded output).  
- Implements `flare_agent` ‚Üí a PydanticAI agent that follows the FLARE reasoning pattern.  
- Defines `flare_answer()` ‚Üí runs the **adaptive retrieval loop** up to `max_steps`, adding new context at each iteration.

By the end, you‚Äôll see how the agent autonomously **plans, retrieves, and finalizes** answers ‚Äî a crucial building block toward **fully agentic RAG systems** that can *think before they answer*.


In [None]:
from typing import List, Optional

# Structured output: NEEDs + optional final field
class FLAREAnswer(BaseModel):
    needs: List[str] = Field(default_factory=list, description="Follow-up retrieval queries.")
    final_answer: Optional[str] = Field(default=None, description="Final answer when sufficient evidence.")

# Ask the model to fill the structured schema directly
flare_agent = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt=(
        "You answer strictly from CONTEXT.\n"
        "- If info is missing, populate `needs` with 1-3 short search queries.\n"
        "- When sufficient evidence is present in the CONTEXT, leave `needs` empty and write `final_answer`.\n"
        "Do not invent facts; cite filenames in the answers like [pikachu.md]."
    ),
    output_type=FLAREAnswer,
    retries=2,
)

def flare_answer(question: str, max_steps: int = 3, per_need_k: int = 5) -> FLAREAnswer:
    context, used = "", []
    for _ in range(max_steps):
        msg = f"CONTEXT:\n{context}\n\nQUESTION: {question}\n"
        out = flare_agent.run_sync(msg).output

        # If final available or no needs, return immediately
        if not out.needs or (out.final_answer and out.final_answer != 'null'):
            return out, used

        # Retrieve for each needed query and expand context
        new_ctx = []
        for q in out.needs:
            used.append(q)
            new_ctx.append(f"QUERY: {q}.\n\nRESPONSE:{tool_hybrid(q, k=per_need_k)}")
        context += ("\n\n" + "\n\n".join(new_ctx)) if new_ctx else ""

    # Last attempt: ask for a final answer with accumulated context
    final = flare_agent.run_sync(f"CONTEXT:\n{context}\n\nQUESTION: {question}\n").output
    return final, used

# --- Example ---
q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res, used = flare_answer(q)

rprint("Final:", res.final_answer)
rprint("Needs issued:", res.needs)
rprint("Used queries:", used)

16:47:57.417 flare_agent run
16:47:57.417   chat google/gemini-2.5-pro
16:48:03.617 Hybrid search called with query: Pikachu normal type attacks
16:48:04.817 Hybrid search called with query: Charizard normal type attacks
16:48:05.505 flare_agent run
16:48:05.505   chat google/gemini-2.5-pro
16:48:17.112 Hybrid search called with query: Pikachu base attack stat
16:48:18.197 Hybrid search called with query: Pikachu normal type moves
16:48:18.900 Hybrid search called with query: Charizard normal type moves
16:48:21.617 flare_agent run
16:48:21.617   chat google/gemini-2.5-pro


## üß≠ Self-RAG ‚Äî Self-Reflective Retrieval and Generation

We‚Äôve now seen how agents can perform adaptive retrieval loops (**FLARE**) and multi-query reasoning.  
The next frontier in agentic RAG is **self-assessment** ‚Äî teaching the model to **critique its own answers** and refine them automatically.  
This is the core idea behind **Self-RAG (Self-Reflective Retrieval-Augmented Generation)**.

### üß† What is Self-RAG?
**Self-RAG** (Yoran et al., 2023) introduces a closed-loop system where the LLM not only retrieves and answers, but also *evaluates* the quality of its own reasoning using structured feedback signals.

In this setup:
1. The **Generator agent (`gen`)** produces an answer grounded in retrieved context.  
2. The **Critic agent (`crit`)** reviews that answer for:
   - **Support score (0‚Äì1):** how well the evidence backs the answer.  
   - **Hallucination risk:** likelihood of unsupported or fabricated information.  
   - **Citation sufficiency:** whether the cited documents justify the claim.  
   - **Missing evidence queries:** follow-up retrievals needed to strengthen the answer.
3. If the critic identifies gaps, the system issues **additional retrievals**, expands the context, and retries ‚Äî iterating until confidence crosses a threshold or the loop limit is reached.

### ‚öôÔ∏è What this code does
- Defines two structured outputs:
  - `Ans` ‚Üí stores the answer, citations, and retrieval tool used.  
  - `Crit` ‚Üí stores evaluation metrics and follow-up needs.  
- Creates two agents:
  - `gen` (generator) ‚Äî answers based on context.  
  - `crit` (critic) ‚Äî evaluates the generator‚Äôs response.  
- Implements `selfrag()` ‚Äî a **multi-turn self-reflective retrieval loop** combining both:
  - The generator writes ‚Üí the critic reviews ‚Üí retrieval expands ‚Üí iteration continues.
- The process stops when:
  - The support score ‚â• threshold (`th`),  
  - Citations are adequate, and  
  - Hallucination risk is low.

To learn more, go through the paper by [Asai et al. (2023)](https://arxiv.org/pdf/2310.11511).

### üéØ Why Self-RAG matters
Self-RAG represents a step toward **autonomous retrieval governance** ‚Äî systems that *know what they don‚Äôt know* and can ask the right follow-up questions. It reduces hallucinations, improves factual grounding, and creates interpretable reasoning logs (`history`).

In the next cell, we‚Äôll run `selfrag()` on the Charizard vs Pikachu question and observe how the model iteratively critiques, retrieves, and converges to a reliable, cited answer.


In [None]:
import json 

class Ans(BaseModel):
    answer: str
    citations: List[str] = []
    used_tool: str = "hybrid"

class Crit(BaseModel):
    correctness_score: float = Field(description="How good the generation is as a float between 0 and 1")
    hallucination_risk: str
    citation_ok: bool
    missing_evidence_queries: List[str] = []

gen = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt="Answer strictly from CONTEXT; if unknown say so. Cite filenames like [pikachu.md].",
    output_type=Ans,
    retries=3
)

crit = Agent(
    model="openrouter:google/gemini-2.5-pro",
    system_prompt=("Score correctness of ANSWER from CONTEXT (0-1), flag hallucination (low|medium|high), "
                   "whether citations suffice, and list up to 3 short follow-up queries. Correctness should be high only if the generation answers the query with required facts."),
    output_type=Crit, 
    retries=3
)

def selfrag(q: str, loops: int = 3, th: float = 0.8, k_init: int = 5, k_need: int = 5):
    ctx, hist, used = tool_hybrid(q, k_init), [], []
    for step in range(1, loops + 1):
        a = gen.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION: {q}").output
        c = crit.run_sync(f"CONTEXT:\n{ctx}\n\nQUESTION:\n{q}\n\nANSWER:\n{a.answer}").output

        state = {"step": step, "correctness": c.correctness_score, "risk": c.hallucination_risk,
                     "citation_ok": c.citation_ok, "missing": c.missing_evidence_queries, "generation": a.answer}
    
        logfire.info(json.dumps(state, indent=4).replace("{", "").replace("}", ""))

        hist.append(state)
        
        if c.correctness_score >= th and c.citation_ok and c.hallucination_risk.lower() == "low" and c.missing_evidence_queries is []:
            return {"final": a.answer, "used_tool": a.used_tool, "used_queries": used, "history": hist}
        
        for need in c.missing_evidence_queries:
            used.append(need); ctx += "\n\n" + tool_hybrid(need, k_need)
        if not c.missing_evidence_queries:  # low info but no needs ‚Üí broaden once
            
            ctx += "\n\n" + tool_hybrid(q, max(3, k_need // 2))

    return {"final": hist[-1]["step"] and a.answer, "used_tool": "hybrid",
            "used_queries": used, "history": hist, "note": "Stopped at max loops."}

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
res = selfrag(q)

rprint("Final:\n", res["final"])
rprint("Used queries:\n", res["used_queries"])
rprint("History:\n", [{k: h[k] for k in ("step","correctness","risk","citation_ok")} for h in res["history"]])

17:10:33.784 Hybrid search called with query: Who has more powerful normal type attack - Charizard or Pikachu?
17:10:34.851 gen run
17:10:34.851   chat google/gemini-2.5-pro
17:10:40.378 crit run
17:10:40.378   chat google/gemini-2.5-pro
17:10:50.437 
    "step": 1,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok": true,
    "missing": [
        "Pikachu attack stats",
        "Charizard attack stats",
        "Pikachu vs Charizard attack power"
    ],
    "generation": "I am sorry, but this document does not contain the answer to this question. \n"

17:10:50.437 Hybrid search called with query: Pikachu attack stats
17:10:52.099 Hybrid search called with query: Charizard attack stats
17:10:53.606 Hybrid search called with query: Pikachu vs Charizard attack power
17:10:54.324 gen run
17:10:54.324   chat google/gemini-2.5-pro
17:11:01.248 crit run
17:11:01.248   chat google/gemini-2.5-pro
17:11:12.324 
    "step": 2,
    "correctness": 1.0,
    "risk": "low",
    "citation_ok

## üß© Late Chunking ‚Äî Adaptive Retrieval Without Preprocessing Overhead

Up to this point, all our retrieval methods assumed we had **pre-chunked the entire corpus** ahead of time. While effective, this approach can be **wasteful** ‚Äî it embeds and stores thousands of text fragments even for small knowledge bases. In large-scale systems, pre-chunking becomes expensive in both **storage** and **embedding cost**.

To overcome this, we now explore **Late Chunking** ‚Äî also called **Dynamic** or **On-Demand Chunking**.

### ‚öôÔ∏è What is Late Chunking?
Instead of embedding every document in advance, we:
1. **Embed entire documents** at a coarse level (1 vector per document).  
2. When a query arrives:
   - **Rank documents** by similarity to the query embedding.  
   - **Select top-N documents** likely to contain relevant information.  
   - **Chunk and embed only those documents**, then **rerank their chunks** by semantic similarity.  
3. Return the **top-k most relevant chunks** as context for the answering agent.

This approach shifts the chunking process *after* initial retrieval ‚Äî hence the name **Late Chunking**.

### üí° Why it helps
| Benefit | Explanation |
|----------|-------------|
| üí∞ **Efficiency** | Only a few documents are chunked and embedded per query ‚Üí major cost savings. |
| ‚ö° **Speed** | Avoids loading or embedding a large number of irrelevant chunks. |
| üéØ **Precision** | Focuses chunking effort on documents already deemed semantically relevant. |
| üß† **Scalability** | Suitable for large corpora or dynamic datasets (e.g., fresh documents, evolving knowledge bases). |


More details in the paper by [Gunther et al. (2024)](https://arxiv.org/pdf/2409.04701).

### üß† What this code does
1. Builds **lightweight document-level embeddings** (1 vector per `.md` file).  
2. Defines `late_chunk_search()`:
   - Retrieves top documents based on query‚Äìdoc similarity.  
   - Chunks only those documents and embeds them on the fly.  
   - Reranks chunks to surface the most semantically relevant passages.  
3. Wraps it as a tool `late_chunk_context()` used by `latechunking_agent`.

It typically achieves similar accuracy with a fraction of the compute and memory footprint.

In [None]:
def _embed(texts: List[str]) -> List[List[float]]:
    return [e.embedding for e in client.embeddings.create(model=EMBED_MODEL, input=texts).data]

def _cos(a, b): 
    a, b = np.array(a), np.array(b); return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))

def load_docs(dir_path: str) -> List[Dict]:
    return [{"filename": f, "content": open(os.path.join(dir_path, f), encoding="utf-8").read()}
            for f in os.listdir(dir_path) if f.endswith(".md")]

# 1) Build a lightweight doc-level index in memory (no pre-chunking)
docs = load_docs(str(DATA_DIR))
doc_vecs = _embed([d["content"] for d in docs])

# 2) Late-chunk retrieval: rank docs by query‚Üídoc similarity, then chunk only top docs and rerank chunks
def late_chunk_search(query: str, top_docs=3, chunk_size=700, overlap=120, top_chunks=6) -> List[Dict]:
    logfire.info(f'Late chunking search with query: {query}')
    qv = _embed([query])[0]

    # rank full documents
    doc_scores = [(_cos(qv, v), i) for i, v in enumerate(doc_vecs)]
    top_doc_idxs = [i for _, i in sorted(doc_scores, reverse=True)[:top_docs]]

    # chunk only selected docs
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    chunks, metas = [], []
    for i in top_doc_idxs:
        for c in splitter.split_text(docs[i]["content"]):
            chunks.append(c); metas.append({"filename": docs[i]["filename"]})

    # rerank chunks by query similarity (embed chunks once)
    chunk_vecs = _embed(chunks) if chunks else []
    ranked = sorted(
        [{"content": c, "metadata": m, "_score": _cos(qv, v)} for c, m, v in zip(chunks, metas, chunk_vecs)],
        key=lambda x: x["_score"], reverse=True
    )
    
    return ranked[:top_chunks]

# 3) Build a context string for your RAG agent
def late_chunk_context(query: str, **kwargs) -> str:
    hits = late_chunk_search(query, **kwargs)
    return build_context_from_results(hits)


latechunking_agent = Agent(
    model=CHAT_MODEL,
    system_prompt=(
        "You answer strictly from the provided CONTEXT.  "
        "Always cite the filenames you relied on, e.g., [pikachu.md]."
    ),
    output_type=RAGAnswer,
    tools=[late_chunk_context],
    retries=3
)

q = "Who has more powerful normal type attack - Charizard or Pikachu?"
rag_response = latechunking_agent.run_sync(q)

rprint(rag_response)

17:13:47.398 latechunking_agent run
17:13:47.414   chat x-ai/grok-4-fast
17:13:51.037   running 1 tool
17:13:51.037     running tool: late_chunk_context
17:13:51.037       Late chunking search with query: Charizard vs Pikachu normal type moves power
17:13:57.576   chat x-ai/grok-4-fast
17:13:59.041   running 1 tool
17:13:59.042     running tool: late_chunk_context
17:13:59.042       Late chunking search with query: Charizard normal type moves
17:14:06.736   chat x-ai/grok-4-fast
17:14:08.616   running 1 tool
17:14:08.624     running tool: late_chunk_context
17:14:08.624       Late chunking search with query: Pikachu normal type moves power
17:14:17.083   chat x-ai/grok-4-fast
17:14:21.135   running 1 tool
17:14:21.135     running tool: late_chunk_context
17:14:21.135       Late chunking search with query: strongest normal type move Charizard Pikachu base power
17:14:35.013   chat x-ai/grok-4-fast


# üèÅ Conclusion ‚Äî From Simple RAG to Adaptive, Agentic Retrieval

In this tutorial, we built a complete end-to-end **Retrieval-Augmented Generation (RAG)** pipeline ‚Äî starting from basic semantic search to advanced agentic techniques that reason about *how* to retrieve.

In the next tutorial, we‚Äôll go one level deeper into **relational reasoning** through **GraphRAG** ‚Äî a paradigm that connects retrieved knowledge not just by similarity, but by **semantic relationships** and **causal links**.


### üß© Key takeaway
RAG is not just about retrieval ‚Äî it‚Äôs about **reasoning with evidence**. From hybrid searches to self-reflective loops, each enhancement makes the agent **more reliable, interpretable, and adaptive**.

Next stop: **GraphRAG** ‚Äî where your agents will not just fetch information, but *understand relationships, infer causality, and build knowledge networks.*

> üï∏Ô∏è *From chunks ‚Üí to connections ‚Üí to cognition.*