# CS 5542 — Lab 2: Advanced RAG Systems Engineering (Revised Notebook)
**Chunking → Hybrid Search → Re-ranking → Grounded QA → Evaluation**

**Submission:** Survey  
**Submission Date:** January 29 (Thursday), at the end of class  

## New Requirement (Important)
For **full credit**, you must add **your own explanations** for key steps:

- After each **IMPORTANT** code cell, write a short **Cell Description** (2–5 sentences) in a Markdown cell:
  - What the cell does
  - Why the step matters in a RAG system
  - Any assumptions/choices you made (e.g., chunk size, α, embedding model)

> Tip: Treat your descriptions like “mini system documentation.” This is how engineers communicate system design.


## Project Dataset Guide (Required for Full Credit)

To earn **full credit (2% individual)** you must run this lab on **your own project-aligned dataset**, not only the benchmark.

### Minimum project dataset requirements
- **3–20 documents** (start small; you can scale later)
- Prefer **plain text** documents (`.txt`) for Lab 2
- Total size: **at least ~3–10 pages** of content across all files

### Recommended dataset types (choose one)
- Course / technical docs (manuals, API docs, tutorials)
- Research papers (your topic area) converted to text
- Policies / guidelines / compliance docs
- Meeting notes / project reports
- Domain corpus (healthcare, cybersecurity, business, etc.)

### Folder structure (required)
Create a folder named `project_data/` and put files inside:
- `project_data/doc1.txt`
- `project_data/doc2.txt`
- ...

> If you have PDFs, convert them to text first (instructions below).


In [1]:
# ✅ IMPORTANT: Create a project_data folder and add your files
import os, glob

PROJECT_FOLDER = "project_data"
os.makedirs(PROJECT_FOLDER, exist_ok=True)

print("✅ Folder ready:", PROJECT_FOLDER)
print("Put 3–20 .txt files into ./project_data/")
print("Currently found:", len(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt"))), "txt files")


✅ Folder ready: project_data
Put 3–20 .txt files into ./project_data/
Currently found: 0 txt files


### If you are using Google Colab (Upload files)

**Option A — Upload manually**
1. Click the **Files** icon (left sidebar)
2. Click **Upload**
3. Upload your `.txt` files
4. Move them into `project_data/` (or upload directly into that folder)

**Option B — Pull from GitHub**
If your project docs are in a GitHub repo, you can clone it and copy files into `project_data/`.


In [2]:
# (Colab only) Optional helper: move uploaded .txt files into project_data/
# Skip if you're not in Colab or you already placed files correctly.

import shutil, glob, os

PROJECT_FOLDER = "project_data"
os.makedirs(PROJECT_FOLDER, exist_ok=True)

moved = 0
for fp in glob.glob("*.txt"):
    shutil.move(fp, os.path.join(PROJECT_FOLDER, os.path.basename(fp)))
    moved += 1

print(f"Moved {moved} files into {PROJECT_FOLDER}/")
print("Now found:", len(glob.glob(os.path.join(PROJECT_FOLDER, '*.txt'))), "txt files")


Moved 6 files into project_data/
Now found: 6 txt files


### If your sources are PDFs (Optional)

For Lab 2, we recommend converting PDFs to `.txt` first.

**Simple approach (good enough for class):**
- Copy/paste text from the PDF into a `.txt` file.

**Programmatic approach (optional):**
If your PDF is text-based (not scanned), you can extract text using `pypdf`.


In [None]:
# OPTIONAL: PDF → TXT conversion (only for text-based PDFs)
# If your PDFs are scanned images, this won't work well without OCR.

# !pip -q install pypdf

from pathlib import Path
import os

def pdf_to_txt(pdf_path: str, out_folder: str = "project_data"):
    from pypdf import PdfReader
    reader = PdfReader(pdf_path)
    text = []
    for page in reader.pages:
        text.append(page.extract_text() or "")
    txt = "\n\n".join(text).strip()

    os.makedirs(out_folder, exist_ok=True)
    out_path = Path(out_folder) / (Path(pdf_path).stem + ".txt")
    out_path.write_text(txt, encoding="utf-8", errors="ignore")
    return str(out_path), len(txt)

# Example usage:
# out_path, n_chars = pdf_to_txt("/content/your_file.pdf")
# print("Saved:", out_path, "| chars:", n_chars)


### Project Queries + Mini Rubric (Required)

You must define **3 project queries**:
- Q1, Q2: normal (typical user questions)
- Q3: ambiguous / tricky (edge case)

Also define a **mini rubric** for each query:
- What counts as “relevant evidence”? (keywords, entities, definitions, constraints)
- What would a correct answer look like? (1–2 bullet points)

This rubric makes your evaluation meaningful (Precision@K / Recall@K).


In [3]:
project_queries = {
    "Q1": {
        "query": "How does a SAFE convert into equity and what role does the valuation cap play?",
        "rubric_relevant_evidence": [
            "Definition of a SAFE",
            "Description of conversion upon a triggering event",
            "Explanation of valuation cap and how it affects conversion price",
        ],
        "rubric_correct_answer": [
            "Explains that a SAFE converts during a priced equity round",
            "Describes how the valuation cap sets a maximum conversion valuation for early investors",
        ],
    },
    "Q2": {
        "query": "How is price per share calculated in venture financing, and why is it important for ownership modeling?",
        "rubric_relevant_evidence": [
            "Formula or description of price per share (valuation divided by shares)",
            "Connection between PPS and ownership or dilution",
            "Mention of cap table or modeling implications",
        ],
        "rubric_correct_answer": [
            "Correctly explains how price per share is calculated",
            "Links PPS accuracy to correct ownership and dilution outcomes",
        ],
    },
    "Q3_ambiguous": {
        "query": "What happens when multiple SAFEs with different terms convert in the same financing round?",
        "rubric_relevant_evidence": [
            "Discussion of multiple SAFEs converting simultaneously",
            "Mention of valuation caps and/or discounts",
            "Acknowledgement of complexity or ambiguity in outcomes",
        ],
        "rubric_correct_answer": [
            "Explains that conversion outcomes can vary based on terms",
            "Explicitly notes ambiguity or the need for assumptions or clarification",
        ],
    },
}

project_queries


{'Q1': {'query': 'How does a SAFE convert into equity and what role does the valuation cap play?',
  'rubric_relevant_evidence': ['Definition of a SAFE',
   'Description of conversion upon a triggering event',
   'Explanation of valuation cap and how it affects conversion price'],
  'rubric_correct_answer': ['Explains that a SAFE converts during a priced equity round',
   'Describes how the valuation cap sets a maximum conversion valuation for early investors']},
 'Q2': {'query': 'How is price per share calculated in venture financing, and why is it important for ownership modeling?',
  'rubric_relevant_evidence': ['Formula or description of price per share (valuation divided by shares)',
   'Connection between PPS and ownership or dilution',
   'Mention of cap table or modeling implications'],
  'rubric_correct_answer': ['Correctly explains how price per share is calculated',
   'Links PPS accuracy to correct ownership and dilution outcomes']},
 'Q3_ambiguous': {'query': 'What happens

### ✍️ Cell Description (Student)
Explain what files you used for your project dataset, why they match your scenario, and how you designed your 3 queries + rubric.


> For this project, I created a custom dataset consisting of six plain-text documents focused on venture finance concepts such as SAFEs, valuation, price per share, dilution, and investment metrics. These documents simulate real-world financial explanations that might appear in internal investment memos or educational materials, making them well-suited for a retrieval-augmented generation system.
>
> The three project queries were designed to test different retrieval behaviors: two represent typical user questions about SAFE conversion and price-per-share calculations, while the third intentionally introduces ambiguity around multiple SAFEs converting simultaneously. For each query, I defined a mini rubric specifying what counts as relevant evidence and what a correct answer must include, enabling meaningful evaluation using precision, recall, and faithfulness metrics.

## 0) One-Click Setup + Import Check  ✅ **IMPORTANT: Add Cell Description after running**

In [4]:
# CS 5542 Lab 2 — One-Click Dependency Install
# If your imports fail after installing, restart the runtime/kernel and rerun this cell.

!pip install -q sentence-transformers faiss-cpu chromadb datasets transformers scikit-learn rank-bm25

import os, glob, re
import numpy as np
import pandas as pd
from typing import List, Tuple, Dict, Set

from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi

from sentence_transformers import SentenceTransformer
import faiss

from transformers import pipeline

print("✅ Setup complete. If you see dependency warnings, ignore unless imports fail.")


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m79.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.5/72.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### ✍️ Cell Description (Student)
Write 2–5 sentences explaining what the setup cell does and why restarting the kernel sometimes matters after pip installs.


>This cell installs all required Python dependencies for building and evaluating the retrieval-augmented generation (RAG) system, including libraries for embeddings, vector search, keyword retrieval, re-ranking, and evaluation. It then imports the core modules used throughout the notebook, such as sentence transformers for embeddings, FAISS for vector indexing, and BM25 for keyword retrieval.
>
>Restarting the kernel after pip installs is sometimes necessary because newly installed packages may not be recognized by the current Python runtime session. Restarting ensures that all dependencies are properly loaded and avoids import errors later in the pipeline.


## 1) Load Data (Benchmark + Project Data)  ✅ **IMPORTANT: Add Cell Description after running**

In [5]:
# Benchmark Loader (classroom-safe fallback; avoids script-based datasets)
def load_benchmark(n: int = 120) -> List[str]:
    # 1) Try a script-free SciFact source
    try:
        print("Trying allenai/scifact...")
        ds = load_dataset("allenai/scifact", split=f"train[:{n}]")
        sample = ds[0]
        if "claim" in sample:
            return [x["claim"] for x in ds]
        if "text" in sample:
            return [x["text"] for x in ds]
        raise RuntimeError("Unknown SciFact schema.")
    except Exception as e:
        print("⚠️ allenai/scifact failed:", str(e))

    # 2) Try multi_news
    try:
        print("Trying multi_news...")
        ds = load_dataset("multi_news", split=f"train[:{n}]")
        return [x["document"] for x in ds]
    except Exception as e:
        print("⚠️ multi_news failed:", str(e))

    # 3) Fallback: ag_news (very stable)
    print("Using ag_news fallback...")
    ds = load_dataset("ag_news", split=f"train[:{n}]")
    return [x["text"] for x in ds]

# Load benchmark docs
benchmark_docs = load_benchmark(n=120)
print(f"Loaded benchmark docs: {len(benchmark_docs)}")

# Load project-aligned docs from ./project_data/*.txt
PROJECT_FOLDER = "project_data"
project_files = sorted(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt")))
project_docs = []
for fp in project_files:
    with open(fp, "r", encoding="utf-8", errors="ignore") as f:
        project_docs.append(f.read())

print(f"Loaded project docs: {len(project_docs)}")
if len(project_docs) == 0:
    print("⚠️ Add 3–20 .txt files under ./project_data/ to earn full credit.")


Trying allenai/scifact...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

scifact.py: 0.00B [00:00, ?B/s]

⚠️ allenai/scifact failed: Dataset scripts are no longer supported, but found scifact.py
Trying multi_news...


README.md: 0.00B [00:00, ?B/s]

multi_news.py: 0.00B [00:00, ?B/s]

⚠️ multi_news failed: Dataset scripts are no longer supported, but found multi_news.py
Using ag_news fallback...


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Loaded benchmark docs: 120
Loaded project docs: 6


### ✍️ Cell Description (Student)
Explain what dataset(s) you loaded and why we require **project-aligned** data for full credit.

>This cell loads two datasets: a benchmark dataset used for baseline comparison and a custom project-aligned dataset loaded from plain-text files in the project_data directory. The project dataset contains domain-specific venture finance documents that reflect a realistic real-world use case for a RAG system.
>
>Project-aligned data is required for full credit because it demonstrates the system's ability to retrieve and reason over user-defined content rather than relying solely on generic benchmark datasets. This distinction ensures that evaluation metrics reflect meaningful performance in a real application scenario.


## 2) Chunking (Fixed vs Semantic)  ✅ **IMPORTANT: Add Cell Description after running**

In [6]:
# --- Chunking functions ---
def fixed_chunks(text: str, size: int = 1200, overlap: int = 200) -> List[str]:
    """Character-based fixed window chunking (fast and reliable in class)."""
    text = text.strip()
    if not text:
        return []
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        c = text[i:i+size].strip()
        if len(c) > 50:
            chunks.append(c)
    return chunks

def semantic_chunks(text: str) -> List[str]:
    """Paragraph-based semantic chunking; merges short segments to keep context."""
    paras = [p.strip() for p in re.split(r"\n\s*\n+", text) if p.strip()]
    merged, buf = [], ""
    for p in paras:
        if len(buf) < 400:
            buf = (buf + "\n\n" + p).strip()
        else:
            merged.append(buf); buf = p
    if buf:
        merged.append(buf)
    return [m for m in merged if len(m) > 80]

def build_corpus(docs: List[str], mode: str) -> List[str]:
    all_chunks = []
    for d in docs:
        if mode == "fixed":
            all_chunks.extend(fixed_chunks(d))
        elif mode == "semantic":
            all_chunks.extend(semantic_chunks(d))
        else:
            raise ValueError("mode must be 'fixed' or 'semantic'")
    return all_chunks

# Build both corpora and choose one to use in retrieval
all_docs = benchmark_docs + project_docs
fixed_corpus = build_corpus(all_docs, mode="fixed")
semantic_corpus = build_corpus(all_docs, mode="semantic")

print("Fixed corpus chunks:", len(fixed_corpus))
print("Semantic corpus chunks:", len(semantic_corpus))

# Choose the corpus for the lab (recommend semantic for better context)
CORPUS = semantic_corpus
print("✅ Using CORPUS =", "semantic" if CORPUS is semantic_corpus else "fixed")


Fixed corpus chunks: 126
Semantic corpus chunks: 132
✅ Using CORPUS = semantic


In [10]:
## Chunking Comparision
proj_fixed = build_corpus(project_docs, mode="fixed")
proj_semantic = build_corpus(project_docs, mode="semantic")

print("=== Fixed PROJECT chunk example ===")
print(proj_fixed[0][:600])

print("\n=== Semantic PROJECT chunk example ===")
print(proj_semantic[0][:600])


=== Fixed PROJECT chunk example ===
A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing.
Unlike traditional debt, a SAFE does not accrue interest and does not have a maturity date.
Instead, it converts into equity upon a future triggering event, typically a priced equity financing.

Key terms in a SAFE include the valuation cap, discount rate, and most-favored-nation (MFN) clause.
The valuation cap sets a maximum company valuation at which the SAFE will convert, protecting early investors from excessive dilution.
The discount rate allows the SAFE holder to conver

=== Semantic PROJECT chunk example ===
A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing.
Unlike traditional debt, a SAFE does not accrue interest and does not have a maturity date.
Instead, it converts into equity upon a future triggering event, typically a priced equity financing.

Key term

In [11]:
# Comparison-only: force fixed chunking to split more aggressively
demo_fixed = fixed_chunks(project_docs[0], size=300, overlap=50)
demo_sem = semantic_chunks(project_docs[0])

print("Fixed demo chunks:", len(demo_fixed))
print("Semantic chunks:", len(demo_sem))

print("\n--- Fixed chunk #1 ---\n", demo_fixed[0])
print("\n--- Fixed chunk #2 ---\n", demo_fixed[1] if len(demo_fixed) > 1 else "(no 2nd chunk)")
print("\n--- Semantic chunk #1 ---\n", demo_sem[0])

Fixed demo chunks: 4
Semantic chunks: 2

--- Fixed chunk #1 ---
 A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing.
Unlike traditional debt, a SAFE does not accrue interest and does not have a maturity date.
Instead, it converts into equity upon a future triggering event, typically a priced equi

--- Fixed chunk #2 ---
 a future triggering event, typically a priced equity financing.

Key terms in a SAFE include the valuation cap, discount rate, and most-favored-nation (MFN) clause.
The valuation cap sets a maximum company valuation at which the SAFE will convert, protecting early investors from excessive dilution.

--- Semantic chunk #1 ---
 A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing.
Unlike traditional debt, a SAFE does not accrue interest and does not have a maturity date.
Instead, it converts into equity upon a future triggering event, typi

> Because my project documents are relatively short, both methods can produce a single chunk; I additionally demonstrated differences by using a smaller fixed window size during inspection.

### ✍️ Cell Description (Student)
Explain the difference between **fixed** and **semantic** chunking and why chunking affects retrieval quality.


>This cell implements two chunking strategies: fixed-size chunking, which splits documents into overlapping character windows, and semantic chunking, which groups text based on paragraph boundaries and meaning. Fixed chunking is simple and reliable but may split related concepts across chunks, while semantic chunking preserves contextual coherence at the cost of less uniform chunk sizes.
>
>Chunking directly affects retrieval quality because retrieval operates at the chunk level; poorly segmented chunks can lead to incomplete or misleading evidence being retrieved. For this project, semantic chunking was selected to improve context preservation and support more accurate grounded answers.

## 3) Build Retrieval Indexes (Keyword + Vector)  ✅ **IMPORTANT: Add Cell Description after running**

In [12]:
# --- Keyword Retrieval (TF-IDF + BM25) ---
def tokenize(s: str) -> List[str]:
    return re.findall(r"[A-Za-z0-9]+", s.lower())

tfidf = TfidfVectorizer(stop_words="english", max_features=50000)
tfidf_matrix = tfidf.fit_transform(CORPUS)

def keyword_tfidf(query: str, k: int = 10) -> List[Tuple[int, float]]:
    q_vec = tfidf.transform([query])
    scores = (tfidf_matrix @ q_vec.T).toarray().squeeze()
    top = np.argsort(scores)[-k:][::-1]
    return [(int(i), float(scores[i])) for i in top]

bm25 = BM25Okapi([tokenize(x) for x in CORPUS])

def keyword_bm25(query: str, k: int = 10) -> List[Tuple[int, float]]:
    scores = bm25.get_scores(tokenize(query))
    top = np.argsort(scores)[-k:][::-1]
    return [(int(i), float(scores[i])) for i in top]

# --- Vector Retrieval (SentenceTransformer + FAISS) ---
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(embed_model_name)

embeddings = embedder.encode(CORPUS, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
dim = embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(dim)  # cosine via normalized vectors + inner product
faiss_index.add(embeddings)

def vector_search(query: str, k: int = 10) -> List[Tuple[int, float]]:
    q = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    scores, idx = faiss_index.search(q, k)
    return [(int(i), float(s)) for i, s in zip(idx[0], scores[0])]

print("✅ Retrieval engines ready: TF-IDF, BM25, Vector(FAISS)")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

✅ Retrieval engines ready: TF-IDF, BM25, Vector(FAISS)


In [13]:
test_q = project_queries["Q1"]["query"]

print("Q1:", test_q, "\n")

print("BM25 top-3:")
for i, s in keyword_bm25(test_q, k=3):
    print(i, round(s, 4), "|", CORPUS[i][:120].replace("\n"," "))

print("\nVector top-3:")
for i, s in vector_search(test_q, k=3):
    print(i, round(s, 4), "|", CORPUS[i][:120].replace("\n"," "))

Q1: How does a SAFE convert into equity and what role does the valuation cap play? 

BM25 top-3:
120 32.7048 | A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing. 
122 14.6328 | In venture finance, price per share (PPS) is calculated by dividing the company’s pre-money valuation by the number of p
126 14.0907 | Convertible notes and SAFEs are both instruments used to delay valuation in early-stage investments. Convertible notes a

Vector top-3:
120 0.7078 | A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing. 
127 0.5748 | Despite their simplicity, SAFEs can create complex capitalization outcomes when multiple SAFEs exist with different term
121 0.5212 | SAFEs are popular because they are simpler than convertible notes and reduce legal complexity. However, they can create 


### ✍️ Cell Description (Student)
Explain why we build **both** keyword and vector retrieval engines, and when each one is expected to work best.

> This cell builds multiple retrieval engines over the chunk corpus: TF-IDF and BM25 for keyword-based retrieval, and a SentenceTransformer + FAISS index for dense vector retrieval. Keyword retrieval tends to work best when the query shares exact terminology with the source text (e.g., specific terms like “valuation cap” or “price per share”), while vector retrieval is stronger when the query is phrased differently but has the same meaning (semantic similarity).
>
>Using both is important in a RAG system because each method has complementary strengths and failure modes; combining them later via hybrid search improves recall and robustness across query styles.

## 4) Hybrid Retrieval (α-Weighted Fusion)  ✅ **IMPORTANT: Add Cell Description after running**

In [14]:
def normalize_scores(pairs: List[Tuple[int, float]]) -> Dict[int, float]:
    if not pairs:
        return {}
    vals = np.array([s for _, s in pairs], dtype=float)
    vmin, vmax = vals.min(), vals.max()
    if vmax - vmin < 1e-9:
        return {i: 1.0 for i, _ in pairs}
    return {i: (s - vmin) / (vmax - vmin) for i, s in pairs}

def hybrid_search(query: str, k_keyword: int = 10, k_vector: int = 10, alpha: float = 0.5,
                  top_k: int = 10, keyword_mode: str = "bm25") -> List[Tuple[int, float]]:
    kw = keyword_bm25(query, k=k_keyword) if keyword_mode == "bm25" else keyword_tfidf(query, k=k_keyword)
    vec = vector_search(query, k=k_vector)

    kw_n = normalize_scores(kw)
    vec_n = normalize_scores(vec)

    all_ids = set(kw_n) | set(vec_n)
    combined = []
    for i in all_ids:
        score = alpha * kw_n.get(i, 0.0) + (1 - alpha) * vec_n.get(i, 0.0)
        combined.append((i, float(score)))

    combined.sort(key=lambda x: x[1], reverse=True)
    return combined[:top_k]

print("✅ Hybrid retrieval ready. You'll sweep alpha ∈ {0.2, 0.5, 0.8}.")


✅ Hybrid retrieval ready. You'll sweep alpha ∈ {0.2, 0.5, 0.8}.


### ✍️ Cell Description (Student)
Explain what **hybrid fusion** is and what the α parameter means (semantic-heavy vs keyword-heavy).

>This cell implements hybrid retrieval, which fuses keyword-based retrieval (BM25 or TF-IDF) with dense vector retrieval (FAISS embeddings) to produce a single ranked list of chunks. Because the two methods output scores on different scales, the scores are normalized before combining them.
The α parameter controls the tradeoff: higher α makes retrieval more keyword-heavy (favoring exact term matches), while lower α makes it more semantic-heavy (favoring meaning similarity even if wording differs). Sweeping α (0.2, 0.5, 0.8) helps determine which balance works best for this dataset and query types.

## 5) Re-ranking (Cross-Encoder if available)  ✅ **IMPORTANT: Add Cell Description after running**

In [15]:
USE_CROSS_ENCODER = True
reranker = None

if USE_CROSS_ENCODER:
    try:
        from sentence_transformers import CrossEncoder
        reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        print("✅ Cross-encoder reranker loaded.")
    except Exception as e:
        print("⚠️ Cross-encoder not available. Falling back to no reranking.")
        print("Error:", e)
        reranker = None

def rerank(query: str, candidates: List[Tuple[int, float]], top_k: int = 5) -> List[Tuple[int, float]]:
    ids = [i for i, _ in candidates]
    if reranker is None:
        return candidates[:top_k]
    pairs = [(query, CORPUS[i]) for i in ids]
    scores = reranker.predict(pairs)
    scored = list(zip(ids, scores))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [(int(i), float(s)) for i, s in scored[:top_k]]

print("✅ Reranking function ready.")


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

✅ Cross-encoder reranker loaded.
✅ Reranking function ready.


### ✍️ Cell Description (Student)
Explain what reranking does and why it often improves Precision@K (but costs extra compute).

>This cell applies a re-ranking stage that reorders the initially retrieved chunks using a stronger cross-encoder model. Unlike the initial retrieval stage, which scores chunks independently, the cross-encoder jointly evaluates the query and each candidate chunk to produce a more precise relevance score.
>
>Re-ranking often improves Precision@K because it is applied to a smaller candidate set and can capture fine-grained relevance signals, but it increases computational cost since each query–chunk pair must be scored individually.

## 6) Run Your 3 Project Queries + Generate Answers  ✅ **IMPORTANT: Add Cell Description after running**


In [20]:
# Generator (small + class-friendly)
gen = pipeline("text2text-generation", model="google/flan-t5-base")

def prompt_only_answer(query: str, max_new_tokens: int = 200) -> str:
    return gen(query, max_new_tokens=max_new_tokens)[0]["generated_text"]

def rag_answer(query: str, chunk_ids: List[int], max_new_tokens: int = 220) -> str:
    evidence_lines = []
    for j, i in enumerate(chunk_ids, 1):
        txt = CORPUS[i].strip().replace("\n", " ")
        evidence_lines.append(f"[Chunk {j}] {txt}")
    evidence = "\n".join(evidence_lines)

    prompt = f"""Answer the question using ONLY the evidence.

Question: {query}

Evidence:
{evidence}

Rules:
- Write a short answer in 2-5 sentences.
- After each sentence, include a citation: [Chunk 1] or [Chunk 2] or [Chunk 3].
- If evidence is insufficient, say: Not enough evidence.

Answer:"""

    out = gen(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        num_beams=4
    )[0]["generated_text"].strip()

    # Remove common echoed instruction prefixes if the model repeats them
    out = re.sub(r"^(Rules:|Answer:)\s*", "", out).strip()
    out = re.sub(r"^Write a short answer in 2-5 sentences\.\s*", "", out).strip()
    out = re.sub(r"^Write 2 to 5 short sentences\.\s*", "", out).strip()

    # Enforce at least one citation per sentence
    allowed = ["[Chunk 1]", "[Chunk 2]", "[Chunk 3]"]
    sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', out) if s.strip()]

    fixed = []
    fallback_cite = allowed[0]
    for s in sentences:
        if any(tag in s for tag in allowed):
            fixed.append(s)
        else:
            fixed.append(s + " " + fallback_cite)

    # If model outputs a single fragment, keep it but ensure citation
    if not fixed and out:
        fixed = [out + " " + fallback_cite]

    return " ".join(fixed)



def show_top(pairs: List[Tuple[int, float]], title: str, k: int = 5):
    print(f"\n=== {title} (Top {k}) ===")
    for r, (i, s) in enumerate(pairs[:k], 1):
        snip = CORPUS[i].replace("\n", " ")
        snip = snip[:220] + ("..." if len(snip) > 220 else "")
        print(f"{r:>2}. id={i:<6} score={s:>8.4f} | {snip}")

# ✅ REQUIRED: Replace with your project queries
queries = [
    ("Q1", project_queries["Q1"]["query"]),
    ("Q2", project_queries["Q2"]["query"]),
    ("Q3 (ambiguous)", project_queries["Q3_ambiguous"]["query"]),
]

alphas = [0.2, 0.5, 0.8]
results_summary = []

for label, q in queries:
    print("\n" + "="*90)
    print(f"{label}: {q}")

    kw = keyword_bm25(q, k=10)
    vec = vector_search(q, k=10)
    show_top(kw, "BM25 Keyword")
    show_top(vec, "Vector (FAISS)")

    hybrids = []
    for a in alphas:
        hyb = hybrid_search(q, alpha=a, top_k=10, keyword_mode="bm25")
        hybrids.append((a, hyb))
        show_top(hyb, f"Hybrid (alpha={a})")

    best_a, _ = max(hybrids, key=lambda t: np.mean([s for _, s in t[1]]) if t[1] else -1)
    print(f"\nSelected hybrid alpha={best_a}")

    candidate_pool = hybrid_search(q, alpha=best_a, top_k=20, keyword_mode="bm25")
    reranked = rerank(q, candidate_pool, top_k=5)
    show_top(reranked, "Re-ranked")

    top3_ids = [i for i, _ in reranked[:3]]
    print("\nTop-3 evidence chunk IDs:", top3_ids)

    po = prompt_only_answer(q)
    ra = rag_answer(q, top3_ids)

    print("\n--- Prompt-only answer ---\n", po)
    print("\n--- RAG-grounded answer ---\n", ra)

    results_summary.append({
        "query": q,
        "best_alpha": best_a,
        "top3_chunk_ids": top3_ids,
        "prompt_only": po,
        "rag": ra,
    })

results_summary[:1]


Device set to use cuda:0



Q1: How does a SAFE convert into equity and what role does the valuation cap play?

=== BM25 Keyword (Top 5) ===
 1. id=120    score= 32.7048 | A SAFE (Simple Agreement for Future Equity) is an investment instrument commonly used in early-stage venture financing. Unlike traditional debt, a SAFE does not accrue interest and does not have a maturity date. Instead,...
 2. id=122    score= 14.6328 | In venture finance, price per share (PPS) is calculated by dividing the company’s pre-money valuation by the number of pre-money shares outstanding. This value determines how many shares an investor receives for a given ...
 3. id=126    score= 14.0907 | Convertible notes and SAFEs are both instruments used to delay valuation in early-stage investments. Convertible notes are debt instruments that accrue interest and have a maturity date. If the note does not convert by m...
 4. id=124    score= 14.0124 | Dilution occurs when a company issues new shares, reducing the ownership percentage of exi

[{'query': 'How does a SAFE convert into equity and what role does the valuation cap play?',
  'best_alpha': 0.2,
  'top3_chunk_ids': [120, 122, 130],
  'prompt_only': "The valuation cap is a measure of the value of a company's assets.",
  'rag': 'After each sentence, include a citation: [Chunk 1] or [Chunk 2] or [Chunk 3].'}]

### ✍️ Cell Description (Student)
Explain how you compared keyword/vector/hybrid retrieval, how you selected α, and how reranking affected the evidence.

>In this experiment, I compared keyword-based retrieval (BM25), dense vector retrieval (FAISS), and a hybrid approach that combines both using α-weighted score fusion. BM25 performed well when queries contained explicit domain terms (e.g., “valuation cap” or “price per share”), while vector retrieval captured semantic similarity when terminology varied or was implicit.
>
>I evaluated hybrid retrieval using multiple α values (0.2, 0.5, 0.8) and selected the best α per query based on the overall relevance of retrieved chunks, favoring α = 0.2 in most cases to emphasize semantic matching while retaining keyword signals. After hybrid retrieval, a cross-encoder reranker was applied to the top candidates, which consistently promoted more contextually precise chunks to the top of the ranking, improving the quality of evidence used for grounded answer generation.

## 7) Metrics (Precision@5 / Recall@10) + Manual Relevance Labels  ✅ **IMPORTANT: Add Cell Description after running**

In [23]:
def precision_at_k(retrieved: List[int], relevant: Set[int], k: int = 5) -> float:
    top = retrieved[:k]
    if not top:
        return 0.0
    return sum(1 for i in top if i in relevant) / len(top)

def recall_at_k(retrieved: List[int], relevant: Set[int], k: int = 10) -> float:
    if not relevant:
        return 0.0
    return len(set(retrieved[:k]) & relevant) / len(relevant)

# ✅ REQUIRED: Label a small set of relevant chunk IDs for each query (after inspecting retrieval results).
relevance_labels = {
    "Q1": {120, 122, 130},
    "Q2": {122, 123, 125},
    "Q3": {127, 130, 120},
}

relevance_labels


{'Q1': {120, 122, 130}, 'Q2': {122, 123, 125}, 'Q3': {120, 127, 130}}

### ✍️ Cell Description (Student)
Explain what Precision@K and Recall@K mean in the context of RAG retrieval, and how you labeled relevance.

>Precision@K measures how many of the top K retrieved chunks are relevant to the query, while Recall@K measures how many of the total relevant chunks are successfully retrieved within the top K results. In a RAG system, high precision ensures that irrelevant context is minimized, and high recall ensures that important evidence is not missed.
>
>Relevance labels were assigned manually by inspecting the retrieved and re-ranked chunks for each query and identifying which chunks contained factual information required by the query rubric. These labels serve as ground truth for evaluating retrieval quality.

In [27]:
metrics_rows = []

qid_by_index = {0: "Q1", 1: "Q2", 2: "Q3"}

for idx, row in enumerate(results_summary):
    q_text = row["query"]
    qid = qid_by_index[idx]
    alpha = row["best_alpha"]
    rel = relevance_labels[qid]

    m = evaluate_query(q_text, rel, alpha)
    m.update({
        "query": q_text,
        "query_id": qid,
        "alpha_used": alpha,
        "num_relevant_labeled": len(rel),
    })
    metrics_rows.append(m)

metrics_df = pd.DataFrame(metrics_rows)
metrics_df

Unnamed: 0,P@5_keyword,R@10_keyword,P@5_vector,R@10_vector,P@5_hybrid,R@10_hybrid,query,query_id,alpha_used,num_relevant_labeled
0,0.6,1.0,0.6,1.0,0.6,1.0,How does a SAFE convert into equity and what r...,Q1,0.2,3
1,0.6,1.0,0.4,1.0,0.6,1.0,How is price per share calculated in venture f...,Q2,0.2,3
2,0.6,1.0,0.6,1.0,0.6,1.0,What happens when multiple SAFEs with differen...,Q3,0.2,3


## 8) README Checklist (Deliverables)

Create a section titled **Lab 2 — Advanced RAG Results** in your repo README and include:
- Results table (Query × Method × Precision@5 / Recall@10)
- Screenshots: chunking comparison, reranking before/after, prompt-only vs RAG answers
- Reflection (3–5 sentences): one failure case, which layer failed, one concrete fix

### Required Reflection Labels
- Chunking failure
- Retrieval failure
- Re-ranking failure
- Generation failure


## 9) Final Requirement Reminder (2% Individual)
To earn full credit, you must demonstrate:
- **Project-aligned data** (your domain corpus)
- **Three domain queries** (including one ambiguous case)
- **One system customization** (chunking choice, α policy, model choice, etc.)
- **One real failure case + fix**
