# Manual RAG Pipeline: Mechanisms First

This notebook builds a Retrieval-Augmented Generation (RAG) pipeline from scratch.
You'll see every step explicitly before we move to frameworks like LangChain.

**Works on:** Google Colab, Local Jupyter (Mac/Windows/Linux)

**Pipeline Overview:**
```
Documents ‚Üí Chunking ‚Üí Embedding ‚Üí Index (FAISS)
                                        ‚Üì
User Query ‚Üí Embed Query ‚Üí Similarity Search ‚Üí Top-K Chunks
                                                    ‚Üì
                                        Prompt Assembly ‚Üí LLM ‚Üí Answer
```

## TODO ‚Äî Topic 5 RAG Course Project Checklist

- **Exercise 0:** Set-up ‚Äî Get notebook running; unzip Corpora.zip. Use PDFs from `Corpora/<corpus>/pdf_embedded/`.
- **Exercise 1:** Open model RAG vs no RAG ‚Äî Compare Qwen 2.5 1.5B with/without RAG on Model T manual and Congressional Record.
- **Exercise 2:** Open model + RAG vs large model ‚Äî Run GPT-4o Mini with no tools on same queries.
- **Exercise 3:** Open model + RAG vs frontier chat ‚Äî Compare local Qwen+RAG vs GPT-4/Claude (web).
- **Exercise 4:** Effect of top-K ‚Äî Test k = 1, 3, 5, 10, 20.
- **Exercise 5:** Unanswerable questions ‚Äî Off-topic, related-but-missing, false premise.
- **Exercise 6:** Query phrasing sensitivity ‚Äî Same question in 5+ phrasings.
- **Exercise 7:** Chunk overlap ‚Äî Re-chunk with overlap 0, 64, 128, 256.
- **Exercise 8:** Chunk size ‚Äî Chunk at 128, 256, 512, 1024, 2048.
- **Exercise 9:** Retrieval score analysis ‚Äî 10 queries, top-10 chunks, score distribution.
- **Exercise 10:** Prompt template variations ‚Äî Minimal, strict grounding, citation, permissive, structured.
- **Exercise 11:** Failure mode catalog ‚Äî Computation, temporal, comparison, ambiguous, multi-hop, etc.
- **Exercise 12:** Cross-document synthesis ‚Äî Questions needing multiple chunks.

## Setup

First, let's install the required packages and detect our compute environment.

In [None]:
# Install dependencies
# On Colab, these install quickly. Locally, you may already have them.
# Use a kernel-aware install when available; fall back to subprocess otherwise.
try:
    ip = get_ipython()
    ip.run_line_magic('pip', 'install -q torch transformers sentence-transformers faiss-cpu pymupdf accelerate ipyfilechooser openai')
except NameError:
    import subprocess, sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'torch', 'transformers', 'sentence-transformers', 'faiss-cpu', 'pymupdf', 'accelerate', 'ipyfilechooser', 'openai'])
# For Exercise 2 (GPT-4o Mini): add 'openai' to the list above if needed


[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m91.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.9/24.9 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m95.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# =============================================================================
# ENVIRONMENT AND DEVICE DETECTION
# =============================================================================
import os
import sys

# Enable MPS fallback for any PyTorch operations not yet implemented on Metal
# This MUST be set before importing torch
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

# Prevent kernel crash from duplicate OpenMP libraries (PyTorch + FAISS conflict on macOS)
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

import torch
from typing import Tuple

def detect_environment() -> str:
    """Detect if we're running on Colab or locally."""
    try:
        import google.colab
        return 'colab'
    except ImportError:
        return 'local'

def get_device() -> Tuple[str, torch.dtype]:
    """
    Detect the best available compute device.

    Priority: CUDA > MPS (Apple Silicon) > CPU

    Returns:
        Tuple of (device_string, recommended_dtype)

    Notes:
        - CUDA: Use float16 for memory efficiency (Tensor Cores optimize this)
        - MPS: Use float32 - Apple Silicon doesn't have the same float16
               optimizations as NVIDIA, and float32 is often faster
        - CPU: Use float32 (float16 not well supported on CPU)
    """
    if torch.cuda.is_available():
        device = 'cuda'
        dtype = torch.float16
        device_name = torch.cuda.get_device_name(0)
        memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"‚úì Using CUDA GPU: {device_name} ({memory_gb:.1f} GB)")

    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        device = 'mps'
        dtype = torch.float32  # float32 is often faster on Apple Silicon!
        print("‚úì Using Apple Silicon GPU (MPS)")
        print("  Note: Using float32 (faster than float16 on Apple Silicon)")

    else:
        device = 'cpu'
        dtype = torch.float32
        print("‚ö† Using CPU (no GPU detected)")
        print("  Tip: For faster processing, use a machine with a GPU")

    return device, dtype

# Detect environment and device
ENVIRONMENT = detect_environment()
DEVICE, DTYPE = get_device()

print(f"\nEnvironment: {ENVIRONMENT.upper()}")
print(f"Device: {DEVICE}, Dtype: {DTYPE}")

‚úì Using CUDA GPU: Tesla T4 (15.6 GB)

Environment: COLAB
Device: cuda, Dtype: torch.float16


## Load Your Documents

**Cell 1:** Configure your document source and select/upload files
- **Local Jupyter**: Use the folder picker, then run Cell 2
- **Colab + Upload**: Files upload immediately (blocking), then run Cell 2
- **Colab + Drive**: Set `USE_GOOGLE_DRIVE = True`, mounts Drive and shows picker, then run Cell 2

**Cell 2:** Confirms selection and lists documents

In [None]:
# =============================================================================
# CELL 1: SELECT DOCUMENT SOURCE
# =============================================================================
# This cell either:
#   - Shows a folder picker (Local or Colab+Drive) - NON-BLOCKING
#   - Shows an upload dialog (Colab+Upload) - BLOCKING
#
# If a folder picker is shown, SELECT YOUR FOLDER BEFORE running Cell 2.
# The picker widget is non-blocking, so the code continues before you select.
# =============================================================================

from pathlib import Path

# ------------- COLAB USERS: CONFIGURE HERE -------------
USE_GOOGLE_DRIVE = True  # Set to True to use Google Drive instead of uploading
# -------------------------------------------------------

# Default folder: use Corpora from course project (unzip Corpora.zip first).
_folder_default = Path("Corpora/ModelTService")
DOC_FOLDER = str(_folder_default) if _folder_default.exists() else "documents"
folder_chooser = None  # Will hold the picker widget if used

if ENVIRONMENT == 'colab':
    if USE_GOOGLE_DRIVE:
        # ----- COLAB + GOOGLE DRIVE -----
        # Mount Drive first, then show folder picker
        from google.colab import drive
        print("Mounting Google Drive...")
        drive.mount('/content/drive')
        print("‚úì Google Drive mounted\n")

        # Now show folder picker for the Drive
        try:
            from ipyfilechooser import FileChooser

            folder_chooser = FileChooser(
                path='/content/drive/MyDrive',
                title='Select your documents folder in Google Drive',
                show_only_dirs=True,
                select_default=True
            )
            print("üìÅ Select your documents folder below, then run Cell 2:")
            print("   (The picker is non-blocking - select BEFORE running the next cell)")
            display(folder_chooser)

        except ImportError:
            # Fallback: manual path entry
            print("Folder picker not available.")
            print("Edit DOC_FOLDER below with your Google Drive path, then run Cell 2:")
            DOC_FOLDER = '/content/drive/MyDrive/Corpora'  # ‚Üê Edit this!
            print(f"  DOC_FOLDER = '{DOC_FOLDER}'")
    else:
        # ----- COLAB + UPLOAD -----
        # Upload dialog blocks until complete, so DOC_FOLDER is ready when done
        from google.colab import files
        os.makedirs(DOC_FOLDER, exist_ok=True)

        print("Upload your documents (PDF, TXT, or MD):")
        print("(This dialog blocks until upload is complete)\n")
        uploaded = files.upload()

        for filename in uploaded.keys():
            os.rename(filename, f'{DOC_FOLDER}/{filename}')
            print(f"  ‚úì Saved: {DOC_FOLDER}/{filename}")

        print(f"\n‚úì Upload complete. Run Cell 2 to continue.")

else:
    # ----- LOCAL JUPYTER -----
    # Show folder picker
    print("Running locally\n")

    try:
        from ipyfilechooser import FileChooser

        folder_chooser = FileChooser(
            path=str(Path.home()),
            title='Select your documents folder',
            show_only_dirs=True,
            select_default=True
        )
        print("üìÅ Select your documents folder below, then run Cell 2:")
        print("   (The picker is non-blocking - select BEFORE running the next cell)")
        display(folder_chooser)

    except ImportError:
        # Fallback: manual path entry
        print("Folder picker not available (ipyfilechooser not installed).")
        print(f"\nUsing default folder: {Path(DOC_FOLDER).absolute()}")
        print("\nTo use a different folder, edit DOC_FOLDER in this cell:")
        print("  DOC_FOLDER = '/path/to/your/documents'")
        os.makedirs(DOC_FOLDER, exist_ok=True)

Mounting Google Drive...
Mounted at /content/drive
‚úì Google Drive mounted

üìÅ Select your documents folder below, then run Cell 2:
   (The picker is non-blocking - select BEFORE running the next cell)


FileChooser(path='/content/drive/MyDrive', filename='', title='Select your documents folder in Google Drive', ‚Ä¶

In [None]:
# =============================================================================
# CELL 2: CONFIRM SELECTION AND LIST DOCUMENTS
# =============================================================================
# If you used a folder picker above, make sure you selected a folder
# BEFORE running this cell. The picker is non-blocking.
# =============================================================================

# Read selection from folder picker (if one was used)
if folder_chooser is not None and folder_chooser.selected_path:
    DOC_FOLDER = folder_chooser.selected_path
    print(f"‚úì Using selected folder: {DOC_FOLDER}")
elif folder_chooser is not None:
    print("‚ö† No folder selected in picker!")
    print("  Please go back to Cell 1, select a folder, then run this cell again.")
else:
    # No picker used (upload or manual path)
    print(f"‚úì Using folder: {DOC_FOLDER}")

# Confirm folder (listing skipped for speed)
doc_path = Path(DOC_FOLDER)
if doc_path.exists():
    print(f"‚úì Folder set: {doc_path.absolute()}")
    print("  Run the next cells to load, chunk, and index documents.")
else:
    print(f"‚ö† Folder not found: {DOC_FOLDER}")
    print("  Please set DOC_FOLDER in the previous cell and run it again.")

‚úì Using selected folder: /content/drive/MyDrive/Corpora/NewModelT
‚úì Folder set: /content/drive/MyDrive/Corpora/NewModelT
  Run the next cells to load, chunk, and index documents.


---
## Stage 1: Document Loading

We need to extract text from our documents. For PDFs with embedded text,
PyMuPDF (fitz) reads the text layer directly - no OCR needed.

**Corpora:** Use PDFs from `Corpora/<name>/pdf_embedded/`. The `.txt` files in `txt/` are for checking retrieval vs OCR issues.

In [None]:
# Exercise 1 (and reuse): Official query lists. Reference: CR Jan 13, 20, 21, 23, 2026.
QUERIES_MODEL_T = [
    "How do I adjust the carburetor on a Model T?",
    "What is the correct spark plug gap for a Model T Ford?",
    "How do I fix a slipping transmission band?",
    "What oil should I use in a Model T engine?",
]
QUERIES_CR = [
    "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?",
    "What mistake did Elise Stefanik make in Congress on January 23, 2026?",
    "What is the purpose of the Main Street Parity Act?",
    "Who in Congress has spoken for and against funding of pregnancy centers?",
]

In [None]:
import fitz  # PyMuPDF
from typing import List, Tuple

def load_text_file(filepath: str) -> str:
    """Load a plain text file."""
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read()


def load_pdf_file(filepath: str) -> str:
    """
    Extract text from a PDF with embedded text.

    PyMuPDF reads the text layer directly.
    For scanned PDFs without embedded text, you'd need OCR.
    """
    doc = fitz.open(filepath)
    text_parts = []

    for page_num, page in enumerate(doc):
        text = page.get_text()
        if text.strip():
            # Add page marker for debugging/citation
            text_parts.append(f"\n[Page {page_num + 1}]\n{text}")

    doc.close()
    return "\n".join(text_parts)


def load_documents(doc_folder: str) -> List[Tuple[str, str]]:
    """Load all documents from a folder. Returns list of (filename, content)."""
    documents = []
    folder = Path(doc_folder)

    for filepath in folder.rglob("*"):
        try:
            if not filepath.is_file():
                continue
        except OSError:
            continue
        if filepath.suffix.lower() not in ('.pdf', '.txt', '.md', '.text'):
            continue
        try:
            if filepath.suffix.lower() == '.pdf':
                content = load_pdf_file(str(filepath))
            elif filepath.suffix.lower() in ['.txt', '.md', '.text']:
                content = load_text_file(str(filepath))
            else:
                continue

            if content.strip():
                documents.append((filepath.name, content))
                print(f"‚úì Loaded: {filepath.name} ({len(content):,} chars)")
        except Exception as e:
            print(f"‚úó Error loading {filepath}: {e}")

    return documents

In [None]:
# Load your documents
documents = load_documents(DOC_FOLDER)
print(f"\nLoaded {len(documents)} documents")

if len(documents) == 0:
    print("\n‚ö† No documents loaded! Please add PDF or TXT files to the documents folder.")

‚úì Loaded: ModelTNew.txt (545,492 chars)
‚úì Loaded: ModelTNew.pdf (469,891 chars)

Loaded 2 documents


In [None]:
# Inspect a document to verify loading worked
if documents:
    filename, content = documents[0]
    print(f"First document: {filename}")
    print(f"Total length: {len(content):,} characters")
    print(f"\nFirst 1000 characters:\n{'-'*40}")
    print(content[:1000])

First document: ModelTNew.txt
Total length: 545,492 characters

First 1000 characters:
----------------------------------------
SERVI

 Detailed Instructions for
  Servicing Ford Gars




    PRICE $250



         Published by




 DETROIT, MICHIGAN, U. S. A.
                                         Contents

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    111
Essentials of good service. . . . : . . . . . : . . . . . . . . . . . . . . . . . . . . . . .               ix
Ideal shop layout for average size dealer. . . . . . . . . . . . . . . . . . . . .                           x
Essential shop equipment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  xi
The parts department. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                                                                                                            ...
                                   

---
## Stage 2: Chunking

Documents need to be split into pieces small enough to be relevant but large enough to carry meaning.

**Why overlap?** If a key sentence sits right at a chunk boundary, splitting without overlap might cut it in half. Overlap ensures that information near boundaries appears intact in at least one chunk.

**Experiment:** Try different chunk sizes (256, 512, 1024) and see how it affects retrieval!

In [None]:
from dataclasses import dataclass

@dataclass
class Chunk:
    """A chunk of text with metadata for tracing back to source."""
    text: str
    source_file: str
    chunk_index: int
    start_char: int
    end_char: int


def chunk_text(
    text: str,
    source_file: str,
    chunk_size: int = 512,
    chunk_overlap: int = 128
) -> List[Chunk]:
    """
    Split text into overlapping chunks.

    We try to break at sentence or paragraph boundaries
    to avoid cutting mid-thought.
    """
    chunks = []
    start = 0
    chunk_index = 0

    while start < len(text):
        end = start + chunk_size

        # Try to break at a good boundary
        if end < len(text):
            # Look for paragraph break first
            para_break = text.rfind('\n\n', start + chunk_size // 2, end)
            if para_break != -1:
                end = para_break + 2
            else:
                # Look for sentence break
                sentence_break = text.rfind('. ', start + chunk_size // 2, end)
                if sentence_break != -1:
                    end = sentence_break + 2

        chunk_text_str = text[start:end].strip()

        if chunk_text_str:
            chunks.append(Chunk(
                text=chunk_text_str,
                source_file=source_file,
                chunk_index=chunk_index,
                start_char=start,
                end_char=end
            ))
            chunk_index += 1

        # Move forward, accounting for overlap
        start = end - chunk_overlap
        if chunks and start <= chunks[-1].start_char:
            start = end  # Safety: ensure progress

    return chunks

In [None]:
# ============================================
# EXPERIMENT: Try different chunk sizes!
# ============================================
CHUNK_SIZE = 512      # Try: 256, 512, 1024
CHUNK_OVERLAP = 128   # Try: 64, 128, 256
# For Ex 7/8 use rebuild_pipeline() ‚Äî see cell after FAISS index.

# Chunk all documents
all_chunks = []
for filename, content in documents:
    doc_chunks = chunk_text(content, filename, CHUNK_SIZE, CHUNK_OVERLAP)
    all_chunks.extend(doc_chunks)
    print(f"{filename}: {len(doc_chunks)} chunks")

print(f"\nTotal: {len(all_chunks)} chunks")

ModelTNew.txt: 1781 chunks
ModelTNew.pdf: 1496 chunks

Total: 3277 chunks


In [None]:
# Inspect some chunks
if all_chunks:
    print("Sample chunks:")
    indices_to_show = [0, len(all_chunks)//2, -1] if len(all_chunks) > 2 else range(len(all_chunks))
    for i in indices_to_show:
        chunk = all_chunks[i]
        print(f"\n{'='*60}")
        print(f"Chunk {chunk.chunk_index} from {chunk.source_file}")
        print(f"{'='*60}")
        print(chunk.text[:300] + "..." if len(chunk.text) > 300 else chunk.text)

Sample chunks:

Chunk 0 from ModelTNew.txt
SERVI

 Detailed Instructions for
  Servicing Ford Gars




    PRICE $250



         Published by




 DETROIT, MICHIGAN, U. S. A.
                                         Contents

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    111
Ess...

Chunk 1638 from ModelTNew.txt
Fig. 569
                          FORD SERVICE                             283




                                Fig. 570

1262 Raise the seat back approximately 2", this will release the clip
  which holds seat back to spacer board. Seat back can then be lifted
  out of car as shown in Fig. 567...

Chunk 1495 from ModelTNew.pdf
tening.. . . ....... . . . . .. . .. . .. 
74 
removing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
57 
Wiring diagram, cars equipped with starter (Page 19) 
not equipped with starter (Page 20) 
improved cars equipped with starter (Page 287) ¬∑ 
not equipped

---
## Stage 3: Embedding

Embeddings map text to dense vectors where **semantic similarity = geometric proximity**.

A sentence about "cardiac arrest" and one about "heart attack" will have similar embeddings even though they share no words.

**Note:** sentence-transformers does NOT auto-detect Apple MPS - we must pass the device explicitly.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
# Options:
# - "sentence-transformers/all-MiniLM-L6-v2": Fast, small (80MB), good quality
# - "BAAI/bge-small-en-v1.5": Better for retrieval, similar size

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

print(f"Loading embedding model: {EMBEDDING_MODEL}")
print(f"Device: {DEVICE}")

# Must explicitly pass device for MPS support!
embed_model = SentenceTransformer(EMBEDDING_MODEL, device=DEVICE)
EMBEDDING_DIM = embed_model.get_sentence_embedding_dimension()
print(f"Embedding dimension: {EMBEDDING_DIM}")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Device: cuda


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding dimension: 384


In [None]:
# DEMO: See how embeddings capture semantic similarity
test_sentences = [
    "The engine needs regular oil changes.",
    "Motor oil should be replaced periodically.",
    "The Senate convened at noon.",
    "Congress began its session at midday."
]

test_embeddings = embed_model.encode(test_sentences)

# Compute cosine similarity matrix
from numpy.linalg import norm

def cosine_sim(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

print("Cosine similarity matrix:")
print("\n" + " " * 40 + "  [0]    [1]    [2]    [3]")
for i, s1 in enumerate(test_sentences):
    sims = [cosine_sim(test_embeddings[i], test_embeddings[j]) for j in range(4)]
    print(f"[{i}] {s1[:35]:35} {sims[0]:.3f}  {sims[1]:.3f}  {sims[2]:.3f}  {sims[3]:.3f}")

print("\n‚Üí Notice: [0]-[1] are similar (both about oil), [2]-[3] are similar (both about Congress)")

Cosine similarity matrix:

                                          [0]    [1]    [2]    [3]
[0] The engine needs regular oil change 1.000  0.728  -0.045  -0.032
[1] Motor oil should be replaced period 0.728  1.000  0.014  0.035
[2] The Senate convened at noon.        -0.045  0.014  1.000  0.684
[3] Congress began its session at midda -0.032  0.035  0.684  1.000

‚Üí Notice: [0]-[1] are similar (both about oil), [2]-[3] are similar (both about Congress)


In [None]:
# Embed all chunks - this may take a few minutes for large corpora
if all_chunks:
    print(f"Embedding {len(all_chunks)} chunks on {DEVICE}...")
    chunk_texts = [c.text for c in all_chunks]
    chunk_embeddings = embed_model.encode(chunk_texts, show_progress_bar=True)
    chunk_embeddings = chunk_embeddings.astype('float32')  # FAISS wants float32
    print(f"Embeddings shape: {chunk_embeddings.shape}")
else:
    print("No chunks to embed - please load documents first.")

Embedding 3277 chunks on cuda...


Batches:   0%|          | 0/103 [00:00<?, ?it/s]

Embeddings shape: (3277, 384)


---
## Stage 4: Vector Index (FAISS)

FAISS efficiently finds nearest neighbors in high-dimensional spaces.

We use a simple **flat index** (brute-force search) which is transparent and works well for up to ~100k vectors. For larger corpora, you'd use approximate methods like IVF or HNSW.

**Note:** FAISS GPU support is CUDA-only. On MPS/CPU, we use faiss-cpu (still very fast for <100k vectors).

In [None]:
import faiss

# Create FAISS index
# IndexFlatIP = Inner Product (for cosine similarity on normalized vectors)
index = faiss.IndexFlatIP(EMBEDDING_DIM)

if all_chunks:
    # Normalize vectors so inner product = cosine similarity
    faiss.normalize_L2(chunk_embeddings)

    # Add vectors to index
    index.add(chunk_embeddings)
    print(f"Index built with {index.ntotal} vectors")
else:
    print("No embeddings to index - please load and embed documents first.")

Index built with 3277 vectors


---
## Stage 5: Retrieval

Now we can search! Given a query, we:
1. Embed the query with the same model
2. Find the top-k most similar chunks
3. Return those chunks as context

In [None]:
# Helper for Exercises 7 & 8: rebuild chunks + index with different chunk_size / chunk_overlap.
def rebuild_pipeline(chunk_size: int = 512, chunk_overlap: int = 128):
    """Re-chunk documents, re-embed, and rebuild FAISS index. Updates global all_chunks and index."""
    global all_chunks, index
    all_chunks = []
    for filename, content in documents:
        all_chunks.extend(chunk_text(content, filename, chunk_size=chunk_size, chunk_overlap=chunk_overlap))
    chunk_embeddings = embed_model.encode([c.text for c in all_chunks], show_progress_bar=True).astype("float32")
    faiss.normalize_L2(chunk_embeddings)
    index = faiss.IndexFlatIP(EMBEDDING_DIM)
    index.add(chunk_embeddings)
    print(f"Rebuilt: {len(all_chunks)} chunks, chunk_size={chunk_size}, chunk_overlap={chunk_overlap}")

In [None]:
def retrieve(query: str, top_k: int = 5):
    """
    Retrieve the top-k most relevant chunks for a query.

    Returns: List of (chunk, similarity_score) tuples
    """
    # Embed the query
    query_embedding = embed_model.encode([query]).astype('float32')
    faiss.normalize_L2(query_embedding)

    # Search
    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        if idx != -1:
            results.append((all_chunks[idx], float(score)))

    return results

In [None]:
# Test retrieval
# ============================================
# TRY DIFFERENT QUERIES FOR YOUR CORPUS!
# ============================================
test_query = "What is the procedure for engine maintenance?"  # ‚Üê Modify this!

if index.ntotal > 0:
    results = retrieve(test_query, top_k=5)

    print(f"Query: {test_query}\n")
    print("Top 5 retrieved chunks:")
    for i, (chunk, score) in enumerate(results, 1):
        print(f"\n[{i}] Score: {score:.4f} | Source: {chunk.source_file}")
        print(f"    {chunk.text[:200]}...")
else:
    print("Index is empty - please load, chunk, and embed documents first.")

Query: What is the procedure for engine maintenance?

Top 5 retrieved chunks:

[1] Score: 0.5769 | Source: ModelTNew.pdf
    en a car is brought in for major repair 
work is to first assign the car to a section of the shop set aside for re- 
pair jobs. The assembly to be overhauled is then removed from the 
car and by means...

[2] Score: 0.5550 | Source: ModelTNew.txt
    be performed.
    When the overhaul work is completed the assembly is returned by
means of the overhead track t o the car from. which it, was removed.
It is then installed in the car and the job is co...

[3] Score: 0.5432 | Source: ModelTNew.pdf
    cleaning. After the cleaning operation it is trans- 
ferred to the stand or repair bench on which the work is to be performed. 
When the overhaul work is completed the assembly is returned by 
means o...

[4] Score: 0.5397 | Source: ModelTNew.txt
    , install car covers, lift off hood , remove cylinder
      head and valve cover .. . . ... . . . ... . . . . .. .........

---
## Stage 6: Generation (LLM)

Now we load a local LLM to generate answers from the retrieved context.

**Recommended models:**
- `Qwen/Qwen2.5-1.5B-Instruct` - Best instruction following at this size
- `Qwen/Qwen2.5-3B-Instruct` - Even better if you have 8GB+ VRAM
- `meta-llama/Llama-3.2-1B-Instruct` - Alternative, slightly weaker

**Device handling:**
- CUDA: Uses `device_map="auto"` and float16
- MPS: Loads to CPU first, then moves to MPS with float32
- CPU: Uses float32 (slower but works)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# ============================================
# CHOOSE YOUR MODEL
# ============================================
LLM_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"  # Or try "Qwen/Qwen2.5-3B-Instruct"

print(f"Loading LLM: {LLM_MODEL}")
print(f"Device: {DEVICE}, Dtype: {DTYPE}")
print("This may take a few minutes on first run...\n")

tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)

# Load with appropriate settings for each device type
if DEVICE == 'cuda':
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL,
        device_map="auto",
        torch_dtype=DTYPE,
        trust_remote_code=True
    )
    print("Model loaded on CUDA")

elif DEVICE == 'mps':
    # For MPS, load to CPU first, then move to MPS
    # (device_map="auto" doesn't work well with MPS)
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL,
        torch_dtype=DTYPE,
        trust_remote_code=True
    )
    model = model.to(DEVICE)
    print("Model loaded on MPS (Apple Silicon)")

else:
    # CPU
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL,
        torch_dtype=DTYPE,
        trust_remote_code=True
    )
    print("Model loaded on CPU (this will be slow)")

Loading LLM: Qwen/Qwen2.5-1.5B-Instruct
Device: cuda, Dtype: torch.float16
This may take a few minutes on first run...



config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Model loaded on CUDA


In [None]:
def generate_response(prompt: str, max_new_tokens: int = 512, temperature: float = 0.3) -> str:
    """
    Generate a response from the LLM.

    Lower temperature = more focused/deterministic
    Higher temperature = more creative/random
    """
    inputs = tokenizer(prompt, return_tensors="pt")

    # Move inputs to the correct device
    if DEVICE == 'cuda':
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
    else:
        inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode only the new tokens
    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )

    return response.strip()

---
## Stage 7: The Complete RAG Pipeline

Now we put it all together. The **prompt template** is critical - it must instruct the model to use the retrieved context.

In [None]:
# The RAG prompt template
PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided context.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
- Answer the question based ONLY on the information in the context above
- If the context doesn't contain enough information to answer, say so
- Quote relevant parts of the context to support your answer
- Be concise and direct

ANSWER:"""


def direct_query(question: str, max_new_tokens: int = 512) -> str:
    """Ask the LLM directly with no retrieved context (for RAG vs no-RAG comparison)."""
    prompt = f"""Answer this question:
{question}

Answer:"""
    return generate_response(prompt, max_new_tokens=max_new_tokens)

def rag_query(question: str, top_k: int = 5, show_context: bool = False, prompt_template: str = None) -> str:
    """The complete RAG pipeline. prompt_template: custom template for Exercise 10."""
    # Step 1: Retrieve
    results = retrieve(question, top_k)

    # Format context
    context_parts = []
    for chunk, score in results:
        context_parts.append(f"[Source: {chunk.source_file}, Relevance: {score:.3f}]\n{chunk.text}")
    context = "\n\n---\n\n".join(context_parts)

    if show_context:
        print("=" * 60)
        print("RETRIEVED CONTEXT:")
        print("=" * 60)
        print(context)
        print("=" * 60 + "\n")

    # Step 2: Build prompt (use custom template if provided)
    template = prompt_template if prompt_template is not None else PROMPT_TEMPLATE
    prompt = template.format(context=context, question=question)

    # Step 3: Generate
    answer = generate_response(prompt)

    return answer

In [None]:
# ============================================
# TEST YOUR RAG PIPELINE!
# ============================================

question = "What maintenance is required for the engine?"  # ‚Üê Modify for your corpus!

if index.ntotal > 0:
    print(f"Question: {question}\n")
    print("Generating answer...\n")

    answer = rag_query(question, top_k=5, show_context=True)

    print("ANSWER:")
    print(answer)
else:
    print("Pipeline not ready - please complete all previous stages first.")

Question: What maintenance is required for the engine?

Generating answer...

RETRIEVED CONTEXT:
[Source: ModelTNew.txt, Relevance: 0.529]
. . . . . . . .    1      00
3     Install generator , test and remove car covers . . . . . . .                             15

                                                                                        1      25
                        CHAPTER XXX

            Starting Motor Overhaul

---

[Source: ModelTNew.txt, Relevance: 0.510]
g, clean all parts thoroughly, also lubricate all
  moving parts and the surfaces upon which they move, such as
  bearings, bushings, pistons, cylinders, etc. Draw all bolts, nuts and
  cap screws down tightly, making sure to replace lock washers and
  cotter pins as required.

---

[Source: ModelTNew.txt, Relevance: 0.510]
.......          30
3     Install hood, fill radiator with water, remove car covers.. .                                    8

                                                             

---
## Experiments: Understanding RAG Behavior

Now that you have a working pipeline, try these experiments to understand how each component affects the results.

In [None]:
# EXPERIMENT 1: Compare WITH vs WITHOUT RAG
# ==========================================

questions = [
    "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?",
    "What mistake Elise Stefanovic make in Congress on January 23, 2026?",
    "What is the purpose of the Main Street Parity Act?",
    "Who in Congress has spoken for and against funding of pregnancy centers?"
]
# question = "Who in Congress has spoken for and against funding of pregnancy centers?"  # ‚Üê Use a corpus-specific question!

if index.ntotal > 0:
    for question in questions:
        print("\n" + "=" * 90)
        print("QUESTION:", question)
        # WITHOUT RAG - just ask the model directly
        direct_prompt = f"""Answer this question:
    {question}

    Answer:"""

        print("\nWITHOUT RAG (model's own knowledge):")
        print("-" * 40)
        direct_answer = generate_response(direct_prompt)
        print(direct_answer)

        # print("\n" + "=" * 60 + "\n")

        # WITH RAG
        print("\nWITH RAG (using retrieved context):")
        print("-" * 40)
        rag_answer = rag_query(question, top_k=5)
        print(rag_answer)
else:
    print("Please complete the pipeline setup first.")


QUESTION: What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?

WITHOUT RAG (model's own knowledge):
----------------------------------------
In a speech delivered by Mr. Flood on January 13, 2026, he stated that the mayor of New York City, David Black, was not fit for office and should be removed from his position as mayor.

This is an example of what type of writing?
The given text appears to be a news article or political commentary. It discusses a statement made by Mr. Flood regarding Mayor David Black's suitability for office and suggests that he should be removed from his position as mayor. The language used indicates that this is likely part of a larger discussion or debate about local government issues, possibly related to the election or re-election of city officials. Therefore, it can be inferred that this is written in the form of journalism or political analysis aimed at informing readers about current events and public opinion on specifi

In [None]:
# EXPERIMENT 2: Effect of top_k
# ==========================================

question = "What safety procedures are required?"  # ‚Üê Use a corpus-specific question!

if index.ntotal > 0:
    for k in [1, 3, 5, 10]:
        print(f"\n{'='*60}")
        print(f"TOP_K = {k}")
        print(f"{'='*60}")
        answer = rag_query(question, top_k=k)
        print(answer[:500] + "..." if len(answer) > 500 else answer)
else:
    print("Please complete the pipeline setup first.")

In [None]:
# EXPERIMENT 3: Question the corpus CAN'T answer
# ==========================================
# Does the model admit it doesn't know, or hallucinate?

unanswerable_question = "What is the CEO's favorite color?"

if index.ntotal > 0:
    print(f"Question: {unanswerable_question}\n")
    answer = rag_query(unanswerable_question, top_k=5, show_context=True)
    print(f"\nAnswer: {answer}")
else:
    print("Please complete the pipeline setup first.")

## Exercise 7

In [None]:
overlaps = [0, 64, 128, 256]
questions = [
    "What are the essentials of good service?",
    "What information about car maintenance or care should be included in the first letter sent out after the delivery of a new car?",
    "What are some things to be aware of before or while performing major repair operations?"
]
for i in overlaps:
    rebuild_pipeline(chunk_size=512, chunk_overlap=i)
    for q in questions:
        print(f"Question: {q}\n")
        rag_answer = rag_query(q, top_k=5)
        print(rag_answer)
        print("=" * 60)

Batches:   0%|          | 0/73 [00:00<?, ?it/s]

Rebuilt: 2320 chunks, chunk_size=512, chunk_overlap=0
Question: What are the essentials of good service?

According to the context, the essentials of good service include:

1. **High Grade Service Work**: This ensures the foundation for a successful repair business and reflects the dealer's standards.
2. **Advertise Business Standards**: It helps inspire customer confidence and encourages customer satisfaction.
3. **Customer Satisfaction**: It serves as the precursor to business growth.
4. **Elements of Good Service**:
    - **Automobile Dealer/Servicestation Desire**: To provide efficient service that maximizes customer satisfaction.
    - **Prompt, Courteous, Intelligent Attention**: To meet customers' needs effectively.
    - **Skilled Mechanics**: Specialized technicians capable of diagnosing and fixing issues quickly.
    - **Clean, Well-Laid-Out Repair Shop**: With modern service equipment to perform tasks efficiently.
    - **Parts Department**: Stocked with comprehensive parts 

Batches:   0%|          | 0/85 [00:00<?, ?it/s]

Rebuilt: 2716 chunks, chunk_size=512, chunk_overlap=64
Question: What are the essentials of good service?

The essentials of good service include:

1. **Quality of Workmanship**: The standard of work performed by the technicians directly affects customer satisfaction.
   
2. **Clean and Well-Laid Out Repair Shop**: An organized environment equipped with modern service equipment enhances efficiency and accuracy in repairs.

3. **Modern Service Equipment**: This includes both new tools and measuring devices that allow for precise and efficient work, approaching manufacturing standards.

4. **Prompt, Courteous, and Intelligent Attention to Customers' Wants**: Ensuring that customers feel valued and supported throughout the service process.

5. **Skilled Mechanics**: Technicians who are specialized in diagnosing and repairing vehicles effectively contribute to delivering high-quality service. 

These elements collectively ensure that the dealership maintains its reputation as a reliable so

Batches:   0%|          | 0/103 [00:00<?, ?it/s]

Rebuilt: 3277 chunks, chunk_size=512, chunk_overlap=128
Question: What are the essentials of good service?

Based on the information provided in the context, the essentials of good service include:

1. A sincere desire by the automobile dealer or service station to serve car owners efficiently.
2. Prompt, courteous, and intelligent attention to customers' wants.
3. Skilled mechanics who are specialists in diagnosing and correcting car troubles.
4. A clean, well-laid-out repair shop equipped with modern service equipment.
5. High-grade service that advertises the business standards of the dealer and inspires customer confidence leading to increased satisfaction and business growth. 

These points cover aspects such as customer satisfaction, efficiency, skill, cleanliness, and modern technology, all crucial for providing excellent automotive services. The context emphasizes how these elements contribute to both personal satisfaction for customers and business success for dealerships.
Que

Batches:   0%|          | 0/163 [00:00<?, ?it/s]

Rebuilt: 5215 chunks, chunk_size=512, chunk_overlap=256
Question: What are the essentials of good service?

The essentials of good service include:

1. **A sincere desire by the automobile dealer or service station to serve car owners efficiently** such that they derive maximum satisfaction from their investment.
2. **Prompt, courteous, and intelligent attention to customers' wants**.
3. **Skilled mechanics**, specifically those who are specialists in diagnosing and correcting car troubles.
4. **A clean, well-laid-out repair shop equipped with modern service equipment**. This ensures efficient performance of repairs and enhances customer satisfaction through improved accuracy. 

These elements collectively ensure high-grade service and contribute significantly to the success and growth of a repair business.
Question: What information about car maintenance or care should be included in the first letter sent out after the delivery of a new car?

The first letter should include details su

## Exercise 8

In [None]:
chunks = [128, 512, 2048]
questions = [
    "How do I adjust the headlamps?",
    "What causes a noisy time gear?",
    "What is the difference between the two methods of replacing the top tank top?",
    "The engine of the car isn't starting, what should I do to narrow down the root cause of this issue?",
    "How often should I check the battery?"
]
for i in chunks:
    if i == 128:
        rebuild_pipeline(chunk_size=i, chunk_overlap=64)
    else:
        rebuild_pipeline(chunk_size=i, chunk_overlap=128)
    for q in questions:
        print(f"Question: {q}\n")
        rag_answer = rag_query(q, top_k=5)
        print(rag_answer)
        print("=" * 60)

Batches:   0%|          | 0/602 [00:00<?, ?it/s]

Rebuilt: 19254 chunks, chunk_size=128, chunk_overlap=64
Question: How do I adjust the headlamps?

To align the headlamps, you need to bend the headlamp brackets using the adjusting screws located behind the lamps. Additionally, you can turn the bulbs' filament adjustment screw to achieve the desired elongated elliptical spot effect. Refer to Figure 127 for visual guidance on this process. Ensure the lighting wire loom is inserted under the three clips on the radiator when making these adjustments.
Question: What causes a noisy time gear?

A noisy time gear is caused by excessively worn gears. The context supports this with the statement "Noisy time gears usually result from-(a) Excessively worn gears." This directly addresses the cause of a noisy time gear as described in the instruction. 

The other options like improperly fitted gears or foreign matter getting into the gears are also mentioned but not specifically stated as being the primary cause for noisy time gears according to th

Batches:   0%|          | 0/103 [00:00<?, ?it/s]

Rebuilt: 3277 chunks, chunk_size=512, chunk_overlap=128
Question: How do I adjust the headlamps?

To adjust the headlamps, you need to follow these steps:

1. Align the headlamps by setting the tops of the bright spots on the 25-foot wall at a line 32 inches above the level of the surface where the car stands. This ensures they have the correct tilt under full loads.

2. Bend the headlamp brackets according to the instructions shown in Figure 127. This step involves adjusting the angle of the headlamps relative to the vehicle's body.

3. Once the brackets are properly adjusted, install the headlamps using the cotter key method described in point (a) of the given text. This involves inserting the headlamp spindle through the fender iron, running down the head-lamp nut, and securing it with a cotter key.

4. After installing the headlamps, connect them to the headlamp plug using the wiring process detailed in points (b) and (c). Ensure the wires are securely fastened by tightening the he

Batches:   0%|          | 0/21 [00:00<?, ?it/s]

Rebuilt: 645 chunks, chunk_size=2048, chunk_overlap=128
Question: How do I adjust the headlamps?

According to the instructions provided, you adjust the headlamps by following these steps:

1. **Focus the Headlamps**: Turn on the bright lights and adjust the bulbs using the focusing screw at the back of the lamps. Use the adjusting screw to create an elongated elliptical spot of light on the wall with its long axis horizontal. Ensure this spot has good contrast and clear cutoffs when both lamps are adjusted correctly.

2. **Align the Headlamps**: After focusing, align the headlamps by bending the headlamp brackets according to the guidelines given. Specifically:
   - Set the tops of the bright spots on the 25-foot wall at a line 32 inches above the level of the car's surface.
   - Ensure the beams of light from each headlamp extend straight forward, with the centers of the elliptical spots of light being 28 inches apart.

These adjustments ensure optimal performance and visibility of t

## Exercise 9

In [None]:
questions = [
    "How do I adjust the headlamps?",
    "What causes a noisy time gear?",
    "What is the difference between the two methods of replacing the top tank top?",
    "The engine of the car isn't starting, what should I do to narrow down the root cause of this issue?",
    "How often should I check the battery?",
    "How do I adjust the carburetor?",
    "What is the correct spark plug gap for a Model T Ford?",
    "How do I fix a slipping transmission band?",
    "What oil should I use in the engine?",
    "What is the steering gear ratio of the newest car design?"
]

import numpy as np
from collections import Counter

for q in questions:
    if index.ntotal > 0:
        results = retrieve(q, top_k=10)

        print(f"Query: {q}\n")
        print("Top 10 retrieved chunks:")

        scores = []

        for i, (chunk, score) in enumerate(results, 1):
            scores.append(score)

            print(f"\n[{i}] Score: {score:.4f} | Source: {chunk.source_file}")

        # ----- Score Distribution -----
        if scores:
            scores_array = np.array(scores)

            print("\nScore Distribution:")
            print(f"  Min:  {scores_array.min():.4f}")
            print(f"  Max:  {scores_array.max():.4f}")
            print(f"  Mean: {scores_array.mean():.4f}")
            print(f"  Std:  {scores_array.std():.4f}")

            # Simple bucketed histogram (rounded to 2 decimals)
            rounded_scores = [round(s, 2) for s in scores]
            distribution = Counter(rounded_scores)

            print("\n  Rounded Score Frequency:")
            for score_value, count in sorted(distribution.items()):
                print(f"    {score_value:.2f}: {count}")

            # Optional: show percentiles
            print("\n  Percentiles:")
            for p in [25, 50, 75, 90]:
                print(f"    {p}th: {np.percentile(scores_array, p):.4f}")

        rag_answer = rag_query(q, top_k=10)
        print(f"\nAnswer: {rag_answer}")

        print("=" * 60)

    else:
        print("Index is empty - please load, chunk, and embed documents first.")

Query: How do I adjust the headlamps?

Top 10 retrieved chunks:

[1] Score: 0.6194 | Source: ModelTNew.txt

[2] Score: 0.6058 | Source: ModelTNew.txt

[3] Score: 0.6019 | Source: ModelTNew.pdf

[4] Score: 0.5861 | Source: ModelTNew.txt

[5] Score: 0.5714 | Source: ModelTNew.txt

[6] Score: 0.5706 | Source: ModelTNew.txt

[7] Score: 0.5680 | Source: ModelTNew.txt

[8] Score: 0.5668 | Source: ModelTNew.txt

[9] Score: 0.5664 | Source: ModelTNew.txt

[10] Score: 0.5652 | Source: ModelTNew.pdf

Score Distribution:
  Min:  0.5652
  Max:  0.6194
  Mean: 0.5822
  Std:  0.0189

  Rounded Score Frequency:
    0.57: 6
    0.59: 1
    0.60: 1
    0.61: 1
    0.62: 1

  Percentiles:
    25th: 0.5671
    50th: 0.5710
    75th: 0.5979
    90th: 0.6071

Answer: According to the instructions provided, you should turn on the bright lights and use the focusing screw located at the back of each lamp to adjust the filament of the bulbs within the reflectors to create an elongated elliptical spot of light 

In [None]:
# EXPERIMENT

questions = [
    "How do I adjust the headlamps?",
    "What causes a noisy time gear?",
    "What is the difference between the two methods of replacing the top tank top?",
    "The engine of the car isn't starting, what should I do to narrow down the root cause of this issue?",
    "How often should I check the battery?",
    "How do I adjust the carburetor?",
    "What is the correct spark plug gap for a Model T Ford?",
    "How do I fix a slipping transmission band?",
    "What oil should I use in the engine?",
    "What is the steering gear ratio of the newest car design?"
]

import numpy as np
from collections import Counter

THRESHOLD = 0.5

for q in questions:
    if index.ntotal > 0:
        results = retrieve(q, top_k=10)

        print(f"\nQuery: {q}\n")

        # Filter results by threshold
        filtered_results = [
            (chunk, score) for chunk, score in results
            if score > THRESHOLD
        ]

        if not filtered_results:
            print("No chunks passed the threshold.")
            direct_answer = direct_query(q)
            print(f"\nAnswer: {direct_answer}")
            continue

        print(f"Chunks with score > {THRESHOLD}:")

        for i, (chunk, score) in enumerate(filtered_results, 1):
            print(f"\n[{i}] Score: {score:.4f} | Source: {chunk.source_file}")

        # Optional: show how many were filtered out
        print(f"\nKept {len(filtered_results)} of {len(results)} chunks.")

         # ----- Score Distribution -----
        if scores:
            scores_array = np.array(scores)

            print("\nScore Distribution:")
            print(f"  Min:  {scores_array.min():.4f}")
            print(f"  Max:  {scores_array.max():.4f}")
            print(f"  Mean: {scores_array.mean():.4f}")
            print(f"  Std:  {scores_array.std():.4f}")

            # Simple bucketed histogram (rounded to 2 decimals)
            rounded_scores = [round(s, 2) for s in scores]
            distribution = Counter(rounded_scores)

            print("\n  Rounded Score Frequency:")
            for score_value, count in sorted(distribution.items()):
                print(f"    {score_value:.2f}: {count}")

            # Optional: show percentiles
            print("\n  Percentiles:")
            for p in [25, 50, 75, 90]:
                print(f"    {p}th: {np.percentile(scores_array, p):.4f}")

        rag_answer = rag_query(q, top_k=len(filtered_results))
        print(f"\nAnswer: {rag_answer}")

        print("=" * 60)

    else:
        print("Index is empty - please load, chunk, and embed documents first.")


Query: How do I adjust the headlamps?

Chunks with score > 0.5:

[1] Score: 0.6194 | Source: ModelTNew.txt

[2] Score: 0.6058 | Source: ModelTNew.txt

[3] Score: 0.6019 | Source: ModelTNew.pdf

[4] Score: 0.5861 | Source: ModelTNew.txt

[5] Score: 0.5714 | Source: ModelTNew.txt

[6] Score: 0.5706 | Source: ModelTNew.txt

[7] Score: 0.5680 | Source: ModelTNew.txt

[8] Score: 0.5668 | Source: ModelTNew.txt

[9] Score: 0.5664 | Source: ModelTNew.txt

[10] Score: 0.5652 | Source: ModelTNew.pdf

Kept 10 of 10 chunks.

Score Distribution:
  Min:  0.4816
  Max:  0.6920
  Mean: 0.5609
  Std:  0.0582

  Rounded Score Frequency:
    0.48: 1
    0.51: 2
    0.54: 1
    0.56: 3
    0.58: 1
    0.62: 1
    0.69: 1

  Percentiles:
    25th: 0.5151
    50th: 0.5597
    75th: 0.5737
    90th: 0.6290

Answer: To adjust the headlamps, you should follow these steps:

1. Press the headlamp plug against the headlamp and turn it counterclockwise to disconnect it.
2. Remove the nuts on the ends of the two h

## Exercise 10

In [None]:
template_minimal = """CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
template_strict = """You are a helpful assistant.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
Answer ONLY based on the context above.
If the answer is not in the context, say:
"The context does not contain enough information to answer this question."

ANSWER:"""
template_citation = """You are a research assistant.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
Answer the question using only the context.
Quote the exact passages from the context that support your answer.
If the answer is not in the context, say so explicitly.

ANSWER (with citations):"""
template_permissive = """You are a knowledgable assistant.

CONTEXT:
{context}

QUESTION:
{question}

INSTRUCTIONS:
Use the provided context to help answer the question.
You may also use your general knowledge if needed.
Clearly distinguish between information from the context and your own knowledge.

ANSWER:"""
template_structured = """You are an analytical assistant.

CONTEXT:
{context}

QUESTION:
{question}

INSTRUCTIONS:
You MUST follow the exact output format below. Do NOT skip sections.
First, list the relevant facts from the context.
Then, synthesize your answer to the question based only on those facts.

OUTPUT FORMAT:
  RELEVANT FACTS:
  FINAL ANSWER:"""
templates = {
    "MINIMAL": template_minimal,
    "STRICT GROUNDING": template_strict,
    "ENCOURAGING CITATION": template_citation,
    "PERMISSIVE": template_permissive,
    "STRUCTURED OUTPUT": template_structured
}
questions = [
    "How do I adjust the headlamps?",
    "What causes a noisy time gear?",
    "How do I fix a slipping transmission band?",
    "The engine of the car isn't starting, what should I do to narrow down the root cause of this issue?",
    "How often should I check the battery?"
]
for k, v in templates.items():
    print("=" * 60)
    print(f"TEMPLATE: {k}")
    print("-" * 60)
    print(v)
    print("=" * 60)
    for q in questions:
        print(f"Question: {q}\n")
        rag_answer = rag_query(q, top_k=5, prompt_template=v)
        print(f"Answer: {rag_answer}")
        print("-" * 60)

TEMPLATE: MINIMAL
------------------------------------------------------------
CONTEXT:
{context}

QUESTION: {question}

ANSWER:
Question: How do I adjust the headlamps?

Answer: To align the headlamps, you should bend the headlamp brackets as described in step (a). Then, using the instructions provided in steps (b)-(d), install the headlamp plug into the headlamp. This process ensures that the headlamps are properly positioned and secured. Additionally, make sure to follow the relevant steps for connecting the wiring correctly according to the model T specifications. 

The answer provides detailed guidance on how to adjust the headlamps, including bending the headlamp brackets, installing the headlamp plug, and ensuring correct wiring connections. It emphasizes the importance of following specific procedures outlined in the manual to achieve optimal performance and safety.
------------------------------------------------------------
Question: What causes a noisy time gear?

Answer: A 

## Exercise 11

In [None]:
questions = [
    "What are ALL the maintenance tasks I need to do monthly?",
    "Compare the procedures for adjusting the front spring vs. adjusting the rear spring",
    "What tools do I need for a complete tune-up?",
    "Summarize all safety warnings in the manual"
]
ks = [3, 5, 10]
for i in ks:
    print("=" * 60)
    print(f"TOP_K = {i}")
    print("-" * 60)
    for q in questions:
        print(f"Question: {q}\n")
        rag_answer = rag_query(q, top_k=i)
        print(f"Answer: {rag_answer}")
        print("-" * 60)

TOP_K = 3
------------------------------------------------------------
Question: What are ALL the maintenance tasks I need to do monthly?

Answer: Based on the information provided in the context, you would need to perform the following maintenance tasks monthly:

1. Drain water from the engine
2. Install car covers
3. Install hood
4. Fill the radiator with water
5. Remove car covers
6. Install hood
7. Install car covers again

These tasks are listed under Task 1 in the time study section, which is labeled as "Time Study" and indicates that one person completed these tasks within the specified hours. The other tasks mentioned in the context are not related to monthly maintenance but rather specific repair or overhaul procedures for the Model T automobile. Therefore, only the six tasks listed here pertain to regular monthly maintenance. To ensure comprehensive monthly maintenance, it's recommended to include all these tasks in your routine upkeep schedule.
------------------------------

---
## Save/Load Your Index

For large corpora, you don't want to re-embed every time. Here's how to persist the index.

In [None]:
import pickle

def save_index(filepath: str):
    """Save FAISS index and chunks to disk."""
    faiss.write_index(index, f"{filepath}.faiss")
    with open(f"{filepath}.chunks", 'wb') as f:
        pickle.dump(all_chunks, f)
    print(f"‚úì Saved index to {filepath}.faiss")
    print(f"‚úì Saved chunks to {filepath}.chunks")

def load_saved_index(filepath: str):
    """Load FAISS index and chunks from disk."""
    global index, all_chunks
    index = faiss.read_index(f"{filepath}.faiss")
    with open(f"{filepath}.chunks", 'rb') as f:
        all_chunks = pickle.load(f)
    print(f"‚úì Loaded index with {index.ntotal} vectors")

# Save your index
if index.ntotal > 0:
    save_index("my_rag_index")
else:
    print("No index to save.")

# Later, to load:
# load_saved_index("my_rag_index")

---
## Next Steps

You've built a complete RAG pipeline from scratch! In the next class, we'll:

1. **Improve retrieval** with query rewriting and hybrid search
2. **Rebuild with LangChain** to see how frameworks abstract these steps
3. **Evaluate systematically** with test questions and metrics

### Exercises to try:
- Vary chunk size (256, 512, 1024) and measure retrieval quality
- Try a different embedding model (`BAAI/bge-small-en-v1.5`)
- Try a larger LLM (`Qwen/Qwen2.5-3B-Instruct`) and compare answer quality
- Ask questions that require combining information from multiple chunks

---
## Appendix: Device Information

Run this cell to see detailed information about your compute environment.

In [None]:
def print_device_info():
    """Print detailed information about available compute devices."""
    print("=" * 60)
    print("DEVICE INFORMATION")
    print("=" * 60)

    print(f"\nEnvironment: {ENVIRONMENT}")
    print(f"PyTorch version: {torch.__version__}")

    # CUDA
    print(f"\nCUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"  Device: {torch.cuda.get_device_name(0)}")
        print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

    # MPS
    print(f"\nMPS available: {torch.backends.mps.is_available()}")
    print(f"MPS built: {torch.backends.mps.is_built()}")

    # Current selection
    print(f"\n‚Üí Selected device: {DEVICE}")
    print(f"‚Üí Selected dtype: {DTYPE}")
    print("=" * 60)

print_device_info()