# 📦 Day 3: Building a RAG System (Retrieval-Augmented Generation)

Welcome to the third session of the Generative AI workshop!

Today we'll learn how to **build a Retrieval-Augmented Generation (RAG) pipeline** using open-source tools. You'll see how to process documents, embed them into a vector store, and query them with a language model to generate intelligent responses grounded in real content.

🎯 **Objectives**

- Understand the concept and benefits of Retrieval-Augmented Generation (RAG)
- Chunk and embed a document using `sentence-transformers`
- Store and search document vectors using `ChromaDB`
- Query a document using a local language model (`FLAN-T5`)
- Build and test a simple QA system over a PDF — no API keys required!


## 🔧 Step 1: Install Required Packages

We'll begin by installing all the necessary Python libraries for this RAG pipeline:

- `chromadb` for vector storage and retrieval
- `PyPDF2` for extracting text from PDF documents
- `transformers` for loading our language model (FLAN-T5)
- `sentence-transformers` for generating text embeddings

This may take a minute the first time you run it.


In [None]:
!pip install chromadb PyPDF2 transformers sentence-transformers --quiet


## 📦 Step 2: Import Required Libraries

Now that we've installed our dependencies, let's import the necessary libraries:

- `PyPDF2` to read PDF files
- `sentence-transformers` to embed text chunks
- `transformers` to load and run our LLM (FLAN-T5)
- `chromadb` to store and retrieve vectorized document chunks
- `torch` as the backend for running the language model


In [None]:
import os
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
import torch
import chromadb
from chromadb.config import Settings


## 📄 Step 3: Upload and Extract Text from a PDF

We'll now upload a PDF file using Colab's file uploader and extract its text content.

- This step reads each page of the PDF using `PyPDF2`
- It joins the extracted text into a single string
- The resulting `full_text` variable will be used for chunking and embedding in the next steps


In [None]:
from google.colab import files

# TODO: Upload a PDF file from your computer
uploaded = files.____()  # Hint: What method allows users to upload files?

# TODO: Get the filename of the uploaded file
filename = next(iter(____))  # Hint: What variable contains the uploaded files?

# TODO: Create a PDF reader object
reader = PdfReader(____)  # Hint: What variable contains the filename?

# Extract all text from the PDF
# TODO: Extract text from each page and join into a single string
full_text = "\n".join([page.____() for page in reader.____ if page.extract_text()])
# Hint: What method extracts text from a page? What attribute contains all the pages?

print(f"✅ PDF uploaded and processed!")
print(f"📄 Filename: {filename}")
print(f"📝 Total text length: {len(full_text)} characters")

# 💡 LEARNING NOTES:
# - This step reads each page of the PDF using PyPDF2
# - It joins the extracted text into a single string
# - The resulting full_text variable will be used for chunking and embedding in the next steps
# - We filter out empty pages to avoid processing blank content

## 👀 Optional: View Extracted Text

Let’s preview the extracted text from the PDF to ensure it was loaded correctly.

This is helpful for:
- Verifying that the PDF contains valid text (not just images)
- Understanding what content the model will later use for answering questions


In [None]:
# TODO: Display the first 1000 characters of the extracted text
print(____[:____])  # Hint: What variable contains our text and how many characters should we show?

# 💡 This is helpful for:
# - Verifying that the PDF contains valid text (not just images)
# - Understanding what content the model will later use for answering questions
# - Checking if the text extraction worked properly

## ✂️ Step 4: Chunk the Text

To make the text manageable for embedding and retrieval, we'll break the PDF content into smaller chunks.

- This function splits the text into sentences using regular expressions
- It groups sentences together until a character limit (e.g., 300) is reached
- The result is a list of `chunks`, each suitable for embedding in the next step


In [None]:
# Simple chunking by sentence
import re

def chunk_text(text, max_length=300):
    # TODO: Split text into sentences using regular expressions
    sentences = re.split(r'(?<=[.!?])\s+', ____)  # Hint: What variable contains our extracted text?

    # TODO: Initialize empty lists for chunks and current chunk
    chunks, chunk = ____, ____  # Hint: What data structures should hold multiple chunks and one chunk? List and String

    # TODO: Loop through each sentence
    for sentence in ____:  # Hint: What variable contains our sentences?
        # TODO: Check if adding this sentence would exceed max_length
        if len(____) + len(____) <= ____:  # Hint: What variables represent current chunk, new sentence, and limit?
            # TODO: Add sentence to current chunk with space
            chunk += ____ + " "  # Hint: What sentence are we adding?
        else:
            # TODO: Save current chunk and start a new one
            chunks.append(chunk.____())  # Hint: What method removes extra whitespace?
            chunk = ____ + " "  # Hint: What sentence starts the new chunk?

    # TODO: Don't forget the last chunk if it has content
    if ____:  # Hint: What variable might have remaining content?
        chunks.append(chunk.strip())

    return ____  # Hint: What should this function return?

# TODO: Apply chunking to our extracted text
chunks = chunk_text(____)  # Hint: What variable contains our full PDF text?

print(f"✅ Text chunked successfully!")
print(f"📊 Total chunks created: {len(chunks)}")
print(f"📏 Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} characters")
print(f"🔍 First chunk preview:")
print(chunks[0])

# 💡 LEARNING NOTES:
# - This function splits text into sentences using regular expressions
# - It groups sentences together until a character limit (e.g., 300) is reached
# - The result is a list of chunks, each suitable for embedding in the next step

## 📚 Optional: Inspect the Chunks

Let’s inspect a few individual chunks to understand how the original text was segmented.

This helps you:
- See how the chunking logic grouped sentences together
- Verify whether the chunks are clean and meaningful for embedding


In [None]:
# TODO: Loop through the first 3 chunks
for i in range(____):  # Hint: How many chunks do we want to see?
    print(f"--- Chunk {____} ---")  # Hint: What variable represents the current chunk number? i + 1?
    print(____[____])  # Hint: What list contains our chunks and what index are we at?
    print()

## 🧬 Step 5: Generate Embeddings

We’ll now convert each text chunk into a numerical vector using a pre-trained sentence embedding model.

- We're using the `all-MiniLM-L6-v2` model from `sentence-transformers`
- These embeddings will later be stored in a vector database for retrieval

Each chunk is now represented in a way that a machine learning model can understand semantically.


In [None]:
# TODO: Initialize a sentence transformer model for creating embeddings
embedder = SentenceTransformer("____")  # Hint: What's the model name for all-MiniLM-L6-v2?

# TODO: Convert all chunks into embedding vectors
embeddings = embedder.____(____)  # Hint: What method creates embeddings and what variable contains our chunks?

print(f"✅ Embeddings created successfully!")
print(f"📊 Number of embeddings: {len(embeddings)}")
print(f"📏 Embedding dimension: {embeddings.shape[1]}")
print(f"🔍 First embedding preview (first 10 values):")
print(embeddings[0][:10])

## 🔎 Optional: Inspect Embeddings

Let’s take a quick look at the generated embeddings.

- Each embedding is a high-dimensional vector representing the meaning of a chunk
- These vectors are what the model uses to retrieve relevant information later

Note: Embeddings are large arrays of numbers, so we’ll only display the first one for illustration.


In [None]:
# TODO: Print information about the first embedding
print(f"Embedding for Chunk 1 (dimension: {len(____[____])}:")  # Hint: What array contains embeddings and what index is the first?
print(____[____])  # Hint: What array and what index for the first embedding?

## 🗂️ Step 6: Store Chunks in ChromaDB

Now we’ll store the chunks and their corresponding embeddings in a ChromaDB collection.

- ChromaDB is an efficient local vector database
- We create a collection named `"test"` (or reuse it if it already exists)
- Each chunk is added along with its embedding and a unique ID

This setup allows us to later search for relevant chunks based on user questions.


In [None]:
# TODO: Create a ChromaDB client with anonymized telemetry disabled
chroma_client = chromadb.Client(Settings(anonymized_telemetry=____))  # Hint: Should telemetry be enabled? True or False?

# TODO: Create a collection to store our documents and embeddings
collection = chroma_client.create_collection(name="____", get_or_create=____)  # Hint: What should we name our collection and should we create if it exists?

# TODO: Add documents and embeddings to the collection
for i, (chunk, emb) in enumerate(zip(____, ____)):  # Hint: What two lists should we iterate through together?
    collection.add(
        documents=[____],  # Hint: What text chunk are we adding?
        embeddings=[____.tolist()],  # Hint: What embedding (converted to list) are we adding?
        ids=[str(____)]  # Hint: What should be the unique ID for this document?
    )

print(f"✅ Vector database created successfully!")
print(f"📊 Total documents stored: {collection.count()}")

## 📋 Optional: Preview Stored Chunks in ChromaDB

Let’s confirm that the chunks and embeddings were properly added to the ChromaDB collection.

This quick check allows us to:
- View some of the stored chunk texts
- Ensure each one has a unique ID


In [None]:
# TODO: Retrieve documents from the collection
results = collection.get(include=["____"])  # Hint: What type of data do we want to retrieve from the collection?

# TODO: Display the first 3 documents
for i in range(min(____, len(results["____"]))):  # Hint: How many documents to show and what key contains the documents?
    print(f"📄 Chunk ID: {results['____'][____]}")  # Hint: What key contains IDs and what index are we at?
    print(results["____"][____])  # Hint: What key contains documents and what index are we at?
    print("____" * 80)  # Hint: What character should create a separator line?

## 🤖 Step 7: Load the Language Model (FLAN-T5)

We’ll now load a lightweight instruction-tuned language model to generate answers based on retrieved context.

- `google/flan-t5-base` is a small and efficient model suitable for Q&A tasks
- We load both the tokenizer and the model using Hugging Face Transformers

This model will take the retrieved document chunks and generate context-aware answers to user questions.


In [None]:
# TODO: Load the tokenizer for the T5 model
tokenizer = AutoTokenizer.from_pretrained("____")  # Hint: What's the model name for google/flan-t5-base?

# TODO: Load the T5 model for sequence-to-sequence generation
model = AutoModelForSeq2SeqLM.from_pretrained("____")  # Hint: Should we use the same name?

## ❓ Step 8: Define a Question-Answering Function

This function allows us to query the document using natural language and receive an answer generated by the language model.

Here’s how it works:

- It first embeds the user's question using the same embedding model as before
- It then queries the ChromaDB collection to retrieve the most relevant text chunks
- These chunks are used as context in a prompt passed to the `flan-t5-base` model
- The model generates an answer based on the context and the question

You can now ask the model questions about the uploaded PDF!


In [None]:
def ask_question(query):
    # TODO: Convert the query into an embedding vector
    query_vec = embedder.____([____])[____]  # Hint: What method creates embeddings? What should we encode? What index for first result?

    # TODO: Search for similar documents in the vector database
    results = collection.query(query_embeddings=[____.tolist()], n_results=____)  # Hint: What vector to search with? How many similar chunks to retrieve? Bonus: What if we are able to add a threshold?

    # TODO: Combine retrieved documents into context
    context = " ".join(results["____"][____])  # Hint: What key contains the retrieved documents? What index for our query results?

    # TODO: Create a prompt that includes context and question
    instruction = "You are a helpful assistant. Use the context to answer the question."
    prompt = (
        f"{instruction}\n\n"
        f"Context:\n{____}\n\n"  # Hint: What variable contains our retrieved context?
        f"Question: {____}\n\n"  # Hint: What variable contains the user's question?
        "Answer:"
    )

    # Display the components before generating answer
    print("=" * 80)
    print("🔍 QUERY:")
    print(f"'{query}'")
    print("\n" + "=" * 80)
    print("📋 INSTRUCTION:")
    print(instruction)
    print("\n" + "=" * 80)
    print("📄 RETRIEVED CONTEXT:")
    print(context)
    print("\n" + "=" * 80)
    print("🤖 GENERATED ANSWER:")
    print("-" * 40)

    # TODO: Tokenize the prompt for the model
    inputs = tokenizer(____, return_tensors="____")  # Hint: What should we tokenize? What tensor format does PyTorch use? ("pt")

    # TODO: Generate an answer using the model
    outputs = model.generate(**____, max_new_tokens=____)  # Hint: What inputs should we pass? What's a reasonable token limit for answers?

    # TODO: Decode and return the generated answer
    full_response = tokenizer.decode(____[____], skip_special_tokens=____)  # Hint: What outputs to decode? What index for first result? Should we skip special tokens?

    # Extract only the answer part (after "Answer:")
    if "Answer:" in full_response:
        answer_start = full_response.find("Answer:") + len("Answer:")
        answer = full_response[answer_start:].lstrip()  # Use lstrip() instead of strip() to preserve trailing whitespace
    else:
        answer = full_response.strip()  # Fallback if "Answer:" not found

    print(answer)
    print("=" * 80)

# 💡 LEARNING NOTES:
# - This function implements the complete RAG pipeline
# - It retrieves relevant context based on query similarity
# - It augments the prompt with retrieved information
# - The model generates answers using both the question and context
# - Now includes debug output showing all components of the RAG process

## ▶️ Step 9: Ask a Question!

Let’s test the full pipeline by asking a question about the uploaded PDF.

- This example asks the model to generate a bullet-point summary
- You can replace the prompt with any question relevant to the document

Try experimenting with different question styles to explore the model's capabilities!


In [None]:
# TODO: Ask a question about the document content
ask_question("____")  # Hint: Write a question that would require information from your uploaded PDF

# 💡 Try different types of questions:
# - Factual questions about specific content
# - Summary requests
# - Questions that require combining information from multiple chunks

Developed Low-Resource Languag e Translation Models for Amharic and Afan Oromo , achieving a BLEU scor e of 25 in Amharic and 20 in Af an Or omo, a significant impro vement given the limited resources. Projects E-Commerce Recommendation and Search Engine ( Jiji Ethiopia) Dec 2024 • Fine-tuned a custom X


# 🚀 Advanced RAG Experiments
## For Students Who Want to Go Further!

Congratulations! You've built a complete RAG system. Now it's time to become a **real RAG researcher** and explore what makes these systems work better.

---

## 🔬 **Choose Your Experiment Track**

### 🧩 **Track 1: Chunking Strategy Optimization**
**The Question**: How does text splitting affect answer quality?

**Experiments to Try:**
- **Chunk size comparison**: Test 100, 300, 500, 1000 character chunks
- **Overlap experiments**: Add 50-100 character overlap between chunks
- **Smart boundaries**: Split by paragraphs vs. sentences vs. fixed length
- **Hybrid approaches**: Combine multiple splitting strategies

**Success Metrics**: Answer quality, retrieval accuracy, response coherence

---

### 🎯 **Track 2: Embedding Model Showdown**
**The Question**: Which embedding model gives the best retrieval results?

**Models to Compare:**
- `all-MiniLM-L6-v2` (what we used - fast and small)
- `all-mpnet-base-v2` (larger, potentially better quality)
- `sentence-transformers/all-MiniLM-L12-v2` (larger variant)
- Domain-specific models for your document type

**Success Metrics**: Retrieval precision, answer relevance, speed comparison

---

### 🔍 **Track 3: Retrieval Strategy Enhancement**
**The Question**: How many chunks should we retrieve and how should we rank them?

**Experiments to Try:**
- **Retrieval count**: Test 1, 3, 5, 10 retrieved chunks
- **Similarity thresholds**: Only use chunks above 0.5, 0.7, 0.8 similarity
- **Re-ranking**: Use different similarity metrics
- **Context limits**: How much context can the model handle effectively?

**Success Metrics**: Answer completeness, hallucination reduction, context utilization

---

### 📚 **Track 4: Multi-Document Mastery**
**The Question**: How well does RAG work with multiple different documents?

**Experiments to Try:**
- Upload 2-3 different PDFs and ask cross-document questions
- Test document type mixing (PDFs + text files + web content)
- Source attribution: Can you track which document answered what?
- Conflicting information: How does the system handle contradictions?

**Success Metrics**: Cross-document reasoning, source accuracy, conflict resolution

---

### ⚡ **Track 5: Real-World Application**
**The Question**: Can you build something actually useful?

**Project Ideas:**
- **Study Assistant**: Upload your course materials, create a personal tutor
- **Research Helper**: Upload papers from your field, ask comparative questions
- **Policy Bot**: Upload company/school policies, create an internal help system
- **Personal Knowledge Base**: Upload your notes, papers, articles

**Success Metrics**: Practical utility, user satisfaction, real-world accuracy

---

### 📊 **Track 6: Evaluation & Quality Analysis**
**The Question**: How do we measure if our RAG system is actually good?

**Evaluation Methods to Build:**
- **Answer quality rubric**: Rate responses on accuracy, relevance, completeness
- **Retrieval evaluation**: Check if the right chunks were found
- **Speed benchmarking**: Measure response times across configurations
- **Hallucination detection**: Identify when the model makes things up

**Success Metrics**: Systematic quality measurement, performance optimization

---

## 📝 **Documentation Tips**

As you experiment, keep track of:
- ✅ **What you tried** (specific configurations, parameters)
- ✅ **What worked** (successful approaches and why)
- ✅ **What didn't work** (failures teach us too!)
- ✅ **Surprising discoveries** (unexpected results often lead to breakthroughs)
- ✅ **Practical insights** (what would you use in a real project?)

---

## 🤝 **Collaboration Encouraged!**

- **Team up** with classmates to tackle different tracks
- **Share findings** - compare results across different approaches
- **Peer review** each other's experiments
- **Present discoveries** to the class

---

## 🌟 **Remember**

> *"The best way to understand RAG is not just to build it, but to break it, improve it, and push its boundaries."*

**Every expert started as a curious experimenter. Every breakthrough began with someone asking "What if...?"**

Ready to become a RAG researcher? Pick your track and start experimenting! 🚀