# Simple RAG (Retrieval-Augmented Generation) with Gemini

**DATA 305 - Module 2: Prompt Pipelines and Structured Reasoning**

This notebook demonstrates a minimal RAG pipeline using `google-genai`, `sentence-transformers`, and `numpy` — no LangChain, no vector databases. By the end, you'll understand the core mechanics behind every RAG system:

1. **Chunk** a document into passages
2. **Embed** each chunk into a vector (using sentence-transformers)
3. **Retrieve** the most relevant chunks for a query
4. **Generate** a grounded answer using retrieved context (using Gemini)

## Setup

We only need two packages. If you're running in Colab, install `google-genai` first.

In [None]:
!pip install -q google-genai sentence-transformers numpy

In [None]:
# Replace with your own API key from Google AI Studio
my_api_key = "YOUR_API_KEY_HERE"

In [None]:
from google import genai
from sentence_transformers import SentenceTransformer
import numpy as np

client = genai.Client(api_key=my_api_key)

# Load a lightweight sentence-transformers model for embeddings
embedder = SentenceTransformer("all-MiniLM-L6-v2")

## Step 1: Load and Chunk a Document

RAG starts with **chunking**: splitting a document into smaller passages so we can retrieve only the relevant parts later.

Here we'll use a sample document defined as a string. In practice, you'd load a file from disk or the web.

In [None]:
# A sample document about William & Mary (feel free to replace with your own text!)
document = """
The College of William & Mary, founded in 1693, is the second-oldest institution of higher
education in the United States. It was established by a royal charter issued by King William III
and Queen Mary II of England. The college is located in Williamsburg, Virginia.

William & Mary has a strong liberal arts tradition and is classified as a Public Ivy. The
university enrolls approximately 6,200 undergraduate students and 2,500 graduate students.
It is known for its rigorous academic programs, particularly in government, business, law,
and the sciences.

The campus features the Sir Christopher Wren Building, which is the oldest academic building
still in continuous use in the United States. The building was constructed between 1695 and 1700
and has survived three fires over its history. It originally housed the entire college including
classrooms, living quarters, and a chapel.

William & Mary has produced many notable alumni, including three U.S. Presidents: Thomas
Jefferson, James Monroe, and John Tyler. Other famous alumni include U.S. Supreme Court Chief
Justice John Marshall and comedian Jon Stewart. The university's motto is "Hark Upon the Gale."

The William & Mary Tribe competes in NCAA Division I athletics as a member of the Coastal
Athletic Association. The school colors are green, gold, and silver. Popular sports include
football, basketball, and gymnastics. The Tribe football team plays at Zable Stadium.

The university is home to several research centers, including the Virginia Institute of Marine
Science (VIMS), one of the largest marine science research and education centers in the country.
VIMS is located on the York River in Gloucester Point, Virginia, and conducts research on
coastal and marine ecosystems.
""".strip()

In [None]:
def chunk_text(text, chunk_size=200, overlap=50):
    """
    Split text into overlapping chunks by character count.
    Overlap helps ensure we don't cut important context at chunk boundaries.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start += chunk_size - overlap
    return chunks


# Try a paragraph-based chunking approach instead — split on double newlines
def chunk_by_paragraph(text):
    """Split text into chunks based on paragraph boundaries."""
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    return paragraphs


chunks = chunk_by_paragraph(document)

print(f"Document split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:100] + "..." if len(chunk) > 100 else chunk)
    print()

## Step 2: Compute Embeddings

An **embedding** is a numerical vector that captures the meaning of a piece of text. Similar texts produce vectors that are close together in vector space.

We'll use the `all-MiniLM-L6-v2` model from [sentence-transformers](https://www.sbert.net/) to embed each chunk. This model runs locally (no API key needed for embeddings!) and produces 384-dimensional vectors.

In [None]:
def get_embeddings(texts):
    """Get embeddings for a list of texts using sentence-transformers."""
    return embedder.encode(texts, convert_to_numpy=True)


# Embed all chunks
chunk_embeddings = get_embeddings(chunks)

print(f"Each embedding has {chunk_embeddings[0].shape[0]} dimensions")
print(f"First embedding (first 10 values): {chunk_embeddings[0][:10]}")

## Step 3: Retrieve Relevant Chunks

Given a user query, we:
1. Embed the query using the same model
2. Compute **cosine similarity** between the query embedding and each chunk embedding
3. Return the top-k most similar chunks

**Cosine similarity** measures the angle between two vectors — values close to 1.0 mean the texts are semantically similar.

In [None]:
def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def retrieve(query, chunks, chunk_embeddings, top_k=2):
    """Find the top_k most relevant chunks for a query."""
    query_embedding = get_embeddings([query])[0]

    similarities = [
        cosine_similarity(query_embedding, chunk_emb)
        for chunk_emb in chunk_embeddings
    ]

    # Get indices of top-k chunks sorted by similarity (highest first)
    top_indices = np.argsort(similarities)[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
            "chunk_index": int(idx),
            "similarity": float(similarities[idx]),
            "text": chunks[idx],
        })
    return results

In [None]:
# Try a retrieval query
query = "What presidents went to William & Mary?"
results = retrieve(query, chunks, chunk_embeddings, top_k=2)

print(f"Query: {query}\n")
for r in results:
    print(f"Chunk {r['chunk_index']} (similarity: {r['similarity']:.4f}):")
    print(r["text"])
    print()

## Step 4: Generate a Grounded Answer

Now we combine retrieval with generation. We pass the retrieved chunks as **context** in the prompt, asking Gemini to answer based only on the provided information.

This is the key idea of RAG: the model generates answers **grounded** in retrieved evidence rather than relying solely on its training data.

In [None]:
def rag_query(query, chunks, chunk_embeddings, top_k=2):
    """Full RAG pipeline: retrieve relevant chunks, then generate an answer."""

    # Step 1: Retrieve
    retrieved = retrieve(query, chunks, chunk_embeddings, top_k=top_k)
    context = "\n\n".join([r["text"] for r in retrieved])

    # Step 2: Generate with context
    prompt = f"""Answer the following question using ONLY the provided context.
If the context doesn't contain enough information, say so.

Context:
{context}

Question: {query}

Answer:"""

    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt,
    )

    return {
        "answer": response.text,
        "retrieved_chunks": retrieved,
    }

In [None]:
# Ask a question!
result = rag_query("What presidents went to William & Mary?", chunks, chunk_embeddings)

print("Answer:")
print(result["answer"])
print("\n--- Retrieved Context ---")
for r in result["retrieved_chunks"]:
    print(f"\nChunk {r['chunk_index']} (similarity: {r['similarity']:.4f}):")
    print(r["text"])

In [None]:
# Try another question
result = rag_query("Tell me about marine science research.", chunks, chunk_embeddings)

print("Answer:")
print(result["answer"])
print("\n--- Retrieved Context ---")
for r in result["retrieved_chunks"]:
    print(f"\nChunk {r['chunk_index']} (similarity: {r['similarity']:.4f}):")
    print(r["text"])

In [None]:
# Try a question the document can't answer
result = rag_query("What is the tuition at William & Mary?", chunks, chunk_embeddings)

print("Answer:")
print(result["answer"])
print("\n--- Retrieved Context ---")
for r in result["retrieved_chunks"]:
    print(f"\nChunk {r['chunk_index']} (similarity: {r['similarity']:.4f}):")
    print(r["text"])

## Exercises

1. **Try your own document:** Replace the `document` string with text from a Wikipedia article, a course syllabus, or any other source. How well does retrieval work?

2. **Experiment with chunking:** Try `chunk_text()` with different `chunk_size` and `overlap` values instead of `chunk_by_paragraph()`. How does chunk size affect retrieval quality?

3. **Change top_k:** What happens when you retrieve more or fewer chunks? Try `top_k=1` vs `top_k=4`.

4. **Improve the prompt:** Modify the generation prompt in `rag_query()`. Can you get the model to cite which chunk it used? Can you get it to respond in a specific format?