# Day 06 — RAG basics (lightweight)

This notebook shows a **simple RAG loop**: retrieve notes, then answer grounded on the retrieved context.

**Concepts covered**
- Retrieval vs. generation responsibilities
- Turning open data into retrievable notes
- Prompt grounding to reduce hallucinations


In [None]:
import pandas as pd

penguins = pd.read_csv("data/penguins.csv")
penguins = penguins.dropna(subset=["species", "island", "body_mass_g", "flipper_length_mm"])

summary = (
    penguins.groupby("species")
    .agg(
        avg_body_mass=("body_mass_g", "mean"),
        avg_flipper=("flipper_length_mm", "mean"),
        islands=("island", lambda x: ", ".join(sorted(set(x)))),
    )
    .reset_index()
)

notes = []
for row in summary.itertuples(index=False):
    notes.append(
        {
            "title": f"{row.species} penguins",
            "text": (
                f"Found on {row.islands}. "
                f"Average body mass is about {row.avg_body_mass:.0f} g "
                f"with flipper length around {row.avg_flipper:.0f} mm."
            ),
        }
    )

def retrieve(query, docs, top_k=2):
    scored = []
    query_terms = set(query.lower().split())
    for doc in docs:
        doc_terms = set(doc["text"].lower().split())
        score = len(query_terms & doc_terms)
        scored.append((score, doc))
    ranked = sorted(scored, key=lambda x: x[0], reverse=True)
    return [doc for score, doc in ranked[:top_k] if score > 0]


In [None]:
question = "Which penguin species is the heaviest, and where is it found?"
context_docs = retrieve(question, notes)
context = "\n\n".join(f"- {d['title']}: {d['text']}" for d in context_docs)
context


Use the prompt template in `prompts/rag_prompt.txt` to send `context` + `question` to your model of choice.

For better grounding, ask the model to **only** answer using the retrieved context, and to cite which note it used.
