# Day 06 — RAG basics (lightweight)

Retrieval-Augmented Generation (RAG) answers questions using a local knowledge base.

We will cover:
- Loading documents
- Simple vectorization with TF-IDF
- Retrieving relevant chunks
- Asking the model with context


## 1) Load documents
We’ll use a small dataset of penguin facts.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4o-mini"

penguins = pd.read_csv("data/penguins.csv")
penguins = penguins.dropna(subset=["species", "island", "body_mass_g"])

penguins["doc"] = (
    "Species: "
    + penguins["species"]
    + ". Island: "
    + penguins["island"]
    + ". Body mass (g): "
    + penguins["body_mass_g"].astype(str)
)

penguins[["species", "island", "body_mass_g"]].head()


## 2) Build a simple retriever
We use TF-IDF to find similar rows.


In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(penguins["doc"])

question = "Which penguin species is the heaviest, and where is it found?"
q_vec = vectorizer.transform([question])

scores = cosine_similarity(q_vec, X).flatten()

top_idx = scores.argsort()[-3:][::-1]
context = "\n".join(penguins.iloc[top_idx]["doc"].tolist())
context


## 3) Ask the model with context
We pass the retrieved context to the model.


In [None]:
prompt = f"""
Answer the question using ONLY the context.
Context:
{context}

Question: {question}
"""

answer = client.responses.create(model=MODEL, input=prompt, temperature=0.2).output_text
answer


## 4) What to do next
Explore better embeddings, chunking, and reranking for stronger RAG performance.
