# Document Q&A with Citations (RAG-lite)

**Goal:** Answer questions with **verifiable citations** to the exact document passages provided.

**No GPU required.**

You’ll learn:
- Basic document chunking
- Lightweight retrieval (TF‑IDF)
- Grounded answers with a citation schema



## 1. Setup and Installation

In [None]:
!pip -q install --upgrade openai pydantic pandas scikit-learn

## 2. Imports + API client

In [None]:
import os
from dataclasses import dataclass
from typing import List, Dict, Optional

import pandas as pd
from pydantic import BaseModel, Field, confloat
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from openai import OpenAI

if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("Missing OPENAI_API_KEY. Set it and re-run.")

client = OpenAI()


## 3. Example documents (synthetic)

In your real workflow, these could be:
- internal runbooks
- policy docs
- incident playbooks

Here we keep them synthetic.


In [None]:
docs = [
    {
        "source_id": "policy-privacy-v1",
        "title": "Patron Privacy Policy (Excerpt)",
        "text": """1. Purpose
NYPL protects patron privacy and limits disclosure of personally identifying information.

2. Data handling
Staff should only access patron data necessary for service delivery. Do not export patron records to personal devices.

3. Disclosure
Do not share patron activity data except as required by law or authorized policy.

4. Retention
Operational logs containing patron identifiers should be retained only as long as necessary and then deleted per retention schedules.
"""
    },
    {
        "source_id": "runbook-search-v1",
        "title": "Search Service Runbook (Excerpt)",
        "text": """1. Symptoms
Common symptoms include 5xx errors on /search and elevated latency.

2. First checks
Check deploy status, error budget alerts, and upstream dependency health (database, cache).

3. Mitigation
If error rate is high, rollback the last deploy and scale search workers. If DB timeouts appear, reduce query load and investigate indexes.

4. Communication
Post an incident update including impact, mitigations, and next update time.
"""
    },
    {
        "source_id": "policy-access-v1",
        "title": "Staff Account Access Policy (Excerpt)",
        "text": """1. Account lockouts
If staff SSO accounts are locked, verify identity via approved channels and reset through the standard access workflow.

2. Least privilege
Grant only the minimum roles needed to perform job functions.

3. Audit
All access changes must be logged with a ticket reference and approver.
"""
    }
]

pd.DataFrame([{k:v for k,v in d.items() if k!='text'} | {"chars": len(d["text"])} for d in docs])


## 4. Chunking + retrieval (TF‑IDF)

This is a simple “RAG-lite” approach:
- split docs into chunks
- retrieve top-k chunks
- ask the model to answer using only those chunks


In [None]:
def chunk_text(text: str, chunk_size: int = 280, overlap: int = 40) -> List[str]:
    # very simple character-based chunking (fine for demos)
    chunks=[]
    start=0
    while start < len(text):
        end = min(len(text), start + chunk_size)
        chunks.append(text[start:end].strip())
        start = end - overlap
        if start < 0:
            start = 0
        if end == len(text):
            break
    return [c for c in chunks if c]

chunks=[]
for d in docs:
    for i, c in enumerate(chunk_text(d["text"])):
        chunks.append({
            "source_id": d["source_id"],
            "title": d["title"],
            "chunk_id": f'{d["source_id"]}::chunk{i}',
            "text": c
        })

df_chunks = pd.DataFrame(chunks)
df_chunks.head()


In [None]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df_chunks["text"].tolist())

def retrieve(query: str, k: int = 4) -> pd.DataFrame:
    q = vectorizer.transform([query])
    sims = cosine_similarity(q, X).flatten()
    top_idx = sims.argsort()[::-1][:k]
    out = df_chunks.iloc[top_idx].copy()
    out["score"] = sims[top_idx]
    return out.sort_values("score", ascending=False)

retrieve("What should we do if /search is returning 5xx and latency is high?")


## 5. Define an answer schema with citations

We will force the model to produce:
- an answer
- citations that point to specific chunks we provided

Structured Outputs helps keep the response reliably machine-readable.

In [None]:
class Citation(BaseModel):
    chunk_id: str = Field(..., description="Must be one of the provided chunk_id values.")
    source_id: str
    title: str
    quoted_evidence: str = Field(..., description="A short excerpt (<=25 words) from the chunk that supports the answer.")

class GroundedAnswer(BaseModel):
    answer: str
    citations: List[Citation]
    could_not_answer: bool = Field(..., description="True if the retrieved chunks do not contain enough info.")
    confidence: confloat(ge=0, le=1)
    followups: List[str] = Field(default_factory=list)


## 6. Ask a question (grounded)

We pass only the retrieved chunks to the model, and instruct it to cite them.

Note: Always validate that returned `chunk_id`s are from the provided set.


In [None]:
SYSTEM = """You answer questions for NYPL developers.
You MUST use only the provided document chunks as your source of truth.

Rules:
- If the chunks do not contain the answer, set could_not_answer=true and ask follow-up questions.
- Every non-trivial claim must be supported by at least one citation.
- quoted_evidence must be a short excerpt copied from the chunk.
"""


def answer_with_citations(question: str, k: int = 4, model: str = "gpt-4o-mini") -> GroundedAnswer:
    ctx = retrieve(question, k=k)
    allowed_chunk_ids = set(ctx["chunk_id"].tolist())

    context_block = "\n\n".join(
        [f"[{r.chunk_id}] ({r.source_id} | {r.title})\n{r.text}" for r in ctx.itertuples()]
    )

    response = client.responses.parse(
        model=model,
        input=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": f"QUESTION:\n{question}\n\nCHUNKS:\n{context_block}"},
        ],
        text_format=GroundedAnswer,
    )
    out = response.output_parsed

    # Validate citations
    bad = [c.chunk_id for c in out.citations if c.chunk_id not in allowed_chunk_ids]
    if bad:
        raise ValueError(f"Model returned citations to unknown chunks: {bad}")

    return out

answer_with_citations("If /search is throwing 5xx errors, what are the first checks and mitigations?")


## 7. Exercises

### EXERCISE 1: Add a 'strict mode' fallback
If validation fails (unknown chunk_id), re-run once with a stronger system instruction.

### EXERCISE 2: Add a second retrieval method
Try a keyword overlap scorer and compare top-k chunks to TF‑IDF.

### EXERCISE 3: Add a new document
Add a synthetic 'Retention Schedule' doc, re-index, and ask a retention question.


In [None]:
# EXERCISE STARTER CELL

pass
