# Document Q&A with Citations (RAG-lite)

Goal: answer questions using **only provided documents**, with **verifiable citations**.

What you’ll practice:
- Chunking + retrieval (TF‑IDF)
- Prompting for grounded answers
- Structured schema for citations + validation
- A simple fallback: 'I don’t know'


## 1. Setup and Installation

**Estimated time:** ~60–90 minutes (with exercises)

### Install
If needed, install dependencies:
```bash
pip install -U openai pydantic pandas numpy scikit-learn
```

### Environment
Set your API key:
```bash
export OPENAI_API_KEY="..."
```

> **Note:** All example data in this notebook is synthetic (safe to share in training).

In [None]:
import os

assert os.getenv('OPENAI_API_KEY'), "Set OPENAI_API_KEY in your environment"

## 2. Imports + API client

In [None]:
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY from env

In [None]:
from pydantic import BaseModel, Field, validator
from typing import List, Optional
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


## 3. Example documents (synthetic policy excerpts)

Replace these with real NYPL docs later (handbook, policies, runbooks).


In [None]:
docs = {
  "Remote Work Policy": """Remote work is permitted up to two days per week for eligible roles.
Employees must be reachable during core hours (10am–4pm) and follow security guidance for devices.
Managers may revoke remote privileges if performance or security issues arise.""",
  "Incident Response Playbook": """If a public-facing service returns repeated 5xx errors, declare an incident and page on-call.
Within 15 minutes, provide an initial status update including scope, severity, and next steps.
After mitigation, conduct a post-incident review within 5 business days.""",
  "Patron Privacy": """Patron personally identifiable information (PII) must not be shared in public channels.
When discussing tickets, redact names, emails, phone numbers, and library card numbers."""
}

pd.DataFrame([{"doc":k, "text":v[:120]+"..."} for k,v in docs.items()])

## 4. Chunking + retrieval (TF‑IDF)

We split documents into short chunks and retrieve the top-k most relevant.


In [None]:
def chunk_text(text: str, max_chars: int = 280):
    chunks=[]
    cur=""
    for sent in text.splitlines():
        sent=sent.strip()
        if not sent: 
            continue
        if len(cur)+len(sent)+1 <= max_chars:
            cur = (cur+" "+sent).strip()
        else:
            chunks.append(cur)
            cur = sent
    if cur:
        chunks.append(cur)
    return chunks

chunks=[]
for doc_name, text in docs.items():
    for i, ch in enumerate(chunk_text(text)):
        chunks.append({"doc": doc_name, "chunk_id": f"{doc_name}::c{i}", "text": ch})

df_chunks = pd.DataFrame(chunks)
df_chunks

In [None]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df_chunks["text"].tolist())

def retrieve(query: str, k: int = 3):
    q = vectorizer.transform([query])
    sims = cosine_similarity(q, X).ravel()
    top = sims.argsort()[::-1][:k]
    return df_chunks.iloc[top].assign(score=sims[top])

retrieve("What is the incident update timing?", k=3)

## 5. Define an answer schema with citations

We force the model to cite chunk IDs it used.


In [None]:
class Citation(BaseModel):
    chunk_id: str = Field(..., description="Must be one of the provided chunk_ids")
    quote: str = Field(..., description="Short verbatim quote (<=25 words) from the chunk")
    why_relevant: str = Field(..., description="Why this quote supports the answer")

class GroundedAnswer(BaseModel):
    answer: str
    citations: List[Citation]
    uncertainty: Optional[str] = Field(None, description="If unsure, say what is missing")

def validate_citations(citations, allowed_ids):
    bad = [c.chunk_id for c in citations if c.chunk_id not in allowed_ids]
    if bad:
        raise ValueError(f"Bad chunk_ids in citations: {bad}")

## 6. Ask a question (grounded)

We send only the retrieved chunks as context.


In [None]:
SYSTEM = """You answer questions using ONLY the provided CONTEXT.
If the answer is not in context, say you don't know and explain what you'd need.
Always include citations to the exact chunk_ids you used."""

def answer_question(query: str, k: int = 3) -> GroundedAnswer:
    retrieved = retrieve(query, k=k)
    allowed_ids = set(retrieved["chunk_id"].tolist())

    context = "\n\n".join([f"[{row.chunk_id}] {row.text}" for row in retrieved.itertuples()])
    user = f"""QUESTION: {query}

CONTEXT:
{context}
"""

    resp = client.responses.parse(
        model="gpt-4o-2024-08-06",
        input=[{"role":"system","content":SYSTEM},{"role":"user","content":user}],
        text_format=GroundedAnswer
    )
    out = resp.output_parsed
    validate_citations(out.citations, allowed_ids)
    return out

ans = answer_question("When do we need to do a post-incident review?", k=3)
ans

## 7. Pretty-print with citations


In [None]:
print(ans.answer)
print()
for c in ans.citations:
    print("-", c.chunk_id, ":", c.quote)

## 8. Exercises


In [None]:
# EXERCISE — SOLUTION
# Add a hard guardrail that refuses answers if `citations` is empty.

def enforce_non_empty_citations(ans: GroundedAnswer) -> GroundedAnswer:
    if not ans.citations:
        raise ValueError("No citations provided; refuse to answer.")
    return ans

ans2 = answer_question("What are core hours?", k=3)
enforce_non_empty_citations(ans2)
ans2


In [None]:

# EXERCISE — SOLUTION
# Improve chunking: split by sentences (.) and aim for ~2–3 sentences per chunk. Rebuild the index and compare retrieval quality on 2 queries.

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def chunk_text_sentences(text: str, max_sents: int = 3):
    sents=[s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]
    chunks=[]
    for i in range(0, len(sents), max_sents):
        chunks.append(" ".join(sents[i:i+max_sents]))
    return chunks

chunks=[]
for doc_name, text in docs.items():
    for i, ch in enumerate(chunk_text_sentences(text, max_sents=3)):
        chunks.append({"doc": doc_name, "chunk_id": f"{doc_name}::s{i}", "text": ch})

df_chunks = pd.DataFrame(chunks)

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df_chunks["text"].tolist())

def retrieve2(query: str, k: int = 3):
    q = vectorizer.transform([query])
    sims = cosine_similarity(q, X).ravel()
    top = sims.argsort()[::-1][:k]
    return df_chunks.iloc[top].assign(score=sims[top])

print("Query 1:")
display(retrieve2("When do we need a post-incident review?", k=3))
print("Query 2:")
display(retrieve2("What are core hours for remote work?", k=3))


In [None]:
# EXERCISE — SOLUTION
# Add a fallback: if top similarity score < 0.15, return 'I don't know' without calling the model.

def answer_question_with_threshold(query: str, k: int = 3, thresh: float = 0.15):
    retrieved = retrieve(query, k=k)
    if retrieved["score"].max() < thresh:
        return GroundedAnswer(answer="I don't know based on the provided documents.", citations=[], uncertainty="Top retrieval score too low; need more relevant documents.")
    out = answer_question(query, k=k)
    return out

print(answer_question_with_threshold("What is the cafeteria menu?", k=3))
