# Agentic AI Workshop — Part 3: RAG Chatbot with ChromaDB

This notebook demonstrates:
- Ingesting a small set of documents
- Chunking + embedding with OpenAI Embeddings
- Storing & retrieving with ChromaDB
- Building a simple RAG prompt
- Gradio Chat UI that does retrieval on each turn

> **Prereqs**
> - `pip install chromadb openai gradio tiktoken` (tiktoken optional for token-aware chunking)


In [None]:
import os
from openai import OpenAI
import sys
import os
import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
import gradio as gr
import os, re, textwrap, uuid
from typing import List, Dict


sys.path.append('../utils')
from helpers import load_env

In [None]:
# Load and verify API keys
load_env(api_key_type="OPENAI_API_KEY")


In [None]:
# Initialize OpenAI client
client = OpenAI()  # Uses OPENAI_API_KEY from environment
print("Client ready.")

In [None]:
# Configure Chroma (in-memory for workshop; switch to persistent_dir for disk)
SAVE_DIR = Path(OUTPUTS_DIR) / "chroma_db"
chroma_client = chromadb.PersistentClient(path=SAVE_DIR)
collection = chroma_client.get_or_create_collection(
    name="workshop_docs",
    metadata={"hnsw:space": "cosine"}
)


In [None]:

# --- Chunking utility (simple)
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100) -> List[str]:
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(n, start + max_chars)
        chunks.append(text[start:end])
        start += max_chars - overlap
    return chunks

# --- Embedding & upsert
def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]:
    resp = client.embeddings.create(model=model, input=texts)
    return [d.embedding for d in resp.data]

def ingest_documents(docs: List[Dict]):
    ids, docs_texts, metadatas = [], [], []
    for d in docs:
        chunks = chunk_text(d["text"])
        for i, ch in enumerate(chunks):
            ids.append(f"{d['id']}_{i}")
            docs_texts.append(ch)
            metadatas.append({"title": d["title"], "parent_id": d["id"], "chunk": i})
    
    embs = embed_texts(docs_texts)
    collection.upsert(ids=ids, documents=docs_texts, metadatas=metadatas, embeddings=embs)
    return len(ids)


# --- Retrieval
def retrieve(query: str, k: int = 3):
    q_emb = embed_texts([query])[0]
    res = collection.query(query_embeddings=[q_emb], n_results=k, include=["documents","metadatas","distances"])
    hits = []
    for doc, meta, dist in zip(res["documents"][0], res["metadatas"][0], res["distances"][0]):
        hits.append({"text": doc, "meta": meta, "distance": float(dist)})
    return hits

# Test retrieval
for h in retrieve("How do you compare models on forecasting tasks?", k=2):
    print(h["meta"]["title"], "→", h["distance"])

In [None]:
# --- Source documents
source_docs = [
    {
        "id": "hr_policy_1",
        "title": "Remote Work Policy",
        "text": """Our company supports flexible remote work arrangements. Employees can work from home up to 3 days per week with manager approval. Remote workers must maintain core hours of 10 AM to 3 PM in their local timezone for team collaboration. All remote employees are required to have a dedicated workspace with reliable high-speed internet (minimum 25 Mbps download). Equipment such as laptops, monitors, and ergonomic chairs can be requested through IT. Monthly stipends of $50 are provided for home internet expenses. Remote workers must be available for video calls and respond to messages within 2 hours during business hours."""
    },
    {
        "id": "hr_policy_2",
        "title": "Paid Time Off (PTO) Policy",
        "text": """Full-time employees accrue 15 days of PTO annually, increasing to 20 days after 3 years of service and 25 days after 7 years. PTO accrues at 1.25 days per month for standard employees. Time off requests should be submitted at least 2 weeks in advance for periods longer than 3 days. Unused PTO can be carried over up to 5 days into the next calendar year. The company observes 10 federal holidays annually. Employees also receive 5 sick days per year that do not roll over. PTO is prorated for new hires based on their start date."""
    },
    {
        "id": "hr_policy_3",
        "title": "Health Insurance Benefits",
        "text": """The company offers comprehensive health insurance plans through BlueCross BlueShield with three tier options: Basic, Standard, and Premium. Coverage begins on the first day of the month following your start date. The company covers 80% of the premium cost for employees and 60% for dependents. The Basic plan has a $2,000 deductible with $30 copays. The Standard plan has a $1,000 deductible with $20 copays. The Premium plan has a $500 deductible with $10 copays. All plans include dental and vision coverage. Open enrollment occurs every November for the following calendar year. Life changes such as marriage or birth qualify for special enrollment periods within 30 days of the event."""
    },
    {
        "id": "hr_policy_4",
        "title": "Professional Development and Training",
        "text": """Employees are allocated $1,500 annually for professional development, including conferences, courses, certifications, and workshops. Additional funds up to $3,000 may be approved for specialized certifications relevant to your role. The company provides free access to LinkedIn Learning, Coursera, and O'Reilly online learning platforms. Employees can dedicate up to 5 hours per month during work hours for learning activities. Tuition reimbursement of up to $5,000 per year is available for degree programs related to your field. Conference attendance requires manager approval and must be submitted 6 weeks in advance. All expenses must be submitted within 30 days with itemized receipts."""
    },
    {
        "id": "hr_policy_5",
        "title": "Parental Leave Policy",
        "text": """Primary caregivers are entitled to 16 weeks of fully paid parental leave. Secondary caregivers receive 8 weeks of fully paid leave. Leave must be taken within 12 months of the child's birth or adoption. Parents can choose to take leave continuously or split it into two separate periods with manager approval. The company provides a phased return-to-work option allowing 50% schedule for 4 weeks following the leave period. During parental leave, all benefits continue including health insurance and 401k matching. New parents also receive a $1,000 stipend for childcare or baby supplies. Employees must provide 4 weeks notice when possible, except in cases of emergency or early delivery."""
    },
]


# Ingest Documents 

In [None]:
# Ingest documents
num_chunks = ingest_documents(source_docs)
print(f"Ingested {num_chunks} chunks into Chroma.")

# RAG 

In [None]:

SYSTEM = (
    "You are a helpful RAG assistant. "
    "Answer based only on the provided CONTEXT. "
    "If the answer is not in context, say you don't know."
)

USER_TEMPLATE = "QUESTION: {question}\n\nCONTEXT:\n{context}"

def build_context(hits):
    ctx = []
    for h in hits:
        title = h["meta"]["title"]
        chunk_id = h["meta"]["chunk"]
        # print(h["meta"]["chunk"])
        ctx.append(f"[{title} / chunk {chunk_id}]\n{h['text']}")
    return "\n\n".join(ctx[:3])

def rag_answer(question: str):
    hits = retrieve(question, k=4)
    context = build_context(hits)
    prompt = USER_TEMPLATE.format(question=question, context=context)
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,
    )
    return resp.choices[0].message.content, hits

question = "How many PTO days do I get?"
print("Question: ", question)
rag_response, retrieval_chunks = rag_answer("How many PTO days do I get?")
print("AI Response: ", rag_response)

In [None]:
# --- Gradio Chat UI
def chat_with_rag(message, history):
    """Answer user question using RAG."""
    answer, hits = rag_answer(message)
    
    # Add sources with proper formatting
    sources = "\n\n**Sources:**\n" + "\n".join(
        f"- {h['meta']['title']} (chunk id: {h['meta']['chunk'] + 1})" 
        for h in hits[:3]
    )
    
    return answer + sources

# Create chat interface
demo = gr.ChatInterface(
    fn=chat_with_rag,
    title="RAG Chatbot with ChromaDB",
    description="Ask questions about machine learning topics",
    examples=[
        "How many PTO days do I get?",
        "What's the remote work policy?",
        "Does the company cover health insurance costs?",
        "How much can I spend on professional development?",
        "What is the parental leave policy?"
    ],
    type="messages"
)

# Launch
demo.launch()


# EXERCISES



EXERCISE 1: Use Your Own Documents (Intermediate)
----------------------------------------------------------
Replace the short HR policies with 2-3 longer documents (500-2000 words each):
- Work documentation, PDFs, or blog posts
- Convert to text and add to source_docs list
- Test with realistic queries about your documents

Questions:
- How does RAG perform with longer, complex documents?
- Do you need to adjust chunk size (max_chars) or overlap?
- How many chunks (k) should you retrieve?

TODO: Add your own documents and test

EXERCISE 2: Tune Retrieval Parameters (Intermediate)
----------------------------------------------------------
Experiment with these parameters:
1. Number of chunks: k=2, k=5, k=10
2. Chunk size: max_chars=400, 800, 1600
3. Overlap: overlap=0, 50, 200

Test combinations and answer:
- Which gives the best answers?
- Trade-offs between precision and context?

TODO: Test 3-4 configurations and compare results

EXERCISE 3: Test Failure Cases (Beginner)
----------------------------------------------------------
Test when RAG fails:
1. Out-of-scope: "What's the weather?" "Who is the CEO?"
2. Multi-document: "Compare PTO and parental leave policies"
3. Ambiguous: "How long is it?" (what is "it"?)
4. Contradictions: "I heard we get 30 PTO days?"

Document:
- How does the system respond?
- Does it say "I don't know" or make things up?
- How could you improve robustness?

TODO: Test scenarios and identify patterns in failures
