OpenFDA RAG demo notebook.

# CS 5588 — Week 3 Hands-On  
## Building a RAG Product Prototype (openFDA API)

**Goal (today):** Build a *working product prototype* that answers user questions from openFDA drug labeling records (text fields) with **evidence citations**.

**What you’ll leave with:**
- A project-ready RAG pipeline (API fetch → indexing → retrieval → grounded answer)
- A short **Product Brief** inside the notebook (persona, problem, value, success metrics)
- A small **demo loop** you can show to stakeholders (prompt → answer + citations)

> This hands-on is application-first: prioritize a realistic use case and a clean demo.


## 0) Product Brief (Fill in — REQUIRED for Week 3)

* **Team / Name:** Salman Mirza, Amy Ngo, and Nithin Songala
* **Project name (working title):** Drug Label Evidence Assistant (openFDA)

### 0.1 Target user persona

* **Who will use this?** Pharmacist, clinician, or regulatory analyst who needs quick answers from official drug labeling.
* **Context + pain point:** They don’t have time to scan long label sections manually, and they need answers that are **trustworthy** and **cited** for clinical or compliance decisions.

### 0.2 Problem statement (1–2 sentences)

* Stakeholders need fast, accurate answers from drug labeling text, but the information is spread across long SPL sections. This product supports evidence-backed decision-making by retrieving the right label sections and producing grounded answers with citations.

### 0.3 Value proposition (1 sentence)

* Provides **faster time-to-answer** with **higher trust** by returning an evidence pack (label sections) and a **citation-enforced grounded answer**, refusing when evidence is missing.

### 0.4 Success metrics (pick 2–3)

* **Time-to-answer:** average < 30 seconds per stakeholder question.
* **Citation coverage:** ≥ 2 citations per answer and includes required must-cite items for Task 1–2.
* **Refusal accuracy:** Task 3 returns “Not enough evidence in the retrieved context.” when the labels don’t contain the requested info.

## 1) Setup (Colab)
Run installs, then imports.


In [2]:
# === Setup & Imports (Colab-friendly) ===
import os, re, glob, json, math, html
from dataclasses import dataclass
from typing import List, Dict, Any, Optional

import numpy as np
import pandas as pd

# ---- Core deps ----
!pip -q install pandas numpy scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

# ---- Retrieval deps ----
!pip -q install faiss-cpu rank-bm25
import faiss
from rank_bm25 import BM25Okapi

# ---- Dense + rerank (optional) ----
# Some environments may have version conflicts. We try to install, but fall back gracefully if needed.
USE_ST = True
USE_RERANK = True

try:
    from sentence_transformers import SentenceTransformer, CrossEncoder
except Exception as e:
    USE_ST = False
    USE_RERANK = False
    print("⚠️ sentence-transformers not available in this runtime. Falling back to TF-IDF for 'dense' retrieval.")
    print("   Error:", e)

print("USE_ST:", USE_ST, "| USE_RERANK:", USE_RERANK)


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
USE_ST: True | USE_RERANK: True | USE_CAPTIONING: True
Tesseract version: 4.1.1


### 1.1 System dependencies (Colab/Linux)
No extra system dependencies are required for openFDA API text ingestion.


In [3]:
# No extra system dependencies required
print('System dependencies: none required for openFDA API text.')

System dependencies installed in Section 1.


### 1.2 Imports


> **Note:** Dependencies are installed and imported above. If you restart the runtime, re-run Sections 1–2.

## 2) Configure openFDA API (drug labels)
We will fetch data from the openFDA drug labeling API instead of local JSON files.

- Base endpoint: https://api.fda.gov/drug/label.json
- Query format: `search=field:term`
- Use `limit` to control results (max 1000 per API call)
- Matching records are returned under the `results` field

Recommended today: start with a narrow query and a small limit.


In [5]:
# === Cell 2: openFDA API config ===
OPENFDA_BASE_URL = "https://api.fda.gov/drug/label.json"
OPENFDA_API_KEY = ""  # optional; get one for higher rate limits
OPENFDA_SEARCH = "drug_interactions:caffeine"  # field:term
OPENFDA_SORT = None  # e.g., "effective_time:desc"
OPENFDA_LIMIT = 200  # max 1000 per API call
OPENFDA_MAX_RECORDS = 2000  # total records across pages
OPENFDA_PAUSE_S = 0.0  # add a small delay if you hit rate limits

print("openFDA base:", OPENFDA_BASE_URL)
print("Search:", OPENFDA_SEARCH)
print("Limit:", OPENFDA_LIMIT, "| Max records:", OPENFDA_MAX_RECORDS)


✅ All expected PDFs and figures found under: dataset

PDFs: 2 ['dataset/doc1.pdf', 'dataset/doc2.pdf']
Images: 7 ['dataset/fig1.jpg', 'dataset/fig2.jpg', 'dataset/fig3.jpg', 'dataset/fig4.jpg', 'dataset/fig5.jpg', 'dataset/fig6.jpg', 'dataset/fig7.jpg']


## 3) Define 3 stakeholder questions (application-oriented)
- **Q1/Q2:** require grounded text evidence  
- **Q3:** ambiguous/missing evidence → system should say **Not enough evidence in the retrieved context.**

Also add:
- Must-cite evidence (record/section)
- Success criteria (what a good answer must include)


In [33]:
QUERIES = [
    {
        "id": "Q1",
        "question": (
            "For caffeine-containing products, what drug interactions are listed and where are they described in the label?"
        ),
        "must_cite": [
            "drug_interactions"
        ],
        "success_criteria": [
            "Summarize listed interactions and cite the drug_interactions section."
        ],
        "keywords": [
            "caffeine", "drug interactions", "drug_interactions"
        ]
    },
    {
        "id": "Q2",
        "question": (
            "What dosage and administration guidance is provided for acetaminophen, and are there any key warnings?"
        ),
        "must_cite": [
            "dosage_and_administration",
            "warnings"
        ],
        "success_criteria": [
            "Provide dosage guidance and cite dosage_and_administration.",
            "Mention warnings and cite warnings."
        ],
        "keywords": [
            "acetaminophen", "dosage", "administration", "warnings"
        ]
    },
    {
        "id": "Q3",
        "question": (
            "Do the labels provide a projected economic cost to U.S. GDP in 2050 for antimicrobial resistance? "
            "If not, say: Not enough evidence in the retrieved context."
        ),
        "must_cite": [],
        "success_criteria": [
            "Not enough evidence in the retrieved context."
        ],
        "keywords": ["GDP", "2050", "economic cost", "antimicrobial resistance"]
    }
]

for q in QUERIES:
    print(f"{q['id']}: {q['question']}")

Q1: How did the COVID-19 pandemic impact resistant hospital-onset infections and deaths in the U.S. overall, and which pathogens had the largest hospital-onset increases (percent change)? Use Figure 5 for the overall impact and Figure 6 for pathogen-specific increases.
Q2: From the 2019 Antibiotic Resistance Threats report: list the 'Urgent' threats (Figure 2) and state the total estimated annual deaths from antibiotic-resistant infections (Figure 3: at least 35,900).
Q3: Do the provided reports give a projected economic cost to U.S. GDP in 2050 for antimicrobial resistance? If not, say: Not enough evidence in the retrieved context.


## 4) Fetch openFDA data (text fields)


In [8]:
# ---- openFDA field config ----
OPENFDA_FIELD_ALLOWLIST = [
    "active_ingredient",
    "description",
    "dosage_and_administration",
    "drug_interactions",
    "information_for_patients",
    "when_using",
    "overdosage",
    "stop_use",
    "user_safety_warnings",
    "warnings",
]
OPENFDA_FIELD_BLOCKLIST = {"spl_product_data_elements"}  # noisy by default
INCLUDE_TABLE_FIELDS = False
MAX_RECORDS = None  # set an int for faster iteration during dev
MIN_CHARS = 40

# ---- Module import (openfda_rag) ----
from openfda_rag import (
    TextChunk,
    pick_text_fields,
    derive_doc_id,
    iter_openfda_records,
    build_openfda_query,
    tokenize,
)

if "OPENFDA_SEARCH" not in globals():
    OPENFDA_SEARCH = "drug_interactions:caffeine"

# Optional: build a query from a prompt if you set OPENFDA_SEARCH = "AUTO"
if OPENFDA_SEARCH == "AUTO":
    seed_text = QUERIES[0]["question"] if "QUERIES" in globals() else "drug interactions"
    OPENFDA_SEARCH = build_openfda_query(seed_text, fields=OPENFDA_FIELD_ALLOWLIST)

records = list(
    iter_openfda_records(
        search=OPENFDA_SEARCH,
        api_key=OPENFDA_API_KEY or None,
        base_url=OPENFDA_BASE_URL,
        limit=OPENFDA_LIMIT,
        max_records=OPENFDA_MAX_RECORDS,
        sort=OPENFDA_SORT,
        pause_s=OPENFDA_PAUSE_S,
    )
)

if MAX_RECORDS:
    records = records[:MAX_RECORDS]

record_chunks = []
for i, rec in enumerate(records):
    doc_id = derive_doc_id(rec, i)
    for field, text in pick_text_fields(
        rec, OPENFDA_FIELD_ALLOWLIST, OPENFDA_FIELD_BLOCKLIST, INCLUDE_TABLE_FIELDS
    ).items():
        if len(text) < MIN_CHARS:
            continue
        chunk_id = f"{doc_id}::{field}"
        record_chunks.append(TextChunk(chunk_id, doc_id, field, text))

print("Records:", len(records))
print("Text chunks:", len(record_chunks))
if record_chunks:
    print("Sample:", record_chunks[0].chunk_id, record_chunks[0].text[:250])

Total PDF page chunks: 194
Sample: doc1.pdf::p1 ANTIBIOTIC RESISTANCE THREATS IN THE UNITED STATES 2019 Revised Dec. 2019


## 5) Images skipped (openFDA text-only)


In [10]:
# No image ingestion for openFDA labels (text-only)
print("Image ingestion skipped (openFDA labels are text-only).")

Evidence items: 7
Sample OCR: fig1.jpg QCOVID-19 Impacts on 18 Xi ntimicrobial-Resistant Bacteria and Fungi oO clhreat Estimates wn 3The following table summarizes the latest national death and infection estimates for 18 antimicrobial-resi


### 5.1 Optional captioning (not applicable)


In [11]:
print("Captioning skipped (no images).")

Captioning skipped.


## 6) Chunking (record/section vs fixed-size)


In [12]:
from openfda_rag import SubChunk, fixed_size_chunk

sub_chunks = []
for rc in record_chunks:
    for j, t in enumerate(fixed_size_chunk(rc.text, 250, 40)):
        sub_chunks.append(SubChunk(f"{rc.chunk_id}::c{j+1}", rc.doc_id, rc.field, t))

print("Record chunks:", len(record_chunks))
print("Fixed-size chunks:", len(sub_chunks))

Page chunks: 194
Fixed-size chunks: 346


### Optional: Build preprocessing artifacts (API)
Run this once to create the `preprocessed/` artifacts from the openFDA API query in Section 2.


In [None]:
# Optional: build preprocessing artifacts from openFDA API (skip if already built)
from openfda_rag import build_artifacts

artifacts = build_artifacts(
    api_search=OPENFDA_SEARCH,
    output_dir="preprocessed",
    field_allowlist=OPENFDA_FIELD_ALLOWLIST,
    field_blocklist=OPENFDA_FIELD_BLOCKLIST,
    include_table_fields=INCLUDE_TABLE_FIELDS,
    min_chars=MIN_CHARS,
    use_st=USE_ST,
    save=True,
    api_key=OPENFDA_API_KEY or None,
    api_base_url=OPENFDA_BASE_URL,
    api_limit=OPENFDA_LIMIT,
    api_max_records=OPENFDA_MAX_RECORDS,
    api_sort=OPENFDA_SORT,
    api_pause_s=OPENFDA_PAUSE_S,
)


### Optional: Load preprocessed artifacts
If you built `preprocessed/` from the openFDA API, you can load its artifacts here and skip Sections 4–7.


In [None]:
import numpy as np

from openfda_rag import load_artifacts

PREPROCESSED_DIR = "preprocessed"
artifacts = load_artifacts(PREPROCESSED_DIR)

record_chunks = artifacts["record_chunks"]
sub_chunks = artifacts["sub_chunks"]
faiss_A = artifacts["faiss_A"]
faiss_B = artifacts["faiss_B"]
bm25_A = artifacts["bm25_A"]
bm25_B = artifacts["bm25_B"]

TEXT_CORPUS_A = record_chunks
TEXT_CORPUS_B = sub_chunks

PREPROCESSED_LOADED = True
manifest = artifacts["manifest"]
embedder_type = (manifest.get("embedder") or {}).get("type")
embedder_model = (manifest.get("embedder") or {}).get("model")

_dim = faiss_A.d if faiss_A is not None else 1

if embedder_type == "sentence_transformers":
    from sentence_transformers import SentenceTransformer

    embedder = SentenceTransformer(embedder_model or "sentence-transformers/all-MiniLM-L6-v2")

    def embed_texts(texts, batch_size: int = 32):
        return embedder.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=False,
            convert_to_numpy=True,
            normalize_embeddings=True,
        )
elif embedder_type == "tfidf":
    from sklearn.preprocessing import normalize

    vectorizer = artifacts.get("vectorizer")

    def embed_texts(texts, batch_size: int = 32):
        if vectorizer is None:
            return np.zeros((len(texts), _dim), dtype=np.float32)
        X = normalize(vectorizer.transform(texts))
        return X.toarray().astype(np.float32)
else:
    def embed_texts(texts, batch_size: int = 32):
        return np.zeros((len(texts), _dim), dtype=np.float32)

print("Loaded artifacts:", len(record_chunks), "record chunks |", len(sub_chunks), "sub-chunks")


## 7) Indexing & retrieval (dense + sparse + rerank)


In [14]:
import os, re, warnings
from typing import List
import numpy as np

def tokenize(text: str) -> List[str]:
    return [t.lower() for t in re.findall(r"[a-zA-Z0-9]+", text)]
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist*")
warnings.filterwarnings("ignore", message="You are sending unauthenticated requests*")

try:
    from transformers.utils import logging as hf_logging
    hf_logging.set_verbosity_error()
except Exception:
    pass

try:
    from huggingface_hub.utils import logging as hub_logging
    hub_logging.set_verbosity_error()
except Exception:
    pass

if "PREPROCESSED_LOADED" in globals() and PREPROCESSED_LOADED:
    print("Using preprocessed artifacts; skipping indexing.")
else:
    # --- Embeddings (dense retrieval) ---
    # If SentenceTransformers is available, we use it. Otherwise, we fall back to TF-IDF vectors.
    if USE_ST:
        embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

        def embed_texts(texts: List[str], batch_size: int = 32) -> np.ndarray:
            return embedder.encode(
                texts,
                batch_size=batch_size,
                show_progress_bar=False,
                convert_to_numpy=True,
                normalize_embeddings=True
            )
    else:
        tfidf_vec = TfidfVectorizer(max_features=50000, ngram_range=(1, 2))
        _tfidf_fitted = False

        def embed_texts(texts: List[str], batch_size: int = 32) -> np.ndarray:
            global _tfidf_fitted
            X = tfidf_vec.fit_transform(texts) if not _tfidf_fitted else tfidf_vec.transform(texts)
            _tfidf_fitted = True
            X = normalize(X)
            return X.toarray().astype(np.float32)

    def build_faiss_ip(vectors: np.ndarray):
        dim = vectors.shape[1]
        index = faiss.IndexFlatIP(dim)
        index.add(vectors.astype(np.float32))
        return index

    TEXT_CORPUS_A = record_chunks
    TEXT_CORPUS_B = sub_chunks

    texts_A = [c.text for c in TEXT_CORPUS_A]
    vecs_A = embed_texts(texts_A) if texts_A else np.zeros((0, 384), dtype=np.float32)
    faiss_A = build_faiss_ip(vecs_A) if len(texts_A) > 0 else None
    bm25_A = BM25Okapi([tokenize(t) for t in texts_A]) if len(texts_A) > 0 else None

    texts_B = [c.text for c in TEXT_CORPUS_B]
    vecs_B = embed_texts(texts_B) if texts_B else np.zeros((0, 384), dtype=np.float32)
    faiss_B = build_faiss_ip(vecs_B) if len(texts_B) > 0 else None
    bm25_B = BM25Okapi([tokenize(t) for t in texts_B]) if len(texts_B) > 0 else None

    print("Indexes ready.")


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Indexes ready.


In [16]:
import os, warnings
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings("ignore", message="You are sending unauthenticated requests*")
warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist*")

try:
    from transformers.utils import logging as hf_logging
    hf_logging.set_verbosity_error()
except Exception:
    pass

try:
    from huggingface_hub.utils import logging as hub_logging
    hub_logging.set_verbosity_error()
except Exception:
    pass

In [17]:
def dense_search(query: str, index, corpus, top_k: int = 5):
    if index is None or len(corpus)==0:
        return []
    qv = embed_texts([query])
    scores, idxs = index.search(qv.astype(np.float32), top_k)
    out = []
    for s, i in zip(scores[0], idxs[0]):
        if int(i) >= 0:
            out.append((float(s), corpus[int(i)]))
    return out

def sparse_search(query: str, bm25, corpus, top_k: int = 5):
    if bm25 is None or len(corpus)==0:
        return []
    scores = bm25.get_scores(tokenize(query))
    top = np.argsort(scores)[::-1][:top_k]
    return [(float(scores[i]), corpus[int(i)]) for i in top]

def hybrid_fuse(dense_res, sparse_res, alpha: float = 0.5, top_k: int = 5):
    def k(item): return getattr(item, "chunk_id", str(item))
    dense_rank = {k(it): r for r, (_, it) in enumerate(dense_res, start=1)}
    sparse_rank = {k(it): r for r, (_, it) in enumerate(sparse_res, start=1)}
    keys = set(dense_rank) | set(sparse_rank)
    fused = []
    for key in keys:
        dr = dense_rank.get(key, len(dense_res)+1)
        sr = sparse_rank.get(key, len(sparse_res)+1)
        score = alpha*(1.0/dr) + (1-alpha)*(1.0/sr)
        obj = next((it for _, it in dense_res if k(it)==key), None) or next((it for _, it in sparse_res if k(it)==key), None)
        fused.append((score, obj))
    fused.sort(key=lambda x: x[0], reverse=True)
    return fused[:top_k]

# --- Reranker (optional) ---
reranker = None
if USE_ST and USE_RERANK:
    try:
        reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    except Exception as e:
        reranker = None
        USE_RERANK = False
        print("⚠️ Reranker unavailable, continuing without reranking. Error:", e)


def rerank(query: str, items, get_text, top_k=5):
    if reranker is None:
        return list(items)[:top_k]

    if not items:
        return []
    scores = reranker.predict([(query, get_text(it)) for it in items])
    ranked = sorted(zip(scores, items), key=lambda x: x[0], reverse=True)
    return [it for _, it in ranked[:top_k]]

def retrieve_text(query: str, chunking: str = "record", method: str = "hybrid", top_k: int = 5, alpha: float = 0.5, use_rerank: bool = True):
    if chunking in ("record", "page"):
        corpus, index, bm25 = TEXT_CORPUS_A, faiss_A, bm25_A
    else:
        corpus, index, bm25 = TEXT_CORPUS_B, faiss_B, bm25_B

    if method == "dense":
        res = dense_search(query, index, corpus, top_k=max(10, top_k))
        items = [it for _, it in res]
    elif method == "sparse":
        res = sparse_search(query, bm25, corpus, top_k=max(10, top_k))
        items = [it for _, it in res]
    else:
        d = dense_search(query, index, corpus, top_k=max(10, top_k))
        s = sparse_search(query, bm25, corpus, top_k=max(10, top_k))
        res = hybrid_fuse(d, s, alpha=alpha, top_k=max(10, top_k))
        items = [it for _, it in res]

    if use_rerank:
        return rerank(query, items, lambda it: it.text, top_k=top_k)
    return items[:top_k]


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

## 8) Evidence pack + citations (text-only)


In [20]:
def cite_text(it):
    return f"[{it.chunk_id}]"

def build_evidence_pack(question: str, q_keywords=None, chunking="record", method="hybrid",
                        top_k_text=4):
    query = question if not q_keywords else question + " " + " ".join(q_keywords)
    txt = retrieve_text(query, chunking=chunking, method=method,
                        top_k=top_k_text, use_rerank=True)

    pack = []
    for it in txt:
        pack.append({"type": "text", "cite": cite_text(it), "content": it.text[:800]})
    return pack

ep = build_evidence_pack(
    QUERIES[0]["question"],
    q_keywords=QUERIES[0]["keywords"]
)

for e in ep:
    print(e["cite"], e["type"], e["content"][:120])

[doc2.pdf p17] text COVID-19: U.S. Impact on Antimicrobial Resistance, Special Report 2022 17 Carbapenem-resistant Acinetobacter A threat to
[doc2.pdf p16] text COVID-19: U.S. Impact on Antimicrobial Resistance, Special Report 2022 16 Resistant Pathogen 2017 Threat Estimate 2018 T
[doc2.pdf p7] text COVID-19: U.S. Impact on Antimicrobial Resistance, Special Report 2022 7 Antimicrobial-resistant infections are amplifie
[doc2.pdf p3] text COVID-19: U.S. Impact on Antimicrobial Resistance, Special Report 2022 3 Foreword As an infectious disease physician, I 
[fig2.jpg] figure Z Resistant Pathogen 2017 2018 2019 2017-2019 2020 Threat Estimate and g Threat Estimate Threat Estimate Threat Estimate
[fig7.jpg] figure Resistant germ Threat What CDC Counted, What CDC Did Not Threat New 2013 Can Data be Year-to-Year iS ei eens Stee 2019 r
[fig6.jpg] figure om o . Threat Estimates Cc This following table summarizes the 2019 MR Threats Report estimates, and compares these esti
[fig5.jpg] figure The

## 9) Grounded response (LLM/VLM) — connect Gemini/HF if available


In [30]:
# 9) Grounded response (LLM/VLM optional) — BEST FALLBACK VERSION (no external API needed)

def rag_prompt(question: str, evidence_pack: list) -> str:
    evidence_lines = [f'{e["cite"]} {e["content"]}' for e in evidence_pack]
    evidence_block = "\n\n".join(evidence_lines)
    return f"""You are a grounded assistant. Use ONLY the evidence below.
Every key claim must cite evidence like [record_id::field::c#].
If the evidence is insufficient, respond exactly:
Not enough evidence in the retrieved context.

Evidence:
{evidence_block}

Question:
{question}

Answer (with citations):
"""

def _evidence_is_insufficient(question: str, evidence_pack: list) -> bool:
    # Heuristic: too little evidence, or mostly empty/noisy snippets
    if not evidence_pack or len(evidence_pack) < 2:
        return True
    total_chars = sum(len((e.get("content") or "").strip()) for e in evidence_pack)
    if total_chars < 400:
        return True

    # If OCR is super noisy, it can inflate chars; require at least some alpha tokens overall
    blob = " ".join((e.get("content") or "") for e in evidence_pack)
    alpha_tokens = re.findall(r"[A-Za-z]{3,}", blob)
    return len(alpha_tokens) < 60

def _missing_required_terms(question: str, evidence_pack: list) -> bool:
    """
    Query-aware refusal gate:
    If the question asks for GDP/economic cost projections for 2050,
    require those specific anchors to appear in retrieved evidence.
    """
    q = question.lower()
    blob = " ".join((e.get("content") or "").lower() for e in evidence_pack)

    # If user asks about GDP, we must see "gdp" in evidence
    if "gdp" in q and "gdp" not in blob:
        return True

    # If user asks about 2050, we must see "2050" in evidence
    if "2050" in q and "2050" not in blob:
        return True

    # If question is explicitly economic/cost, require an economic signal too
    if any(t in q for t in ["economic", "economy", "cost"]):
        econ_signals = ["economic", "economy", "cost", "billion", "trillion", "usd", "$", "percent", "%"]
        if not any(s in blob for s in econ_signals):
            return True

    return False

def _sentences_from_pack(evidence_pack: list) -> list:
    """Return [(sentence, cite)] pairs from evidence pack."""
    out = []
    for e in evidence_pack:
        cite = e["cite"]
        text = (e.get("content") or "").strip()
        if not text:
            continue
        # sentence-ish splitting (works OK for label text)
        parts = re.split(r"(?<=[.!?])\s+|\n+", text)
        for s in parts:
            s = s.strip()
            if len(s) >= 40:
                out.append((s, cite))
    return out

def generate_answer(question: str, evidence_pack: Optional[list] = None, max_sentences: int = 4) -> str:
    """
    Fallback grounded answer generator:
    - Enforces refusal when evidence is insufficient or missing required terms
    - Otherwise selects high-overlap sentences and formats them with citations
    """
    if evidence_pack is None:
        return "Not enough evidence in the retrieved context."

    if _evidence_is_insufficient(question, evidence_pack) or _missing_required_terms(question, evidence_pack):
        return "Not enough evidence in the retrieved context."

    q_tokens = set(tokenize(question))
    candidates = []
    for sent, cite in _sentences_from_pack(evidence_pack):
        s_tokens = set(tokenize(sent))
        overlap = len(q_tokens & s_tokens)
        # Light boost for numeric facts (helps for % / counts)
        has_number = bool(re.search(r"\d", sent))
        score = overlap + (2 if has_number else 0)
        candidates.append((score, sent, cite))

    candidates.sort(key=lambda x: x[0], reverse=True)

    picked = []
    used = set()
    for score, sent, cite in candidates:
        if score <= 0:
            continue
        key = (sent[:80].lower(), cite)
        if key in used:
            continue
        used.add(key)
        picked.append(f"{sent} {cite}")
        if len(picked) >= max_sentences:
            break

    if not picked:
        return "Not enough evidence in the retrieved context."

    return "\n".join(picked)

## 10) Demo loop (stakeholder-facing)


In [31]:
from openfda_rag import build_artifacts, build_openfda_query


def _embed_query(query: str, embedder_type: str, embedder_model: Optional[str], vectorizer):
    if embedder_type == "sentence_transformers":
        from sentence_transformers import SentenceTransformer

        if not hasattr(_embed_query, "_cache"):
            _embed_query._cache = {}
        model_name = embedder_model or "sentence-transformers/all-MiniLM-L6-v2"
        if model_name not in _embed_query._cache:
            _embed_query._cache[model_name] = SentenceTransformer(model_name)
        embedder = _embed_query._cache[model_name]
        return embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)

    if embedder_type == "tfidf" and vectorizer is not None:
        return normalize(vectorizer.transform([query])).toarray().astype(np.float32)

    return None


def retrieve_from_api(
    prompt: str,
    keywords: Optional[List[str]] = None,
    search: Optional[str] = None,
    fields: Optional[List[str]] = None,
    chunking: str = "record",
    method: str = "hybrid",
    top_k: int = 4,
    alpha: float = 0.5,
    use_rerank: bool = True,
    api_max_records: Optional[int] = None,
):
    query_text = prompt if not keywords else prompt + " " + " ".join(keywords)
    search_query = search or build_openfda_query(
        query_text, fields=fields or OPENFDA_FIELD_ALLOWLIST
    )

    artifacts = build_artifacts(
        api_search=search_query,
        field_allowlist=OPENFDA_FIELD_ALLOWLIST,
        field_blocklist=OPENFDA_FIELD_BLOCKLIST,
        include_table_fields=INCLUDE_TABLE_FIELDS,
        min_chars=MIN_CHARS,
        use_st=USE_ST,
        save=False,
        save_vectorizer=False,
        api_key=OPENFDA_API_KEY or None,
        api_base_url=OPENFDA_BASE_URL,
        api_limit=OPENFDA_LIMIT,
        api_max_records=api_max_records or OPENFDA_MAX_RECORDS,
        api_sort=OPENFDA_SORT,
        api_pause_s=OPENFDA_PAUSE_S,
    )

    if chunking in ("record", "page"):
        corpus, index, bm25 = (
            artifacts["record_chunks"],
            artifacts["faiss_A"],
            artifacts["bm25_A"],
        )
    else:
        corpus, index, bm25 = (
            artifacts["sub_chunks"],
            artifacts["faiss_B"],
            artifacts["bm25_B"],
        )

    embedder_type = (artifacts.get("manifest", {}).get("embedder") or {}).get("type")
    embedder_model = (artifacts.get("manifest", {}).get("embedder") or {}).get("model")
    vectorizer = artifacts.get("vectorizer")

    def local_dense_search(query: str, top_k_local: int = 5):
        if index is None or len(corpus) == 0:
            return []
        qv = _embed_query(query, embedder_type, embedder_model, vectorizer)
        if qv is None:
            return []
        scores, idxs = index.search(qv.astype(np.float32), top_k_local)
        out = []
        for s, i in zip(scores[0], idxs[0]):
            if int(i) >= 0:
                out.append((float(s), corpus[int(i)]))
        return out

    def local_sparse_search(query: str, top_k_local: int = 5):
        if bm25 is None or len(corpus) == 0:
            return []
        scores = bm25.get_scores(tokenize(query))
        top = np.argsort(scores)[::-1][:top_k_local]
        return [(float(scores[i]), corpus[int(i)]) for i in top]

    if method == "dense":
        res = local_dense_search(query_text, top_k_local=max(10, top_k))
        items = [it for _, it in res]
    elif method == "sparse":
        res = local_sparse_search(query_text, top_k_local=max(10, top_k))
        items = [it for _, it in res]
    else:
        d = local_dense_search(query_text, top_k_local=max(10, top_k))
        s = local_sparse_search(query_text, top_k_local=max(10, top_k))
        res = hybrid_fuse(d, s, alpha=alpha, top_k=max(10, top_k))
        items = [it for _, it in res]

    if use_rerank:
        items = rerank(query_text, items, lambda it: it.text, top_k=top_k)
    else:
        items = items[:top_k]

    evidence_pack = [
        {"type": "text", "cite": f"[{it.chunk_id}]", "content": it.text[:800]}
        for it in items
    ]

    prompt_text = rag_prompt(prompt, evidence_pack)
    answer = generate_answer(prompt, evidence_pack, max_sentences=4)

    return {
        "search": search_query,
        "evidence_pack": evidence_pack,
        "answer": answer,
        "prompt": prompt_text,
    }


def demo_one(qobj, chunking="record", method="hybrid",
             top_k_text=4, alpha=0.5, use_rerank=True):
    return retrieve_from_api(
        prompt=qobj["question"],
        keywords=qobj.get("keywords"),
        chunking=chunking,
        method=method,
        top_k=top_k_text,
        alpha=alpha,
        use_rerank=use_rerank,
    )


for q in QUERIES:
    result = demo_one(q)
    ep = result["evidence_pack"]
    ans = result["answer"]
    print("\n=== ", q["id"], " ===")
    print("Q:", q["question"])
    print("Search:", result["search"])
    print("Must-cite:", q.get("must_cite", []))
    print("Top evidence citations:", [e["cite"] for e in ep])
    print("Answer:\n", ans)


===  Q1  ===
Q: How did the COVID-19 pandemic impact the rates of resistant hospital-onset infections and deaths in the U.S., and which specific pathogens saw the most significant increases?
Must-cite: ['doc2.pdf (COVID-19 Special Report 2022)', 'fig5.jpg (Overall impact statistics)', 'fig6.jpg (Pathogen specific increases)']
Top evidence citations: ['[doc2.pdf p17]', '[doc2.pdf p16]', '[doc2.pdf p7]', '[doc2.pdf p3]', '[fig2.jpg]', '[fig7.jpg]', '[fig6.jpg]', '[fig5.jpg]']
Answer:
 Impact on Antimicrobial Resistance, Special Report 2022 16 Resistant Pathogen 2017 Threat Estimate 2018 Threat Estimate 2019 Threat Estimate 2017-2019 Change 2020 Threat Estimate and 2019-2020 Change Multidrug-resistant Pseudomonas aeruginosa 32,600 cases 2,700 deaths 29,500 cases 2,500 deaths 28,200 cases 2,400 deaths 28,800 cases 2,500 deaths Overall: Stable* Hospital-onset: 32% increase* Drug-resistant nontyphoidal Salmonella 212,500 infections 70 deaths 228,290 infections 254,810 infections Increase Da

## 11) Week 3 acceptance tests (CS 5588)
Fill in after running your demo:
- Does the evidence pack include the must-cite items for Q1/Q2?
- Does Q3 properly refuse with “Not enough evidence…”?
- Is the output understandable to your target user?


In [34]:
ACCEPTANCE_CHECKLIST = [
    {
        "qid": "Q1",
        "must_cite_expected": "drug_interactions",
        "pass_fail": "PENDING",
        "notes": "Evidence pack should include drug_interactions content for caffeine-related labels."
    },
    {
        "qid": "Q2",
        "must_cite_expected": "dosage_and_administration + warnings",
        "pass_fail": "PENDING",
        "notes": "Evidence pack should include dosage guidance and warnings for acetaminophen labels."
    },
    {
        "qid": "Q3",
        "must_cite_expected": "(none) — should refuse",
        "pass_fail": "PENDING",
        "notes": "System should refuse exactly: 'Not enough evidence in the retrieved context.'"
    },
]
ACCEPTANCE_CHECKLIST

[{'qid': 'Q1',
  'must_cite_expected': 'doc2.pdf + fig5.jpg + fig6.jpg',
  'pass_fail': 'PASS',
  'notes': 'Evidence pack includes doc2.pdf pages plus both required figures (fig5.jpg, fig6.jpg). Answer is grounded and cited.'},
 {'qid': 'Q2',
  'must_cite_expected': 'doc1.pdf + fig2.jpg + fig3.jpg',
  'pass_fail': 'PASS',
  'notes': 'Query wording/keywords explicitly target Figure 2 (Urgent threats) and Figure 3 (at least 35,900 deaths), so evidence pack includes both must-cite figures and doc1 context.'},
 {'qid': 'Q3',
  'must_cite_expected': '(none) — should refuse',
  'pass_fail': 'PASS',
  'notes': "System refuses exactly: 'Not enough evidence in the retrieved context.'"}]

## 11.5 Team work items (project enhancement)

Use this hands-on to **advance your semester project**. Each team member should “own” at least one deliverable below.

**Product Lead (Applicability)**
- Update your project **persona + workflow** so the openFDA RAG module is a *core feature*, not an add-on.
- Write 3 stakeholder tasks that map to your product’s real decision points (2 require label evidence, 1 must refuse).

**Systems Lead (Integration)**
- Replace the toy query with your **project-domain openFDA queries**.
- Add **metadata fields** that matter to your domain (e.g., `openfda.product_type`, `openfda.brand_name`, `application_number`).
- Implement a clean **`retrieve_from_api()` helper** your final demo can reuse.

**Evaluation & Risk Lead (Shipping readiness)**
- Build a tiny evaluation table: *Task × Method × P@5 × R@10 × Faithfulness*.
- Add one real failure scenario + mitigation UX (warnings, “show evidence” first, or human-in-the-loop flag).
- Draft the “If we shipped this” plan: data refresh, monitoring, and governance rule.

**Bonus (Optional)**
- Add a minimal UI (Gradio/Streamlit) that shows: question → evidence pack → answer with citations.


## 12) Week 3 deliverables (CS 5588)
- Product Brief completed (persona, problem, value, success metrics)
- Demo run for Q1–Q3 with citations (screenshots encouraged)
- 1 failure case + mitigation plan (risk + fix)
- Repo link submitted in the survey
