# Demo Pipeline: Query → Retrieval → Extractive Summary → Uncertainty → Answer

This notebook demonstrates the end-to-end proof-of-concept pipeline using a small synthetic demonstration corpus.

Pipeline:
1) Load demo corpus (`data/demo_corpus/`)
2) Build TF–IDF index
3) Retrieve top-k evidence items (cosine similarity)
4) Produce a query-aware extractive summary
5) Calibrate and label uncertainty (Low / Medium / High)
6) Generate a citation-grounded answer with traceable evidence

Note: The demo corpus is synthetic/simplified and does not represent the full literature set used in the study.


In [None]:
## 1 Imports & Path Setup
from pathlib import Path
from dataclasses import dataclass
from typing import List

import pandas as pd

# --- Import repository modules (final, no in-notebook algorithms) ---
from retrieval.tfidf_index import build_tfidf_index
from retrieval.cosine_search import retrieve_top_k

from summarization import summarize_retrieved
from uncertainty import annotate_retrieval_df, overall_confidence, ConfidenceConfig


In [None]:
## 2 Load Demo Corpus
REPO_ROOT = Path("..")  # notebooks/ is one level below repo root
CORPUS_DIR = REPO_ROOT / "data" / "demo_corpus"

assert CORPUS_DIR.exists(), f"Corpus folder not found: {CORPUS_DIR.resolve()}"

@dataclass
class Doc:
    doc_id: str
    title: str
    text: str

def load_demo_corpus(corpus_dir: Path) -> List[Doc]:
    docs: List[Doc] = []
    for fp in sorted(corpus_dir.glob("*.txt")):
        raw = fp.read_text(encoding="utf-8").strip()
        doc_id = fp.stem
        title = fp.stem.replace("_", " ").title()
        docs.append(Doc(doc_id=doc_id, title=title, text=raw))
    if not docs:
        raise RuntimeError(f"No .txt files found under: {corpus_dir.resolve()}")
    return docs

docs = load_demo_corpus(CORPUS_DIR)
len(docs), [d.doc_id for d in docs]


In [None]:
## 3 Build TF–IDF index
texts = [d.text for d in docs]
doc_ids = [d.doc_id for d in docs]
titles = [d.title for d in docs]

index = build_tfidf_index(
    texts=texts,
    doc_ids=doc_ids,
    titles=titles,
    ngram_range=(1, 2),
    max_features=5000,
    stop_words="english",
)

index

In [None]:
## 4 Define a query & retrieve top-k evidence
## Retrieval demo

We retrieve the top-k evidence items using TF–IDF + cosine similarity.  
The retrieval output includes:
- similarity score
- a short snippet for readability
- the full text (`text`) for downstream summarization (demo corpus only)


In [None]:
## 5 Retrieval
query = "Does long-term PM2.5 exposure increase COPD exacerbations?"

retrieved = retrieve_top_k(
    query=query,
    index=index,
    k=3,
    include_text=True,   # ensures summarization can use `text`
    snippet_len=360
)

retrieved[["rank", "doc_id", "similarity", "title"]]


In [None]:
## 6 Show evidence cards (snippet view)
for _, r in retrieved.iterrows():
    print("\n" + "-" * 80)
    print(f"Evidence Card #{int(r['rank'])}")
    print(f"Title: {r['title']}")
    print(f"doc_id: {r['doc_id']}")
    print(f"Similarity: {r['similarity']:.4f}")
    print(f"Snippet: {r['snippet']}")


In [None]:
## 7 Uncertainty calibration (annotate retrieval table)
## Uncertainty calibration

We calibrate similarity scores and map them to qualitative confidence labels:
- Low / Medium / High

This step produces:
- calibrated_score
- confidence label per evidence item
- an overall confidence label for the final answer


In [None]:
## 8 Uncertainty labeling
# You can keep defaults or set the same parameters used in the manuscript
cfg = ConfidenceConfig(
    normalize=True,
    a=10.0,
    b=0.5,
    t_low=0.33,
    t_high=0.67
)

retrieved_u = annotate_retrieval_df(
    retrieved,
    similarity_col="similarity",
    out_score_col="calibrated_score",
    out_label_col="confidence",
    config=cfg
)

retrieved_u[["rank", "doc_id", "similarity", "calibrated_score", "confidence", "title"]]


## Extractive summarization

We generate a query-aware extractive summary by selecting top sentences from the retrieved evidence texts.

Outputs:
- summary_text: concatenated top sentences
- selected_sentences: traceable sentence-level evidence (doc_id, title, similarity, sentence, sent_score)


In [None]:
## 10 Summarization
summary_text, selected_sentences = summarize_retrieved(
    query=query,
    retrieved_df=retrieved_u,
    top_n_sentences=3
)

summary_text


In [None]:
## 11) Inspect selected sentences (traceability)
selected_sentences[["doc_id", "sent_score", "similarity", "sentence"]]


## Final answer

We present:
- overall confidence (derived from evidence confidence labels)
- evidence-grounded summary
- traceable supporting evidence list


In [None]:
## 13) Final answer generation
conf = overall_confidence(retrieved_u, label_col="confidence", rank_col="rank", method="top1")

answer_lines = []
answer_lines.append(f"Question: {query}")
answer_lines.append("")
answer_lines.append(f"Answer Summary (Confidence: {conf})")
answer_lines.append(summary_text)
answer_lines.append("")
answer_lines.append("Supporting Evidence")
for _, r in retrieved_u.sort_values("rank").iterrows():
    answer_lines.append(
        f"- {r['title']} (doc_id: {r['doc_id']}, similarity: {r['similarity']:.3f}, confidence: {r['confidence']})"
    )

print("\n".join(answer_lines))

In [None]:
## 14) Try additional queries
more_queries = [
    "Does short-term NO2 exposure worsen childhood asthma symptoms?",
    "Does early-life lead exposure affect neurodevelopment?"
]

for q in more_queries:
    print("\n" + "=" * 100)
    r = retrieve_top_k(q, index, k=3, include_text=True, snippet_len=360)
    r = annotate_retrieval_df(r, config=cfg)
    s, sel = summarize_retrieved(q, r, top_n_sentences=3)
    c = overall_confidence(r, method="top1")

    print(f"Question: {q}\n")
    print(f"Answer Summary (Confidence: {c})")
    print(s)
    print("\nSupporting Evidence")
    for _, row in r.sort_values("rank").iterrows():
        print(f"- {row['title']} (doc_id: {row['doc_id']}, confidence: {row['confidence']}, sim: {row['similarity']:.3f})")
