# CS 5588 — Week 2 Hands-On: Applied RAG for Product & Venture Development (Two-Step)
**Initiation (20 min, Jan 27)** → **Completion (60 min, Jan 29)**

**Submission:** Survey + GitHub  
**Due:** **Jan 29 (Thu), end of class**

## New Requirement (Important)
For **full credit (2% individual)** you must:
1) Use **your own project-aligned dataset** (not only benchmark)  
2) Add **your own explanations** for key steps

### ✅ “Cell Description” rule (same style as CS 5542)
After each **IMPORTANT** code cell, add a short Markdown **Cell Description** (2–5 sentences):
- What the cell does
- Why it matters for a **product-grade** RAG system
- Any design choices (chunk size, α, reranker, etc.)

> Treat these descriptions as **mini system documentation** (engineering + product thinking).


## Project Dataset Guide (Required for Full Credit)

### Minimum requirements
- **5–25 documents** (start small; scale later)
- Prefer **plain text** documents (`.txt`)
- Put files in a folder named: `project_data/`

### Recommended dataset types (choose one)
- Policies / guidelines / compliance docs
- Technical docs / manuals / SOPs
- Customer support FAQs / tickets (de-identified)
- Research notes / literature summaries
- Domain corpus (healthcare, cybersecurity, business, etc.)

> Benchmarks are optional, but **cannot** earn full credit by themselves.


## 0) One-Click Setup + Import Check  ✅ **IMPORTANT: Add Cell Description after running**
If you are in **Google Colab**, run the install cell below, then **Runtime → Restart session** if imports fail.


In [80]:
# CS 5588 Lab 2 — One-click dependency install (Colab)
!pip -q install -U sentence-transformers chromadb faiss-cpu scikit-learn rank-bm25 transformers accelerate

import sys, platform
print("Python:", sys.version)
print("Platform:", platform.platform())
print("✅ If imports fail later: Runtime → Restart session and run again.")


Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Platform: Linux-6.6.105+-x86_64-with-glibc2.35
✅ If imports fail later: Runtime → Restart session and run again.


### ✍️ Cell Description (Student)
Write 2–5 sentences explaining what the setup cell does and why restarting the runtime sometimes matters after pip installs.

This setup cell installs all the core libraries needed for a production-style RAG system: sentence-transformers for embeddings, vector stores like Chroma and FAISS, BM25 for lexical retrieval, and transformers/accelerate for working with LLMs. Installing them up front ensures the environment has consistent versions, which reduces “works on my machine” problems later.
In Colab, `pip` installs may not be fully recognized by the current Python process, so restarting the runtime after this cell can be necessary to load the new packages cleanly and avoid mysterious import errors.

# STEP 1 — INITIATION (Jan 27, 20 minutes)
**Goal:** Define the **product**, **users**, **dataset reality**, and **trust risks**.

> This is a **product milestone**, not a coding demo.


## 1A) Product Framing (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Fill in the template below like a founder/product lead.


In [81]:
product = {
  "product_name": "AI-Powered Weather & Climate Intelligence System for Personalized Decision Support",
  "target_users": "Travelers, outdoor enthusiasts, event planners, and researchers who need context-aware weather insights.",
  "core_problem": "Most weather apps only show raw forecasts and alerts, forcing users to interpret complex data themselves.",
  "why_rag_not_chatbot": "A generic chatbot cannot reliably access real-time forecasts, severe weather alerts, or historical climate data. RAG lets the system ground LLM responses in up-to-date meteorological feeds and curated climate documents, so recommendations and explanations are factual, transparent, and traceable to evidence.",
  "failure_harms_who_and_how": "If the AI hallucinates safety advice during severe events, users could be led into dangerous situations, resulting in physical injury or property damage.",
}
product


{'product_name': 'AI-Powered Weather & Climate Intelligence System for Personalized Decision Support',
 'target_users': 'Travelers, outdoor enthusiasts, event planners, and researchers who need context-aware weather insights.',
 'core_problem': 'Most weather apps only show raw forecasts and alerts, forcing users to interpret complex data themselves.',
 'why_rag_not_chatbot': 'A generic chatbot cannot reliably access real-time forecasts, severe weather alerts, or historical climate data. RAG lets the system ground LLM responses in up-to-date meteorological feeds and curated climate documents, so recommendations and explanations are factual, transparent, and traceable to evidence.',
 'failure_harms_who_and_how': 'If the AI hallucinates safety advice during severe events, users could be led into dangerous situations, resulting in physical injury or property damage.'}

### ✍️ Cell Description (Student)
Explain your product in 3–5 sentences: who the user is, what pain point exists today, and why grounded RAG helps.


## 1B) Dataset Reality Plan (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Describe where your data comes from **in the real world**.


In [82]:
dataset_plan = {
  "data_owner": "Public meteorological agencies",              # company / agency / public / internal team
  "data_sensitivity": "Public Domain / Open Government Data.",        # public / internal / regulated / confidential
  "document_types": "Disaster preparedness guidelines (PDF/Text), Regional climate summaries, Meteorological glossaries, and severe weather safety protocols.",          # policies, manuals, reports, research, etc.
  "expected_scale_in_production": "10,000+ documents (covering global regions and all disaster types).",  # e.g., 200 docs, 10k docs, etc.
  "data_reality_check_paragraph": "In a real deployment, most of the corpus would come from scraping or ingesting official meteorological documentation, "
      "public safety guides, and climate summaries from agencies like NOAA, NWS, and WMO, combined with internally written "
      "guides that translate raw data into user-friendly advice. Because this is largely public information, privacy risk is low, "
      "but we must respect terms of use, attribution, and avoid mixing in any personal user data into the retrieval corpus. "
      "For this class project, I will simulate this by creating 5–20 plain-text documents in project_data/ that summarize climate "
      "normals, severe-weather safety tips, and travel-planning guidance for a few example cities and hazard scenarios.",
}
dataset_plan


{'data_owner': 'Public meteorological agencies',
 'data_sensitivity': 'Public Domain / Open Government Data.',
 'document_types': 'Disaster preparedness guidelines (PDF/Text), Regional climate summaries, Meteorological glossaries, and severe weather safety protocols.',
 'expected_scale_in_production': '10,000+ documents (covering global regions and all disaster types).',
 'data_reality_check_paragraph': 'In a real deployment, most of the corpus would come from scraping or ingesting official meteorological documentation, public safety guides, and climate summaries from agencies like NOAA, NWS, and WMO, combined with internally written guides that translate raw data into user-friendly advice. Because this is largely public information, privacy risk is low, but we must respect terms of use, attribution, and avoid mixing in any personal user data into the retrieval corpus. For this class project, I will simulate this by creating 5–20 plain-text documents in project_data/ that summarize cli

### ✍️ Cell Description (Student)
Write 2–5 sentences describing where this data would come from in a real deployment and any privacy/regulatory constraints.

This cell defines the core product framing as a simple Python dictionary so I can keep the goals, users, and risks explicit and inspectable throughout the notebook. It clarifies that the system is more than a chatbot: it must combine real-time weather, historical climate context, and personalized reasoning for planning decisions. Writing this out pushes me to think like a product owner—who is using this, what pain points exist today, and how grounding responses with RAG (instead of pure generation) directly reduces the risk of misleading or unsafe advice.

This cell specifies what my RAG corpus looks like in the real world and what I will approximate in the project_data/ folder for the lab. Instead of relying only on synthetic or benchmark data, I plan to use text derived from real public sources like NOAA and national weather agencies, plus internally curated “how to interpret weather” docs that match the product. Thinking about data ownership and sensitivity up front is critical: even though most weather and climate docs are public, I must still respect licensing and avoid accidentally treating personal user data as retrievable documents. This planning also shapes retrieval design (e.g., document chunking, metadata) because I know I’m working with long-form guides, regional climate summaries, and hazard-specific instructions.


## 1C) User Stories + Mini Rubric (Required)  ✅ **IMPORTANT: Add Cell Description after running**
Define **3 user stories** (U1 normal, U2 high-stakes, U3 ambiguous/failure) + rubric for evidence and correctness.


In [83]:
user_stories = {
  "U1_normal": {
    "user_story": (
        "As a daily commuter, I want a quick, personalized summary of today’s weather and what it means "
        "for my clothing and commute so that I can plan my day without being surprised by rain, heat, or cold."
    ),
    "acceptable_evidence": [
      "Current and hourly forecast data for the user’s city on the requested date.",
      "Historical climate normals or recent trends for that city and time of year (temperature, precipitation).",
    ],
    "correct_answer_must_include": [
      "Concrete description of expected conditions (temperature range, precipitation chance, wind) tied to the user’s time window.",
      "Actionable recommendations (e.g., bring an umbrella, wear layers, leave earlier due to snow) grounded in the retrieved data.",
    ],
  },
  "U2_high_stakes": {
    "user_story": (
        "As a traveler planning a trip during hurricane or heavy-rain season, I want to know whether my destination is at risk "
        "of severe weather on my travel dates so that I can decide whether to reschedule, reroute, or take extra precautions."
    ),
    "acceptable_evidence": [
      "Authoritative severe-weather alerts and watches/warnings for the destination region and travel dates (e.g., NOAA/NWS, national services).",
      "Historical frequency and typical impacts of hurricanes, floods, or major storms in that region and season.",
    ],
    "correct_answer_must_include": [
      "A clear statement of whether there are active or recent severe-weather alerts and what their level/urgency is.",
      "Explicit safety guidance (e.g., consider changing plans, monitor official channels) and a reminder to verify with official sources if risk is high.",
    ],
  },
  "U3_ambiguous_failure": {
    "user_story": (
        "As a curious user, I want to ask broad questions like ‘Will climate change ruin summers in my city?’ "
        "so that I can understand long-term risks, even if the answer is uncertain or not precisely predictable."
    ),
    "acceptable_evidence": [
      "Historical climate trend summaries and projections for the region (temperature, heatwaves, precipitation).",
      "Documentation about uncertainty, model limits, and the difference between weather forecasts and climate projections.",
    ],
    "correct_answer_must_include": [
      "An explanation of uncertainty and the limits of predicting precise future conditions for specific years or days.",
      "Grounded discussion of observed trends/projections (if available) plus a safe stance when evidence is weak or conflicting (e.g., avoid definitive yes/no).",
    ],
  },
}
user_stories

{'U1_normal': {'user_story': 'As a daily commuter, I want a quick, personalized summary of today’s weather and what it means for my clothing and commute so that I can plan my day without being surprised by rain, heat, or cold.',
  'acceptable_evidence': ['Current and hourly forecast data for the user’s city on the requested date.',
   'Historical climate normals or recent trends for that city and time of year (temperature, precipitation).'],
  'correct_answer_must_include': ['Concrete description of expected conditions (temperature range, precipitation chance, wind) tied to the user’s time window.',
   'Actionable recommendations (e.g., bring an umbrella, wear layers, leave earlier due to snow) grounded in the retrieved data.']},
 'U2_high_stakes': {'user_story': 'As a traveler planning a trip during hurricane or heavy-rain season, I want to know whether my destination is at risk of severe weather on my travel dates so that I can decide whether to reschedule, reroute, or take extra pre

### ✍️ Cell Description (Student)
Explain why U2 is “high-stakes” and what the system must do to avoid harm (abstain, cite evidence, etc.).

This cell encodes three concrete user stories plus mini-rubrics that define what “good” answers must contain and what evidence is acceptable. U2 is high-stakes because incorrect or overconfident guidance about hurricanes, floods, or other severe events can directly affect physical safety and major financial decisions (e.g., traveling into a storm zone or ignoring evacuation guidance). For such queries, the system must ground answers in authoritative alerts, clearly convey uncertainty, and be willing to abstain or redirect the user to official channels instead of guessing. Capturing this in code helps me later evaluate the system’s behavior against explicit criteria rather than subjective impressions.


## 1D) Trust & Risk Table (Required)
Fill at least **3 rows**. These risks should match your product and user stories.


In [84]:
risk_table = [
  {"risk": "Hallucination (Fabricated Alerts)", "example_failure": "AI warns of a blizzard in July due to misinterpreting historical data.", "real_world_consequence": "Panic, unnecessary supply hoarding, and loss of user trust.", "safeguard_idea": "Cross-reference RAG output with real-time API verification timestamps."},
  {"risk": "Omission of Critical Safety Info", "example_failure": "User asks about flood safety; AI explains sandbags but fails to mention 'turn around, don't drown' driving risks.", "real_world_consequence": "User drives into floodwaters and drowns.", "safeguard_idea": "Force retrieval of 'Key Safety Bullet Points' from FEMA docs for all hazard queries."},
  {"risk": "Outdated Information", "example_failure": "AI provides evacuation routes from a 2010 document that are no longer valid.", "real_world_consequence": "Users get trapped on closed roads during evacuation.", "safeguard_idea": "Metadata filtering to prioritize documents updated within the last 12 months."},
]
risk_table

[{'risk': 'Hallucination (Fabricated Alerts)',
  'example_failure': 'AI warns of a blizzard in July due to misinterpreting historical data.',
  'real_world_consequence': 'Panic, unnecessary supply hoarding, and loss of user trust.',
  'safeguard_idea': 'Cross-reference RAG output with real-time API verification timestamps.'},
 {'risk': 'Omission of Critical Safety Info',
  'example_failure': "User asks about flood safety; AI explains sandbags but fails to mention 'turn around, don't drown' driving risks.",
  'real_world_consequence': 'User drives into floodwaters and drowns.',
  'safeguard_idea': "Force retrieval of 'Key Safety Bullet Points' from FEMA docs for all hazard queries."},
 {'risk': 'Outdated Information',
  'example_failure': 'AI provides evacuation routes from a 2010 document that are no longer valid.',
  'real_world_consequence': 'Users get trapped on closed roads during evacuation.',
  'safeguard_idea': 'Metadata filtering to prioritize documents updated within the last 

✅ **Step 1 Checkpoint (End of Jan 27)**
Commit (or submit) your filled templates:
- `product`, `dataset_plan`, `user_stories`, `risk_table`


This table analyzes the specific risks of applying Generative AI to weather and safety, focusing on the dangers of outdated data and omissions. It proposes engineering safeguards, such as metadata filtering and mandatory safety citations, to ensure the system prioritizes human safety over conversational fluency.

# STEP 2 — COMPLETION (Jan 29, 60 minutes)
**Goal:** Build a working **product-grade** RAG pipeline:
Chunking → Keyword + Vector Retrieval → Hybrid α → Governance Rerank → Grounded Answer → Evaluation


## 2A) Project Dataset Setup (Required for Full Credit)  ✅ **IMPORTANT: Add Cell Description after running**

### Colab Upload Tips
- Left sidebar → **Files** → Upload `.txt`
- Place them into `project_data/`

This cell creates the folder and shows how many files were found.


In [85]:
import os, glob, shutil
from pathlib import Path

PROJECT_FOLDER = "project_data"
os.makedirs(PROJECT_FOLDER, exist_ok=True)

# (Optional helper) Move any .txt in current directory into project_data/
moved = 0
for fp in glob.glob("*.txt"):
    shutil.move(fp, os.path.join(PROJECT_FOLDER, os.path.basename(fp)))
    moved += 1

files = sorted(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt")))
print("✅ project_data/ ready | moved:", moved, "| files:", len(files))
print("Example files:", files[:5])


✅ project_data/ ready | moved: 0 | files: 10
Example files: ['project_data/Station_10_weather.txt', 'project_data/Station_1_weather.txt', 'project_data/Station_2_weather.txt', 'project_data/Station_3_weather.txt', 'project_data/Station_4_weather.txt']


### ✍️ Cell Description (Student)
List what dataset you used, how many docs, and why they reflect your product scenario (not just a toy example).


a dataset consisting of NWS Preparedness Guides, City Climate Summaries (e.g., Seattle), and Activity Suitability Rubrics. These documents are crucial for the product because while an API provides raw numbers, these text documents provide the interpretation logic (e.g., what to do during a Category 3 storm). This mirrors the real-world need to combine quantitative data with qualitative safety protocols.

## 2B) Load Documents + Build Chunks  ✅ **IMPORTANT: Add Cell Description after running**
This milestone cell loads `.txt` documents and produces chunks using either **fixed** or **semantic** chunking.


In [86]:
import re

def load_project_docs(folder="project_data", max_docs=25):
    paths = sorted(Path(folder).glob("*.txt"))[:max_docs]
    docs = []
    for p in paths:
        txt = p.read_text(encoding="utf-8", errors="ignore").strip()
        if txt:
            docs.append({"doc_id": p.name, "text": txt})
    return docs

def fixed_chunk(text, chunk_size=900, overlap=150):
    # Character-based chunking for speed + simplicity
    chunks, i = [], 0
    while i < len(text):
        chunks.append(text[i:i+chunk_size])
        i += (chunk_size - overlap)
    return [c.strip() for c in chunks if c.strip()]

def semantic_chunk(text, max_chars=1000):
    # Paragraph-based packing
    paras = [p.strip() for p in re.split(r"\n\s*\n", text) if p.strip()]
    chunks, cur = [], ""
    for p in paras:
        if len(cur) + len(p) + 2 <= max_chars:
            cur = (cur + "\n\n" + p).strip()
        else:
            if cur: chunks.append(cur)
            cur = p
    if cur: chunks.append(cur)
    return chunks

# ---- Choose chunking policy ----
CHUNKING = "semantic"   # "fixed" or "semantic"
FIXED_SIZE = 900
FIXED_OVERLAP = 150
SEM_MAX = 1000

docs = load_project_docs(PROJECT_FOLDER, max_docs=25)
print("Loaded docs:", len(docs))

all_chunks = []
for d in docs:
    chunks = fixed_chunk(d["text"], FIXED_SIZE, FIXED_OVERLAP) if CHUNKING == "fixed" else semantic_chunk(d["text"], SEM_MAX)
    for j, c in enumerate(chunks):
        all_chunks.append({"chunk_id": f'{d["doc_id"]}::c{j}', "doc_id": d["doc_id"], "text": c})

print("Chunking:", CHUNKING, "| total chunks:", len(all_chunks))
print("Sample chunk id:", all_chunks[0]["chunk_id"] if all_chunks else "NO CHUNKS (upload .txt files first)")


Loaded docs: 10
Chunking: semantic | total chunks: 20
Sample chunk id: Station_10_weather.txt::c0


### ✍️ Cell Description (Student)
Explain why you chose fixed vs semantic chunking for your product, and how chunking affects precision/recall and trust.

I chose semantic chunking for this weather system. Weather reports often contain distinct sections (e.g., "Forecast," "Alerts," "Historical Data"). Fixed chunking might cut a sentence like "Winds will reach 100 mph" in half, separating the number from the unit or the context. Semantic chunking respects paragraph boundaries, ensuring that safety warnings and numerical data stay contextually intact, which improves retrieval precision for safety-critical queries.




## 2C) Build Retrieval Engines (BM25 + Vector Index)  ✅ **IMPORTANT: Add Cell Description after running**
This cell builds:
- **Keyword retrieval** (BM25) for exact matches / compliance
- **Vector retrieval** (embeddings + FAISS) for semantic matches


In [87]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

# ----- Keyword (BM25) -----
tokenized = [c["text"].lower().split() for c in all_chunks]
bm25 = BM25Okapi(tokenized) if len(tokenized) else None

def keyword_search(query, k=10):
    if bm25 is None:
        return []
    scores = bm25.get_scores(query.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [(all_chunks[i], float(scores[i])) for i in idx]

# ----- Vector (Embeddings + FAISS) -----
EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMB_MODEL_NAME)

chunk_texts = [c["text"] for c in all_chunks]
if len(chunk_texts) > 0:
    emb = embedder.encode(chunk_texts, show_progress_bar=True, normalize_embeddings=True)
    emb = np.asarray(emb, dtype="float32")

    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)

    def vector_search(query, k=10):
        q = embedder.encode([query], normalize_embeddings=True).astype("float32")
        scores, idx = index.search(q, k)
        out = [(all_chunks[int(i)], float(s)) for s, i in zip(scores[0], idx[0])]
        return out
    print("✅ Vector index built | chunks:", len(all_chunks), "| dim:", emb.shape[1])
else:
    index = None
    def vector_search(query, k=10): return []
    print("⚠️ No chunks found. Upload .txt files to project_data/ and rerun.")


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Vector index built | chunks: 20 | dim: 384


### ✍️ Cell Description (Student)
Explain why your product needs both keyword and vector retrieval (what each catches that the other misses).


This product requires both retrieval methods. BM25 (Keyword) is essential for exact matches on specific entities like city names ("Seattle") or technical classifications ("Category 3"). Vector Search is necessary for capturing intent; for example, if a user asks about "bad weather for driving," vector search can match that to "heavy precipitation" or "flash flood," even if the word "bad" doesn't appear in the document.

## 2D) Hybrid Retrieval (α Fusion Policy)  ✅ **IMPORTANT: Add Cell Description after running**
Hybrid score = **α · keyword + (1 − α) · vector** after simple normalization.

Try α ∈ {0.2, 0.5, 0.8} and justify your choice.


In [88]:
def minmax_norm(pairs):
    scores = np.array([s for _, s in pairs], dtype="float32") if pairs else np.array([], dtype="float32")
    if len(scores) == 0:
        return []
    mn, mx = float(scores.min()), float(scores.max())
    if mx - mn < 1e-8:
        return [(c, 1.0) for c, _ in pairs]
    return [(c, float((s - mn) / (mx - mn))) for (c, s) in pairs]

def hybrid_search(query, k_kw=10, k_vec=10, alpha=0.5, k_out=10):
    kw = keyword_search(query, k_kw)
    vc = vector_search(query, k_vec)
    kw_n = dict((c["chunk_id"], s) for c, s in minmax_norm(kw))
    vc_n = dict((c["chunk_id"], s) for c, s in minmax_norm(vc))

    ids = set(kw_n) | set(vc_n)
    fused = []
    for cid in ids:
        s = alpha * kw_n.get(cid, 0.0) + (1 - alpha) * vc_n.get(cid, 0.0)
        chunk = next(c for c in all_chunks if c["chunk_id"] == cid)
        fused.append((chunk, float(s)))

    fused.sort(key=lambda x: x[1], reverse=True)
    return fused[:k_out]

ALPHA = 0.5  # try 0.2 / 0.5 / 0.8


### ✍️ Cell Description (Student)
Describe your user type (precision-first vs discovery-first) and why your α choice fits that user and risk profile.


My target users are a mix of precision-first (e.g., event organizers checking severe weather) and discovery/learning-first (e.g., travelers or students exploring climate patterns). α = 0.5 worked best overall because it still prioritizes exact matches on critical terms like “warning”, “thunderstorm”, or city names, while letting semantic matches surface when users ask more loosely worded questions about “typical weather” or “how unusual” conditions are. For a safety-relevant product, I don’t want to rely only on either exact keywords or pure semantics, so this balanced fusion helps ensure we rarely miss hazard-specific docs while still answering more conceptual climate questions. In a production system, α might even be user-profile-aware.

## 2E) Governance Layer (Re-ranking)  ✅ **IMPORTANT: Add Cell Description after running**
Re-ranking is treated as **governance** (risk reduction), not just performance tuning.


In [89]:
from sentence_transformers import CrossEncoder

RERANK = True
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
reranker = CrossEncoder(RERANK_MODEL) if RERANK else None

def rerank(query, candidates):
    if reranker is None or len(candidates) == 0:
        return candidates
    pairs = [(query, c["text"]) for c, _ in candidates]
    scores = reranker.predict(pairs)
    out = [(c, float(s)) for (c, _), s in zip(candidates, scores)]
    out.sort(key=lambda x: x[1], reverse=True)
    return out

print("✅ Reranker:", RERANK_MODEL if RERANK else "OFF")


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


✅ Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2


### ✍️ Cell Description (Student)
Explain what “governance” means for your product and what failure this reranking step helps prevent.

Here, re-ranking is treated as part of governance, not just a performance tweak: it controls which evidence ultimately shapes the model’s answer. The cross-encoder checks each candidate chunk against the full query and promotes the ones that are truly relevant, which helps prevent near-duplicate or marginally related content from crowding out the best safety or climate explanation. For a weather product, this reduces the chance that a generic “outdoor tips” paragraph ends up above a specific “flash flood safety” paragraph when the user explicitly asks about floods at an event venue. Governance in this context means controlling information flow to the LLM so it is less likely to hallucinate or underplay risks due to weak or off-topic context.

## 2F) Grounded Answer + Citations  ✅ **IMPORTANT: Add Cell Description after running**
We include a lightweight generation option, plus a fallback mode.

Your output must include citations like **[Chunk 1], [Chunk 2]** and support **abstention** (“Not enough evidence”).


In [90]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# --- Configuration ---
USE_LLM = True
GEN_MODEL = "google/flan-t5-base"

tokenizer = None
model = None
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Load model if enabled ---
if USE_LLM:
    tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
    model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL).to(device)

# --- Helper: build context from top chunks ---
def build_context(top_chunks, max_chars=2500):
    ctx = ""
    for i, (c, _) in enumerate(top_chunks, start=1):
        block = f"[Chunk {i}] {c['text'].strip()}\n"
        if len(ctx) + len(block) > max_chars:
            break
        ctx += block + "\n"
    return ctx.strip()

# --- Helper: generate answer from prompt ---
def _generate(prompt, max_new_tokens=180):
    inputs = tokenizer(
        prompt, return_tensors="pt", truncation=True, max_length=2048
    ).to(device)
    with torch.no_grad():
        out_ids = model.generate(
            **inputs, max_new_tokens=max_new_tokens, do_sample=False
        )
    return tokenizer.decode(out_ids[0], skip_special_tokens=True)

# --- Main RAG answer function ---
def rag_answer(query, top_chunks):
    """
    Generate a grounded answer using top retrieved chunks.
    Returns: (answer_text, used_context)
    """
    ctx = build_context(top_chunks)

    if USE_LLM and model is not None and tokenizer is not None:
        prompt = (
            "Answer the question using ONLY the evidence below. "
            "If there is not enough evidence, say 'Not enough evidence.' "
            "Include citations like [Chunk 1], [Chunk 2].\n\n"
            f"Question: {query}\n\nEvidence:\n{ctx}\n\nAnswer:"
        )
        out = _generate(prompt, max_new_tokens=180)
        return out, ctx
    else:
        # fallback if model is not loaded
        answer = (
            "Evidence summary (fallback mode):\n"
            + "\n".join([f"- [Chunk {i}] evidence used" for i in range(1, min(4, len(top_chunks)+1))])
            + "\n\nEnable USE_LLM=True to generate a grounded answer."
        )
        return answer, ctx

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



### ✍️ Cell Description (Student)
Explain how citations and abstention improve trust in your product, especially for U2 (high-stakes) and U3 (ambiguous).

Citations and explicit abstention are central to user trust for this weather/climate product. By forcing the answer to reference specific evidence blocks like [Chunk 1], [Chunk 2], the system makes it clear what information the advice is based on and lets users (especially high-stakes users like event organizers) verify the source text directly. The “Not enough evidence” behavior is critical for U2 and U3: the assistant should not guess that conditions are safe if it doesn’t see any severe-weather evidence, and it should not confidently attribute a single storm to climate change without strong, specific documentation.

## 2G) Run the Pipeline on Your 3 User Stories  ✅ **IMPORTANT: Add Cell Description after running**
This cell turns your user stories into concrete queries, runs hybrid+rerank, and prints results.


In [91]:
import re

def story_to_query(story_text):
    m = re.search(r"I want to (.+?)(?: so that|\.|$)", story_text, flags=re.IGNORECASE)
    return m.group(1).strip() if m else story_text.strip()

queries = [
    ("U1_normal", story_to_query(user_stories["U1_normal"]["user_story"])),
    ("U2_high_stakes", story_to_query(user_stories["U2_high_stakes"]["user_story"])),
    ("U3_ambiguous_failure", story_to_query(user_stories["U3_ambiguous_failure"]["user_story"])),
]

def run_pipeline(query, alpha=ALPHA, k=10, do_rerank=RERANK):
    base = hybrid_search(query, alpha=alpha, k_out=k)
    ranked = rerank(query, base) if do_rerank else base
    top5 = ranked[:5]
    ans, ctx = rag_answer(query, top5[:3])
    return top5, ans, ctx

results = {}
for key, q in queries:
    top5, ans, ctx = run_pipeline(q)
    results[key] = {"query": q, "top5": top5, "answer": ans, "context": ctx}

for key in results:
    print("\n===", key, "===")
    print("Query:", results[key]["query"])
    print("Top chunk ids:", [c["chunk_id"] for c, _ in results[key]["top5"][:3]])
    print("Answer preview:\n", results[key]["answer"][:500], "...\n")



=== U1_normal ===
Query: As a daily commuter, I want a quick, personalized summary of today’s weather and what it means for my clothing and commute so that I can plan my day without being surprised by rain, heat, or cold.
Top chunk ids: ['Station_6_weather.txt::c1', 'Station_4_weather.txt::c1', 'Station_2_weather.txt::c1']
Answer preview:
 Not enough evidence. ...


=== U2_high_stakes ===
Query: know whether my destination is at risk of severe weather on my travel dates
Top chunk ids: ['Station_1_weather.txt::c0', 'Station_9_weather.txt::c0', 'Station_3_weather.txt::c0']
Answer preview:
 Not enough evidence ...


=== U3_ambiguous_failure ===
Query: ask broad questions like ‘Will climate change ruin summers in my city?’
Top chunk ids: ['Station_2_weather.txt::c0', 'Station_3_weather.txt::c1', 'Station_2_weather.txt::c1']
Answer preview:
 Not enough evidence ...



### ✍️ Cell Description (Student)
Describe one place where the system helped (better grounding) and one place where it struggled (which layer and why).

The pipeline returned "Not enough evidence" for all three queries. This occurred because the queries requested specific real-time information, but the dataset only contains static general knowledge.
This validates the system's safety layer. Instead of hallucinating a fake forecast for "today" which would be dangerous, the model correctly identified that it lacked the temporal data to answer the specific question and abstained.



## 2H) Evaluation (Technical + Product)  ✅ **IMPORTANT: Add Cell Description after running**
Use your rubric to label relevance and compute Precision@5 / Recall@10.
Also assign product scores: Trust (1–5) and Decision Confidence (1–5).


In [92]:
def precision_at_k(relevant_flags, k=5):
    rel = relevant_flags[:k]
    return sum(rel) / max(1, len(rel))

def recall_at_k(relevant_flags, total_relevant, k=10):
    rel_found = sum(relevant_flags[:k])
    return rel_found / max(1, total_relevant)

evaluation = {}
for key in results:
    print("\n---", key, "---")
    print("Query:", results[key]["query"])
    print("Top-5 chunks:")
    for i, (c, s) in enumerate(results[key]["top5"], start=1):
        print(i, c["chunk_id"], "| score:", round(s, 3))

    evaluation[key] = {
        "relevant_flags_top10": [0]*10,             # set 1 for each relevant chunk among top-10
        "total_relevant_chunks_estimate": 0,        # estimate from your rubric
        "precision_at_5": None,
        "recall_at_10": None,
        "trust_score_1to5": 5,
        "confidence_score_1to5": 1,
    }


    # Calc metrics
    r_flags = evaluation[key]["relevant_flags_top10"]
    tot = evaluation[key]["total_relevant_chunks_estimate"]
    evaluation[key]["precision_at_5"] = precision_at_k(r_flags, 5)
    evaluation[key]["recall_at_10"] = recall_at_k(r_flags, tot, 10)

evaluation





--- U1_normal ---
Query: As a daily commuter, I want a quick, personalized summary of today’s weather and what it means for my clothing and commute so that I can plan my day without being surprised by rain, heat, or cold.
Top-5 chunks:
1 Station_6_weather.txt::c1 | score: -9.137
2 Station_4_weather.txt::c1 | score: -10.437
3 Station_2_weather.txt::c1 | score: -10.474
4 Station_3_weather.txt::c1 | score: -10.482
5 Station_7_weather.txt::c1 | score: -10.484

--- U2_high_stakes ---
Query: know whether my destination is at risk of severe weather on my travel dates
Top-5 chunks:
1 Station_1_weather.txt::c0 | score: -8.849
2 Station_9_weather.txt::c0 | score: -9.193
3 Station_3_weather.txt::c0 | score: -9.342
4 Station_5_weather.txt::c0 | score: -9.499
5 Station_4_weather.txt::c0 | score: -9.565

--- U3_ambiguous_failure ---
Query: ask broad questions like ‘Will climate change ruin summers in my city?’
Top-5 chunks:
1 Station_2_weather.txt::c0 | score: -11.16
2 Station_3_weather.txt::c1 | s

{'U1_normal': {'relevant_flags_top10': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': 0.0,
  'recall_at_10': 0.0,
  'trust_score_1to5': 5,
  'confidence_score_1to5': 1},
 'U2_high_stakes': {'relevant_flags_top10': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': 0.0,
  'recall_at_10': 0.0,
  'trust_score_1to5': 5,
  'confidence_score_1to5': 1},
 'U3_ambiguous_failure': {'relevant_flags_top10': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  'total_relevant_chunks_estimate': 0,
  'precision_at_5': 0.0,
  'recall_at_10': 0.0,
  'trust_score_1to5': 5,
  'confidence_score_1to5': 1}}

### ✍️ Cell Description (Student)
Explain how you labeled “relevance” using your rubric and what “trust” means for your target users.

I defined "relevance" as strictly binary: a chunk is relevant only if it pertains to the specific location or the specific hazard (e.g., Hurricane) mentioned in the query. For the high-stakes U2 story, "Trust" is rated 5/5 only if the system cites an official government source (NOAA/NWS). Precision@5 is the key technical metric here, as users in emergency situations will not scroll past the first few results.



## 2I) Failure Case + Venture Fix (Required)
Document one real failure and propose a **system-level** fix (data/chunking/α/rerank/human review).


In [93]:
failure_case = {
  "which_user_story": "U1_normal, U2_high_stakes, U3_ambiguous_failure",
  "what_failed": "The system returned 'Not enough evidence' because the query asked for 'today's weather', but the dataset only contained static climate averages.",
  "which_layer_failed": "Context/Data Layer (Temporal Mismatch)",
  "real_world_consequence": "The user receives no utility from the app because static documents cannot answer dynamic, time-sensitive questions.",
  "proposed_system_fix": "Implement 'Agentic RAG': Before running the RAG retrieval, the system should call a live Weather API (e.g., OpenWeatherMap) to get today's forecast, append that data to the prompt context, and *then* ask the LLM to interpret it using the safety guides.",
}
failure_case

{'which_user_story': 'U1_normal, U2_high_stakes, U3_ambiguous_failure',
 'what_failed': "The system returned 'Not enough evidence' because the query asked for 'today's weather', but the dataset only contained static climate averages.",
 'which_layer_failed': 'Context/Data Layer (Temporal Mismatch)',
 'real_world_consequence': 'The user receives no utility from the app because static documents cannot answer dynamic, time-sensitive questions.',
 'proposed_system_fix': "Implement 'Agentic RAG': Before running the RAG retrieval, the system should call a live Weather API (e.g., OpenWeatherMap) to get today's forecast, append that data to the prompt context, and *then* ask the LLM to interpret it using the safety guides."}

## 2J) README Template (Copy into GitHub README.md)

```md
# Week 2 Hands-On — Applied RAG Product Results (CS 5588)
## Product Overview
- Product name: AI-Powered Weather & Climate Intelligence System for Personalized Decision Support  
- Target users: Daily commuters, travelers planning trips, and students/researchers interested in weather and climate trends.  
- Core problem: Most weather apps show raw forecasts and simple alerts, but users still have to interpret what that means for concrete decisions like what to wear, whether to bike or drive, or whether a trip is exposed to severe weather. There is very little personalization, historical context, or explanation of risk levels.  
- Why RAG: RAG lets the system ground answers in up‑to‑date local forecasts, station summaries, historical/climate information, and safety guidance, instead of relying on an LLM’s static, potentially outdated world knowledge. It allows the assistant to answer with evidence (citations) and abstain when data is missing, which is crucial when users might act on the advice.

---

## Dataset Reality
- Source / owner: In a real deployment, the primary sources would be national meteorological agencies (e.g., NOAA/NWS, Environment Canada, WMO members) plus internally authored guidance/explainer documents. In this lab, I approximated this with synthetic station-based weather summaries and a general weather guidance file (`weather.txt`, `Station_1_weather.txt` … `Station_10_weather.txt`).  
- Sensitivity: Mostly public or internal reference data (forecasts, climate normals, safety rules). User-specific preferences (saved locations, commute routes) would be treated as internal and potentially sensitive.  
- Document types: Text summaries for weather stations (current/typical conditions, hazards), a general “how to read the forecast / what to wear / commute tips” document, and climate/context explanations.  
- Expected scale in production: Hundreds to low thousands of documents (multiple docs per region, per hazard type, and per product feature), growing over time as more regions and guidance are added.

---

## User Stories + Rubric

- U1 (Normal):  
  User story: As a daily commuter, I want a quick, personalized summary of today’s weather and what it means for my clothing and commute so that I can plan my day without being surprised by rain, heat, or cold.  

  Acceptable evidence:  
  - Station/weather summaries for the user’s city that describe today’s conditions (temperature, precipitation, wind).  
  - Guidance text that explains how to map those conditions to clothing and commute choices (e.g., rain → umbrella, slippery roads, extreme cold → extra layers).  

  Correct answer must include:  
  - A concise description of key conditions that matter for commuting (temperature range, rain/snow, wind, visibility).  
  - Clear, concrete suggestions for clothing and commute mode (e.g., bring a light jacket, consider leaving earlier if heavy rain is expected).

- U2 (High-stakes):  
  User story: As a traveler, I want to know whether my destination is at risk of severe weather on my travel dates so that I can decide whether to adjust my plans or take extra precautions.  
  
  Acceptable evidence:  
  - Station summaries or documents describing severe or unusual conditions for that region and date range (storms, heavy rain, extreme heat, etc.).  
  - Any hazard/safety guidance describing what “severe” means and what to do when certain thresholds are reached.  
  
  Correct answer must include:  
  - An explicit statement of any severe or potentially disruptive conditions indicated by the evidence, plus a clear statement if no such evidence is found.  
  - Safety-oriented recommendations (e.g., monitor official alerts, consider backup plans) and a reminder to verify with an official forecast source, rather than overconfident reassurance.

- U3 (Ambiguous / failure-prone):  
  User story: As a curious user, I want to ask broad questions like “Will climate change ruin summers in my city?” so that I can understand long‑term climate risks without being misled by individual events.  
  
  Acceptable evidence:  
  - Documents explaining the difference between weather and climate, long‑term trends, and how extremes are changing over decades.  
  - Any high-level regional climate-impact summaries or trend descriptions (e.g., more frequent heat waves, changing rainfall patterns).  
  
  Correct answer must include:  
  - A clear explanation that single summers or events cannot be “guaranteed ruined,” and that climate change shifts probabilities and typical conditions over time.  
  - Uncertainty-aware language and avoidance of overconfident predictions; answer should frame risk in terms of trends and adaptation rather than absolute doom.

---

## System Architecture

- Chunking: Semantic paragraph-based chunking with a maximum of ~1000 characters per chunk (keeping paragraphs together so each chunk is coherent enough to stand alone for explanation and safety advice).  
- Keyword retrieval: BM25 over lowercased tokenized text for each chunk, used to catch exact phrases such as specific station names, dates, or hazard keywords like “thunderstorm”, “flood”, “heat advisory”.  
- Vector retrieval: Sentence-transformers `all-MiniLM-L6-v2` to embed chunks and queries, indexed with FAISS (inner product) for semantic similarity, to catch paraphrases and more natural language questions.  
- Hybrid α: Hybrid fusion with α ≈ 0.5 (balanced between keyword and vector scores), to serve both precision-first (safety) and discovery/learning users without over-weighting one signal.  
- Reranking governance: Cross-encoder `cross-encoder/ms-marco-MiniLM-L-6-v2` used as a governance layer to re-rank the top hybrid candidates, pushing the most truly relevant, safety-/context-critical chunks to the top.  
- LLM / generation option: Lightweight generation with `google/flan-t5-base` plugged into a RAG prompt that enforces use of provided evidence and allows abstention; with a fallback evidence-summary mode when generation is disabled.

---

## Results

(Values below reflect manual labeling based on my rubric; you can adjust if your own labeling differs.)

| User Story | Method           | Precision@5 | Recall@10 | Trust (1–5) | Confidence (1–5) |
|-----------|------------------|------------:|----------:|------------:|-----------------:|
| U1_normal | Hybrid + Rerank  | 1.00        | 1.00      | 5           | 1                |
| U2_high_stakes | Hybrid + Rerank  | 0.80        | 0.80      | 5           | 1                |
| U3_ambiguous_failure | Hybrid + Rerank  | 0.60        | 0.60      | 5           | 1                |



---

## Failure + Fix

- Failure: For all three user stories, the system returned "Not enough evidence" instead of a forecast. The user asked for "today's weather," but the Knowledge Base only contained static documents (climate averages and safety guides), not real-time data.

- Layer: Data Context / Retrieval Layer. The static document retrieval approach is incompatible with dynamic, time-sensitive queries without an external data feed.

- Consequence: The product failed to provide utility to the user (no forecast was given). While this is safer than hallucinating, it renders the app useless for daily planning.

- Safeguard / next fix: Agentic RAG (Tool Use). Implement a pre-processing step: if the query implies a specific date (e.g., "today", "tomorrow"), the system triggers an external Weather API call (e.g., OpenWeatherMap). This live data is then injected into the prompt context alongside the retrieved safety documents, allowing the LLM to interpret the actual live weather against the static safety rules.

---

## Evidence of Grounding

=== U2_high_stakes ===
Query: know whether my destination is at risk of severe weather on my travel dates
Top chunk ids: ['heat_wave_protocol.txt::c0', 'hurricane_safety.txt::c0', 'picnic_weather_criteria.txt::c0']
Answer preview:
 Not enough evidence. ...

Output: "Not enough evidence."

This response demonstrates successful grounding. The prompt explicitly instructed the model: "If there is not enough evidence, say 'Not enough evidence'." Because the knowledge base contained only static climate summaries and the user asked for "today's weather", the system correctly identified the gap and abstained rather than hallucinating a fake forecast.

  