In [1]:
!pip -q install praw pandas tqdm


[0m

## 1. Environment Setup

Ensure **Python 3.9+** is installed.

Install required libraries:


pip install google-generativeai faiss-cpu praw yt-dlp numpy pandas

export GEMINI_API_KEY="YOUR_API_KEY" 

# 2. Run Order 

Run the notebook **top to bottom, in order**, without skipping cells.

The notebook should be executed in the following sequence:

1. **Imports & Configuration**  
   - Initializes required libraries  
   - Configures Gemini API access  
   - Defines helper and utility functions  

2. **Reddit Data Loading & Processing**  
   - Loads scraped Reddit posts from JSONL files  
   - Merges post text and comments into documents  
   - Chunks documents into smaller text segments  

3. **YouTube Data Loading & Processing**  
   - Loads pre-generated YouTube transcripts  
   - Converts transcripts into text documents  
   - Chunks transcripts and merges them with Reddit data  

4. **Embedding & Vector Index Creation**  
   - Generates embeddings using Gemini  
   - Stores embeddings in a FAISS similarity index  

5. **Retrieval Functions**  
   - Embeds user queries  
   - Retrieves top-k most relevant chunks from FAISS  

6. **RAG Prompting & Generation**  
   - Injects retrieved context into the persona-constrained prompt  
   - Generates persona-grounded responses using Gemini  

7. **Baseline vs RAG Evaluation**  
   - Runs the same queries through a baseline LLM  
   - Compares outputs using a 0–2 evaluation rubric  
   - Analyzes persona match, groundedness, actionability, and sustainability alignment  



In [10]:
import os

# Set these in Colab secrets / environment (recommended)
# os.environ["REDDIT_CLIENT_ID"] = "..."
# os.environ["REDDIT_CLIENT_SECRET"] = "..."
# os.environ["REDDIT_USER_AGENT"] = "DSO599_final_project (by u/YOUR_USERNAME)"


# SCRAPE REDDIT DATA

In [12]:
import os, csv, json, time, sys, re
from datetime import datetime, timezone
from urllib.parse import urlparse, urlunparse, urlencode, parse_qsl

import requests

# ---------------- Config ----------------
THREAD_URLS = [
    "https://www.reddit.com/r/femalefashionadvice/comments/1pcnc1c/ladies_and_theydies_who_work_remotely_what_do_you/",
    "https://www.reddit.com/r/femalefashionadvice/comments/1olvz55/what_do_you_think_of_all_the_recent_highlow/",
    "https://www.reddit.com/r/femalefashionadvice/comments/1ba6kg1/where_do_2030yo_get_their_fashion_trendsadvice/",
    "https://www.reddit.com/r/science/comments/kfulot/women_overestimate_mens_attraction_to_thin_female/",
]

SAVE_DIR = "data"
os.makedirs(SAVE_DIR, exist_ok=True)
JSONL_PATH = os.path.join(SAVE_DIR, "reddit_threads.jsonl")
CSV_PATH   = os.path.join(SAVE_DIR, "reddit_comments.csv")

UA = "DSO599-final-project/1.0 (contact: your_email@school.edu)"  # customize
REQUEST_DELAY_SECS = 1.5

SESSION = requests.Session()
SESSION.headers.update({"User-Agent": UA})

# ---------------- Helpers ----------------
def utc_iso(ts: float) -> str:
    return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat() if ts else None

def normalize_reddit_url(url: str) -> str:
    # Strip query/fragment and force www.reddit.com
    base = url.split("?")[0].split("#")[0].rstrip("/")
    p = urlparse(base)
    if "reddit.com" in p.netloc:
        p = p._replace(netloc="www.reddit.com")
    return p.geturl()

def to_api_json_url(url: str) -> str:
    url = normalize_reddit_url(url)
    parsed = list(urlparse(url))
    if not parsed[2].endswith("/"):
        parsed[2] += "/"
    parsed[2] += ".json"
    qs = dict(parse_qsl(parsed[4]))
    qs.setdefault("sort", "best")
    qs.setdefault("raw_json", "1")
    parsed[4] = urlencode(qs)
    return urlunparse(parsed)

def get_json(url: str):
    r = SESSION.get(url, timeout=30)
    r.raise_for_status()
    return r.json()

def morechildren(link_fullname: str, children_ids):
    api = "https://www.reddit.com/api/morechildren"
    params = {
        "api_type": "json",
        "link_id": link_fullname,
        "children": ",".join(children_ids),
        "sort": "best",
        "raw_json": "1",
    }
    r = SESSION.post(api, data=params, timeout=60)
    r.raise_for_status()
    js = r.json()
    return js.get("json", {}).get("data", {}).get("things", [])

def walk_comments(tree_nodes, link_fullname, out_rows, post_permalink, depth=0):
    more_queue = []
    for node in tree_nodes:
        kind = node.get("kind")
        data = node.get("data", {}) or {}

        if kind == "t1":  # comment
            out_rows.append({
                "comment_id": data.get("id"),
                "parent_id": data.get("parent_id"),
                "author": data.get("author"),
                "body": (data.get("body") or ""),
                "score": data.get("score"),
                "created_utc": data.get("created_utc"),
                "created_at_utc": utc_iso(data.get("created_utc")),
                "depth": depth,
                "permalink": "https://www.reddit.com" + (data.get("permalink") or post_permalink),
            })
            replies = data.get("replies")
            if isinstance(replies, dict):
                walk_comments(replies.get("data", {}).get("children", []),
                              link_fullname, out_rows, post_permalink, depth + 1)

        elif kind == "more":
            children = data.get("children") or []
            if children:
                more_queue.append(children)

    # Expand "more" comments in batches
    for children in more_queue:
        for i in range(0, len(children), 100):
            batch = children[i:i+100]
            # polite delay + basic retry
            time.sleep(REQUEST_DELAY_SECS)
            try:
                things = morechildren(link_fullname, batch)
            except requests.HTTPError as e:
                print(f"[warn] morechildren HTTP {e.response.status_code}. Sleeping & retry...", file=sys.stderr)
                time.sleep(5)
                things = morechildren(link_fullname, batch)
            walk_comments(things, link_fullname, out_rows, post_permalink, depth)

def scrape_thread(post_url: str):
    post_url = normalize_reddit_url(post_url)
    time.sleep(REQUEST_DELAY_SECS)
    js = get_json(to_api_json_url(post_url))

    if not isinstance(js, list) or len(js) < 2:
        raise RuntimeError("Unexpected JSON shape")

    post_listing = js[0]["data"]["children"][0]["data"]
    link_fullname = "t3_" + post_listing["id"]
    post_permalink = post_listing.get("permalink") or ""

    comment_listing = js[1]["data"]["children"]
    rows = []
    walk_comments(comment_listing, link_fullname, rows, post_permalink, depth=0)

    thread = {
        "source_type": "reddit",
        "thread_url": post_url,
        "post_id": post_listing.get("id"),
        "subreddit": post_listing.get("subreddit"),
        "title": post_listing.get("title", ""),
        "author": post_listing.get("author"),
        "created_utc": post_listing.get("created_utc"),
        "created_at_utc": utc_iso(post_listing.get("created_utc")),
        "score": post_listing.get("score"),
        "num_comments_reported": post_listing.get("num_comments"),
        "selftext": post_listing.get("selftext") or "",
        "permalink": "https://www.reddit.com" + post_permalink if post_permalink else post_url,
        "scraped_at_utc": datetime.now(timezone.utc).isoformat(),
        "comments": rows,
    }
    return thread

def build_rag_text(thread: dict, max_comments: int = None) -> str:
    lines = []
    lines.append(f"SUBREDDIT: r/{thread.get('subreddit')}")
    lines.append(f"TITLE: {thread.get('title','')}")
    lines.append(f"URL: {thread.get('thread_url','')}")
    lines.append("")
    if thread.get("selftext"):
        lines.append("POST:")
        lines.append(thread["selftext"])
        lines.append("")
    lines.append("COMMENTS:")
    comments = thread.get("comments", [])
    if max_comments is not None:
        comments = comments[:max_comments]
    for cm in comments:
        a = cm.get("author") or "unknown"
        ts = cm.get("created_at_utc") or ""
        body = (cm.get("body") or "").replace("\n", " ").strip()
        lines.append(f"- ({ts}) u/{a}: {body}")
    return "\n".join(lines).strip()

# ---------------- Run all threads ----------------
all_comment_rows = []
with open(JSONL_PATH, "w", encoding="utf-8") as jf:
    for u in THREAD_URLS:
        try:
            print(f"Scraping: {u}")
            thread = scrape_thread(u)
            thread["rag_text"] = build_rag_text(thread)  # helpful for chunking later
            jf.write(json.dumps(thread, ensure_ascii=False) + "\n")

            for cm in thread["comments"]:
                all_comment_rows.append({
                    "thread_url": thread["thread_url"],
                    "subreddit": thread["subreddit"],
                    "post_id": thread["post_id"],
                    "post_title": thread["title"],
                    "comment_id": cm.get("comment_id"),
                    "parent_id": cm.get("parent_id"),
                    "author": cm.get("author"),
                    "created_at_utc": cm.get("created_at_utc"),
                    "score": cm.get("score"),
                    "depth": cm.get("depth"),
                    "body": cm.get("body"),
                    "permalink": cm.get("permalink"),
                })

            print(f"  ✓ got {len(thread['comments'])} comments (reported: {thread.get('num_comments_reported')})")
        except Exception as e:
            print(f"  ✗ ERROR: {e}")

# Save CSV
if all_comment_rows:
    import pandas as pd
    pd.DataFrame(all_comment_rows).to_csv(CSV_PATH, index=False)

print(f"\nDone.\nJSONL: {JSONL_PATH}\nCSV:   {CSV_PATH}")


Scraping: https://www.reddit.com/r/femalefashionadvice/comments/1pcnc1c/ladies_and_theydies_who_work_remotely_what_do_you/
  ✓ got 62 comments (reported: 64)
Scraping: https://www.reddit.com/r/femalefashionadvice/comments/1olvz55/what_do_you_think_of_all_the_recent_highlow/
  ✓ got 77 comments (reported: 81)
Scraping: https://www.reddit.com/r/femalefashionadvice/comments/1ba6kg1/where_do_2030yo_get_their_fashion_trendsadvice/
  ✓ got 158 comments (reported: 168)
Scraping: https://www.reddit.com/r/science/comments/kfulot/women_overestimate_mens_attraction_to_thin_female/


[warn] morechildren HTTP 429. Sleeping & retry...


  ✗ ERROR: 429 Client Error: Too Many Requests for url: https://www.reddit.com/api/morechildren

Done.
JSONL: data/reddit_threads.jsonl
CSV:   data/reddit_comments.csv


# Phase 1: Reddit RAG Data Source 1
# Step 1 — Gemini key

In [14]:
import os
from google import genai

# Put your key here locally (do NOT post it online)
os.environ["GEMINI_API_KEY"] = "AIzaSyDWz7Dga1PANKTUGv0orZIQLxKG1OeXnA8"

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])


# Step 2 — Load Reddit JSONL and build a “document text”

In [16]:
import json
from datetime import datetime, timezone

REDDIT_JSONL_PATH = "data/reddit_threads.jsonl"

def load_reddit_threads(path):
    threads = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            threads.append(json.loads(line))
    return threads

def build_thread_text(thread, max_comments=250):
    """
    Make a single text block per thread:
    title + post + first N comments (to avoid huge threads).
    """
    title = thread.get("title","").strip()
    selftext = (thread.get("selftext") or "").strip()
    url = thread.get("thread_url","")

    parts = [f"TITLE: {title}", f"URL: {url}"]

    if selftext:
        parts.append("\nPOST:\n" + selftext)

    parts.append("\nCOMMENTS:")
    comments = thread.get("comments", [])[:max_comments]
    for c in comments:
        body = (c.get("body") or "").replace("\n", " ").strip()
        if body:
            author = c.get("author") or "unknown"
            parts.append(f"- u/{author}: {body}")

    return "\n".join(parts).strip()

threads = load_reddit_threads(REDDIT_JSONL_PATH)

# (Recommended) keep only femalefashionadvice for your fashion persona
threads = [t for t in threads if str(t.get("subreddit","")).lower() == "femalefashionadvice"]

print("Threads loaded:", len(threads))
print("Example title:", threads[0]["title"])


Threads loaded: 3
Example title: Ladies and theydies who work remotely, what do you wear when you want to feel semi-professional but also comfy ?


# Step 3 — Chunking

In [18]:
def chunk_text(text, size=250):
    words = text.split()
    return [" ".join(words[i:i+size]) for i in range(0, len(words), size)]

CHUNK_SIZE = 250  # Reddit: 200–350 is usually good
MAX_COMMENTS_PER_THREAD = 250  # keep cost under control

chunks = []
for t in threads:
    doc_text = build_thread_text(t, max_comments=MAX_COMMENTS_PER_THREAD)
    post_id = t.get("post_id") or "unknown"
    url = t.get("thread_url","")
    title = t.get("title","")

    for i, ch in enumerate(chunk_text(doc_text, size=CHUNK_SIZE)):
        ch = ch.strip()
        if not ch:
            continue
        chunks.append({
            "source": "reddit",
            "post_id": post_id,
            "title": title,
            "url": url,
            "chunk_id": f"reddit_{post_id}_c{i:03d}",
            "text": ch
        })

print("Total chunks:", len(chunks))
print("Example chunk:", chunks[0]["chunk_id"])


Total chunks: 64
Example chunk: reddit_1pcnc1c_c000


# Step 4 — Embedding function

In [20]:
import numpy as np

def embed(text: str) -> np.ndarray:
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text
    )
    embedding_obj = result.embeddings[0]
    return np.array(embedding_obj.values, dtype="float32")


# Step 5 — Compute embeddings + FAISS index

In [22]:
import faiss
import time

print("Computing embeddings...")

embeddings = []
valid_chunks = []

for idx, c in enumerate(chunks, 1):
    text = c["text"].strip()
    if not text:
        continue
    emb = embed(text)
    embeddings.append(emb)
    valid_chunks.append(c)

    if idx % 25 == 0:
        print(f"  embedded {idx}/{len(chunks)} chunks")

emb_matrix = np.vstack(embeddings)
dimension = emb_matrix.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(emb_matrix)

chunks = valid_chunks

print("Embeddings computed:", len(embeddings))
print("Vectors in index:", index.ntotal)


Computing embeddings...
  embedded 25/64 chunks
  embedded 50/64 chunks
Embeddings computed: 64
Vectors in index: 64


# Step 6 — RAG functions

In [24]:
from google.genai.errors import ServerError, ClientError
import time

def safe_generate(prompt: str,
                  model: str = "models/gemini-2.5-flash",
                  max_retries: int = 5,
                  base_delay: float = 1.5):
    last_error = None
    for attempt in range(max_retries):
        try:
            response = client.models.generate_content(
                model=model,
                contents=prompt
            )
            return response
        except ServerError as e:
            last_error = e
            print(f"⚠️ ServerError on attempt {attempt+1}/{max_retries}: {e}")
            time.sleep(base_delay * (attempt + 1))
        except ClientError as e:
            print(f"❌ ClientError (not retried): {e}")
            raise
    raise RuntimeError("Failed after multiple retries.") from last_error

def retrieve(query: str, k: int = 4):
    q_emb = embed(query).reshape(1, -1)
    distances, indices = index.search(q_emb, k)
    return [chunks[i] for i in indices[0]]

def rag_answer(question: str, k: int = 4) -> str:
    retrieved = retrieve(question, k=k)
    context = "\n\n".join([f"[{c['chunk_id']}] {c['text']}" for c in retrieved])

    prompt = f"""
You are a Customer Persona Twin (persona-mimic) for a fashion-forward, sustainability-aware shopper.
You speak like this persona: confident, trend-aware, practical, and mindful of ethics/sustainability.

Rules:
- Use ONLY the CONTEXT below (Reddit thread data). If context is insufficient, say what’s missing.
- Prefer fall-appropriate styling, darker/earthy tones, and fashion-forward but wearable choices.
- Include sustainability-minded options (materials, thrifting, capsule ideas, cost-per-wear).
- Output format:
  1) "Style direction" (1–2 sentences)
  2) "Outfit formula" (3 bullet points)
  3) "Sustainable swaps" (2–4 bullet points)
  4) "What I’d avoid" (1–2 bullet points)
  5) "Cited signals" (list chunk_ids used)

CONTEXT:
{context}

QUESTION:
{question}
""".strip()

    try:
        resp = safe_generate(prompt, model="models/gemini-2.5-flash")
        return resp.text
    except RuntimeError:
        return "Model overloaded; retry later."


# Step 7 — Try it + baseline

In [26]:
question = "How would you style for fall if you want to look fashion-forward but also sustainable?"

print("RAG ANSWER:\n")
print(rag_answer(question, k=5))

print("\n\nBASELINE (NO RAG) ANSWER:\n")
try:
    baseline = safe_generate(question, model="models/gemini-2.5-flash")
    print(baseline.text)
except RuntimeError:
    print("Baseline overloaded; retry later.")


RAG ANSWER:

This fall, I'm leaning into a blend of refined comfort and thoughtful layering, ensuring each piece is both contemporary and built to last. It’s about creating a capsule that feels effortless but looks intentionally chic, moving past fleeting fads towards enduring style.

**Outfit formula**
*   **Base Layer & Elevated Comfort:** Start with a fitted, long-sleeve tee or a lightweight sweater in a natural fiber like cotton or fine merino wool. This is my comfortable base, versatile enough for camera-on meetings or quick errands.
*   **Polished Bottoms:** Pair with "office joggers" or boiled wool pants. They offer the comfort of loungewear but the structured look of trousers, especially in a rich fall hue like deep forest green, charcoal, or classic black.
*   **Layering Piece & Accessory:** Add an oversized button-down shirt, cut to be intentionally loose, or a soft, luxurious sweater in cashmere or mohair. Finish with a quality accessory, like a beautiful leather belt or a s

# Prompt Iteration v2

In [27]:
def rag_answer(question: str, k: int = 4) -> str:
    # 1) Retrieve top-k chunks
    retrieved = retrieve(question, k=k)

    # 2) Build context string from retrieved chunks
    context = "\n\n".join(
        [f"[{c['chunk_id']}] {c['text']}" for c in retrieved]
    )

    # 3) Build the prompt (NOW context exists)
    prompt = f"""
You are a Customer Persona Twin: “Trend-Conscious Sustainable Shopper”.
Your job is to mimic how this customer segment talks and decides.

Voice + priorities:
- concise, confident, trend-aware
- prefers darker/earthy fall tones, layering, polished comfort
- sustainability is non-negotiable (materials, thrifting, cost-per-wear)
- avoids fast fashion, synthetic-heavy “sweaty” fabrics, hype drops

Grounding rules:
- Use ONLY the CONTEXT below. Don’t invent brands or facts not in context.
- If a category is missing (e.g., shoes), suggest a neutral, widely compatible option without naming brands.”

Output format (exact):
A) My fall vibe in one line
B) 3 outfit formulas (each = top + bottom + shoe + outer layer)
C) Sustainable moves (2–4 actionable bullets)
D) Messaging test (2 marketing lines that would appeal to me)
E) Evidence used (list chunk_ids)

CONTEXT:
{context}

QUESTION:
{question}
""".strip()

    # 4) Generate response
    try:
        response = safe_generate(prompt, model="models/gemini-2.5-flash")
        return response.text
    except RuntimeError:
        return "The RAG model is overloaded. Please retry."


In [28]:
eval_questions = [
  "How would you style for fall if you want to look fashion-forward but sustainable?",
  "What fall colors feel current but not too trendy?",
  "I work remote—what’s a fall outfit that looks polished on camera but comfy?",
  "What materials should I look for if I want sustainable fall basics?",
  "How would you update a wardrobe with only 3 new fall pieces?",
  "Would you respond better to 'limited drop' messaging or 'cost-per-wear' messaging and why?",
  "What would you avoid buying this fall if you care about sustainability?",
  "Give me a sustainable fall capsule for a student budget."
]


In [29]:
print(rag_answer(
    "How would you style for fall if you want to be fashion-forward but sustainable?",
    k=5
))


A) My fall vibe in one line
Polished comfort in dark, earthy layers, crafted from natural fibers that last beyond the season.

B) 3 outfit formulas
1.  **Top:** A fitted long-sleeve t-shirt in cotton or silk
    **Bottom:** Boiled wool pants
    **Shoe:** Minimalist leather flats or ankle boots
    **Outer layer:** A soft cashmere or mohair sweater
2.  **Top:** An oversized button-down shirt in crisp cotton or flowy crepe silk
    **Bottom:** Dark wash jeans or tailored knit pants
    **Shoe:** Neutral low-heeled boots
    **Outer layer:** A structured wool blazer or a long, open-front sweater
3.  **Top:** A lightweight cotton sweater or a boatneck long-sleeve
    **Bottom:** Refined office joggers or linen pants
    **Shoe:** Clean leather loafers or dressy sneakers
    **Outer layer:** A longline cashmere cardigan

C) Sustainable moves
*   Prioritize natural fibers like cotton, linen, wool, and silk, avoiding synthetics that cause discomfort and end up in landfills.
*   Seek out item

In [34]:
question = "What materials should I look for if I want sustainable fall basics?"

print("RAG ANSWER:\n")
print(rag_answer(question, k=5))

print("\n\nBASELINE (NO RAG) ANSWER:\n")
try:
    baseline = safe_generate(question, model="models/gemini-2.5-flash")
    print(baseline.text)
except RuntimeError:
    print("Baseline overloaded; retry later.")

RAG ANSWER:

For sustainable fall basics, prioritize natural fibers that offer both comfort and longevity. Think quality over quantity.

**Materials to look for:**

*   **Wool:** A versatile choice, especially boiled wool for pants or cozy sweaters.
*   **Cashmere:** For elevated comfort and warmth.
*   **Cotton:** Look for 100% cotton in various weaves, from fitted tees to heavier denim.
*   **Linen:** Still relevant for layering or knit pants, even as it gets cooler.
*   **Silk:** Crepe silk, in particular, offers a matte finish and drapes beautifully for sophisticated layering.
*   **Mohair:** For soft, fluffy textures.

**Materials to avoid:**
Stay away from synthetics like acrylic, nylon, and polyester. They make you sweat and devalue the garment.

**Evidence used:**
[reddit_1olvz55_c002], [reddit_1pcnc1c_c008], [reddit_1olvz55_c001], [reddit_1pcnc1c_c012], [reddit_1ba6kg1_c024]


BASELINE (NO RAG) ANSWER:

Looking for sustainable fall basics is a fantastic goal! The best material

In [36]:
examples = []

# Example 1 — Strong RAG answer
q1 = "How would you style for fall if you want to be fashion-forward but sustainable?"
rag_1 = rag_answer(q1, k=5)
base_1 = safe_generate(q1, model="models/gemini-2.5-flash").text

examples.append({
    "type": "Strong RAG vs Baseline",
    "question": q1,
    "rag_answer": rag_1,
    "baseline_answer": base_1
})

# Example 2 — RAG says “not enough info”
q2 = "What fall shoe brands should I buy right now?"
rag_2 = rag_answer(q2, k=5)
base_2 = safe_generate(q2, model="models/gemini-2.5-flash").text

examples.append({
    "type": "Grounded Uncertainty",
    "question": q2,
    "rag_answer": rag_2,
    "baseline_answer": base_2
})
q3 = "What materials should I look for if I want sustainable fall basics?"
rag_3 = rag_answer(q1, k=5)
base_3 = safe_generate(q1, model="models/gemini-2.5-flash").text

examples.append({
    "type": "Strong RAG vs Baseline",
    "question": q3,
    "rag_answer": rag_3,
    "baseline_answer": base_3
})

examples


[{'type': 'Strong RAG vs Baseline',
  'question': 'How would you style for fall if you want to be fashion-forward but sustainable?',
  'rag_answer': 'Here\'s my approach to fall styling, balancing forward-thinking fashion with my commitment to a mindful wardrobe.\n\nA) My fall vibe in one line\nCurated comfort: natural fibers, earthy tones, and refined layering for a mindful, enduring fall aesthetic.\n\nB) 3 outfit formulas\n1.  **Top:** Fitted long-sleeve t-shirt (cotton or silk blend)\n    **Bottom:** Boiled wool pants (with the comfort and look of high-quality loungewear)\n    **Shoe:** Neutral ankle boots\n    **Outer layer:** Cashmere or mohair sweater\n2.  **Top:** Oversized button-down shirt (linen)\n    **Bottom:** Dark wash jeans\n    **Shoe:** Neutral loafers\n    **Outer layer:** Wool blazer\n3.  **Top:** Cotton lightweight sweater (in an earthy fall tone)\n    **Bottom:** Knit pants (dark, comfortable yet structured)\n    **Shoe:** Neutral flats\n    **Outer layer:** Classi

In [40]:
import json

with open("data/rag_examples.json", "w", encoding="utf-8") as f:
    json.dump(examples, f, ensure_ascii=False, indent=2)

print("Saved → data/rag_examples.json")


Saved → data/rag_examples.json


# YOUTUBE VIDEO/Audio DOWNLOAD AND TRANSCRIPTION Phase 2: YouTube as Data Source #2

In [45]:
!pip -q install yt-dlp google-generativeai


[0m

In [43]:
!ffmpeg -version
!ffprobe -version



ffmpeg version 8.0.1 Copyright (c) 2000-2025 the FFmpeg developers
built with Apple clang version 16.0.0 (clang-1600.0.26.6)
configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/8.0.1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --ena

# 1) Download audio-only MP3s

In [52]:
import os, subprocess, shlex

YOUTUBE_URLS = [
    "https://www.youtube.com/watch?v=6iRUd7b-0e8",
    "https://www.youtube.com/watch?v=vMnDaid_LPo",
    "https://www.youtube.com/watch?v=KLM-93oF-5k",
    "https://www.youtube.com/watch?v=7aZdvlt07wE",
    "https://www.youtube.com/watch?v=BsjHWMwJWos",
    "https://www.youtube.com/watch?v=uudFZHCH1tA",
    "https://www.youtube.com/watch?v=JTUMEmf6HFA",
    "https://www.youtube.com/watch?v=1iyPcPiwG_U",
    "https://www.youtube.com/watch?v=LS93Cmjs2TE",
]

AUDIO_DIR = "data/youtube_audio"
os.makedirs(AUDIO_DIR, exist_ok=True)

def download_audio(url: str):
    cmd = (
        f'yt-dlp -x --audio-format mp3 '
        f'-o "{AUDIO_DIR}/%(id)s.%(ext)s" {shlex.quote(url)}'
    )
    print("Running:", cmd)
    subprocess.run(cmd, shell=True, check=True)

for u in YOUTUBE_URLS:
    download_audio(u)

print("✅ Done downloading audio to:", AUDIO_DIR)


Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=6iRUd7b-0e8'
[youtube] Extracting URL: https://www.youtube.com/watch?v=6iRUd7b-0e8
[youtube] 6iRUd7b-0e8: Downloading webpage




[youtube] 6iRUd7b-0e8: Downloading android sdkless player API JSON
[youtube] 6iRUd7b-0e8: Downloading web safari player API JSON




[youtube] 6iRUd7b-0e8: Downloading m3u8 information




[info] 6iRUd7b-0e8: Downloading 1 format(s): 251-11
[download] data/youtube_audio/6iRUd7b-0e8.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/6iRUd7b-0e8.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=vMnDaid_LPo'
[youtube] Extracting URL: https://www.youtube.com/watch?v=vMnDaid_LPo
[youtube] vMnDaid_LPo: Downloading webpage




[youtube] vMnDaid_LPo: Downloading android sdkless player API JSON
[youtube] vMnDaid_LPo: Downloading web safari player API JSON




[youtube] vMnDaid_LPo: Downloading m3u8 information




[info] vMnDaid_LPo: Downloading 1 format(s): 251-11
[download] data/youtube_audio/vMnDaid_LPo.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/vMnDaid_LPo.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=KLM-93oF-5k'
[youtube] Extracting URL: https://www.youtube.com/watch?v=KLM-93oF-5k
[youtube] KLM-93oF-5k: Downloading webpage




[youtube] KLM-93oF-5k: Downloading android sdkless player API JSON
[youtube] KLM-93oF-5k: Downloading web safari player API JSON




[youtube] KLM-93oF-5k: Downloading m3u8 information




[info] KLM-93oF-5k: Downloading 1 format(s): 251-11
[download] data/youtube_audio/KLM-93oF-5k.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/KLM-93oF-5k.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=7aZdvlt07wE'
[youtube] Extracting URL: https://www.youtube.com/watch?v=7aZdvlt07wE
[youtube] 7aZdvlt07wE: Downloading webpage




[youtube] 7aZdvlt07wE: Downloading android sdkless player API JSON
[youtube] 7aZdvlt07wE: Downloading web safari player API JSON




[youtube] 7aZdvlt07wE: Downloading m3u8 information




[info] 7aZdvlt07wE: Downloading 1 format(s): 251-8
[download] data/youtube_audio/7aZdvlt07wE.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/7aZdvlt07wE.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=BsjHWMwJWos'
[youtube] Extracting URL: https://www.youtube.com/watch?v=BsjHWMwJWos
[youtube] BsjHWMwJWos: Downloading webpage




[youtube] BsjHWMwJWos: Downloading android sdkless player API JSON
[youtube] BsjHWMwJWos: Downloading web safari player API JSON




[youtube] BsjHWMwJWos: Downloading m3u8 information




[info] BsjHWMwJWos: Downloading 1 format(s): 251-9
[download] data/youtube_audio/BsjHWMwJWos.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/BsjHWMwJWos.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=uudFZHCH1tA'
[youtube] Extracting URL: https://www.youtube.com/watch?v=uudFZHCH1tA
[youtube] uudFZHCH1tA: Downloading webpage




[youtube] uudFZHCH1tA: Downloading android sdkless player API JSON
[youtube] uudFZHCH1tA: Downloading web safari player API JSON




[youtube] uudFZHCH1tA: Downloading m3u8 information




[info] uudFZHCH1tA: Downloading 1 format(s): 251-3
[download] data/youtube_audio/uudFZHCH1tA.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/uudFZHCH1tA.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=JTUMEmf6HFA'
[youtube] Extracting URL: https://www.youtube.com/watch?v=JTUMEmf6HFA
[youtube] JTUMEmf6HFA: Downloading webpage




[youtube] JTUMEmf6HFA: Downloading android sdkless player API JSON
[youtube] JTUMEmf6HFA: Downloading web safari player API JSON




[youtube] JTUMEmf6HFA: Downloading m3u8 information




[info] JTUMEmf6HFA: Downloading 1 format(s): 251-9
[download] data/youtube_audio/JTUMEmf6HFA.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/JTUMEmf6HFA.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=1iyPcPiwG_U'
[youtube] Extracting URL: https://www.youtube.com/watch?v=1iyPcPiwG_U
[youtube] 1iyPcPiwG_U: Downloading webpage




[youtube] 1iyPcPiwG_U: Downloading android sdkless player API JSON
[youtube] 1iyPcPiwG_U: Downloading web safari player API JSON




[youtube] 1iyPcPiwG_U: Downloading m3u8 information




[info] 1iyPcPiwG_U: Downloading 1 format(s): 251-10
[download] data/youtube_audio/1iyPcPiwG_U.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/1iyPcPiwG_U.mp3; file is already in target format mp3
Running: yt-dlp -x --audio-format mp3 -o "data/youtube_audio/%(id)s.%(ext)s" 'https://www.youtube.com/watch?v=LS93Cmjs2TE'
[youtube] Extracting URL: https://www.youtube.com/watch?v=LS93Cmjs2TE
[youtube] LS93Cmjs2TE: Downloading webpage




[youtube] LS93Cmjs2TE: Downloading android sdkless player API JSON
[youtube] LS93Cmjs2TE: Downloading web safari player API JSON




[youtube] LS93Cmjs2TE: Downloading m3u8 information




[info] LS93Cmjs2TE: Downloading 1 format(s): 251-11
[download] data/youtube_audio/LS93Cmjs2TE.mp3 has already been downloaded
[ExtractAudio] Not converting audio data/youtube_audio/LS93Cmjs2TE.mp3; file is already in target format mp3
✅ Done downloading audio to: data/youtube_audio


# 2) Transcribe MP3s with Gemini → JSONL

In [62]:
import os, json, time, glob
from datetime import datetime, timezone
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("models/gemini-2.5-flash")

OUT_YT_JSONL = "data/youtube_transcripts.jsonl"

def upload_wait(path: str):
    f = genai.upload_file(path=path)
    while f.state.name == "PROCESSING":
        time.sleep(5)
        f = genai.get_file(name=f.name)
    if f.state.name == "FAILED":
        raise RuntimeError(f"Gemini failed processing {path}")
    return f

PROMPT = """
You are transcribing a fashion YouTube video for a customer persona twin RAG system.

Return ONLY valid JSON:
{
  "source_type": "youtube",
  "video_id": "string",
  "url": "string",
  "transcript": "string (readable, near-verbatim; don't over-summarize)",
  "key_fall_signals": ["string"],
  "key_sustainability_signals": ["string"],
  "scraped_at_utc": "string"
}

Rules:
- No markdown, JSON only.
- transcript should be plain text (no timestamps needed).
"""

mp3_files = sorted(glob.glob("data/youtube_audio/*.mp3"))

with open(OUT_YT_JSONL, "w", encoding="utf-8") as out:
    for path in mp3_files:
        video_id = os.path.splitext(os.path.basename(path))[0]
        print("Transcribing:", video_id)

        up = upload_wait(path)
        try:
            resp = model.generate_content([PROMPT, up])
            raw = resp.text

            try:
                obj = json.loads(raw)
            except json.JSONDecodeError:
                start = raw.find("{")
                end = raw.rfind("}") + 1
                obj = json.loads(raw[start:end])

            obj["source_type"] = "youtube"
            obj["video_id"] = video_id
            obj["url"] = f"https://www.youtube.com/watch?v={video_id}"
            obj["scraped_at_utc"] = datetime.now(timezone.utc).isoformat()

            out.write(json.dumps(obj, ensure_ascii=False) + "\n")

        finally:
            try:
                genai.delete_file(up.name)
            except Exception:
                pass

print("✅ Saved transcripts to:", OUT_YT_JSONL)


Transcribing: 1iyPcPiwG_U
Transcribing: 6iRUd7b-0e8
Transcribing: 7aZdvlt07wE
Transcribing: BsjHWMwJWos
Transcribing: JTUMEmf6HFA
Transcribing: KLM-93oF-5k
Transcribing: LS93Cmjs2TE
Transcribing: uudFZHCH1tA
Transcribing: vMnDaid_LPo
✅ Saved transcripts to: data/youtube_transcripts.jsonl


# 3) Append YouTube docs to your existing Reddit docs → all_docs_v2.jsonl

In [65]:
import json, os

REDDIT_JSONL = "data/reddit_threads.jsonl"   # your existing file
YOUTUBE_JSONL = "data/youtube_transcripts.jsonl"
ALL_DOCS_V2 = "data/all_docs_v2.jsonl"

def load_jsonl(path):
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            yield json.loads(line)

docs = []

# Reddit (fashion only)
for t in load_jsonl(REDDIT_JSONL):
    if str(t.get("subreddit","")).lower() != "femalefashionadvice":
        continue
    post_id = t.get("post_id","unknown")
    text = t.get("rag_text") or (t.get("title","") + "\n" + (t.get("selftext") or ""))
    docs.append({
        "doc_id": f"reddit_{post_id}",
        "source_type": "reddit",
        "title": t.get("title",""),
        "url": t.get("thread_url",""),
        "text": text
    })

# YouTube
for y in load_jsonl(YOUTUBE_JSONL):
    vid = y["video_id"]
    text = (
        "TRANSCRIPT:\n" + y.get("transcript","") +
        "\n\nFALL SIGNALS:\n" + "\n".join(y.get("key_fall_signals", [])) +
        "\n\nSUSTAINABILITY SIGNALS:\n" + "\n".join(y.get("key_sustainability_signals", []))
    )
    docs.append({
        "doc_id": f"youtube_{vid}",
        "source_type": "youtube",
        "title": "",
        "url": y.get("url"),
        "text": text
    })

with open(ALL_DOCS_V2, "w", encoding="utf-8") as f:
    for d in docs:
        f.write(json.dumps(d, ensure_ascii=False) + "\n")

print("✅ Wrote", len(docs), "docs to", ALL_DOCS_V2)


✅ Wrote 12 docs to data/all_docs_v2.jsonl


In [67]:
!pip -q install faiss-cpu


[0m

# 4.  Rebuild chunks + embeddings + FAISS (Reddit + YouTube)

In [70]:
import json, os
import numpy as np
import faiss

DOCS_PATH = "data/all_docs_v2.jsonl"

def chunk_text_words(text: str, size: int = 250):
    words = text.split()
    return [" ".join(words[i:i+size]) for i in range(0, len(words), size)]

# ---- 1) Load docs + create chunks ----
CHUNK_SIZE = 250   # you can change to 200–400

chunks = []
with open(DOCS_PATH, "r", encoding="utf-8") as f:
    for line in f:
        doc = json.loads(line)
        doc_id = doc["doc_id"]
        source = doc["source_type"]
        title  = doc.get("title","")
        url    = doc.get("url","")
        text   = doc.get("text","")

        for i, ch in enumerate(chunk_text_words(text, size=CHUNK_SIZE)):
            ch = ch.strip()
            if not ch:
                continue
            chunks.append({
                "source": source,
                "doc_id": doc_id,
                "title": title,
                "url": url,
                "chunk_id": f"{doc_id}_c{i:03d}",
                "text": ch
            })

print("✅ Total chunks:", len(chunks))

# ---- 2) Embed chunks ----
print("Computing embeddings (this may take a while)...")

embeddings = []
valid_chunks = []

for idx, c in enumerate(chunks, 1):
    txt = c["text"].strip()
    if not txt:
        continue

    emb = embed(txt)  # <-- uses your Gemini embedding function
    embeddings.append(emb)
    valid_chunks.append(c)

    if idx % 25 == 0:
        print(f"  embedded {idx}/{len(chunks)}")

emb_matrix = np.vstack(embeddings).astype("float32")

# ---- 3) Build FAISS index ----
dim = emb_matrix.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(emb_matrix)

# Replace chunks with valid ones
chunks = valid_chunks

print("✅ Index built.")
print("Dimension:", dim)
print("Vectors in index:", index.ntotal)


✅ Total chunks: 252
Computing embeddings (this may take a while)...
  embedded 25/252
  embedded 50/252
  embedded 75/252
  embedded 100/252
  embedded 125/252
  embedded 150/252
  embedded 175/252
  embedded 200/252
  embedded 225/252
  embedded 250/252
✅ Index built.
Dimension: 3072
Vectors in index: 252


# Quick test: confirm YouTube is included in retrieval

In [73]:
q = "Give me a sustainable fall capsule wardrobe with outfit formulas."
hits = retrieve(q, k=8)

for h in hits:
    print(h["chunk_id"], "|", h["source"], "|", (h["url"] or "")[:60])


youtube_vMnDaid_LPo_c000 | youtube | https://www.youtube.com/watch?v=vMnDaid_LPo
youtube_vMnDaid_LPo_c016 | youtube | https://www.youtube.com/watch?v=vMnDaid_LPo
youtube_BsjHWMwJWos_c029 | youtube | https://www.youtube.com/watch?v=BsjHWMwJWos
youtube_6iRUd7b-0e8_c000 | youtube | https://www.youtube.com/watch?v=6iRUd7b-0e8
youtube_LS93Cmjs2TE_c010 | youtube | https://www.youtube.com/watch?v=LS93Cmjs2TE
youtube_uudFZHCH1tA_c015 | youtube | https://www.youtube.com/watch?v=uudFZHCH1tA
youtube_KLM-93oF-5k_c012 | youtube | https://www.youtube.com/watch?v=KLM-93oF-5k
youtube_vMnDaid_LPo_c001 | youtube | https://www.youtube.com/watch?v=vMnDaid_LPo


# Step 5: Re-run 3 demo questions and SAVE outputs (baseline vs RAG)

In [76]:
Q1 = "How would you style for fall if you want to be fashion-forward but sustainable? Give 3 outfit formulas."
Q2 = "What materials should I prioritize for fall basics if I care about sustainability?"
Q3 = "What are the BEST sustainable shoe brands for fall 2025?"


In [78]:
def baseline_answer(q):
    return safe_generate(prompt=q, model="models/gemini-2.5-flash").text

def rag_persona_answer(q, k=6):
    return rag_answer(q, k=k)

for q in [Q1, Q2, Q3]:
    print("\n" + "="*80)
    print("QUESTION:", q)
    print("\n--- RAG ---")
    print(rag_persona_answer(q, k=8))
    print("\n--- BASELINE ---")
    print(baseline_answer(q))



QUESTION: How would you style for fall if you want to be fashion-forward but sustainable? Give 3 outfit formulas.

--- RAG ---
A) My fall vibe in one line
Chic, intentional layering in deep fall tones, where quality and longevity dictate my style.

B) 3 outfit formulas
1.  **Top:** Chunky cable knit burgundy sweater
    **Bottom:** Tailored pants
    **Shoe:** Black loafers
    **Outer layer:** Tweed blazer
2.  **Top:** Striped long sleeve shirt
    **Bottom:** Darker wash jeans
    **Shoe:** Black boots
    **Outer layer:** Chocolate brown jacket
3.  **Top:** Turtleneck
    **Bottom:** Barrel jean
    **Shoe:** Suede knee-high boots
    **Outer layer:** Deep brown wool coat

C) Sustainable moves
*   Prioritize quality, timeless pieces that will last for decades, avoiding trends I'll only wear once.
*   Curate a fall capsule wardrobe, shopping intentionally from a wish list.
*   Actively thrift for unique, secondhand items to minimize new purchases.
*   Consign or donate unworn items 

In [80]:
q = "I work remotely. What are polished but comfy fall outfits for video calls?"
hits = retrieve(q, k=10)
for h in hits:
    print(h["chunk_id"], "|", h["source"])


reddit_1pcnc1c_c000 | reddit
reddit_1pcnc1c_c012 | reddit
reddit_1pcnc1c_c005 | reddit
reddit_1pcnc1c_c015 | reddit
reddit_1pcnc1c_c006 | reddit
reddit_1pcnc1c_c007 | reddit
reddit_1pcnc1c_c014 | reddit
reddit_1pcnc1c_c004 | reddit
reddit_1pcnc1c_c011 | reddit
reddit_1pcnc1c_c009 | reddit


In [82]:
hits = retrieve(q, k=15)
for h in hits:
    print(h["chunk_id"], "|", h["source"])


reddit_1pcnc1c_c000 | reddit
reddit_1pcnc1c_c012 | reddit
reddit_1pcnc1c_c005 | reddit
reddit_1pcnc1c_c015 | reddit
reddit_1pcnc1c_c006 | reddit
reddit_1pcnc1c_c007 | reddit
reddit_1pcnc1c_c014 | reddit
reddit_1pcnc1c_c004 | reddit
reddit_1pcnc1c_c011 | reddit
reddit_1pcnc1c_c009 | reddit
reddit_1pcnc1c_c003 | reddit
reddit_1pcnc1c_c008 | reddit
reddit_1pcnc1c_c010 | reddit
reddit_1pcnc1c_c016 | reddit
reddit_1pcnc1c_c001 | reddit


In [84]:
q_mix = "Give me 3 fall outfit formulas using layering and a capsule wardrobe approach, focused on sustainability."
hits = retrieve(q_mix, k=12)
for h in hits:
    print(h["chunk_id"], "|", h["source"])


youtube_vMnDaid_LPo_c016 | youtube
youtube_vMnDaid_LPo_c000 | youtube
youtube_6iRUd7b-0e8_c000 | youtube
youtube_BsjHWMwJWos_c029 | youtube
youtube_LS93Cmjs2TE_c010 | youtube
youtube_vMnDaid_LPo_c001 | youtube
youtube_6iRUd7b-0e8_c003 | youtube
youtube_uudFZHCH1tA_c015 | youtube
youtube_BsjHWMwJWos_c010 | youtube
youtube_6iRUd7b-0e8_c011 | youtube
youtube_6iRUd7b-0e8_c007 | youtube
youtube_KLM-93oF-5k_c000 | youtube


In [86]:
hits = retrieve(q_mix, k=25)
sources = {}
for h in hits:
    sources[h["source"]] = sources.get(h["source"], 0) + 1
    print(h["chunk_id"], "|", h["source"])
print("Source counts:", sources)


youtube_vMnDaid_LPo_c016 | youtube
youtube_vMnDaid_LPo_c000 | youtube
youtube_6iRUd7b-0e8_c000 | youtube
youtube_BsjHWMwJWos_c029 | youtube
youtube_LS93Cmjs2TE_c010 | youtube
youtube_vMnDaid_LPo_c001 | youtube
youtube_6iRUd7b-0e8_c003 | youtube
youtube_uudFZHCH1tA_c015 | youtube
youtube_BsjHWMwJWos_c010 | youtube
youtube_6iRUd7b-0e8_c011 | youtube
youtube_6iRUd7b-0e8_c007 | youtube
youtube_KLM-93oF-5k_c000 | youtube
youtube_KLM-93oF-5k_c012 | youtube
youtube_vMnDaid_LPo_c002 | youtube
reddit_1pcnc1c_c008 | reddit
youtube_uudFZHCH1tA_c014 | youtube
reddit_1pcnc1c_c015 | reddit
youtube_KLM-93oF-5k_c003 | youtube
youtube_6iRUd7b-0e8_c006 | youtube
youtube_uudFZHCH1tA_c006 | youtube
reddit_1pcnc1c_c000 | reddit
youtube_uudFZHCH1tA_c000 | youtube
reddit_1pcnc1c_c010 | reddit
reddit_1pcnc1c_c002 | reddit
youtube_KLM-93oF-5k_c005 | youtube
Source counts: {'youtube': 20, 'reddit': 5}


In [88]:
import re

def rag_answer(question: str, k: int = 8) -> str:
    retrieved = retrieve(question, k=k)

    # Build context with chunk ids
    context_blocks = []
    chunk_ids = []
    for c in retrieved:
        cid = c.get("chunk_id")
        chunk_ids.append(cid)
        context_blocks.append(f"[{cid}] {c['text']}")

    context = "\n\n".join(context_blocks)

    prompt = f"""
You are a Customer Persona Twin (persona-mimic): "Trend-Conscious Sustainable Shopper".

Voice + priorities:
- concise, confident, trend-aware
- prefers darker/earthy fall tones, layering, polished comfort
- sustainability is non-negotiable (natural fibers, thrifting, cost-per-wear)
- avoids fast fashion hype drops and synthetic-heavy "sweaty" fabrics

GROUNDING RULES (STRICT):
- Use ONLY what is stated in the CONTEXT.
- Do NOT invent brand names, materials, trends, or specific garments not mentioned.
- If shoes/brands are not mentioned, say "not specified in context".
- If the question asks for specific brand names, rankings ("best"), or future-year claims (e.g. 2025)
  and the CONTEXT does not explicitly provide them, you MUST refuse using the exact format below.

REFUSAL FORMAT (exact):
A) I don’t have enough info from the context to answer that.
E) Evidence used: [chunk_ids]

OUTPUT FORMAT (exact):
A) My fall vibe in one line
B) 3 outfit formulas (bullets; each = top + bottom + shoes + outer layer)
C) Sustainable moves (2–4 bullets; must be actionable)
D) Messaging test (write 2 marketing lines that would appeal to me)
E) Evidence used (list chunk_ids)

CONTEXT:
{context}

QUESTION:
{question}
""".strip()

    resp = safe_generate(prompt, model="models/gemini-2.5-flash")
    text = resp.text.strip()

    # If the model forgets to include evidence, append it safely
    if "Evidence used" not in text:
        text += "\n\nE) Evidence used\n" + "\n".join([f"* [{cid}]" for cid in chunk_ids])

    return text


In [90]:
print(rag_answer("What are the BEST sustainable shoe brands for fall 2025?", k=12))


A) I don’t have enough info from the context to answer that.
E) Evidence used: [youtube_KLM-93oF-5k_c012], [reddit_1olvz55_c002], [youtube_JTUMEmf6HFA_c015], [youtube_uudFZHCH1tA_c015], [youtube_LS93Cmjs2TE_c010], [youtube_LS93Cmjs2TE_c000], [youtube_BsjHWMwJWos_c008], [youtube_uudFZHCH1tA_c010], [youtube_BsjHWMwJWos_c029], [youtube_BsjHWMwJWos_c009], [youtube_vMnDaid_LPo_c016], [youtube_BsjHWMwJWos_c014]
