# Can an LLM Be Your Recommendation Engine?

**A Reproducible Experiment in Cold-Start Fashion Recommendation**

This notebook walks through a complete experiment testing whether an LLM can infer a user's latent style preferences from just 10 click signals and make accurate cross-retailer recommendations — no collaborative filtering, no purchase history, no style quiz.

## Experiment Overview

| Step | What happens | Key output |
|------|-------------|------------|
| **Phase 1** | Collect training data: 10 items the user clicked + 35 items they saw but skipped | `training_data.json`, `negative_samples.json` |
| **Phase 2A** | Synthesize a preference profile from positive signals only (5-agent ensemble) | `phase2a_preference_brief.md` |
| **Phase 2B** | Synthesize a profile from positive + negative signals (5-agent ensemble) | `phase3_preference_brief.md` (contrastive) |
| **Phase 3** | Score 103 new items from 2 unseen retailers against both profiles | `fp_scores.json`, `br_scores.json` (and `_2a` variants) |
| **Phase 4** | Blind evaluation: user labels all 103 items without seeing scores | `phase3_full_results.json` |
| **Phase 5** | Compute precision, recall, lift, tier calibration | Analysis below |

## What You Need to Reproduce This

1. **An Anthropic API key** (set as `ANTHROPIC_API_KEY` environment variable)
2. **Your own training data**: 5-15 items you'd click on + items you'd skip from a product category page
3. **A test catalog**: Items from a *different* retailer to score against the learned profile
4. **~$5-25** in API credits (depending on model choice and catalog size)

---
## Setup

In [None]:
# Install dependencies (run once)
# !pip install anthropic pandas matplotlib seaborn

In [None]:
import anthropic
import json
import base64
import os
import re
import random
from pathlib import Path
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")
plt.rcParams["figure.figsize"] = (10, 5)

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

In [None]:
# ── Configuration ────────────────────────────────────────────────────
# Change these to customize the experiment for your own data.

MODEL = "claude-sonnet-4-5-20250929"  # or "claude-opus-4-6", "claude-haiku-4-5-20251001"
NUM_AGENTS = 5                         # how many independent synthesis runs to ensemble
DATA_DIR = Path("data")                # packaged experiment data

# Tier definitions (used for scoring)
TIER_SCHEMA = {
    1: "Strong Match — high confidence the user would click to explore this item",
    2: "Moderate Match — aligns with several preferences but has notable gaps",
    3: "Weak Match — one or two alignment points, but overall not a fit",
    4: "No Match — conflicts with core preferences or hits anti-preferences",
}

print(f"Model: {MODEL}")
print(f"Ensemble size: {NUM_AGENTS} agents")
print(f"Data dir: {DATA_DIR.resolve()}")

---
## Phase 1: Training Data

The raw signal is minimal: a user browsed ~370 jackets & coats on Anthropologie.com and **clicked on 10** (positive signal). We also captured **35 items the user saw but did not click** (negative signal).

Each item includes:
- Name, price, brand
- 1-2 hero images (what the user saw on the category browse page)
- Product attributes (material, silhouette, color, etc.)

### Collecting your own training data

To replicate with your own preferences:
1. Browse a product category page on any retailer
2. Save 5-15 items you'd click to explore further → `training_data.json`
3. Save 20-50 items you saw but skipped → `negative_samples.json`
4. For each item, save: name, price, and at least 1 hero image

In [None]:
# Load the training data from the original experiment
with open(DATA_DIR / "training" / "training_data.json") as f:
    training_data = json.load(f)

with open(DATA_DIR / "training" / "negative_samples.json") as f:
    negative_samples = json.load(f)

positive_items = training_data["items"]
negative_items = negative_samples["items"] if isinstance(negative_samples, dict) else negative_samples

print(f"Positive items (clicked): {len(positive_items)}")
print(f"Negative items (skipped): {len(negative_items)}")
print(f"Source: {training_data.get('source', 'N/A')}")
print(f"Category: {training_data.get('category', 'N/A')}")
print()
print("Positive items:")
for item in positive_items:
    print(f"  {item['id']:>3}. {item['name']:<55} ${item['price_usd']}")

### Helper: Encode images for the API

The Anthropic API accepts images as base64-encoded content blocks. This helper loads a local image and formats it for the messages API.

In [None]:
import mimetypes

def image_to_content_block(image_path: str) -> dict:
    """Convert a local image file to an Anthropic API content block."""
    path = Path(image_path)
    if not path.exists():
        raise FileNotFoundError(f"Image not found: {path}")
    mime, _ = mimetypes.guess_type(str(path))
    if not mime:
        mime = "image/jpeg"
    with open(path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("ascii")
    return {
        "type": "image",
        "source": {"type": "base64", "media_type": mime, "data": b64},
    }


def build_item_content_blocks(item: dict, include_images: bool = True) -> list:
    """Build a list of content blocks (text + images) for one item."""
    blocks = []
    text_desc = f"Item {item.get('id', '?')}: {item['name']} — ${item.get('price_usd', item.get('price', '?'))}"
    if "visual_description" in item:
        text_desc += f"\nVisual: {item['visual_description']}"
    if "key_details" in item:
        text_desc += f"\nDetails: {item['key_details']}"
    blocks.append({"type": "text", "text": text_desc})

    if include_images:
        for img_key in ["hero_img_path", "hero_img2_path"]:
            if img_key in item and item[img_key] and Path(item[img_key]).exists():
                blocks.append(image_to_content_block(item[img_key]))
    return blocks


# Quick test
sample = positive_items[0]
blocks = build_item_content_blocks(sample)
print(f"Built {len(blocks)} content blocks for: {sample['name']}")
print(f"  Block types: {[b['type'] for b in blocks]}")

---
## Phase 2A: Positive-Only Preference Synthesis

We ask the LLM to infer the user's latent preferences from **only the 10 clicked items** — no negative signal.

To reduce variance, we run **N independent agents** (same prompt, different samples from the model's output distribution) and merge their outputs into a consensus profile.

### The Prompt

This is the actual prompt used in the experiment. The key design choices:
- We provide **hero images** (what the user saw when browsing), not detail images
- We ask for **confidence tiers** (strong / medium / speculation)
- We ask for **anti-preferences** inferred from absence

In [None]:
POSITIVE_ONLY_SYSTEM_PROMPT = """You are a style analyst specializing in inferring latent 
fashion preferences from implicit behavioral signals. You will receive a set of items 
that a user clicked on (chose to explore further) while browsing a large product 
category page. Your task is to infer the user's underlying style preferences.

For each preference dimension you identify, classify your confidence:
- STRONG PREFERENCES: Consistent patterns across most items (high confidence)
- MEDIUM CONFIDENCE: Appears in some items, plausible pattern
- ANTI-PREFERENCES: What the user likely avoids (inferred from absence)

Be specific and actionable. Instead of "likes casual styles", say "prefers relaxed/
oversized bomber silhouettes in cropped lengths with textural complexity."

Analyze the IMAGES carefully — visual signals (texture, drape, color temperature, 
styling context) are often more informative than text descriptions."""


def build_positive_only_prompt(items: list) -> list:
    """Build the user message content blocks for positive-only synthesis."""
    content = []
    content.append({
        "type": "text",
        "text": (
            f"A user browsed ~370 jackets & coats on Anthropologie.com and selected "
            f"these {len(items)} items (clicked to view detail page). "
            f"Infer the user's implicit style preferences from ONLY these positive signals.\n\n"
            f"For each item below, you'll see the name, price, and the hero images "
            f"visible on the category browse page (what the user saw before clicking).\n"
        ),
    })
    for item in items:
        content.extend(build_item_content_blocks(item, include_images=True))
    content.append({
        "type": "text",
        "text": (
            "\n\nBased on these clicked items, produce a comprehensive User Preference Profile "
            "with: STRONG PREFERENCES, MEDIUM CONFIDENCE PREFERENCES, and ANTI-PREFERENCES "
            "(inferred from what's absent in the selections). "
            "Cover: silhouette, material/texture, color palette, closure type, "
            "embellishment/details, price behavior, and overall aesthetic."
        ),
    })
    return content


print("Prompt builder ready. Preview of text blocks:")
preview = build_positive_only_prompt(positive_items)
for block in preview:
    if block["type"] == "text":
        print(f"  [text] {block['text'][:120]}...")
    else:
        print(f"  [image] {block['source']['media_type']}")

In [None]:
# ── Run the 5-agent ensemble (Phase 2A) ──────────────────────────────
# Each agent gets the identical prompt but produces an independent synthesis.
# Temperature > 0 ensures variation across runs.
#
# ⚠️  This cell makes API calls. Cost estimate: ~$3-8 depending on model.
#     Set RUN_API_CALLS = True to execute, or load saved results below.

RUN_API_CALLS = False  # Set to True to actually call the API

phase2a_syntheses = []

if RUN_API_CALLS:
    user_content = build_positive_only_prompt(positive_items)
    for i in range(NUM_AGENTS):
        print(f"Running agent {i+1}/{NUM_AGENTS}...")
        response = client.messages.create(
            model=MODEL,
            max_tokens=4096,
            system=POSITIVE_ONLY_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": user_content}],
        )
        synthesis = response.content[0].text
        phase2a_syntheses.append(synthesis)
        print(f"  Agent {i+1} done — {len(synthesis)} chars")
        
        # Save each agent's output
        out_dir = DATA_DIR / "results" / "agent_syntheses"
        out_dir.mkdir(parents=True, exist_ok=True)
        with open(out_dir / f"phase2a_agent{i+1}_synthesis.md", "w") as f:
            f.write(synthesis)
    print(f"\nAll {NUM_AGENTS} agents complete.")
else:
    # Load saved results from the original experiment
    for i in range(1, 6):
        path = DATA_DIR / "results" / "agent_syntheses" / f"phase2a_agent{i}_synthesis.md"
        if path.exists():
            phase2a_syntheses.append(path.read_text())
    print(f"Loaded {len(phase2a_syntheses)} saved agent syntheses from the original experiment.")

In [None]:
# ── Merge into consensus profile ─────────────────────────────────────
# We ask the LLM to read all N agent outputs and produce a merged brief,
# noting which preferences had full consensus vs. partial agreement.

MERGE_PROMPT = """You are given {n} independent style analyses of the same user's 
fashion preferences. Each was produced by a separate analyst seeing the same data.

Produce a single merged User Preference Profile that:
1. Notes consensus level for each preference (e.g., "5/5 agents agreed" vs "3/5")
2. Elevates high-consensus findings to STRONG PREFERENCES
3. Flags disagreements or low-consensus items as SPECULATION
4. Preserves specific, actionable language (not vague summaries)
5. Includes ANTI-PREFERENCES with consensus counts

Format the output as a structured markdown document with clear sections."""

if RUN_API_CALLS and phase2a_syntheses:
    agent_texts = "\n\n---\n\n".join(
        [f"## Agent {i+1} Analysis\n\n{s}" for i, s in enumerate(phase2a_syntheses)]
    )
    response = client.messages.create(
        model=MODEL,
        max_tokens=4096,
        system=MERGE_PROMPT.format(n=len(phase2a_syntheses)),
        messages=[{"role": "user", "content": agent_texts}],
    )
    phase2a_brief = response.content[0].text
    with open(DATA_DIR / "results" / "profiles" / "positive_only_brief.md", "w") as f:
        f.write(phase2a_brief)
    print("Merged profile saved.")
else:
    phase2a_brief = (DATA_DIR / "results" / "profiles" / "positive_only_brief.md").read_text()
    print(f"Loaded saved positive-only preference brief ({len(phase2a_brief)} chars).")

print("\n" + "="*70)
print("POSITIVE-ONLY PREFERENCE BRIEF (first 2000 chars)")
print("="*70)
print(phase2a_brief[:2000])

---
## Phase 2B: Contrastive Preference Synthesis

Now we give the LLM **both** signals: the 10 items the user clicked **and** 35 items the user saw but did not click. The hypothesis is that negative signal should refine the profile — turning "likes bombers" into "likes bombers ONLY with embellishment."

### What changed in the prompt

The key addition is asking the LLM to use negative items to:
1. **Sharpen boundaries** — what specifically within a liked category gets rejected?
2. **Add conditional logic** — "likes X only when Y"
3. **Confirm anti-preferences** with direct evidence (not just absence)

In [None]:
CONTRASTIVE_SYSTEM_PROMPT = """You are a style analyst specializing in inferring latent 
fashion preferences from implicit behavioral signals. You will receive two sets of items:

1. POSITIVE items: items the user clicked on (chose to explore further)
2. NEGATIVE items: items the user saw on the same page but did NOT click

Use both signals to build a nuanced preference profile. The negative items are especially
valuable for:
- Sharpening boundaries within liked categories (e.g., "likes leather BUT NOT in moto style")
- Adding conditional logic ("likes embroidery ONLY in structured silhouettes")
- Confirming anti-preferences with direct evidence

For each preference, classify confidence and note whether it comes from positive signal,
negative signal, or both.

Analyze the IMAGES carefully — visual signals are often more informative than text."""


def build_contrastive_prompt(pos_items: list, neg_items: list) -> list:
    """Build user message for contrastive synthesis."""
    content = []
    content.append({
        "type": "text",
        "text": (
            f"A user browsed ~370 jackets & coats on Anthropologie.com. "
            f"They clicked on {len(pos_items)} items (POSITIVE signal) and saw but "
            f"did NOT click on {len(neg_items)} other items (NEGATIVE signal).\n\n"
            f"── POSITIVE ITEMS (user clicked to view) ──\n"
        ),
    })
    for item in pos_items:
        content.extend(build_item_content_blocks(item, include_images=True))

    content.append({"type": "text", "text": "\n── NEGATIVE ITEMS (user saw but skipped) ──\n"})
    for item in neg_items:
        content.extend(build_item_content_blocks(item, include_images=True))

    content.append({
        "type": "text",
        "text": (
            "\n\nProduce a User Preference Profile using BOTH positive and negative signals. "
            "For each preference: state it, note the evidence (which positive items support it, "
            "which negative items confirm the boundary), and classify confidence.\n"
            "Include: STRONG PREFERENCES, CONDITIONAL PREFERENCES (likes X only when Y), "
            "and ANTI-PREFERENCES (with specific negative-item evidence)."
        ),
    })
    return content


print(f"Contrastive prompt will include {len(positive_items)} positive + {len(negative_items)} negative items.")

In [None]:
# ── Run 5-agent ensemble for contrastive synthesis ────────────────────
# ⚠️  This cell is more expensive than 2A because of the larger input (45 items with images).
#     Cost estimate: ~$8-15 depending on model.

phase2b_syntheses = []

if RUN_API_CALLS:
    user_content = build_contrastive_prompt(positive_items, negative_items)
    for i in range(NUM_AGENTS):
        print(f"Running agent {i+1}/{NUM_AGENTS}...")
        response = client.messages.create(
            model=MODEL,
            max_tokens=6000,
            system=CONTRASTIVE_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": user_content}],
        )
        synthesis = response.content[0].text
        phase2b_syntheses.append(synthesis)
        print(f"  Agent {i+1} done — {len(synthesis)} chars")
    print(f"\nAll {NUM_AGENTS} agents complete.")
else:
    for i in range(1, 6):
        path = DATA_DIR / "results" / "agent_syntheses" / f"phase2b_agent{i}_synthesis.md"
        if path.exists():
            phase2b_syntheses.append(path.read_text())
    print(f"Loaded {len(phase2b_syntheses)} saved contrastive agent syntheses.")

In [None]:
# ── Merge contrastive profile ────────────────────────────────────────

if RUN_API_CALLS and phase2b_syntheses:
    agent_texts = "\n\n---\n\n".join(
        [f"## Agent {i+1} Analysis\n\n{s}" for i, s in enumerate(phase2b_syntheses)]
    )
    response = client.messages.create(
        model=MODEL,
        max_tokens=6000,
        system=MERGE_PROMPT.format(n=len(phase2b_syntheses)),
        messages=[{"role": "user", "content": agent_texts}],
    )
    phase2b_brief = response.content[0].text
    with open(DATA_DIR / "results" / "profiles" / "contrastive_brief.md", "w") as f:
        f.write(phase2b_brief)
    print("Merged contrastive profile saved.")
else:
    phase2b_brief = (DATA_DIR / "results" / "profiles" / "contrastive_brief.md").read_text()
    print(f"Loaded saved contrastive preference brief ({len(phase2b_brief)} chars).")

print("\n" + "="*70)
print("CONTRASTIVE PREFERENCE BRIEF (first 2000 chars)")
print("="*70)
print(phase2b_brief[:2000])

### What did negative signal add?

The contrastive profile's main value-add is **conditional logic** — turning flat preferences into boundary-aware rules:

| Positive-only said | Contrastive refined to |
|---|---|
| "Likes bombers" | "Likes bombers ONLY with embellishment" (skipped plain bombers at $360-400) |
| "Likes faux leather" | "Likes faux leather in unconventional silhouettes only" (skipped standard moto) |
| "Likes craft details" | "Likes craft details in structured silhouettes only" (skipped boho/tie-front with same details) |
| "Prefers relaxed fit" | "Actively rejects fitted/body-conscious" (confirmed by 3+ skipped fitted items) |

---
## Phase 3: Score New Catalog Items

Now we test the profiles on **unseen items from different retailers**. This is the cold-start cross-retailer transfer test.

- **Free People**: 70 jackets & coats
- **Banana Republic**: 33 jackets & coats

Each item is scored on the 4-tier scale by both the positive-only and contrastive profiles.

### The Scoring Prompt

For each item, we send the preference brief + the item's image and metadata, and ask the LLM to assign a tier with a rationale.

In [None]:
SCORING_SYSTEM_PROMPT = """You are a recommendation engine. You have a detailed user 
preference profile and must score new items on how well they match.

Score each item on this 4-tier scale:
- Tier 1 (Strong Match): High confidence the user would click to explore this item
- Tier 2 (Moderate Match): Aligns with several preferences but has notable gaps  
- Tier 3 (Weak Match): One or two alignment points, but overall not a fit
- Tier 4 (No Match): Conflicts with core preferences or hits anti-preferences

For each item, provide:
1. The tier (1-4)
2. A 1-2 sentence rationale citing specific preference dimensions

Be calibrated: Tier 1 should be rare (~5-10% of items). Most items should be Tier 3-4.
Analyze the IMAGE carefully — visual match matters more than text description match."""


def score_single_item(item: dict, preference_brief: str, item_image_path: str = None) -> dict:
    """Score a single catalog item against a preference profile."""
    content = []
    content.append({
        "type": "text",
        "text": f"## User Preference Profile\n\n{preference_brief}",
    })
    content.append({
        "type": "text",
        "text": f"\n## Item to Score\n\nName: {item['name']}\nPrice: {item.get('price', 'N/A')}",
    })
    if item_image_path and Path(item_image_path).exists():
        content.append(image_to_content_block(item_image_path))
    content.append({
        "type": "text",
        "text": "\nScore this item. Respond in this exact format:\nTier: [1-4]\nRationale: [your reasoning]",
    })

    response = client.messages.create(
        model=MODEL,
        max_tokens=300,
        system=SCORING_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": content}],
    )
    text = response.content[0].text

    # Parse tier from response
    tier_match = re.search(r"Tier:\s*(\d)", text)
    tier = int(tier_match.group(1)) if tier_match else None
    rationale_match = re.search(r"Rationale:\s*(.+)", text, re.DOTALL)
    rationale = rationale_match.group(1).strip() if rationale_match else text

    return {"id": item["id"], "name": item["name"], "tier": tier, "rationale": rationale}


def score_catalog(catalog: list, preference_brief: str, image_dir: str = None) -> list:
    """Score an entire catalog of items. Returns list of scored items."""
    results = []
    for i, item in enumerate(catalog):
        img_path = None
        if image_dir:
            # Try common naming patterns
            for ext in [".jpg", ".png", ".webp"]:
                candidate = Path(image_dir) / f"{item['id']}{ext}"
                if candidate.exists():
                    img_path = str(candidate)
                    break
        if not img_path and "img_url" in item:
            img_path = item.get("img_path")  # fallback to relative path in data
            
        result = score_single_item(item, preference_brief, img_path)
        results.append(result)
        if (i + 1) % 10 == 0:
            print(f"  Scored {i+1}/{len(catalog)} items...")
    return results


print("Scoring functions ready.")

In [None]:
# ── Load test catalogs ───────────────────────────────────────────────

with open(DATA_DIR / "test" / "fp_catalog.json") as f:
    fp_catalog = json.load(f)
with open(DATA_DIR / "test" / "br_catalog.json") as f:
    br_catalog = json.load(f)

print(f"Free People catalog: {len(fp_catalog)} items")
print(f"Banana Republic catalog: {len(br_catalog)} items")
print(f"Total items to score: {len(fp_catalog) + len(br_catalog)}")

In [None]:
# ── Score catalogs (or load saved scores) ────────────────────────────
# ⚠️  Scoring 103 items × 2 profiles = 206 API calls.
#     Cost estimate: ~$15-20 with Sonnet, ~$2-4 with Haiku.

if RUN_API_CALLS:
    img_base = str(DATA_DIR / "test" / "images")
    print("Scoring Free People with contrastive profile...")
    fp_scores_contrastive = score_catalog(fp_catalog, phase2b_brief, f"{img_base}/fp")
    print("Scoring Banana Republic with contrastive profile...")
    br_scores_contrastive = score_catalog(br_catalog, phase2b_brief, f"{img_base}/br")
    print("Scoring Free People with positive-only profile...")
    fp_scores_posonly = score_catalog(fp_catalog, phase2a_brief, f"{img_base}/fp")
    print("Scoring Banana Republic with positive-only profile...")
    br_scores_posonly = score_catalog(br_catalog, phase2a_brief, f"{img_base}/br")
    
    # Save
    scores_dir = DATA_DIR / "results" / "scores"
    scores_dir.mkdir(parents=True, exist_ok=True)
    for name, data in [("fp_scores_contrastive", fp_scores_contrastive),
                       ("br_scores_contrastive", br_scores_contrastive),
                       ("fp_scores_posonly", fp_scores_posonly),
                       ("br_scores_posonly", br_scores_posonly)]:
        with open(scores_dir / f"{name}.json", "w") as f:
            json.dump(data, f, indent=2)
    print("All scores saved.")
else:
    # Load saved scores from the original experiment
    scores_dir = DATA_DIR / "results" / "scores"
    with open(scores_dir / "fp_scores_contrastive.json") as f:
        fp_scores_contrastive = json.load(f)
    with open(scores_dir / "br_scores_contrastive.json") as f:
        br_scores_contrastive = json.load(f)
    with open(scores_dir / "fp_scores_posonly.json") as f:
        fp_scores_posonly = json.load(f)
    with open(scores_dir / "br_scores_posonly.json") as f:
        br_scores_posonly = json.load(f)
    print("Loaded saved scores from the original experiment.")

# Combine into a single DataFrame for analysis
all_contrastive = fp_scores_contrastive + br_scores_contrastive
all_posonly = fp_scores_posonly + br_scores_posonly
print(f"\nContrastive scores: {len(all_contrastive)} items")
print(f"Positive-only scores: {len(all_posonly)} items")

In [None]:
# ── Tier distribution comparison ─────────────────────────────────────

def tier_counts(scores):
    c = Counter(s["tier"] for s in scores)
    return {t: c.get(t, 0) for t in [1, 2, 3, 4]}

ct = tier_counts(all_contrastive)
pt = tier_counts(all_posonly)

print("Tier distribution:")
print(f"{'Tier':<8} {'Contrastive':>12} {'Positive-only':>14}")
print("-" * 36)
for t in [1, 2, 3, 4]:
    print(f"Tier {t:<3} {ct[t]:>12} {pt[t]:>14}")
print(f"{'Total':<8} {sum(ct.values()):>12} {sum(pt.values()):>14}")

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)
for ax, (title, counts) in zip(axes, [("Contrastive Profile", ct), ("Positive-Only Profile", pt)]):
    colors = ["#5C7D68", "#B8726A", "#D8CFBF", "#A0908A"]
    ax.bar([f"Tier {t}" for t in [1,2,3,4]], [counts[t] for t in [1,2,3,4]], color=colors)
    ax.set_title(title, fontsize=13)
    ax.set_ylabel("Number of items")
    for i, t in enumerate([1,2,3,4]):
        ax.text(i, counts[t] + 0.5, str(counts[t]), ha="center", fontsize=11)
plt.suptitle("How the LLM distributed items across tiers", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
## Phase 4: Blind Evaluation

The user evaluated all 103 items **blind** — items were shuffled, and no tier labels or retailer names were shown. For each item, the user saw two hero images and indicated whether they'd click to explore further (binary: click / skip).

### Generating a blind evaluation UI

The cell below creates a shuffled HTML file that a user can open in a browser to do the blind evaluation. Each item shows two images and a click/skip button.

In [None]:
def generate_blind_eval_html(scored_items: list, output_path: str, image_base: str = "."):
    """Generate a shuffled HTML evaluation page for blind labeling.
    
    The user opens this in a browser, clicks 'Would Explore' or 'Skip' for each item,
    then copies the JSON results at the end.
    """
    # Shuffle items (but keep a key to map back)
    shuffled = list(enumerate(scored_items))
    random.shuffle(shuffled)
    
    html = """<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Blind Evaluation</title>
<style>
  body { font-family: system-ui; max-width: 800px; margin: 40px auto; background: #f5f5f5; }
  .item { background: white; padding: 20px; margin: 16px 0; border-radius: 12px;
          box-shadow: 0 1px 3px rgba(0,0,0,0.1); text-align: center; }
  .item img { max-width: 300px; max-height: 400px; margin: 8px; border-radius: 8px; }
  .buttons { margin-top: 12px; }
  .buttons button { padding: 10px 24px; margin: 0 8px; border: none; border-radius: 8px;
                    cursor: pointer; font-size: 14px; }
  .btn-click { background: #5C7D68; color: white; }
  .btn-skip { background: #D8CFBF; color: #2A2320; }
  .done { opacity: 0.4; }
  #results { margin-top: 40px; background: white; padding: 20px; border-radius: 12px;
             white-space: pre-wrap; font-family: monospace; font-size: 12px; }
</style>
</head><body>
<h1>Blind Item Evaluation</h1>
<p>For each item, decide: would you click to explore this further?</p>
<p id="progress">0 / """ + str(len(scored_items)) + """ evaluated</p>\n"""

    for display_pos, (orig_idx, item) in enumerate(shuffled):
        img_path = item.get("img_path", "")
        html += f"""<div class="item" id="item-{display_pos}">
  <div style="color:#999;font-size:12px">Item {display_pos + 1} of {len(scored_items)}</div>
  <div><img src="{image_base}/{img_path}" onerror="this.style.display='none'"></div>
  <div class="buttons">
    <button class="btn-click" onclick="record('{item['id']}', true, this)">Would Explore</button>
    <button class="btn-skip" onclick="record('{item['id']}', false, this)">Skip</button>
  </div>
</div>\n"""

    html += """
<div id="results" style="display:none">
  <h3>Results (copy this JSON):</h3>
  <pre id="results-json"></pre>
</div>
<script>
const results = {};
let count = 0;
const total = """ + str(len(scored_items)) + """;
function record(id, clicked, btn) {
  if (results[id] !== undefined) return;
  results[id] = clicked;
  btn.closest('.item').classList.add('done');
  count++;
  document.getElementById('progress').textContent = count + ' / ' + total + ' evaluated';
  if (count === total) {
    document.getElementById('results').style.display = 'block';
    document.getElementById('results-json').textContent = JSON.stringify(results, null, 2);
  }
}
</script>
</body></html>"""

    with open(output_path, "w") as f:
        f.write(html)
    print(f"Blind evaluation HTML written to: {output_path}")
    print(f"Items: {len(scored_items)} (shuffled)")
    
    # Save the shuffle key for later mapping
    shuffle_key = [{"display_pos": dp, "id": item["id"]} for dp, (_, item) in enumerate(shuffled)]
    key_path = output_path.replace(".html", "_shuffle_key.json")
    with open(key_path, "w") as f:
        json.dump(shuffle_key, f, indent=2)
    print(f"Shuffle key saved to: {key_path}")


# Example: generate eval HTML (uncomment to run)
# generate_blind_eval_html(fp_scores_contrastive + br_scores_contrastive, "blind_eval.html")
print("Blind evaluation generator ready.")

In [None]:
# ── Load ground truth from the original experiment ────────────────────

with open(DATA_DIR / "results" / "evaluation" / "phase3_full_results.json") as f:
    full_results = json.load(f)

# Build a set of clicked item IDs from the ground truth
clicked_ids = {item["id"] for item in full_results["clicked_items"]}

print(f"Total items evaluated: {full_results['total_items']}")
print(f"Total clicks (user would explore): {full_results['total_clicks']}")
print(f"Overall click rate: {full_results['overall_click_rate']}")
print(f"\nClicked items: {sorted(clicked_ids)}")

---
## Phase 5: Analysis

Now we merge the LLM's tier scores with the user's blind click labels and compute metrics.

In [None]:
# ── Build analysis DataFrame ─────────────────────────────────────────

rows = []
for item in all_contrastive:
    item_id = item["id"]
    # Find the positive-only score for the same item
    posonly_match = next((s for s in all_posonly if s["id"] == item_id), None)
    rows.append({
        "id": item_id,
        "name": item["name"],
        "retailer": "Free People" if item_id.startswith("FP") else "Banana Republic",
        "tier_contrastive": item["tier"],
        "tier_posonly": posonly_match["tier"] if posonly_match else None,
        "user_clicked": item_id in clicked_ids,
        "rationale_contrastive": item.get("rationale", ""),
    })

df = pd.DataFrame(rows)
df["is_rec_contrastive"] = df["tier_contrastive"].isin([1, 2])
df["is_rec_posonly"] = df["tier_posonly"].isin([1, 2])

print(f"Analysis DataFrame: {len(df)} items")
print(f"\nRetailer breakdown:")
print(df["retailer"].value_counts().to_string())
print(f"\nUser click rate: {df['user_clicked'].mean():.1%}")
df.head()

In [None]:
# ── Tier calibration ─────────────────────────────────────────────────
# For each tier: what fraction did the user actually click?
# A well-calibrated system has monotonically decreasing click rates from T1→T4.

def tier_calibration(df, tier_col, label=""):
    print(f"\n{'='*50}")
    print(f"Tier Calibration: {label}")
    print(f"{'='*50}")
    cal = df.groupby(tier_col).agg(
        n_items=("user_clicked", "count"),
        n_clicked=("user_clicked", "sum"),
    )
    cal["click_rate"] = cal["n_clicked"] / cal["n_items"]
    cal.index.name = "Tier"
    print(cal.to_string(formatters={"click_rate": "{:.1%}".format}))
    return cal

cal_ct = tier_calibration(df, "tier_contrastive", "Contrastive Profile")
cal_po = tier_calibration(df, "tier_posonly", "Positive-Only Profile")

In [None]:
# ── Precision, Recall, Lift ──────────────────────────────────────────

def compute_metrics(df, rec_col, label=""):
    recs = df[df[rec_col]]
    non_recs = df[~df[rec_col]]
    
    tp = recs["user_clicked"].sum()
    fp = len(recs) - tp
    fn = non_recs["user_clicked"].sum()
    tn = len(non_recs) - fn
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    base_rate = df["user_clicked"].mean()
    lift = precision / base_rate if base_rate > 0 else 0
    
    print(f"\n{'='*50}")
    print(f"Metrics: {label}")
    print(f"{'='*50}")
    print(f"Recommendations (T1+T2): {len(recs)} items")
    print(f"Non-recommendations (T3+T4): {len(non_recs)} items")
    print(f"")
    print(f"True Positives:  {tp:>3}  (recommended & user clicked)")
    print(f"False Positives: {fp:>3}  (recommended but user skipped)")
    print(f"False Negatives: {fn:>3}  (not recommended but user clicked)")
    print(f"True Negatives:  {tn:>3}  (not recommended & user skipped)")
    print(f"")
    print(f"Precision: {precision:.1%}  (of items we recommended, how many did the user like?)")
    print(f"Recall:    {recall:.1%}  (of items the user liked, how many did we recommend?)")
    print(f"Lift:      {lift:.2f}x  (vs. {base_rate:.1%} base click rate)")
    
    return {"tp": tp, "fp": fp, "fn": fn, "tn": tn,
            "precision": precision, "recall": recall, "lift": lift}

m_ct = compute_metrics(df, "is_rec_contrastive", "Contrastive Profile (T1+T2 = recommendation)")
m_po = compute_metrics(df, "is_rec_posonly", "Positive-Only Profile (T1+T2 = recommendation)")

In [None]:
# ── Side-by-side comparison ──────────────────────────────────────────

comparison = pd.DataFrame({
    "Contrastive": m_ct,
    "Positive-Only": m_po,
}).T

comparison["precision"] = comparison["precision"].map("{:.1%}".format)
comparison["recall"] = comparison["recall"].map("{:.1%}".format)
comparison["lift"] = comparison["lift"].map("{:.2f}x".format)
comparison[["tp", "fp", "fn", "tn"]] = comparison[["tp", "fp", "fn", "tn"]].astype(int)

print("\n" + "="*60)
print("HEAD-TO-HEAD: Contrastive vs Positive-Only")
print("="*60)
print(comparison.to_string())

In [None]:
# ── Visualization: Tier calibration curves ───────────────────────────

fig, ax = plt.subplots(figsize=(8, 5))

tiers = [1, 2, 3, 4]
ct_rates = [cal_ct.loc[t, "click_rate"] if t in cal_ct.index else 0 for t in tiers]
po_rates = [cal_po.loc[t, "click_rate"] if t in cal_po.index else 0 for t in tiers]

x = range(len(tiers))
width = 0.35
bars1 = ax.bar([i - width/2 for i in x], ct_rates, width, label="Contrastive", color="#B8726A")
bars2 = ax.bar([i + width/2 for i in x], po_rates, width, label="Positive-Only", color="#5C7D68")

ax.axhline(y=df["user_clicked"].mean(), color="gray", linestyle="--", alpha=0.7, label=f"Base rate ({df['user_clicked'].mean():.1%})")
ax.set_xticks(x)
ax.set_xticklabels([f"Tier {t}" for t in tiers])
ax.set_ylabel("User click rate")
ax.set_title("Tier Calibration: Did higher tiers actually get more clicks?")
ax.yaxis.set_major_formatter(mticker.PercentFormatter(1.0))
ax.legend()

for bar_group in [bars1, bars2]:
    for bar in bar_group:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f"{height:.0%}", ha="center", va="bottom", fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# ── Confusion matrix heatmap ─────────────────────────────────────────
# Cross-tabulate contrastive tier vs positive-only tier, colored by user click rate

cross = df.groupby(["tier_contrastive", "tier_posonly"]).agg(
    count=("user_clicked", "count"),
    clicks=("user_clicked", "sum"),
).reset_index()
cross["click_rate"] = cross["clicks"] / cross["count"]

pivot = cross.pivot(index="tier_contrastive", columns="tier_posonly", values="count").fillna(0).astype(int)
pivot_rate = cross.pivot(index="tier_contrastive", columns="tier_posonly", values="click_rate").fillna(0)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(pivot, annot=True, fmt="d", cmap="YlOrBr", ax=ax, cbar_kws={"label": "Item count"})
ax.set_xlabel("Positive-Only Tier")
ax.set_ylabel("Contrastive Tier")
ax.set_title("Where do the two profiles agree/disagree?")
plt.tight_layout()
plt.show()

# Agreement rate
agree = (df["tier_contrastive"] == df["tier_posonly"]).mean()
print(f"\nExact tier agreement: {agree:.1%}")
within_one = (abs(df["tier_contrastive"] - df["tier_posonly"]) <= 1).mean()
print(f"Within 1 tier: {within_one:.1%}")

In [None]:
# ── Error analysis: False Positives and False Negatives ──────────────

print("="*60)
print("FALSE POSITIVES (Contrastive): Recommended but user skipped")
print("="*60)
fps = df[(df["is_rec_contrastive"]) & (~df["user_clicked"])]
for _, row in fps.iterrows():
    print(f"  {row['id']}: {row['name']} (Tier {row['tier_contrastive']})")
    print(f"    Rationale: {row['rationale_contrastive'][:120]}...")
    print()

print("\n" + "="*60)
print("FALSE NEGATIVES (Contrastive): User clicked but we didn't recommend")
print("="*60)
fns = df[(~df["is_rec_contrastive"]) & (df["user_clicked"])]
for _, row in fns.iterrows():
    print(f"  {row['id']}: {row['name']} (Tier {row['tier_contrastive']})")
    print(f"    Rationale: {row['rationale_contrastive'][:120]}...")
    print()

---
## Cost Analysis

How much does this cost in practice? The answer depends heavily on model choice.

In [None]:
# ── Cost estimation ──────────────────────────────────────────────────
# These are approximate costs based on the original experiment's token usage.

PRICING = {
    "claude-opus-4-6":           {"input": 15.0, "output": 75.0, "cache_read": 1.50},
    "claude-sonnet-4-5-20250929": {"input": 3.0,  "output": 15.0, "cache_read": 0.30},
    "claude-haiku-4-5-20251001":  {"input": 0.80, "output": 4.0,  "cache_read": 0.08},
}

# Estimated tokens per operation (from original experiment)
TOKENS_PER_OP = {
    "profile_synthesis_per_agent": {"input": 50_000, "output": 3_000},  # with images
    "profile_merge":               {"input": 20_000, "output": 2_000},
    "score_single_item":           {"input": 5_000,  "output": 200},
}

def estimate_cost(model: str, n_agents: int, n_items: int, n_profiles: int = 2):
    """Estimate API cost for a full experiment run."""
    p = PRICING[model]
    
    # Profile synthesis: n_agents × n_profiles
    synth = TOKENS_PER_OP["profile_synthesis_per_agent"]
    synth_cost = n_agents * n_profiles * (
        synth["input"] * p["input"] / 1e6 + synth["output"] * p["output"] / 1e6
    )
    
    # Merge: n_profiles
    merge = TOKENS_PER_OP["profile_merge"]
    merge_cost = n_profiles * (
        merge["input"] * p["input"] / 1e6 + merge["output"] * p["output"] / 1e6
    )
    
    # Scoring: n_items × n_profiles
    score = TOKENS_PER_OP["score_single_item"]
    score_cost = n_items * n_profiles * (
        score["input"] * p["input"] / 1e6 + score["output"] * p["output"] / 1e6
    )
    
    total = synth_cost + merge_cost + score_cost
    return {
        "synthesis": synth_cost, "merge": merge_cost,
        "scoring": score_cost, "total": total,
    }

print(f"{'Model':<35} {'Synthesis':>10} {'Scoring':>10} {'Total':>10}")
print("-" * 67)
for model_id in PRICING:
    short = model_id.split("-")[1].title()
    costs = estimate_cost(model_id, n_agents=5, n_items=103)
    print(f"{short + ' (5 agents, 103 items)':<35} ${costs['synthesis']:>8.2f} ${costs['scoring']:>8.2f} ${costs['total']:>8.2f}")

print("\n--- Smaller experiment (3 agents, 50 items) ---")
for model_id in PRICING:
    short = model_id.split("-")[1].title()
    costs = estimate_cost(model_id, n_agents=3, n_items=50)
    print(f"{short + ' (3 agents, 50 items)':<35} ${costs['synthesis']:>8.2f} ${costs['scoring']:>8.2f} ${costs['total']:>8.2f}")

---
## Key Findings

From the original experiment:

1. **The LLM understood preferences well from just 10 clicks.** Both profiles correctly identified the core aesthetic (craft-forward bombers, textural complexity, dark earth tones).

2. **Negative signal improved understanding but hurt recommendations.** The contrastive profile had sharper conditional logic ("bombers only with embellishment") but converted soft preferences into hard rejection gates, causing more false negatives.

3. **Positive-only achieved better recall at similar precision.** The simpler profile was more permissive, capturing more items the user actually liked.

4. **The "text bottleneck" limits recommendation quality.** The image → text → LLM pipeline loses continuous visual information (texture, drape, color temperature) when compressing to discrete text tokens. A native multimodal embedding approach might do better.

5. **Tier calibration worked directionally but not perfectly.** T4 items had the lowest click rate (~9%), confirming the system correctly identifies non-matches. But T1-T3 were not as cleanly separated as expected.

6. **Cross-retailer transfer is inherently hard.** The user's click rate on Banana Republic (3%) was far below Free People (24%), suggesting brand/aesthetic fit matters beyond individual item attributes.

---
## How to Run This With Your Own Data

### Step 1: Collect training data
Browse a product category page and save your clicks and skips:

```json
// training_data.json
{
  "source": "your-retailer.com",
  "category": "Jackets & Coats",
  "items": [
    {
      "id": 1,
      "name": "Item Name",
      "price_usd": 150,
      "hero_img_path": "images/item1.jpg",
      "visual_description": "optional text description"
    }
  ]
}
```

### Step 2: Set `RUN_API_CALLS = True` and run cells
The notebook will synthesize profiles, score your catalog, and produce metrics.

### Step 3: Do the blind evaluation
Open the generated HTML, label each item, paste the JSON back.

### Step 4: Analyze
The analysis cells compute precision, recall, lift, and tier calibration automatically.

### Tips for adaptation
- **Different categories**: This works for any product category where visual browsing is the primary signal (furniture, shoes, jewelry, etc.)
- **Fewer agents**: 3 agents is usually sufficient; 5 gives marginal improvement
- **Model choice**: Sonnet is the best cost/quality tradeoff for scoring; use Opus for profile synthesis if budget allows
- **Smaller catalogs**: Even 20-30 test items give meaningful signal