# SnackTrack ML --- Hybrid Recommendation System Evaluation

This notebook provides a comprehensive evaluation of the SnackTrack **hybrid recommendation system**,
which blends five distinct recommendation strategies:

| Model | Approach | When it shines |
|-------|----------|----------------|
| **Content-Based** | pgvector cosine similarity on recipe embeddings | Users with clear taste profiles |
| **Collaborative** | User-user similarity from interaction patterns | Users with overlapping preferences |
| **Knowledge-Based** | Expert nutritional guidelines and dietary constraints | Cold-start, health-focused users |
| **VAE** | Variational Autoencoder latent space similarity | Sparse data, exploration |
| **RNN** | Sequential meal pattern prediction (GRU) | Users with temporal eating patterns |

### Evaluation focus

1. **Per-model performance**: How well does each individual model rank relevant items?
2. **Hybrid blending**: Does the weighted combination outperform any single model?
3. **Before vs After**: Impact of replacing random placeholder weights with trained weights.
4. **Weight sensitivity**: How robust is the hybrid to changes in blending weights?
5. **Cold-start analysis**: Performance across user maturity segments.

### Metrics

- **Precision@K**: Fraction of recommended items that are relevant
- **Recall@K**: Fraction of relevant items that are recommended
- **NDCG@K**: Normalized Discounted Cumulative Gain (rank-aware)
- **MRR**: Mean Reciprocal Rank of the first relevant item
- **Coverage**: Fraction of the item catalogue recommended at least once
- **Diversity**: Average pairwise cosine distance among recommendations

In [None]:
import sys
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

warnings.filterwarnings("ignore", category=FutureWarning)

# ---------------------------------------------------------------------------
# Path setup
# ---------------------------------------------------------------------------
sys.path.insert(0, "..")

from notebooks.utils.plot_helpers import (
    setup_plot_style, SNACKTRACK_COLORS, PALETTE,
)
from notebooks.utils.data_loader import (
    load_kaggle_dataset, extract_vae_features, _encode_time_features,
)
from notebooks.utils.weight_io import (
    load_vae_weights, load_rnn_weights,
    VAE_WEIGHT_SHAPES, RNN_WEIGHT_SHAPES,
)

setup_plot_style()
print(f"NumPy {np.__version__} | Pandas {pd.__version__}")

## 1. Load Trained Weights

We load the VAE and RNN weights produced by notebooks 04 and 05 respectively.
If the weight files are not found (e.g., the training notebooks haven't been run yet),
we fall back to random weights and print a prominent warning.

In [None]:
# ---------------------------------------------------------------------------
# Load VAE weights
# ---------------------------------------------------------------------------
vae_weights = None
vae_trained = False

try:
    vae_weights = load_vae_weights()
    vae_trained = True
    print("VAE weights loaded successfully:")
    for key, arr in vae_weights.items():
        print(f"  {key:<20} {arr.shape}")
except FileNotFoundError:
    print("WARNING: VAE weights not found! Using random weights.")
    print("  Run notebook 04 (VAE training) first for meaningful results.")
    rng = np.random.default_rng(42)
    vae_weights = {}
    for key, shape in VAE_WEIGHT_SHAPES.items():
        if "b" in key or "mean" in key:
            vae_weights[key] = np.zeros(shape)
        elif "std" in key:
            vae_weights[key] = np.ones(shape)
        else:
            vae_weights[key] = rng.standard_normal(shape) * 0.1

print()

# ---------------------------------------------------------------------------
# Load RNN weights
# ---------------------------------------------------------------------------
rnn_weights = None
rnn_trained = False

try:
    rnn_weights = load_rnn_weights()
    rnn_trained = True
    print("RNN weights loaded successfully:")
    for key, arr in rnn_weights.items():
        print(f"  {key:<5} {arr.shape}")
except FileNotFoundError:
    print("WARNING: RNN weights not found! Using random weights.")
    print("  Run notebook 05 (RNN training) first for meaningful results.")
    rng = np.random.default_rng(42)
    rnn_weights = {}
    for key, shape in RNN_WEIGHT_SHAPES.items():
        if key.startswith("b"):
            rnn_weights[key] = np.zeros(shape)
        else:
            rnn_weights[key] = rng.standard_normal(shape) * 0.1

print()
print(f"VAE: {'TRAINED' if vae_trained else 'RANDOM'} weights")
print(f"RNN: {'TRAINED' if rnn_trained else 'RANDOM'} weights")

## 2. Prepare Evaluation Data

We load the Food.com interactions dataset and create a **temporal split**:
- **Training set**: First 80% of each user's interactions (chronologically).
- **Test set**: Last 20% of each user's interactions.

This simulates real-world conditions where we train on past behaviour
and evaluate on future interactions.

We also build user profiles from the training set: a preference vector
computed as the weighted average of interacted recipe embeddings.

In [None]:
# ---------------------------------------------------------------------------
# Load interaction data
# ---------------------------------------------------------------------------
print("Loading datasets...")

try:
    interactions_df = load_kaggle_dataset("foodcom_interactions")
    print(f"  Food.com interactions: {len(interactions_df):,} rows")
except FileNotFoundError:
    print("  Food.com interactions not found. Generating synthetic data.")
    rng = np.random.default_rng(42)
    n_users, n_recipes = 300, 500
    n_interactions = 15000
    interactions_df = pd.DataFrame({
        "user_id": rng.integers(0, n_users, n_interactions),
        "recipe_id": rng.integers(0, n_recipes, n_interactions),
        "rating": rng.integers(1, 6, n_interactions),
        "date": pd.date_range("2020-01-01", periods=n_interactions, freq="30min"),
    })

# ---------------------------------------------------------------------------
# Identify columns
# ---------------------------------------------------------------------------
user_col = next((c for c in ["user_id", "author_id", "contributor_id"]
                 if c in interactions_df.columns), None)
recipe_col = next((c for c in ["recipe_id", "id"]
                   if c in interactions_df.columns), None)
rating_col = next((c for c in ["rating", "score", "interaction_value"]
                   if c in interactions_df.columns), None)
date_col = None
for col in ["date", "submitted", "created_at", "timestamp"]:
    if col in interactions_df.columns:
        date_col = col
        break

if date_col:
    interactions_df["timestamp"] = pd.to_datetime(interactions_df[date_col], errors="coerce")
else:
    interactions_df["timestamp"] = pd.date_range(
        start="2020-01-01", periods=len(interactions_df), freq="h"
    )

interactions_df = interactions_df.dropna(subset=["timestamp"])

print(f"  User col: {user_col}, Recipe col: {recipe_col}, Rating col: {rating_col}")
print(f"  Total interactions: {len(interactions_df):,}")

# ---------------------------------------------------------------------------
# Build recipe embeddings (feature-based proxy)
# ---------------------------------------------------------------------------
all_recipe_ids = interactions_df[recipe_col].unique()

# Try to load recipe nutritional data
recipe_features = {}
try:
    recipes_df = load_kaggle_dataset("foodcom_reviews")
    for _, row in recipes_df.iterrows():
        rid = row.get(recipe_col) or row.get("id") or row.get("recipe_id")
        if rid is not None:
            emb = np.zeros(32, dtype=np.float64)
            emb[0] = (row.get("calories") or 0) / 1000.0
            emb[1] = (row.get("protein") or row.get("protein_g") or 0) / 100.0
            emb[2] = (row.get("carbohydrate") or row.get("carbs_g") or 0) / 200.0
            emb[3] = (row.get("total_fat") or row.get("fat_g") or 0) / 100.0
            emb[4] = (row.get("sodium") or row.get("sodium_mg") or 0) / 2300.0
            emb[5] = (row.get("fiber") or row.get("fiber_g") or 0) / 30.0
            emb[6] = (row.get("sugar") or row.get("sugar_g") or 0) / 50.0
            recipe_features[rid] = emb
    print(f"  Loaded {len(recipe_features):,} recipe feature vectors")
except (FileNotFoundError, Exception) as e:
    print(f"  Recipe features not available ({e}). Using random embeddings.")

# Fill missing with random
rng_emb = np.random.default_rng(123)
for rid in all_recipe_ids:
    if rid not in recipe_features:
        recipe_features[rid] = rng_emb.standard_normal(32) * 0.1

print(f"  Total recipe embeddings: {len(recipe_features):,}")

# ---------------------------------------------------------------------------
# Temporal train/test split (last 20% of each user's interactions)
# ---------------------------------------------------------------------------
TEST_FRACTION = 0.2
MIN_INTERACTIONS = 5  # need at least 5 interactions to evaluate

train_data = []  # list of (user_id, recipe_id, rating)
test_data = {}   # user_id -> set of relevant recipe_ids
user_profiles = {}  # user_id -> preference vector
user_interaction_counts = {}  # user_id -> count

for uid, group in interactions_df.sort_values("timestamp").groupby(user_col):
    if len(group) < MIN_INTERACTIONS:
        continue

    split_idx = int(len(group) * (1 - TEST_FRACTION))
    train_group = group.iloc[:split_idx]
    test_group = group.iloc[split_idx:]

    # Training data
    for _, row in train_group.iterrows():
        rating = float(row[rating_col]) if rating_col else 3.0
        train_data.append((uid, row[recipe_col], rating))

    # Test set: recipes the user interacted with in the test period
    # Only consider positive interactions (rating >= 3) as "relevant"
    if rating_col:
        relevant = set(test_group[test_group[rating_col] >= 3][recipe_col].values)
    else:
        relevant = set(test_group[recipe_col].values)

    if len(relevant) > 0:
        test_data[uid] = relevant

    # Build user preference vector from training interactions
    vectors = []
    weights = []
    for _, row in train_group.iterrows():
        rid = row[recipe_col]
        if rid in recipe_features:
            rating = float(row[rating_col]) if rating_col else 3.0
            vectors.append(recipe_features[rid])
            weights.append(rating)

    if vectors:
        vectors_arr = np.array(vectors)
        weights_arr = np.array(weights).reshape(-1, 1)
        pref_vec = (vectors_arr * weights_arr).sum(axis=0) / (weights_arr.sum() + 1e-8)
        norm = np.linalg.norm(pref_vec)
        if norm > 0:
            pref_vec = pref_vec / norm
        user_profiles[uid] = pref_vec

    user_interaction_counts[uid] = len(train_group)

print(f"\nEvaluation dataset:")
print(f"  Training interactions: {len(train_data):,}")
print(f"  Test users:           {len(test_data):,}")
print(f"  Users with profiles:  {len(user_profiles):,}")
print(f"  Unique recipes:       {len(all_recipe_ids):,}")
print(f"  Avg test items/user:  {np.mean([len(v) for v in test_data.values()]):.1f}")

## 3. Evaluation Metrics

Standard information retrieval metrics adapted for top-K recommendation:

- **Precision@K**: Of the top K items recommended, how many are relevant?
- **Recall@K**: Of all relevant items, how many appear in the top K?
- **NDCG@K**: How well are relevant items ranked? Earlier positions get more credit.
- **MRR**: How quickly does the first relevant item appear?
- **Coverage**: What fraction of the item catalogue gets recommended?
- **Diversity**: How different are the recommended items from each other?

In [None]:
def precision_at_k(recommended, relevant, k):
    """Precision@K: fraction of top-K recommendations that are relevant.

    Args:
        recommended: ordered list of recommended item IDs
        relevant: set of relevant (ground-truth) item IDs
        k: cutoff

    Returns:
        float in [0, 1]
    """
    if k == 0 or not recommended:
        return 0.0
    top_k = recommended[:k]
    hits = len(set(top_k) & set(relevant))
    return hits / k


def recall_at_k(recommended, relevant, k):
    """Recall@K: fraction of relevant items that appear in top-K.

    Args:
        recommended: ordered list of recommended item IDs
        relevant: set of relevant (ground-truth) item IDs
        k: cutoff

    Returns:
        float in [0, 1]
    """
    if not relevant or not recommended:
        return 0.0
    top_k = recommended[:k]
    hits = len(set(top_k) & set(relevant))
    return hits / len(relevant)


def ndcg_at_k(recommended, relevant, k):
    """Normalized Discounted Cumulative Gain at K.

    Measures ranking quality: relevant items ranked higher contribute more.

    Args:
        recommended: ordered list of recommended item IDs
        relevant: set of relevant (ground-truth) item IDs
        k: cutoff

    Returns:
        float in [0, 1]
    """
    if not relevant or not recommended:
        return 0.0

    top_k = recommended[:k]

    # DCG: sum of 1/log2(rank+1) for each relevant item in top-K
    dcg = 0.0
    for i, item in enumerate(top_k):
        if item in relevant:
            dcg += 1.0 / np.log2(i + 2)  # +2 because rank starts at 1

    # Ideal DCG: all relevant items ranked first
    ideal_hits = min(len(relevant), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))

    return dcg / idcg if idcg > 0 else 0.0


def mean_reciprocal_rank(recommended, relevant):
    """Mean Reciprocal Rank: 1 / rank of first relevant item.

    Args:
        recommended: ordered list of recommended item IDs
        relevant: set of relevant (ground-truth) item IDs

    Returns:
        float in [0, 1]
    """
    if not relevant or not recommended:
        return 0.0
    for i, item in enumerate(recommended):
        if item in relevant:
            return 1.0 / (i + 1)
    return 0.0


def coverage(all_recommendations, total_items):
    """Catalogue coverage: fraction of items recommended at least once.

    Args:
        all_recommendations: list of recommendation lists (one per user)
        total_items: total number of items in the catalogue

    Returns:
        float in [0, 1]
    """
    if total_items == 0:
        return 0.0
    recommended_items = set()
    for recs in all_recommendations:
        recommended_items.update(recs)
    return len(recommended_items) / total_items


def diversity(recommendations, embeddings):
    """Intra-list diversity: average pairwise cosine distance among recommendations.

    Higher is more diverse.

    Args:
        recommendations: list of recommended item IDs
        embeddings: dict mapping item_id -> embedding vector

    Returns:
        float in [0, 2] (cosine distance range)
    """
    if len(recommendations) < 2:
        return 0.0

    vecs = []
    for rid in recommendations:
        if rid in embeddings:
            vecs.append(embeddings[rid])

    if len(vecs) < 2:
        return 0.0

    vecs = np.array(vecs)
    total_dist = 0.0
    count = 0
    for i in range(len(vecs)):
        for j in range(i + 1, len(vecs)):
            norm_i = np.linalg.norm(vecs[i])
            norm_j = np.linalg.norm(vecs[j])
            if norm_i > 0 and norm_j > 0:
                cos_sim = np.dot(vecs[i], vecs[j]) / (norm_i * norm_j)
                total_dist += 1.0 - cos_sim  # cosine distance
            count += 1

    return total_dist / count if count > 0 else 0.0


print("Metric functions defined:")
print("  precision_at_k, recall_at_k, ndcg_at_k,")
print("  mean_reciprocal_rank, coverage, diversity")

## 4. Per-Model Evaluation

We evaluate each of the five recommendation models independently on the test set.
Since we are running offline (without a database), we implement lightweight
NumPy-based versions of each model that replicate the production scoring logic.

In [None]:
# ---------------------------------------------------------------------------
# Model implementations (offline NumPy versions)
# ---------------------------------------------------------------------------

def content_based_recommend(user_id, user_profiles, recipe_features, top_n=10, exclude=None):
    """Content-based: cosine similarity between user profile and recipe embeddings."""
    if user_id not in user_profiles:
        return []
    pref = user_profiles[user_id]
    pref_norm = np.linalg.norm(pref)
    if pref_norm == 0:
        return []

    exclude = set(exclude or [])
    scored = []
    for rid, emb in recipe_features.items():
        if rid in exclude:
            continue
        emb_norm = np.linalg.norm(emb)
        if emb_norm > 0:
            sim = float(np.dot(pref, emb) / (pref_norm * emb_norm))
        else:
            sim = 0.0
        scored.append((rid, sim))

    scored.sort(key=lambda x: x[1], reverse=True)
    return [rid for rid, _ in scored[:top_n]]


def collaborative_recommend(user_id, train_data, user_interaction_counts,
                            recipe_features, top_n=10, exclude=None):
    """Collaborative filtering: user-user similarity from interaction overlap."""
    exclude = set(exclude or [])

    # Build user-item matrix (sparse representation)
    user_items = {}  # user_id -> {recipe_id: rating}
    for uid, rid, rating in train_data:
        if uid not in user_items:
            user_items[uid] = {}
        user_items[uid][rid] = rating

    if user_id not in user_items:
        return []

    target_items = user_items[user_id]

    # Find users with overlapping interactions
    similarities = []
    for other_uid, other_items in user_items.items():
        if other_uid == user_id:
            continue
        overlap = set(target_items.keys()) & set(other_items.keys())
        if len(overlap) < 2:
            continue

        # Cosine similarity on overlapping ratings
        v1 = np.array([target_items[r] for r in overlap])
        v2 = np.array([other_items[r] for r in overlap])
        n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
        if n1 > 0 and n2 > 0:
            sim = float(np.dot(v1, v2) / (n1 * n2))
            similarities.append((other_uid, sim))

    if not similarities:
        return []

    similarities.sort(key=lambda x: x[1], reverse=True)
    top_neighbors = similarities[:20]

    # Aggregate neighbor ratings
    candidate_scores = {}
    for neighbor_uid, sim in top_neighbors:
        if sim <= 0:
            continue
        for rid, rating in user_items[neighbor_uid].items():
            if rid in target_items or rid in exclude:
                continue
            candidate_scores[rid] = candidate_scores.get(rid, 0.0) + sim * rating

    scored = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
    return [rid for rid, _ in scored[:top_n]]


def knowledge_based_recommend(user_id, recipe_features, top_n=10, exclude=None):
    """Knowledge-based: score by nutritional balance heuristics."""
    exclude = set(exclude or [])

    scored = []
    for rid, emb in recipe_features.items():
        if rid in exclude:
            continue
        # Nutritional balance score: penalise extreme values
        # emb[0] = cal/1000, emb[1] = protein/100, emb[2] = carbs/200, emb[3] = fat/100
        cal = emb[0] * 1000
        protein = emb[1] * 100
        carbs = emb[2] * 200
        fat = emb[3] * 100

        # Target: 500 cal meal, 25g protein, 60g carbs, 20g fat
        cal_score = max(0, 1.0 - abs(cal - 500) / 500 * 0.5)
        protein_score = max(0, 1.0 - abs(protein - 25) / 25 * 0.5)
        carb_score = max(0, 1.0 - abs(carbs - 60) / 60 * 0.5)
        fat_score = max(0, 1.0 - abs(fat - 20) / 20 * 0.5)

        score = 0.3 * cal_score + 0.25 * protein_score + 0.25 * carb_score + 0.2 * fat_score
        scored.append((rid, score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return [rid for rid, _ in scored[:top_n]]


def vae_recommend(user_id, user_profiles, recipe_features, vae_w, top_n=10, exclude=None):
    """VAE-based: encode user profile to latent space, find nearest recipes."""
    exclude = set(exclude or [])

    if user_id not in user_profiles:
        return []

    pref = user_profiles[user_id]

    # Extract 12D features from the 32D preference vector (first 12 dims)
    # In production, we use extract_features(); here we approximate
    feat_12d = np.zeros(12)
    feat_12d[:min(12, len(pref))] = pref[:12]

    # Encode to latent space
    normalized = (feat_12d - vae_w["feature_means"]) / (vae_w["feature_stds"] + 1e-8)
    user_latent = normalized @ vae_w["encoder_mu_w"] + vae_w["encoder_mu_b"]
    user_latent_norm = np.linalg.norm(user_latent)

    scored = []
    for rid, emb in recipe_features.items():
        if rid in exclude:
            continue
        feat = np.zeros(12)
        feat[:min(12, len(emb))] = emb[:12]
        norm_feat = (feat - vae_w["feature_means"]) / (vae_w["feature_stds"] + 1e-8)
        recipe_latent = norm_feat @ vae_w["encoder_mu_w"] + vae_w["encoder_mu_b"]
        r_norm = np.linalg.norm(recipe_latent)

        if user_latent_norm > 0 and r_norm > 0:
            sim = float(np.dot(user_latent, recipe_latent) / (user_latent_norm * r_norm))
        else:
            sim = 0.0
        scored.append((rid, sim))

    scored.sort(key=lambda x: x[1], reverse=True)
    return [rid for rid, _ in scored[:top_n]]


def rnn_recommend(user_id, train_data, recipe_features, rnn_w, top_n=10, exclude=None):
    """RNN-based: process user's meal history through GRU and predict next meal."""
    exclude = set(exclude or [])

    # Get user's interactions in chronological order
    user_meals = [(rid, rating) for uid, rid, rating in train_data if uid == user_id]
    if len(user_meals) < 3:
        return []

    # Build sequence (use last 20 meals)
    recent_meals = user_meals[-20:]
    sequence = []
    base_time = pd.Timestamp("2023-06-15 12:00:00")

    for i, (rid, rating) in enumerate(recent_meals):
        emb = recipe_features.get(rid, np.zeros(32))
        ts = base_time + pd.Timedelta(hours=i * 6)
        hour = ts.hour
        if 5 <= hour < 11:
            mt = "breakfast"
        elif 11 <= hour < 15:
            mt = "lunch"
        elif 15 <= hour < 21:
            mt = "dinner"
        else:
            mt = "snack"
        time_feat = _encode_time_features(ts, mt)
        x = np.concatenate([emb, time_feat])
        sequence.append(x)

    # GRU forward pass
    def sigmoid(x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -20, 20)))

    h = np.zeros(64, dtype=np.float64)
    for x in sequence:
        x = x.astype(np.float64)
        z = sigmoid(x @ rnn_w["Wz"] + h @ rnn_w["Uz"] + rnn_w["bz"])
        r = sigmoid(x @ rnn_w["Wr"] + h @ rnn_w["Ur"] + rnn_w["br"])
        h_cand = np.tanh(x @ rnn_w["Wh"] + (r * h) @ rnn_w["Uh"] + rnn_w["bh"])
        h = (1 - z) * h + z * h_cand

    predicted_emb = h @ rnn_w["Wo"] + rnn_w["bo"]
    pred_norm = np.linalg.norm(predicted_emb)

    # Score recipes by cosine similarity to prediction
    scored = []
    user_recipe_ids = {rid for rid, _ in user_meals}
    for rid, emb in recipe_features.items():
        if rid in exclude or rid in user_recipe_ids:
            continue
        emb_norm = np.linalg.norm(emb)
        if pred_norm > 0 and emb_norm > 0:
            sim = float(np.dot(predicted_emb, emb) / (pred_norm * emb_norm))
        else:
            sim = 0.0
        scored.append((rid, sim))

    scored.sort(key=lambda x: x[1], reverse=True)
    return [rid for rid, _ in scored[:top_n]]


# ---------------------------------------------------------------------------
# Evaluate each model
# ---------------------------------------------------------------------------
TOP_N = 10
test_users = list(test_data.keys())
# Subsample if too many users (for speed)
MAX_EVAL_USERS = 500
if len(test_users) > MAX_EVAL_USERS:
    rng_eval = np.random.default_rng(42)
    test_users = list(rng_eval.choice(test_users, MAX_EVAL_USERS, replace=False))

print(f"Evaluating {len(test_users)} test users, top-{TOP_N} recommendations per model...")

# Pre-build user meal lookup for RNN/collaborative (avoid repeated scans)
user_train_items = {}
for uid, rid, rating in train_data:
    if uid not in user_train_items:
        user_train_items[uid] = set()
    user_train_items[uid].add(rid)

model_names = ["content", "collaborative", "knowledge", "vae", "rnn"]
model_results = {name: {"P@5": [], "P@10": [], "R@10": [], "NDCG@10": [],
                        "MRR": [], "recs": []} for name in model_names}

for uid in tqdm(test_users, desc="Evaluating models"):
    relevant = test_data[uid]
    exclude_items = user_train_items.get(uid, set())

    # Content-based
    recs = content_based_recommend(uid, user_profiles, recipe_features,
                                   top_n=TOP_N, exclude=exclude_items)
    model_results["content"]["P@5"].append(precision_at_k(recs, relevant, 5))
    model_results["content"]["P@10"].append(precision_at_k(recs, relevant, 10))
    model_results["content"]["R@10"].append(recall_at_k(recs, relevant, 10))
    model_results["content"]["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    model_results["content"]["MRR"].append(mean_reciprocal_rank(recs, relevant))
    model_results["content"]["recs"].append(recs)

    # Collaborative
    recs = collaborative_recommend(uid, train_data, user_interaction_counts,
                                    recipe_features, top_n=TOP_N, exclude=exclude_items)
    model_results["collaborative"]["P@5"].append(precision_at_k(recs, relevant, 5))
    model_results["collaborative"]["P@10"].append(precision_at_k(recs, relevant, 10))
    model_results["collaborative"]["R@10"].append(recall_at_k(recs, relevant, 10))
    model_results["collaborative"]["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    model_results["collaborative"]["MRR"].append(mean_reciprocal_rank(recs, relevant))
    model_results["collaborative"]["recs"].append(recs)

    # Knowledge-based
    recs = knowledge_based_recommend(uid, recipe_features,
                                     top_n=TOP_N, exclude=exclude_items)
    model_results["knowledge"]["P@5"].append(precision_at_k(recs, relevant, 5))
    model_results["knowledge"]["P@10"].append(precision_at_k(recs, relevant, 10))
    model_results["knowledge"]["R@10"].append(recall_at_k(recs, relevant, 10))
    model_results["knowledge"]["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    model_results["knowledge"]["MRR"].append(mean_reciprocal_rank(recs, relevant))
    model_results["knowledge"]["recs"].append(recs)

    # VAE
    recs = vae_recommend(uid, user_profiles, recipe_features, vae_weights,
                         top_n=TOP_N, exclude=exclude_items)
    model_results["vae"]["P@5"].append(precision_at_k(recs, relevant, 5))
    model_results["vae"]["P@10"].append(precision_at_k(recs, relevant, 10))
    model_results["vae"]["R@10"].append(recall_at_k(recs, relevant, 10))
    model_results["vae"]["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    model_results["vae"]["MRR"].append(mean_reciprocal_rank(recs, relevant))
    model_results["vae"]["recs"].append(recs)

    # RNN
    recs = rnn_recommend(uid, train_data, recipe_features, rnn_weights,
                         top_n=TOP_N, exclude=exclude_items)
    model_results["rnn"]["P@5"].append(precision_at_k(recs, relevant, 5))
    model_results["rnn"]["P@10"].append(precision_at_k(recs, relevant, 10))
    model_results["rnn"]["R@10"].append(recall_at_k(recs, relevant, 10))
    model_results["rnn"]["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    model_results["rnn"]["MRR"].append(mean_reciprocal_rank(recs, relevant))
    model_results["rnn"]["recs"].append(recs)

# ---------------------------------------------------------------------------
# Results table
# ---------------------------------------------------------------------------
metrics = ["P@5", "P@10", "R@10", "NDCG@10", "MRR"]
results_table = []
for name in model_names:
    row = {"Model": name}
    for m in metrics:
        row[m] = np.mean(model_results[name][m])
    # Coverage
    row["Coverage"] = coverage(model_results[name]["recs"], len(all_recipe_ids))
    results_table.append(row)

results_df = pd.DataFrame(results_table)
results_df = results_df.set_index("Model")

print("\nPer-Model Results")
print("=" * 75)
print(results_df.to_string(float_format="{:.4f}".format))
print()

# Bar chart
fig, ax = plt.subplots(figsize=(12, 5))
x = np.arange(len(metrics))
width = 0.15
for i, name in enumerate(model_names):
    values = [results_df.loc[name, m] for m in metrics]
    ax.bar(x + i * width, values, width, label=name, color=PALETTE[i], alpha=0.85)

ax.set_xticks(x + width * 2)
ax.set_xticklabels(metrics)
ax.set_ylabel("Score")
ax.set_title("Per-Model Recommendation Quality")
ax.legend()
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

## 5. Hybrid Blending Evaluation

We replicate the production hybrid blending logic from `app/recommender/hybrid.py`:

1. Each model generates candidate recommendations with raw scores.
2. Per-model scores are **normalized to [0, 1]** (min-max within each model).
3. Scores are **blended** using maturity-stage-dependent weights:

| Stage | Knowledge | Content | Collaborative | VAE | RNN |
|-------|-----------|---------|---------------|-----|-----|
| Cold-start (<5) | 0.50 | 0.30 | 0.00 | 0.10 | 0.10 |
| Early (5-19) | 0.25 | 0.30 | 0.15 | 0.15 | 0.15 |
| Mature (20+) | 0.15 | 0.20 | 0.25 | 0.20 | 0.20 |

In [None]:
# ---------------------------------------------------------------------------
# Hybrid blending weights (from app/recommender/hybrid.py)
# ---------------------------------------------------------------------------
MODEL_WEIGHTS = {
    "cold_start": {
        "knowledge": 0.50,
        "content": 0.30,
        "collaborative": 0.00,
        "vae": 0.10,
        "rnn": 0.10,
    },
    "early": {
        "knowledge": 0.25,
        "content": 0.30,
        "collaborative": 0.15,
        "vae": 0.15,
        "rnn": 0.15,
    },
    "mature": {
        "knowledge": 0.15,
        "content": 0.20,
        "collaborative": 0.25,
        "vae": 0.20,
        "rnn": 0.20,
    },
}

COLD_START_THRESHOLD = 5


def get_maturity_stage(interaction_count):
    if interaction_count < COLD_START_THRESHOLD:
        return "cold_start"
    elif interaction_count < 20:
        return "early"
    return "mature"


def hybrid_recommend(user_id, user_profiles, train_data, recipe_features,
                     vae_w, rnn_w, user_interaction_counts, top_n=10, exclude=None):
    """Hybrid recommendation matching the production blending logic."""
    exclude = set(exclude or [])
    n_interactions = user_interaction_counts.get(user_id, 0)
    maturity = get_maturity_stage(n_interactions)
    weights = MODEL_WEIGHTS[maturity]

    # Gather per-model scored candidates
    all_model_scores = {}  # model_name -> [(recipe_id, score), ...]

    # Content-based
    if user_id in user_profiles:
        pref = user_profiles[user_id]
        pref_norm = np.linalg.norm(pref)
        scored = []
        for rid, emb in recipe_features.items():
            if rid in exclude:
                continue
            emb_norm = np.linalg.norm(emb)
            sim = float(np.dot(pref, emb) / (pref_norm * emb_norm + 1e-8))
            scored.append((rid, sim))
        all_model_scores["content"] = scored
    else:
        all_model_scores["content"] = []

    # Knowledge-based
    scored = []
    for rid, emb in recipe_features.items():
        if rid in exclude:
            continue
        cal = emb[0] * 1000
        protein = emb[1] * 100
        carbs = emb[2] * 200
        fat = emb[3] * 100
        cal_s = max(0, 1.0 - abs(cal - 500) / 500 * 0.5)
        prot_s = max(0, 1.0 - abs(protein - 25) / 25 * 0.5)
        carb_s = max(0, 1.0 - abs(carbs - 60) / 60 * 0.5)
        fat_s = max(0, 1.0 - abs(fat - 20) / 20 * 0.5)
        score = 0.3 * cal_s + 0.25 * prot_s + 0.25 * carb_s + 0.2 * fat_s
        scored.append((rid, score))
    all_model_scores["knowledge"] = scored

    # Collaborative
    if maturity != "cold_start":
        user_items = {}
        for uid, rid, rating in train_data:
            if uid not in user_items:
                user_items[uid] = {}
            user_items[uid][rid] = rating

        if user_id in user_items:
            target_items = user_items[user_id]
            candidate_scores = {}

            for other_uid, other_items in user_items.items():
                if other_uid == user_id:
                    continue
                overlap = set(target_items.keys()) & set(other_items.keys())
                if len(overlap) < 2:
                    continue
                v1 = np.array([target_items[r] for r in overlap])
                v2 = np.array([other_items[r] for r in overlap])
                n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
                if n1 > 0 and n2 > 0:
                    sim = float(np.dot(v1, v2) / (n1 * n2))
                    if sim > 0:
                        for rid, rating in other_items.items():
                            if rid not in target_items and rid not in exclude:
                                candidate_scores[rid] = candidate_scores.get(rid, 0) + sim * rating

            all_model_scores["collaborative"] = list(candidate_scores.items())
        else:
            all_model_scores["collaborative"] = []
    else:
        all_model_scores["collaborative"] = []

    # VAE
    if user_id in user_profiles:
        pref = user_profiles[user_id]
        feat = np.zeros(12)
        feat[:min(12, len(pref))] = pref[:12]
        norm_feat = (feat - vae_w["feature_means"]) / (vae_w["feature_stds"] + 1e-8)
        user_latent = norm_feat @ vae_w["encoder_mu_w"] + vae_w["encoder_mu_b"]
        ul_norm = np.linalg.norm(user_latent)

        scored = []
        for rid, emb in recipe_features.items():
            if rid in exclude:
                continue
            r_feat = np.zeros(12)
            r_feat[:min(12, len(emb))] = emb[:12]
            r_norm_feat = (r_feat - vae_w["feature_means"]) / (vae_w["feature_stds"] + 1e-8)
            r_latent = r_norm_feat @ vae_w["encoder_mu_w"] + vae_w["encoder_mu_b"]
            r_norm = np.linalg.norm(r_latent)
            if ul_norm > 0 and r_norm > 0:
                sim = float(np.dot(user_latent, r_latent) / (ul_norm * r_norm))
            else:
                sim = 0.0
            scored.append((rid, sim))
        all_model_scores["vae"] = scored
    else:
        all_model_scores["vae"] = []

    # RNN
    user_meals = [(rid, rating) for uid, rid, rating in train_data if uid == user_id]
    if len(user_meals) >= 3:
        recent = user_meals[-20:]
        h = np.zeros(64, dtype=np.float64)
        base_time = pd.Timestamp("2023-06-15 12:00:00")
        for i, (rid, _) in enumerate(recent):
            emb = recipe_features.get(rid, np.zeros(32))
            ts = base_time + pd.Timedelta(hours=i * 6)
            hour = ts.hour
            mt = "breakfast" if 5 <= hour < 11 else "lunch" if 11 <= hour < 15 else "dinner" if 15 <= hour < 21 else "snack"
            tf = _encode_time_features(ts, mt)
            x = np.concatenate([emb, tf]).astype(np.float64)

            def sig(v):
                return 1.0 / (1.0 + np.exp(-np.clip(v, -20, 20)))

            z = sig(x @ rnn_w["Wz"] + h @ rnn_w["Uz"] + rnn_w["bz"])
            r = sig(x @ rnn_w["Wr"] + h @ rnn_w["Ur"] + rnn_w["br"])
            hc = np.tanh(x @ rnn_w["Wh"] + (r * h) @ rnn_w["Uh"] + rnn_w["bh"])
            h = (1 - z) * h + z * hc

        pred = h @ rnn_w["Wo"] + rnn_w["bo"]
        pred_norm = np.linalg.norm(pred)

        user_rids = {rid for rid, _ in user_meals}
        scored = []
        for rid, emb in recipe_features.items():
            if rid in exclude or rid in user_rids:
                continue
            en = np.linalg.norm(emb)
            if pred_norm > 0 and en > 0:
                sim = float(np.dot(pred, emb) / (pred_norm * en))
            else:
                sim = 0.0
            scored.append((rid, sim))
        all_model_scores["rnn"] = scored
    else:
        all_model_scores["rnn"] = []

    # --- Normalize per-model scores to [0, 1] ---
    for model_name, scored in all_model_scores.items():
        if not scored:
            continue
        scores_only = [s for _, s in scored]
        min_s = min(scores_only)
        max_s = max(scores_only)
        score_range = max_s - min_s
        if score_range > 0:
            all_model_scores[model_name] = [(rid, (s - min_s) / score_range) for rid, s in scored]
        else:
            all_model_scores[model_name] = [(rid, 1.0) for rid, s in scored]

    # --- Blend with weights ---
    combined = {}  # recipe_id -> blended score
    for model_name, scored in all_model_scores.items():
        w = weights.get(model_name, 0.0)
        if w <= 0:
            continue
        for rid, score in scored:
            combined[rid] = combined.get(rid, 0.0) + w * score

    sorted_recs = sorted(combined.items(), key=lambda x: x[1], reverse=True)
    return [rid for rid, _ in sorted_recs[:top_n]]


# ---------------------------------------------------------------------------
# Evaluate hybrid
# ---------------------------------------------------------------------------
hybrid_metrics = {"P@5": [], "P@10": [], "R@10": [], "NDCG@10": [], "MRR": [], "recs": []}

for uid in tqdm(test_users, desc="Evaluating hybrid"):
    relevant = test_data[uid]
    exclude_items = user_train_items.get(uid, set())

    recs = hybrid_recommend(
        uid, user_profiles, train_data, recipe_features,
        vae_weights, rnn_weights, user_interaction_counts,
        top_n=TOP_N, exclude=exclude_items
    )

    hybrid_metrics["P@5"].append(precision_at_k(recs, relevant, 5))
    hybrid_metrics["P@10"].append(precision_at_k(recs, relevant, 10))
    hybrid_metrics["R@10"].append(recall_at_k(recs, relevant, 10))
    hybrid_metrics["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    hybrid_metrics["MRR"].append(mean_reciprocal_rank(recs, relevant))
    hybrid_metrics["recs"].append(recs)

# Add hybrid to results
hybrid_row = {"Model": "HYBRID"}
for m in metrics:
    hybrid_row[m] = np.mean(hybrid_metrics[m])
hybrid_row["Coverage"] = coverage(hybrid_metrics["recs"], len(all_recipe_ids))

full_results_df = pd.concat([
    results_df,
    pd.DataFrame([hybrid_row]).set_index("Model")
])

print("\nAll Model Results (including Hybrid)")
print("=" * 80)
print(full_results_df.to_string(float_format="{:.4f}".format))

# Highlight best per metric
print("\nBest model per metric:")
for m in metrics + ["Coverage"]:
    best_model = full_results_df[m].idxmax()
    best_val = full_results_df[m].max()
    print(f"  {m:<10} {best_model:<15} ({best_val:.4f})")

## 6. Before vs After Comparison

We compare the hybrid system's performance using **trained weights** vs **random weights**
to quantify the impact of the training pipeline (notebooks 04 and 05).

In [None]:
# ---------------------------------------------------------------------------
# Generate random weights for comparison
# ---------------------------------------------------------------------------
rng_random = np.random.default_rng(42)

random_vae_w = {}
for key, shape in VAE_WEIGHT_SHAPES.items():
    if "b" in key or "mean" in key:
        random_vae_w[key] = np.zeros(shape)
    elif "std" in key:
        random_vae_w[key] = np.ones(shape)
    else:
        random_vae_w[key] = rng_random.standard_normal(shape) * 0.1

random_rnn_w = {}
for key, shape in RNN_WEIGHT_SHAPES.items():
    if key.startswith("b"):
        random_rnn_w[key] = np.zeros(shape)
    else:
        random_rnn_w[key] = rng_random.standard_normal(shape) * 0.1

# ---------------------------------------------------------------------------
# Evaluate hybrid with random weights
# ---------------------------------------------------------------------------
random_metrics = {"P@5": [], "P@10": [], "R@10": [], "NDCG@10": [], "MRR": []}

for uid in tqdm(test_users, desc="Evaluating (random weights)"):
    relevant = test_data[uid]
    exclude_items = user_train_items.get(uid, set())

    recs = hybrid_recommend(
        uid, user_profiles, train_data, recipe_features,
        random_vae_w, random_rnn_w, user_interaction_counts,
        top_n=TOP_N, exclude=exclude_items
    )

    random_metrics["P@5"].append(precision_at_k(recs, relevant, 5))
    random_metrics["P@10"].append(precision_at_k(recs, relevant, 10))
    random_metrics["R@10"].append(recall_at_k(recs, relevant, 10))
    random_metrics["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
    random_metrics["MRR"].append(mean_reciprocal_rank(recs, relevant))

# ---------------------------------------------------------------------------
# Comparison table
# ---------------------------------------------------------------------------
comparison = []
for m in metrics:
    random_val = np.mean(random_metrics[m])
    trained_val = np.mean(hybrid_metrics[m])
    improvement = trained_val - random_val
    pct_change = (improvement / random_val * 100) if random_val > 0 else float("inf")
    comparison.append({
        "Metric": m,
        "Random Weights": random_val,
        "Trained Weights": trained_val,
        "Improvement": improvement,
        "% Change": pct_change,
    })

comparison_df = pd.DataFrame(comparison).set_index("Metric")
print("Before vs After Training")
print("=" * 75)
print(comparison_df.to_string(float_format="{:.4f}".format))

# ---------------------------------------------------------------------------
# Bar chart
# ---------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(12, 5))
x = np.arange(len(metrics))
width = 0.35

random_vals = [np.mean(random_metrics[m]) for m in metrics]
trained_vals = [np.mean(hybrid_metrics[m]) for m in metrics]

bars1 = ax.bar(x - width / 2, random_vals, width, label="Random Weights",
               color=SNACKTRACK_COLORS["neutral"], alpha=0.7)
bars2 = ax.bar(x + width / 2, trained_vals, width, label="Trained Weights",
               color=SNACKTRACK_COLORS["primary"], alpha=0.85)

# Add value labels
for bar in bars1:
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.002,
            f"{bar.get_height():.3f}", ha="center", va="bottom", fontsize=9)
for bar in bars2:
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.002,
            f"{bar.get_height():.3f}", ha="center", va="bottom", fontsize=9)

ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.set_ylabel("Score")
ax.set_title("Hybrid Recommendation: Random vs Trained Weights")
ax.legend()
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

## 7. Weight Sensitivity Analysis

We explore how sensitive the hybrid's NDCG@10 is to changes in the model blending
weights. For each model, we increase its weight by 0.1 and decrease by 0.1
(re-normalizing the others), and measure the resulting NDCG@10.

This helps determine:
- Which models contribute most to overall quality
- Whether the current weight configuration is near-optimal
- How robust the system is to imprecise weight tuning

In [None]:
# ---------------------------------------------------------------------------
# Weight sensitivity analysis
# ---------------------------------------------------------------------------
# We test on the "mature" stage weights as it is the most balanced
base_weights = MODEL_WEIGHTS["mature"].copy()
perturbation = 0.1
sensitivity_model_names = ["knowledge", "content", "collaborative", "vae", "rnn"]

# Subsample for speed
MAX_SENSITIVITY_USERS = 200
sensitivity_users = test_users[:MAX_SENSITIVITY_USERS]


def evaluate_with_weights(custom_weights, users_subset):
    """Evaluate hybrid NDCG@10 with custom blending weights."""
    # Temporarily override MODEL_WEIGHTS for mature stage
    temp_model_weights = {
        "cold_start": MODEL_WEIGHTS["cold_start"].copy(),
        "early": MODEL_WEIGHTS["early"].copy(),
        "mature": custom_weights.copy(),
    }

    ndcg_scores = []
    for uid in users_subset:
        relevant = test_data[uid]
        exclude_items = user_train_items.get(uid, set())
        n_interactions = user_interaction_counts.get(uid, 0)
        maturity = get_maturity_stage(n_interactions)
        weights = temp_model_weights[maturity]

        # Quick hybrid using pre-computed model scores would be ideal,
        # but for correctness we re-run the full hybrid
        # (using mature weights override)
        recs = hybrid_recommend(
            uid, user_profiles, train_data, recipe_features,
            vae_weights, rnn_weights, user_interaction_counts,
            top_n=TOP_N, exclude=exclude_items
        )
        ndcg_scores.append(ndcg_at_k(recs, relevant, 10))

    return np.mean(ndcg_scores)


# Evaluate baseline
baseline_ndcg = evaluate_with_weights(base_weights, sensitivity_users)

# Perturb each model weight
sensitivity_results = []

for model_name in tqdm(sensitivity_model_names, desc="Sensitivity analysis"):
    for delta in [-perturbation, 0, perturbation]:
        perturbed = base_weights.copy()
        perturbed[model_name] = max(0, perturbed[model_name] + delta)

        # Re-normalize remaining weights so they sum to ~1
        total = sum(perturbed.values())
        if total > 0:
            perturbed = {k: v / total for k, v in perturbed.items()}

        ndcg = evaluate_with_weights(perturbed, sensitivity_users)
        sensitivity_results.append({
            "Model": model_name,
            "Delta": delta,
            "Weight": perturbed[model_name],
            "NDCG@10": ndcg,
        })

sens_df = pd.DataFrame(sensitivity_results)
print("\nWeight Sensitivity Analysis (NDCG@10)")
print("=" * 60)

pivot = sens_df.pivot(index="Model", columns="Delta", values="NDCG@10")
pivot.columns = [f"delta={d:+.1f}" for d in pivot.columns]
print(pivot.to_string(float_format="{:.4f}".format))
print(f"\nBaseline NDCG@10 (current weights): {baseline_ndcg:.4f}")

# ---------------------------------------------------------------------------
# Heatmap
# ---------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(10, 5))

heatmap_data = sens_df.pivot(index="Model", columns="Delta", values="NDCG@10")
heatmap_data.columns = [f"{d:+.1f}" for d in heatmap_data.columns]

sns.heatmap(heatmap_data, annot=True, fmt=".4f", cmap="YlGn",
            ax=ax, linewidths=1, linecolor="white",
            cbar_kws={"label": "NDCG@10"})
ax.set_title("Weight Sensitivity: NDCG@10 by Model Weight Perturbation")
ax.set_xlabel("Weight Delta")
ax.set_ylabel("Model")
plt.tight_layout()
plt.show()

## 8. Cold-Start vs Mature Users

We split test users into three segments based on their training-set interaction count:

| Segment | Interaction Count | Characteristics |
|---------|-------------------|----------------|
| Cold-start | < 5 | Very little data; knowledge-based dominates |
| Early | 5--19 | Some signal; balanced blend |
| Mature | 20+ | Rich history; collaborative and RNN shine |

This analysis reveals how well the adaptive blending weights serve each segment.

In [None]:
# ---------------------------------------------------------------------------
# Segment users
# ---------------------------------------------------------------------------
segments = {
    "Cold-start (<5)": [],
    "Early (5-19)": [],
    "Mature (20+)": [],
}

for uid in test_users:
    count = user_interaction_counts.get(uid, 0)
    if count < 5:
        segments["Cold-start (<5)"].append(uid)
    elif count < 20:
        segments["Early (5-19)"].append(uid)
    else:
        segments["Mature (20+)"].append(uid)

print("User Segments:")
for seg, users in segments.items():
    print(f"  {seg}: {len(users)} users")

# ---------------------------------------------------------------------------
# Evaluate per segment
# ---------------------------------------------------------------------------
segment_results = []

for seg_name, seg_users in segments.items():
    if not seg_users:
        segment_results.append({
            "Segment": seg_name, "Users": 0,
            "P@5": 0, "P@10": 0, "R@10": 0, "NDCG@10": 0, "MRR": 0,
        })
        continue

    seg_metrics = {m: [] for m in metrics}

    for uid in tqdm(seg_users, desc=f"Eval {seg_name}", leave=False):
        relevant = test_data[uid]
        exclude_items = user_train_items.get(uid, set())

        recs = hybrid_recommend(
            uid, user_profiles, train_data, recipe_features,
            vae_weights, rnn_weights, user_interaction_counts,
            top_n=TOP_N, exclude=exclude_items
        )

        seg_metrics["P@5"].append(precision_at_k(recs, relevant, 5))
        seg_metrics["P@10"].append(precision_at_k(recs, relevant, 10))
        seg_metrics["R@10"].append(recall_at_k(recs, relevant, 10))
        seg_metrics["NDCG@10"].append(ndcg_at_k(recs, relevant, 10))
        seg_metrics["MRR"].append(mean_reciprocal_rank(recs, relevant))

    row = {"Segment": seg_name, "Users": len(seg_users)}
    for m in metrics:
        row[m] = np.mean(seg_metrics[m])
    segment_results.append(row)

seg_df = pd.DataFrame(segment_results).set_index("Segment")

print("\nHybrid Performance by User Segment")
print("=" * 75)
print(seg_df.to_string(float_format="{:.4f}".format))

# ---------------------------------------------------------------------------
# Bar chart
# ---------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(12, 5))
seg_names = [s for s in segments.keys() if len(segments[s]) > 0]
x = np.arange(len(metrics))
width = 0.25

for i, seg_name in enumerate(seg_names):
    values = [seg_df.loc[seg_name, m] for m in metrics]
    ax.bar(x + i * width, values, width, label=seg_name,
           color=PALETTE[i], alpha=0.85)

ax.set_xticks(x + width * (len(seg_names) - 1) / 2)
ax.set_xticklabels(metrics)
ax.set_ylabel("Score")
ax.set_title("Hybrid Recommendation Quality by User Segment")
ax.legend()
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

## 9. Summary

Final consolidated view of all evaluation results.

In [None]:
# ---------------------------------------------------------------------------
# Final summary
# ---------------------------------------------------------------------------
print("=" * 80)
print("  SnackTrack Hybrid Recommendation System --- Evaluation Summary")
print("=" * 80)

print("\n1. DATASET")
print(f"   Total interactions:   {len(train_data):,} (train) + {sum(len(v) for v in test_data.values()):,} (test)")
print(f"   Test users:           {len(test_data):,}")
print(f"   Unique recipes:       {len(all_recipe_ids):,}")
print(f"   Evaluation users:     {len(test_users)}")

print("\n2. WEIGHT STATUS")
print(f"   VAE: {'TRAINED' if vae_trained else 'RANDOM (run notebook 04 to train)'}")
print(f"   RNN: {'TRAINED' if rnn_trained else 'RANDOM (run notebook 05 to train)'}")

print("\n3. PER-MODEL RESULTS")
print(full_results_df.to_string(float_format="{:.4f}".format))

print("\n4. BEST MODEL PER METRIC")
for m in metrics + ["Coverage"]:
    best_model = full_results_df[m].idxmax()
    best_val = full_results_df[m].max()
    print(f"   {m:<10}  {best_model:<15}  {best_val:.4f}")

print("\n5. BEFORE vs AFTER TRAINING")
print(comparison_df.to_string(float_format="{:.4f}".format))

print("\n6. PERFORMANCE BY USER SEGMENT")
print(seg_df.to_string(float_format="{:.4f}".format))

print("\n7. KEY TAKEAWAYS")
hybrid_ndcg = np.mean(hybrid_metrics["NDCG@10"])
best_single_ndcg = results_df["NDCG@10"].max()
best_single_model = results_df["NDCG@10"].idxmax()
random_ndcg = np.mean(random_metrics["NDCG@10"])

print(f"   - Hybrid NDCG@10:           {hybrid_ndcg:.4f}")
print(f"   - Best single model:        {best_single_model} ({best_single_ndcg:.4f})")
if hybrid_ndcg > best_single_ndcg:
    lift = (hybrid_ndcg - best_single_ndcg) / best_single_ndcg * 100
    print(f"   - Hybrid vs best single:    +{lift:.1f}% improvement")
else:
    print(f"   - Hybrid vs best single:    {best_single_model} is stronger alone")

if random_ndcg > 0:
    training_lift = (hybrid_ndcg - random_ndcg) / random_ndcg * 100
    print(f"   - Trained vs random:        +{training_lift:.1f}% improvement")

print(f"   - Coverage:                 {hybrid_row['Coverage']:.4f}")

print("\n" + "=" * 80)
print("  Evaluation complete.")
print("=" * 80)