# Forecast Accuracy Review

**Purpose**: Compare new model candidates against our current predictions.  
**Test period**: FY start (Nov 25) through current week.  
**Audience**: Leadership — all charts are presentation-ready.

---

### How to read this notebook

1. **Run the walkthrough notebook first** (`walkthrough_your_data.ipynb`) to generate model predictions  
2. This notebook **loads those results** and builds the visuals  
3. Every chart can be exported to HTML (interactive) or PNG (for slides)  
4. Sections are ordered for a top-down executive presentation

---
## 0. Setup & Load Results

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio

# Consistent styling for all charts
pio.templates.default = "plotly_white"

from ds_timeseries.evaluation.metrics import wape, mae, bias, evaluate_forecast
from ds_timeseries.features import add_fiscal_features, rollup_to_fiscal_month
from ds_timeseries.evaluation.plots import (
    plot_forecast,
    plot_forecast_grid,
    plot_model_comparison,
    COLORS,
    MODEL_COLORS,
)

# --- Color palette ---
# Consistent across all leadership charts
BRAND = {
    "green":  "#2D9F5A",   # good / improvement
    "yellow": "#F5A623",   # acceptable / caution
    "red":    "#D94F4F",   # poor / needs attention
    "blue":   "#2E86AB",   # primary / our new model
    "gray":   "#8B95A2",   # benchmark / reference
    "dark":   "#2D3436",   # text
    "light":  "#F8F9FA",   # background
}

# Chart export settings
EXPORT_DIR = "../data/raw/charts"  # change if you want
CHART_HEIGHT = 500
CHART_WIDTH = 900

import os
os.makedirs(EXPORT_DIR, exist_ok=True)

print("Setup complete.")

In [None]:
# === EDIT THIS SECTION ===
# Load data from the walkthrough notebook.
# The walkthrough saves model_scorecard.csv. You need to also save/load:
#   - test actuals
#   - fiscal calendar
#   - model predictions dict
#
# Easiest approach: run both notebooks in the same Jupyter session,
# or save/load with parquet as shown below.

# --- Your actuals (full dataset, for train history + test) ---
# raw = pd.read_parquet("../data/raw/your_sales_data.parquet")
# raw["ds"] = pd.to_datetime(raw["ds"])
# ... (same column mapping as walkthrough) ...

# --- Or if you saved train/test separately: ---
# test = pd.read_parquet("../data/raw/test_actuals.parquet")
# train = pd.read_parquet("../data/raw/train_actuals.parquet")

# --- Your fiscal calendar ---
# fiscal_cal = pd.read_parquet("../data/raw/your_fiscal_calendar.parquet")

# --- Cutoff date ---
CUTOFF_DATE = pd.Timestamp("2025-11-25")

# --- For dollar impact estimation ---
AVG_UNIT_VALUE = 500   # average dollar value per unit of y
# If y is already in dollars, set this to 1

# --- Model predictions (dict of name -> DataFrame with unique_id, ds, yhat) ---
# Collect all model predictions into this dictionary.
# Each value is a DataFrame with columns: unique_id, ds, yhat
all_predictions = {
    # "Your Predictions": existing_preds,
    # "LightGBM":         lgb_preds,
    # "XGBoost":          xgb_preds,
    # "CatBoost":         cat_preds,
    # "DRFAM":            drfam_preds,
    # "SimpleEnsemble":   simple_ens_preds,
    # etc.
}

# --- Validate that required variables are loaded ---
for var_name in ["test", "all_predictions"]:
    if var_name not in dir():
        raise NameError(
            f"'{var_name}' is not defined. Load your data above before continuing. "
            f"See walkthrough_your_data.ipynb for how to generate these."
        )

print(f"Models loaded: {list(all_predictions.keys())}")
print(f"Test period: {test['ds'].min().date()} to {test['ds'].max().date()}")
print(f"Test weeks: {test['ds'].nunique()}")
print(f"Series: {test['unique_id'].nunique()}")

### Score all models against actuals

In [None]:
# Build a unified scores table
scores_list = []

for model_name, preds in all_predictions.items():
    merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")
    if len(merged) == 0:
        print(f"  WARNING: {model_name} — no matching rows")
        continue
    scores = evaluate_forecast(merged["y"], merged["yhat"])
    scores["model"] = model_name
    scores["n_rows"] = len(merged)
    scores_list.append(scores)

scorecard = pd.DataFrame(scores_list).sort_values("wape").reset_index(drop=True)
scorecard["rank"] = range(1, len(scorecard) + 1)

# Flag "Your Predictions" as the benchmark
benchmark_wape = scorecard.loc[scorecard["model"] == "Your Predictions", "wape"].values
if len(benchmark_wape) > 0:
    benchmark_wape = benchmark_wape[0]
    scorecard["vs_benchmark_pct"] = -(scorecard["wape"] - benchmark_wape) / benchmark_wape * 100
else:
    benchmark_wape = scorecard["wape"].max()
    scorecard["vs_benchmark_pct"] = np.nan

scorecard

---
## 1. Executive Summary — The Headline Number

One slide. One number. The improvement.

In [None]:
best_model = scorecard.loc[scorecard["model"] != "Your Predictions"].iloc[0]
improvement_pct = (benchmark_wape - best_model["wape"]) / benchmark_wape * 100
improvement_abs = benchmark_wape - best_model["wape"]

# Estimate dollar impact
annual_volume = test["y"].sum()  # total units in test period
weeks_in_test = test["ds"].nunique()
annual_volume_projected = annual_volume * (52 / max(weeks_in_test, 1))
dollar_impact = annual_volume_projected * AVG_UNIT_VALUE * improvement_abs

fig = go.Figure()

# Big number card
fig.add_trace(go.Indicator(
    mode="number+delta",
    value=best_model["wape"] * 100,
    number={"suffix": "%", "font": {"size": 72, "color": BRAND["blue"]}},
    delta={
        "reference": benchmark_wape * 100,
        "decreasing": {"color": BRAND["green"]},
        "increasing": {"color": BRAND["red"]},
        "suffix": " pts",
        "font": {"size": 28},
    },
    title={
        "text": (
            f"<b>Forecast Accuracy (WAPE)</b><br>"
            f"<span style='font-size:16px;color:{BRAND['gray']}'>Best Model: {best_model['model']}</span>"
        ),
        "font": {"size": 20, "color": BRAND["dark"]},
    },
    domain={"x": [0, 0.5], "y": [0, 1]},
))

fig.add_trace(go.Indicator(
    mode="number",
    value=dollar_impact,
    number={"prefix": "$", "font": {"size": 56, "color": BRAND["green"]}, "valueformat": ",.0f"},
    title={
        "text": (
            f"<b>Estimated Annual Impact</b><br>"
            f"<span style='font-size:14px;color:{BRAND['gray']}'>Projected from {weeks_in_test}-week test period</span>"
        ),
        "font": {"size": 20, "color": BRAND["dark"]},
    },
    domain={"x": [0.55, 1], "y": [0, 1]},
))

fig.update_layout(
    height=300,
    margin=dict(t=80, b=20, l=20, r=20),
    paper_bgcolor=BRAND["light"],
)

fig.write_html(f"{EXPORT_DIR}/01_executive_summary.html")
fig.show()

print(f"\nTalking point: 'We reduced forecast error from {benchmark_wape:.1%} to {best_model['wape']:.1%}")
print(f"               — a {improvement_pct:.0f}% improvement, worth an estimated ${dollar_impact:,.0f}/year.'")

---
## 2. Model Comparison — Which Approach Wins?

Horizontal bar chart. Your current predictions as a red dashed line.  
Everything to the left of the line is an improvement.

In [None]:
sc = scorecard.sort_values("wape", ascending=True).copy()

# Color: green if better than benchmark, gray if benchmark, red if worse
def bar_color(row):
    if row["model"] == "Your Predictions":
        return BRAND["gray"]
    return BRAND["green"] if row["wape"] < benchmark_wape else BRAND["red"]

sc["color"] = sc.apply(bar_color, axis=1)

fig = go.Figure()

fig.add_trace(go.Bar(
    y=sc["model"],
    x=sc["wape"],
    orientation="h",
    marker_color=sc["color"],
    text=[f"{w:.1%}" for w in sc["wape"]],
    textposition="outside",
    textfont=dict(size=13),
    hovertemplate="%{y}<br>WAPE: %{x:.2%}<extra></extra>",
))

# Benchmark reference line
fig.add_vline(
    x=benchmark_wape,
    line_dash="dash",
    line_color=BRAND["red"],
    line_width=2,
    annotation_text=f"Current: {benchmark_wape:.1%}",
    annotation_position="top",
    annotation_font=dict(size=13, color=BRAND["red"]),
)

fig.update_layout(
    title=dict(
        text="<b>Forecast Accuracy by Model</b><br>"
             "<span style='font-size:14px;color:gray'>WAPE — lower is better. Green = beats current predictions.</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    xaxis=dict(
        title="Weighted Absolute % Error (WAPE)",
        tickformat=".0%",
        gridcolor="#EEEEEE",
        range=[0, max(sc["wape"]) * 1.2],
    ),
    yaxis=dict(title=""),
    height=max(350, len(sc) * 40 + 120),
    width=CHART_WIDTH,
    margin=dict(l=180, r=80, t=100, b=60),
)

fig.write_html(f"{EXPORT_DIR}/02_model_comparison.html")
fig.show()

---
## 3. Accuracy Trendline — Week-by-Week Performance

Shows how each model's accuracy tracks over time against actuals.  
Leadership can see: "Is the model consistently good, or did it get lucky in one week?"

In [None]:
# Compute weekly WAPE for each model
weekly_scores = []

for model_name, preds in all_predictions.items():
    merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")
    for week_date, week_df in merged.groupby("ds"):
        total_actual = week_df["y"].abs().sum()
        if total_actual > 0:
            w = (week_df["y"] - week_df["yhat"]).abs().sum() / total_actual
        else:
            w = np.nan
        weekly_scores.append({
            "model": model_name,
            "ds": week_date,
            "wape": w,
            "total_actual": total_actual,
            "total_error": (week_df["y"] - week_df["yhat"]).abs().sum(),
        })

weekly_df = pd.DataFrame(weekly_scores)
weekly_df["ds"] = pd.to_datetime(weekly_df["ds"])

print(f"Weekly scores: {len(weekly_df)} rows")
print(f"Models: {weekly_df['model'].nunique()}")
print(f"Weeks: {weekly_df['ds'].nunique()}")

In [None]:
# --- Weekly WAPE trendline: all models ---
fig = go.Figure()

# Sort models by overall WAPE so legend is ordered best-to-worst
model_order = scorecard.sort_values("wape")["model"].tolist()

for idx, model_name in enumerate(model_order):
    model_weekly = weekly_df[weekly_df["model"] == model_name].sort_values("ds")

    is_benchmark = model_name == "Your Predictions"

    fig.add_trace(go.Scatter(
        x=model_weekly["ds"],
        y=model_weekly["wape"],
        mode="lines+markers",
        name=model_name,
        line=dict(
            color=BRAND["gray"] if is_benchmark else MODEL_COLORS[idx % len(MODEL_COLORS)],
            width=3 if is_benchmark else 2,
            dash="dash" if is_benchmark else "solid",
        ),
        marker=dict(size=6 if is_benchmark else 5),
        hovertemplate=f"{model_name}<br>Week: %{{x|%b %d}}<br>WAPE: %{{y:.1%}}<extra></extra>",
        opacity=1.0 if is_benchmark or idx < 3 else 0.5,
    ))

# Add fiscal month boundaries from your calendar
try:
    month_ends = fiscal_cal[fiscal_cal["is_fiscal_month_end"]]["ds"]
    month_ends_in_range = month_ends[
        (month_ends >= test["ds"].min()) & (month_ends <= test["ds"].max())
    ]
    for me in month_ends_in_range:
        fig.add_vline(
            x=me, line_dash="dot", line_color="#DDDDDD", line_width=1,
        )
except NameError:
    pass  # fiscal_cal not loaded

fig.update_layout(
    title=dict(
        text="<b>Weekly Forecast Accuracy Over Time</b><br>"
             "<span style='font-size:14px;color:gray'>WAPE per week — lower is better. "
             "Dashed gray = current predictions.</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    xaxis=dict(
        title="Week",
        tickformat="%b %d",
        gridcolor="#EEEEEE",
    ),
    yaxis=dict(
        title="WAPE (lower = more accurate)",
        tickformat=".0%",
        gridcolor="#EEEEEE",
    ),
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
    hovermode="x unified",
    height=CHART_HEIGHT,
    width=CHART_WIDTH,
    margin=dict(t=120, b=60),
)

fig.write_html(f"{EXPORT_DIR}/03_weekly_accuracy_trendline.html")
fig.show()

In [None]:
# --- Cumulative (rolling) WAPE: shows convergence over time ---
fig = go.Figure()

for idx, model_name in enumerate(model_order):
    model_weekly = weekly_df[weekly_df["model"] == model_name].sort_values("ds")
    # Cumulative WAPE = cumulative error / cumulative actual
    model_weekly = model_weekly.copy()
    model_weekly["cum_error"] = model_weekly["total_error"].cumsum()
    model_weekly["cum_actual"] = model_weekly["total_actual"].cumsum()
    model_weekly["cum_wape"] = model_weekly["cum_error"] / model_weekly["cum_actual"]

    is_benchmark = model_name == "Your Predictions"

    fig.add_trace(go.Scatter(
        x=model_weekly["ds"],
        y=model_weekly["cum_wape"],
        mode="lines",
        name=model_name,
        line=dict(
            color=BRAND["gray"] if is_benchmark else MODEL_COLORS[idx % len(MODEL_COLORS)],
            width=3 if is_benchmark else 2,
            dash="dash" if is_benchmark else "solid",
        ),
        hovertemplate=f"{model_name}<br>Through: %{{x|%b %d}}<br>Cumulative WAPE: %{{y:.1%}}<extra></extra>",
        opacity=1.0 if is_benchmark or idx < 3 else 0.5,
    ))

fig.update_layout(
    title=dict(
        text="<b>Cumulative Accuracy Over Time</b><br>"
             "<span style='font-size:14px;color:gray'>Running WAPE — stabilizes as more weeks are included. "
             "Dashed gray = current predictions.</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    xaxis=dict(title="Through Week", tickformat="%b %d", gridcolor="#EEEEEE"),
    yaxis=dict(title="Cumulative WAPE", tickformat=".0%", gridcolor="#EEEEEE"),
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
    hovermode="x unified",
    height=CHART_HEIGHT,
    width=CHART_WIDTH,
    margin=dict(t=120, b=60),
)

fig.write_html(f"{EXPORT_DIR}/04_cumulative_accuracy.html")
fig.show()

---
## 4. Fiscal Month Accuracy — The Monthly View

Leadership thinks in fiscal months. This rolls up weekly accuracy to show  
how each model performed per fiscal period.

In [None]:
# Roll up each model's predictions to fiscal months
monthly_scores = []

for model_name, preds in all_predictions.items():
    merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")
    try:
        monthly = rollup_to_fiscal_month(
            merged, value_cols=["y", "yhat"], fiscal_calendar=fiscal_cal
        )
        # Aggregate across all series per fiscal month
        month_agg = (
            monthly
            .groupby(["fiscal_year", "fiscal_month"])
            .agg({"y": "sum", "yhat": "sum"})
            .reset_index()
        )
        month_agg["wape"] = (month_agg["y"] - month_agg["yhat"]).abs() / month_agg["y"].abs()
        month_agg["model"] = model_name
        month_agg["period_label"] = "FY" + month_agg["fiscal_year"].astype(str) + " M" + month_agg["fiscal_month"].astype(str).str.zfill(2)
        monthly_scores.append(month_agg)
    except Exception as e:
        print(f"  {model_name}: Could not roll up — {e}")

monthly_all = pd.concat(monthly_scores, ignore_index=True)
print(f"Fiscal months with data: {monthly_all['period_label'].nunique()}")

In [None]:
# --- Grouped bar chart: WAPE by fiscal month, grouped by model ---
# Only show top N models to keep it readable
TOP_N_MODELS = 5
top_models = scorecard.head(TOP_N_MODELS)["model"].tolist()
if "Your Predictions" not in top_models:
    top_models.append("Your Predictions")

monthly_top = monthly_all[monthly_all["model"].isin(top_models)].copy()
periods = sorted(monthly_top["period_label"].unique())

fig = go.Figure()

for idx, model_name in enumerate(top_models):
    model_data = monthly_top[monthly_top["model"] == model_name].set_index("period_label")

    is_benchmark = model_name == "Your Predictions"

    fig.add_trace(go.Bar(
        name=model_name,
        x=periods,
        y=[model_data.loc[p, "wape"] if p in model_data.index else np.nan for p in periods],
        marker_color=BRAND["gray"] if is_benchmark else MODEL_COLORS[idx % len(MODEL_COLORS)],
        marker_pattern_shape="/" if is_benchmark else "",
        hovertemplate=f"{model_name}<br>%{{x}}<br>WAPE: %{{y:.1%}}<extra></extra>",
    ))

fig.update_layout(
    barmode="group",
    title=dict(
        text="<b>Accuracy by Fiscal Month</b><br>"
             "<span style='font-size:14px;color:gray'>WAPE per fiscal month — lower is better. "
             "Hatched bars = current predictions.</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    xaxis=dict(title="Fiscal Period", gridcolor="#EEEEEE"),
    yaxis=dict(title="WAPE", tickformat=".0%", gridcolor="#EEEEEE"),
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
    height=CHART_HEIGHT,
    width=CHART_WIDTH,
    margin=dict(t=120, b=60),
)

fig.write_html(f"{EXPORT_DIR}/05_fiscal_month_accuracy.html")
fig.show()

---
## 5. Traffic Light Summary — At a Glance

How many Customer-Material combinations fall into each accuracy bucket?  
Simple red/yellow/green that operations teams can act on.

In [None]:
# Compare traffic light distributions: Your Predictions vs Best Model
compare_models = ["Your Predictions", best_model["model"]]

traffic_data = []

for model_name in compare_models:
    if model_name not in all_predictions:
        continue
    preds = all_predictions[model_name]
    merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")

    # Per-series WAPE
    series_wape = (
        merged.groupby("unique_id")
        .apply(lambda x: (x["y"] - x["yhat"]).abs().sum() / max(x["y"].abs().sum(), 1e-6), include_groups=False)
        .reset_index()
        .rename(columns={0: "wape"})
    )

    # Traffic light buckets
    series_wape["bucket"] = pd.cut(
        series_wape["wape"],
        bins=[0, 0.35, 0.50, float("inf")],
        labels=["Good (<35%)", "Acceptable (35-50%)", "Needs Attention (>50%)"],
    )
    counts = series_wape["bucket"].value_counts().to_dict()
    total = len(series_wape)

    for bucket, count in counts.items():
        traffic_data.append({
            "model": model_name,
            "bucket": bucket,
            "count": count,
            "pct": count / total,
        })

traffic_df = pd.DataFrame(traffic_data)

# Stacked bar chart
fig = go.Figure()

bucket_colors = {
    "Good (<35%)": BRAND["green"],
    "Acceptable (35-50%)": BRAND["yellow"],
    "Needs Attention (>50%)": BRAND["red"],
}

for bucket_name in ["Good (<35%)", "Acceptable (35-50%)", "Needs Attention (>50%)"]:
    bucket_data = traffic_df[traffic_df["bucket"] == bucket_name]
    fig.add_trace(go.Bar(
        name=bucket_name,
        y=bucket_data["model"],
        x=bucket_data["pct"],
        orientation="h",
        marker_color=bucket_colors[bucket_name],
        text=[f"{p:.0%} ({c})" for p, c in zip(bucket_data["pct"], bucket_data["count"])],
        textposition="inside",
        textfont=dict(size=14, color="white"),
        hovertemplate=f"{bucket_name}<br>%{{y}}<br>%{{x:.1%}} ({bucket_data['count'].values[0] if len(bucket_data) > 0 else 0} series)<extra></extra>",
    ))

fig.update_layout(
    barmode="stack",
    title=dict(
        text="<b>Accuracy Distribution: Current vs Best Model</b><br>"
             "<span style='font-size:14px;color:gray'>% of Customer-Material combinations in each accuracy bucket</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    xaxis=dict(title="% of Series", tickformat=".0%", gridcolor="#EEEEEE", range=[0, 1]),
    yaxis=dict(title=""),
    legend=dict(orientation="h", yanchor="bottom", y=1.05, xanchor="center", x=0.5),
    height=300,
    width=CHART_WIDTH,
    margin=dict(l=200, r=40, t=120, b=60),
)

fig.write_html(f"{EXPORT_DIR}/06_traffic_light.html")
fig.show()

---
## 6. Forecast vs Actual — Top Products

Pick the highest-volume Customer-Material pairs that leadership knows.  
Show the forecast tracking actuals week by week.

In [None]:
# Find top 6 series by total volume in test period
top_series = (
    test.groupby("unique_id")["y"].sum()
    .sort_values(ascending=False)
    .head(6)
    .index.tolist()
)

print(f"Top {len(top_series)} series by volume:")
for s in top_series:
    vol = test[test["unique_id"] == s]["y"].sum()
    print(f"  {s}: {vol:,.0f} total units")

In [None]:
# Build a comparison dict: Your Predictions vs Best Model
comparison_preds = {}
if "Your Predictions" in all_predictions:
    comparison_preds["Current"] = all_predictions["Your Predictions"]
comparison_preds[best_model["model"]] = all_predictions[best_model["model"]]

# Also need full actuals (train + test) for historical context
# If you have the full df loaded:
try:
    full_actuals = pd.concat([train[["unique_id", "ds", "y"]], test[["unique_id", "ds", "y"]]])
except NameError:
    full_actuals = test[["unique_id", "ds", "y"]]  # fallback to test only

In [None]:
# 2x3 grid: top 6 products, showing Current vs Best Model vs Actuals
n_cols = 3
n_rows = 2

fig = make_subplots(
    rows=n_rows, cols=n_cols,
    subplot_titles=[s[:40] for s in top_series],
    vertical_spacing=0.12,
    horizontal_spacing=0.06,
)

for idx, series_id in enumerate(top_series):
    row = idx // n_cols + 1
    col = idx % n_cols + 1

    # Historical (last 26 weeks before cutoff)
    series_hist = full_actuals[
        (full_actuals["unique_id"] == series_id) & (full_actuals["ds"] < CUTOFF_DATE)
    ].sort_values("ds").tail(26)

    # Actuals in test period
    series_actual = test[test["unique_id"] == series_id].sort_values("ds")

    # Historical line
    fig.add_trace(go.Scatter(
        x=series_hist["ds"], y=series_hist["y"],
        mode="lines", line=dict(color=BRAND["dark"], width=2),
        name="Historical", showlegend=(idx == 0),
        legendgroup="hist",
        hovertemplate="%{y:.0f}<extra>Historical</extra>",
    ), row=row, col=col)

    # Actual in test period
    fig.add_trace(go.Scatter(
        x=series_actual["ds"], y=series_actual["y"],
        mode="lines+markers", line=dict(color=BRAND["dark"], width=2, dash="dot"),
        marker=dict(size=5),
        name="Actual", showlegend=(idx == 0),
        legendgroup="actual",
        hovertemplate="%{y:.0f}<extra>Actual</extra>",
    ), row=row, col=col)

    # Each model's forecast
    for m_idx, (m_name, m_preds) in enumerate(comparison_preds.items()):
        series_pred = m_preds[m_preds["unique_id"] == series_id].sort_values("ds")
        color = BRAND["gray"] if m_name == "Current" else BRAND["blue"]
        dash = "dash" if m_name == "Current" else "solid"

        fig.add_trace(go.Scatter(
            x=series_pred["ds"], y=series_pred["yhat"],
            mode="lines", line=dict(color=color, width=2, dash=dash),
            name=m_name, showlegend=(idx == 0),
            legendgroup=m_name,
            hovertemplate=f"%{{y:.0f}}<extra>{m_name}</extra>",
        ), row=row, col=col)

    # Cutoff line
    fig.add_vline(x=CUTOFF_DATE, line_dash="dot", line_color="#CCCCCC", row=row, col=col)

fig.update_layout(
    title=dict(
        text="<b>Top Products: Forecast vs Actual</b><br>"
             "<span style='font-size:14px;color:gray'>Solid dark = actual, "
             "dashed gray = current predictions, solid blue = new model</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    legend=dict(orientation="h", yanchor="bottom", y=1.06, xanchor="left", x=0),
    height=600,
    width=CHART_WIDTH + 100,
    margin=dict(t=140, b=40),
)
fig.update_xaxes(tickformat="%b %d", gridcolor="#EEEEEE")
fig.update_yaxes(gridcolor="#EEEEEE")

fig.write_html(f"{EXPORT_DIR}/07_top_products_forecast_vs_actual.html")
fig.show()

---
## 7. Biggest Improvements & Biggest Misses

Where did the new model help the most?  
Where does it still struggle? (Actionable for ops.)

In [None]:
if "Your Predictions" in all_predictions:
    benchmark_preds = all_predictions["Your Predictions"]
    best_preds = all_predictions[best_model["model"]]

    # Per-series WAPE for both
    def series_wapes(preds):
        merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")
        return (
            merged.groupby("unique_id")
            .apply(lambda x: (x["y"] - x["yhat"]).abs().sum() / max(x["y"].abs().sum(), 1e-6), include_groups=False)
            .rename("wape")
        )

    wape_current = series_wapes(benchmark_preds)
    wape_new = series_wapes(best_preds)

    comparison = pd.DataFrame({
        "current_wape": wape_current,
        "new_wape": wape_new,
    }).dropna()
    comparison["improvement"] = comparison["current_wape"] - comparison["new_wape"]
    comparison["volume"] = test.groupby("unique_id")["y"].sum()

    # Top 15 improvements
    top_improvements = comparison.sort_values("improvement", ascending=False).head(15)
    # Top 15 regressions (where new model is worse)
    top_regressions = comparison.sort_values("improvement", ascending=True).head(15)

    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=["Biggest Improvements", "Needs Investigation"],
        horizontal_spacing=0.15,
    )

    # Improvements
    fig.add_trace(go.Bar(
        y=[s[:25] for s in top_improvements.index],
        x=top_improvements["improvement"],
        orientation="h",
        marker_color=BRAND["green"],
        text=[f"+{v:.0%}" for v in top_improvements["improvement"]],
        textposition="outside",
        hovertemplate="%{y}<br>Improvement: %{x:.1%}<extra></extra>",
        showlegend=False,
    ), row=1, col=1)

    # Regressions
    fig.add_trace(go.Bar(
        y=[s[:25] for s in top_regressions.index],
        x=top_regressions["improvement"].abs(),
        orientation="h",
        marker_color=BRAND["red"],
        text=[f"-{v:.0%}" for v in top_regressions["improvement"].abs()],
        textposition="outside",
        hovertemplate="%{y}<br>Regression: %{x:.1%}<extra></extra>",
        showlegend=False,
    ), row=1, col=2)

    fig.update_layout(
        title=dict(
            text=f"<b>Where {best_model['model']} Helps vs Hurts</b><br>"
                 "<span style='font-size:14px;color:gray'>WAPE improvement per Customer-Material pair</span>",
            font=dict(size=18, color=BRAND["dark"]),
        ),
        height=550,
        width=CHART_WIDTH + 100,
        margin=dict(l=200, r=80, t=100, b=40),
    )
    fig.update_xaxes(tickformat=".0%", gridcolor="#EEEEEE")
    fig.update_yaxes(gridcolor="#EEEEEE")

    fig.write_html(f"{EXPORT_DIR}/08_improvements_and_misses.html")
    fig.show()

    print(f"\n{(comparison['improvement'] > 0).sum()} of {len(comparison)} series improved")
    print(f"{(comparison['improvement'] < 0).sum()} series regressed")
    print(f"{(comparison['improvement'] == 0).sum()} series unchanged")
else:
    print("No benchmark predictions loaded — skipping improvement analysis.")

---
## 8. Accuracy by Hierarchy Level

Roll up accuracy to each level of the hierarchy.  
Shows whether the model is better at the aggregate or granular level.

In [None]:
# === EDIT THIS: hierarchy columns in your test data ===
# These should match your data. Comment out any you don't have.
hierarchy_levels = {
    "Total":           None,  # all data
    "Parent Customer": "parent_customer_id",
    "Customer":        "customer_id",
    "Profit Center":   "profit_center_id",
    "Material":        "material_id",
    "Cust-Material":   "unique_id",
}

hier_scores = []

for model_name in compare_models:
    if model_name not in all_predictions:
        continue
    preds = all_predictions[model_name]
    merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")

    for level_name, level_col in hierarchy_levels.items():
        if level_col is not None and level_col not in merged.columns:
            continue
        if level_col is None:
            # Total level
            w = wape(merged["y"], merged["yhat"])
            n_groups = 1
        else:
            # Per-group, then average (weighted by volume)
            group_wapes = (
                merged.groupby(level_col)
                .apply(lambda x: (x["y"] - x["yhat"]).abs().sum() / max(x["y"].abs().sum(), 1e-6), include_groups=False)
            )
            w = wape(merged["y"], merged["yhat"])  # volume-weighted
            n_groups = len(group_wapes)

        hier_scores.append({
            "model": model_name,
            "level": level_name,
            "wape": w,
            "n_groups": n_groups,
        })

hier_df = pd.DataFrame(hier_scores)

if len(hier_df) > 0:
    fig = px.bar(
        hier_df,
        x="level", y="wape", color="model",
        barmode="group",
        text_auto=".1%",
        color_discrete_map={
            compare_models[0]: BRAND["gray"] if compare_models[0] == "Your Predictions" else BRAND["blue"],
            compare_models[-1]: BRAND["blue"],
        } if len(compare_models) == 2 else None,
    )
    fig.update_layout(
        title=dict(
            text="<b>Accuracy by Hierarchy Level</b><br>"
                 "<span style='font-size:14px;color:gray'>WAPE at each aggregation level — "
                 "accuracy improves as you aggregate up</span>",
            font=dict(size=18, color=BRAND["dark"]),
        ),
        xaxis=dict(title=""),
        yaxis=dict(title="WAPE", tickformat=".0%", gridcolor="#EEEEEE"),
        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0,
                    title=""),
        height=CHART_HEIGHT,
        width=CHART_WIDTH,
        margin=dict(t=120, b=60),
    )
    fig.write_html(f"{EXPORT_DIR}/09_accuracy_by_hierarchy.html")
    fig.show()
else:
    print("No hierarchy columns found in test data.")

---
## 9. Bias Analysis — Over-Forecasting vs Under-Forecasting

Leadership cares about direction. Over-forecasting means excess inventory.  
Under-forecasting means stockouts.

In [None]:
# Weekly bias (forecast - actual): positive = over-forecast
bias_data = []

for model_name in compare_models:
    if model_name not in all_predictions:
        continue
    preds = all_predictions[model_name]
    merged = test.merge(preds[["unique_id", "ds", "yhat"]], on=["unique_id", "ds"], how="inner")

    for week_date, week_df in merged.groupby("ds"):
        total_bias = (week_df["yhat"] - week_df["y"]).sum()
        total_actual = week_df["y"].sum()
        bias_data.append({
            "model": model_name,
            "ds": week_date,
            "bias_units": total_bias,
            "bias_pct": total_bias / max(abs(total_actual), 1e-6),
            "total_actual": total_actual,
        })

bias_df = pd.DataFrame(bias_data)

fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=["Weekly Bias (% of Actual)", "Cumulative Bias Direction"],
    vertical_spacing=0.15,
)

for idx, model_name in enumerate(compare_models):
    if model_name not in all_predictions:
        continue
    model_bias = bias_df[bias_df["model"] == model_name].sort_values("ds").copy()

    is_benchmark = model_name == "Your Predictions"
    color = BRAND["gray"] if is_benchmark else BRAND["blue"]

    # Weekly bias
    fig.add_trace(go.Bar(
        x=model_bias["ds"],
        y=model_bias["bias_pct"],
        name=model_name,
        marker_color=[BRAND["red"] if b > 0 else BRAND["green"] for b in model_bias["bias_pct"]],
        opacity=0.4 if is_benchmark else 0.7,
        showlegend=False,
        hovertemplate=f"{model_name}<br>%{{x|%b %d}}<br>Bias: %{{y:.1%}}<extra></extra>",
    ), row=1, col=1)

    # Cumulative bias
    model_bias["cum_bias"] = model_bias["bias_units"].cumsum()
    fig.add_trace(go.Scatter(
        x=model_bias["ds"],
        y=model_bias["cum_bias"],
        mode="lines",
        name=model_name,
        line=dict(color=color, width=2.5, dash="dash" if is_benchmark else "solid"),
        hovertemplate=f"{model_name}<br>Through: %{{x|%b %d}}<br>Cum. Bias: %{{y:,.0f}} units<extra></extra>",
    ), row=2, col=1)

fig.add_hline(y=0, line_dash="solid", line_color="#CCCCCC", row=1, col=1)
fig.add_hline(y=0, line_dash="solid", line_color="#CCCCCC", row=2, col=1)

# Annotations
fig.add_annotation(
    text="Over-forecasting (excess inventory)",
    xref="paper", yref="y", x=1.02, y=0.05,
    showarrow=False, font=dict(size=11, color=BRAND["red"]),
    textangle=-90, row=1, col=1,
)
fig.add_annotation(
    text="Under-forecasting (stockout risk)",
    xref="paper", yref="y", x=1.02, y=-0.05,
    showarrow=False, font=dict(size=11, color=BRAND["green"]),
    textangle=-90, row=1, col=1,
)

fig.update_layout(
    title=dict(
        text="<b>Forecast Bias Analysis</b><br>"
             "<span style='font-size:14px;color:gray'>Red = over-forecast (excess inventory), "
             "green = under-forecast (stockout risk)</span>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    height=650,
    width=CHART_WIDTH,
    margin=dict(t=100, b=40, r=80),
)
fig.update_xaxes(tickformat="%b %d", gridcolor="#EEEEEE")
fig.update_yaxes(tickformat=".0%", gridcolor="#EEEEEE", row=1, col=1)
fig.update_yaxes(gridcolor="#EEEEEE", row=2, col=1, title="Cumulative units")

fig.write_html(f"{EXPORT_DIR}/10_bias_analysis.html")
fig.show()

---
## 10. Summary Table for Slides

Copy-pasteable table for PowerPoint.

In [None]:
# Final summary table
summary = scorecard[["rank", "model", "wape", "mae", "bias"]].copy()
summary["wape"] = summary["wape"].map("{:.1%}".format)
summary["mae"] = summary["mae"].map("{:.1f}".format)
summary["bias"] = summary["bias"].map("{:+.1f}".format)

# Traffic light emoji for slides
def traffic_light(wape_str):
    w = float(wape_str.strip("%")) / 100
    if w < 0.35:
        return "GREEN"
    elif w < 0.50:
        return "YELLOW"
    return "RED"

summary["status"] = summary["wape"].apply(traffic_light)
summary = summary.rename(columns={
    "rank": "#",
    "model": "Model",
    "wape": "WAPE",
    "mae": "MAE",
    "bias": "Bias",
    "status": "Status",
})

print("\n" + "=" * 70)
print("SUMMARY FOR SLIDES")
print("=" * 70)
print(f"Test Period: {test['ds'].min().date()} to {test['ds'].max().date()} ({test['ds'].nunique()} weeks)")
print(f"Series: {test['unique_id'].nunique()} Customer-Material combinations")
print(f"Best Model: {best_model['model']}")
print(f"Improvement: {improvement_pct:.0f}% better than current predictions")
print(f"Estimated Impact: ${dollar_impact:,.0f}/year")
print("=" * 70)
print()
print(summary.to_string(index=False))

In [None]:
# Table as a clean Plotly figure (screenshot-ready)
fig = go.Figure(data=[go.Table(
    header=dict(
        values=list(summary.columns),
        fill_color=BRAND["blue"],
        font=dict(color="white", size=14),
        align="center",
        height=36,
    ),
    cells=dict(
        values=[summary[c] for c in summary.columns],
        fill_color=[
            [BRAND["light"]] * len(summary),  # #
            [BRAND["light"]] * len(summary),  # Model
            [BRAND["light"]] * len(summary),  # WAPE
            [BRAND["light"]] * len(summary),  # MAE
            [BRAND["light"]] * len(summary),  # Bias
            [[BRAND["green"] if s == "GREEN" else BRAND["yellow"] if s == "YELLOW" else BRAND["red"]
              for s in summary["Status"]]],  # Status
        ],
        font=dict(size=13),
        align="center",
        height=30,
    ),
)])

fig.update_layout(
    title=dict(
        text="<b>Model Scorecard</b>",
        font=dict(size=18, color=BRAND["dark"]),
    ),
    height=max(200, len(summary) * 32 + 120),
    width=CHART_WIDTH,
    margin=dict(t=60, b=20, l=20, r=20),
)

fig.write_html(f"{EXPORT_DIR}/11_scorecard_table.html")
fig.show()

---
## Export All Charts

All charts have been auto-saved as interactive HTML to `data/raw/charts/`.  

For static PNG (for PowerPoint), uncomment and run below.  
Requires `kaleido`: `pip install kaleido`

In [None]:
# # Uncomment to export all charts as PNG for slides
# import glob
# for html_file in sorted(glob.glob(f"{EXPORT_DIR}/*.html")):
#     fig = pio.read_json(html_file.replace(".html", ".json"))  # won't work directly
#     # Instead, re-run each fig.write_image() call above
#
# # Or export specific figures:
# fig.write_image(f"{EXPORT_DIR}/02_model_comparison.png", width=900, height=500, scale=2)

print(f"\nAll charts saved to: {EXPORT_DIR}/")
print("Files:")
for f in sorted(os.listdir(EXPORT_DIR)):
    if f.endswith(".html"):
        print(f"  {f}")

---
## Presentation Order (Recommended)

| Slide | Chart | File | Talking Point |
|-------|-------|------|---------------|
| 1 | Executive Summary | `01_executive_summary.html` | "We improved accuracy by X%, worth $Y/year" |
| 2 | Model Comparison | `02_model_comparison.html` | "We tested N approaches — [best model] wins" |
| 3 | Weekly Trendline | `03_weekly_accuracy_trendline.html` | "The improvement is consistent week over week" |
| 4 | Cumulative Accuracy | `04_cumulative_accuracy.html` | "As we accumulate more weeks, the gap holds" |
| 5 | Fiscal Month View | `05_fiscal_month_accuracy.html` | "It works across fiscal periods" |
| 6 | Traffic Light | `06_traffic_light.html` | "More products moved from red/yellow to green" |
| 7 | Top Products | `07_top_products_forecast_vs_actual.html` | "Here are our biggest products — model tracks closely" |
| 8 | Improvements | `08_improvements_and_misses.html` | "Biggest wins and where we still need work" |
| 9 | Hierarchy Accuracy | `09_accuracy_by_hierarchy.html` | "Accuracy is strong at every level of rollup" |
| 10 | Bias Analysis | `10_bias_analysis.html` | "Less systematic bias = fewer inventory surprises" |
| 11 | Scorecard | `11_scorecard_table.html` | "Full scorecard for reference" |