# Cross-validation and Metrics in Time Series Forecasting

This tutorial demonstrates proper cross-validation techniques and evaluation metrics for time series forecasting.

**Duration:** ~10 minutes

## Learning objectives

By the end of this tutorial, you will be able to:
- Use time series splitters for proper cross-validation
- Visualize cross-validation windows with `plot_windows`
- Apply various forecasting metrics
- Use the `evaluate` function for comprehensive model assessment

## 1. Time Series Cross-validation Splitters

Time series data requires special consideration for cross-validation due to temporal dependencies.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sktime.datasets import load_airline
from sktime.forecasting.exp_smoothing import ExponentialSmoothing
from sktime.forecasting.model_evaluation import evaluate
from sktime.forecasting.naive import NaiveForecaster
from sktime.split import (
    ExpandingWindowSplitter,
    SlidingWindowSplitter,
    temporal_train_test_split,
)
from sktime.utils.plotting import plot_windows

# Load and examine data
y = load_airline()
print(f"Dataset: {y.shape[0]} observations from {y.index[0]} to {y.index[-1]}")

# Create different types of splitters
splitters = {
    "Expanding Window": ExpandingWindowSplitter(
        initial_window=36,  # Initial training window
        step_length=12,  # Step between CV folds
        fh=[1, 2, 3, 6, 12],  # Forecast horizons to evaluate
    ),
    "Sliding Window": SlidingWindowSplitter(
        window_length=48,  # Fixed training window size
        step_length=6,  # Step between CV folds
        fh=[1, 3, 6, 12],  # Forecast horizons
    ),
}

for name, splitter in splitters.items():
    n_splits = splitter.get_n_splits(y)
    print(f"{name}: {n_splits} splits")

## 2. Visualizing Cross-validation Windows

Understanding how your data is split is crucial for proper evaluation.

In [None]:
# Visualize different splitting strategies
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Expanding window
plot_windows(
    splitters["Expanding Window"],
    y,
    ax=axes[0],
    title="Expanding Window Cross-validation",
)

# Sliding window
plot_windows(
    splitters["Sliding Window"], y, ax=axes[1], title="Sliding Window Cross-validation"
)

plt.tight_layout()
plt.show()

print("Window Visualization Legend:")
print("- Blue: Training data")
print("- Orange: Test data (forecast horizon)")
print("- Each row represents one cross-validation fold")

## 3. Forecasting Metrics

Different metrics capture different aspects of forecast quality.

In [None]:
from sktime.performance_metrics.forecasting import (
    geometric_mean_absolute_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    mean_squared_error,
    mean_squared_percentage_error,
    median_absolute_error,
)

# Create sample predictions for demonstration
y_train, y_test = temporal_train_test_split(y, test_size=12)

# Fit simple forecasters
naive_forecaster = NaiveForecaster(strategy="seasonal_last", sp=12)
naive_forecaster.fit(y_train)
y_pred_naive = naive_forecaster.predict(fh=range(1, 13))

exp_smoothing = ExponentialSmoothing(trend="add", seasonal="multiplicative", sp=12)
exp_smoothing.fit(y_train)
y_pred_exp = exp_smoothing.predict(fh=range(1, 13))

# Calculate various metrics
metrics = {
    "MAE": mean_absolute_error,
    "MSE": mean_squared_error,
    "MAPE": mean_absolute_percentage_error,
    "MedAE": median_absolute_error,
    "MSPE": mean_squared_percentage_error,
    "GMAE": geometric_mean_absolute_error,
}

print("Forecasting Metrics Comparison:")
print("\nNaive Forecaster:")
for name, metric_func in metrics.items():
    try:
        value = metric_func(y_test, y_pred_naive)
        if name in ["MAPE", "MSPE"]:
            print(f"{name:>6}: {value:>8.2%}")
        else:
            print(f"{name:>6}: {value:>8.2f}")
    except Exception as e:
        print(f"{name:>6}: Error - {str(e)[:30]}")

print("\nExponential Smoothing:")
for name, metric_func in metrics.items():
    try:
        value = metric_func(y_test, y_pred_exp)
        if name in ["MAPE", "MSPE"]:
            print(f"{name:>6}: {value:>8.2%}")
        else:
            print(f"{name:>6}: {value:>8.2f}")
    except Exception as e:
        print(f"{name:>6}: Error - {str(e)[:30]}")

## 4. Understanding Different Metrics

Each metric has different properties and use cases.

In [None]:
print("Metric Properties and Use Cases:")

print("\n1. MEAN ABSOLUTE ERROR (MAE):")
print("   - Units: Same as original data")
print("   - Robust to outliers")
print("   - Easy to interpret")
print("   - Use when: You want interpretable, robust error measure")

print("\n2. MEAN SQUARED ERROR (MSE):")
print("   - Units: Squared original units")
print("   - Penalizes large errors more")
print("   - Sensitive to outliers")
print("   - Use when: Large errors are particularly costly")

print("\n3. MEAN ABSOLUTE PERCENTAGE ERROR (MAPE):")
print("   - Units: Percentage")
print("   - Scale-independent")
print("   - Issues with values near zero")
print("   - Use when: Comparing across different scales")

print("\n4. MEDIAN ABSOLUTE ERROR (MedAE):")
print("   - Units: Same as original data")
print("   - Very robust to outliers")
print("   - Less sensitive to extreme errors")
print("   - Use when: Data has many outliers")

# Demonstrate metric behavior with artificial examples
print("\n\nMetric Behavior Example:")

# Create examples with different error patterns
y_true = pd.Series([100, 100, 100, 100, 100])
y_pred_consistent = pd.Series([95, 95, 95, 95, 95])  # Consistent small errors
y_pred_outlier = pd.Series([100, 100, 100, 100, 50])  # One large error

examples = {
    "Consistent Small Errors": y_pred_consistent,
    "One Large Error": y_pred_outlier,
}

for scenario, y_pred in examples.items():
    print(f"\n{scenario}:")
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    medae = median_absolute_error(y_true, y_pred)

    print(f"  MAE:  {mae:.1f}")
    print(f"  RMSE: {rmse:.1f}")
    print(f"  MedAE: {medae:.1f}")
    print(f"  Errors: {(y_true - y_pred).abs().tolist()}")

## 5. Comprehensive Evaluation with evaluate()

The `evaluate` function provides a comprehensive assessment across multiple CV folds.

In [None]:
# Set up comprehensive evaluation
forecasters = {
    "Naive": NaiveForecaster(strategy="seasonal_last", sp=12),
    "ExponentialSmoothing": ExponentialSmoothing(
        trend="add", seasonal="multiplicative", sp=12
    ),
}

# Use expanding window splitter for evaluation
cv_splitter = ExpandingWindowSplitter(
    initial_window=36, step_length=6, fh=[1, 3, 6, 12]
)

# Define metrics to evaluate
scoring = [
    "mean_absolute_error",
    "mean_squared_error",
    "mean_absolute_percentage_error",
]

print("Running comprehensive evaluation...")
print(f"CV Splitter: {cv_splitter.get_n_splits(y)} folds")
print(f"Forecast horizons: {cv_splitter.fh}")
print(f"Metrics: {scoring}")

# Evaluate forecasters
results = {}
for name, forecaster in forecasters.items():
    print(f"\nEvaluating {name}...")
    result = evaluate(
        forecaster=forecaster, y=y, cv=cv_splitter, scoring=scoring, return_data=True
    )
    results[name] = result
    print(f"Completed {name}")

print("\nEvaluation completed!")

## 6. Analyzing Evaluation Results

In [None]:
# Analyze results
print("Cross-validation Results Summary:")
print("=" * 50)

for name, result in results.items():
    print(f"\n{name.upper()}:")

    # Get the evaluation metrics
    if hasattr(result, "columns"):  # DataFrame result
        metrics_df = result
    else:  # Dictionary result
        metrics_df = pd.DataFrame(result)

    print(metrics_df.describe())

# Compare forecasters
print("\n\nFORECASTER COMPARISON:")
print("=" * 30)

comparison_metrics = []
for name, result in results.items():
    if hasattr(result, "columns"):  # DataFrame result
        metrics_df = result
    else:
        metrics_df = pd.DataFrame(result)

    # Calculate mean performance across folds
    mean_metrics = metrics_df.mean()
    mean_metrics.name = name
    comparison_metrics.append(mean_metrics)

if comparison_metrics:
    comparison_df = pd.concat(comparison_metrics, axis=1)
    print(comparison_df)

    # Find best forecaster for each metric
    print("\nBest Forecaster by Metric:")
    for metric in comparison_df.index:
        if "error" in metric.lower():
            best = comparison_df.loc[metric].idxmin()  # Lower is better
        else:
            best = comparison_df.loc[metric].idxmax()  # Higher is better
        print(f"{metric}: {best}")

## 7. Forecast Horizon Analysis

Understanding how forecast quality changes with horizon length.

In [None]:
# Analyze performance by forecast horizon
print("Performance by Forecast Horizon:")
print("=" * 35)

# Create detailed horizon analysis
horizon_analysis = {}

for name, result in results.items():
    print(f"\n{name}:")

    if hasattr(result, "columns"):  # DataFrame result
        metrics_df = result
    else:
        metrics_df = pd.DataFrame(result)

    # Group by forecast horizon if available
    if "fh" in metrics_df.columns:
        horizon_stats = metrics_df.groupby("fh").agg(["mean", "std"])
        print(horizon_stats)
        horizon_analysis[name] = horizon_stats
    else:
        # If fh not in columns, show overall statistics
        print("Forecast horizon information not available in results")
        print(metrics_df.describe())

# Visualize horizon performance if available
if horizon_analysis:
    print("\nCreating horizon performance visualization...")

    # Plot performance by horizon for MAPE
    fig, ax = plt.subplots(figsize=(10, 6))

    for name, stats in horizon_analysis.items():
        if "mean_absolute_percentage_error" in stats.columns:
            mape_mean = stats[("mean_absolute_percentage_error", "mean")]
            mape_std = stats[("mean_absolute_percentage_error", "std")]

            ax.plot(mape_mean.index, mape_mean.values, "o-", label=name)
            ax.fill_between(
                mape_mean.index,
                mape_mean.values - mape_std.values,
                mape_mean.values + mape_std.values,
                alpha=0.3,
            )

    ax.set_xlabel("Forecast Horizon")
    ax.set_ylabel("MAPE")
    ax.set_title("Forecast Performance by Horizon")
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()
else:
    print("\nDetailed horizon analysis not available with current results structure")

## 8. Best Practices for Cross-validation

Key guidelines for proper time series evaluation.

In [None]:
print("Cross-validation Best Practices:")
print("=" * 35)

print("\n1. TEMPORAL ORDERING:")
print("   ✓ Always respect temporal order")
print("   ✓ Training data must come before test data")
print("   ✓ Never use future information for training")

print("\n2. SPLITTER SELECTION:")
print("   • ExpandingWindowSplitter: When you want to use all available history")
print("   • SlidingWindowSplitter: When recent data is most relevant")
print("   • Consider data size and computational constraints")

print("\n3. FORECAST HORIZON:")
print("   • Test multiple horizons: [1, 3, 6, 12]")
print("   • Include your actual use case horizon")
print("   • Consider seasonal patterns (e.g., 12 for monthly data)")

print("\n4. METRICS SELECTION:")
print("   • Use multiple metrics to get complete picture")
print("   • MAPE for scale-independent comparison")
print("   • MAE/RMSE for absolute error understanding")
print("   • Choose metrics aligned with business objectives")

print("\n5. STATISTICAL SIGNIFICANCE:")
print("   • Use sufficient CV folds (typically 3-10)")
print("   • Report confidence intervals when possible")
print("   • Consider seasonal effects in fold selection")

# Demonstrate proper vs improper CV
print("\n\nCOMMON MISTAKES TO AVOID:")
print("=" * 30)

print("\n❌ WRONG - Random splits:")
print("   from sklearn.model_selection import KFold")
print("   cv = KFold(n_splits=5, shuffle=True)  # Destroys temporal order!")

print("\n✓ CORRECT - Temporal splits:")
print("   from sktime.split import ExpandingWindowSplitter")
print("   cv = ExpandingWindowSplitter(initial_window=36)")

print("\n❌ WRONG - Using future data for preprocessing:")
print("   scaler.fit(X_full)  # Uses future data!")
print("   X_scaled = scaler.transform(X_full)")

print("\n✓ CORRECT - Preprocessing in pipeline:")
print("   from sktime.forecasting.compose import ForecastingPipeline")
print("   pipeline = ForecastingPipeline([('scale', scaler), ('forecast', model)])")

print("\n❌ WRONG - Single metric evaluation:")
print("   # Only using RMSE might miss important patterns")

print("\n✓ CORRECT - Multiple metrics:")
print("   scoring = ['mean_absolute_error', 'mean_absolute_percentage_error', ...]")

## Summary

In this tutorial, you learned:

1. **Time Series Splitters**: `ExpandingWindowSplitter` and `SlidingWindowSplitter` for proper CV
2. **Window Visualization**: Using `plot_windows` to understand data splits
3. **Forecasting Metrics**: MAE, MSE, MAPE, MedAE and their properties
4. **Comprehensive Evaluation**: Using `evaluate()` for robust model assessment
5. **Results Analysis**: Interpreting CV results and comparing forecasters
6. **Horizon Analysis**: Understanding how performance varies with forecast distance
7. **Best Practices**: Guidelines for proper time series evaluation

## Key Takeaways

- **Temporal Order**: Always respect time ordering in cross-validation
- **Multiple Metrics**: Use several metrics to get a complete picture
- **Multiple Horizons**: Test performance at different forecast distances
- **Statistical Rigor**: Use sufficient folds and report uncertainty
- **Business Alignment**: Choose metrics that reflect real-world costs

## Next Steps

- Learn "Hyperparameter Tuning" to optimize model performance using CV
- Explore "Probabilistic Forecasting" for uncertainty quantification
- Try "Global Forecasting" for advanced evaluation techniques