# Model Evaluation

This notebook provides a comprehensive guide on evaluating ERCOT RTLMP spike prediction models. It covers various aspects of model evaluation, including performance metrics calculation, visualization, and comparison across different thresholds and time periods.

**Learning Objectives:**
*   Understand the importance of model evaluation in the ERCOT RTLMP spike prediction system.
*   Learn how to calculate and interpret various performance metrics, including AUC, precision, recall, and Brier score.
*   Visualize model performance using ROC curves, precision-recall curves, calibration curves, and confusion matrices.
*   Analyze model performance across different price thresholds and time periods.
*   Compare the performance of multiple models and identify the best model for different scenarios.
*   Generate a comprehensive evaluation report for stakeholders.

# Setup and Imports

In [None]:
# External imports
import pandas as pd  # version 2.0+
import numpy as np  # version 1.24+
import matplotlib.pyplot as plt  # version 3.7+
import seaborn as sns  # version 0.12+
import plotly.express as px  # version 5.14+
import plotly.graph_objects as go  # version 5.14+
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, average_precision_score, brier_score_loss  # version 1.2+
from sklearn.model_selection import train_test_split  # version 1.2+
import datetime  # Standard
import pathlib  # Standard

In [None]:
# Internal modules
from src.backend.models.evaluation import ModelEvaluator, ThresholdOptimizer
from src.backend.visualization.performance_plots import ModelPerformancePlotter
from src.backend.backtesting.performance_metrics import BacktestingMetricsCalculator
from src.backend.models.xgboost_model import XGBoostModel
from src.backend.data.fetchers.ercot_api import ERCOTDataFetcher
from src.backend.features.feature_pipeline import FeaturePipeline

In [None]:
# Global constants and utility functions
PRICE_THRESHOLDS = [50.0, 100.0, 200.0, 300.0]
DEFAULT_NODE = 'HB_NORTH'
EVALUATION_METRICS = ['accuracy', 'precision', 'recall', 'f1', 'auc', 'brier_score']
MODEL_REGISTRY_PATH = '../models/registry'

def create_target_variables(rtlmp_df: pd.DataFrame, thresholds: List[float]) -> pd.DataFrame:
    """Creates binary target variables for different price thresholds"""
    targets = pd.DataFrame(index=rtlmp_df.index)
    for threshold in thresholds:
        hourly_data = rtlmp_df.groupby(pd.Grouper(key='timestamp', freq='H'))['price'].max()
        targets[f'spike_occurred_{threshold}'] = (hourly_data > threshold).astype(int)
    return targets

def load_and_prepare_data(start_date: str, end_date: str, node: str, thresholds: List[float]) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Loads historical data and prepares features and targets for model evaluation"""
    data_fetcher = ERCOTDataFetcher()
    rtlmp_df = data_fetcher.fetch_historical_data(start_date=start_date, end_date=end_date, identifiers=[node])
    grid_df = data_fetcher.fetch_historical_data(start_date=start_date, end_date=end_date, identifiers=[])
    feature_pipeline = FeaturePipeline()
    feature_pipeline.add_data_source('rtlmp_df', rtlmp_df)
    feature_pipeline.add_data_source('grid_df', grid_df)
    features = feature_pipeline.create_features()
    targets = create_target_variables(rtlmp_df, thresholds)
    return features, targets

def split_evaluation_data(features: pd.DataFrame, targets: pd.DataFrame, test_size: float, random_state: int) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Splits data into training and evaluation sets"""
    X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

def compare_thresholds(metrics_by_threshold: Dict[float, Dict[str, float]], metrics_to_compare: List[str]) -> pd.DataFrame:
    """Compares model performance across different price thresholds"""
    comparison_data = []
    for threshold, metrics in metrics_by_threshold.items():
        threshold_metrics = {metric: metrics[metric] for metric in metrics_to_compare}
        threshold_metrics['threshold'] = threshold
        comparison_data.append(threshold_metrics)
    comparison_df = pd.DataFrame(comparison_data).set_index('threshold')
    return comparison_df

def plot_feature_importance(model: XGBoostModel, top_n: int) -> Tuple[plt.Figure, plt.Axes]:
    """Plots feature importance from a trained model"""
    importance = model.get_feature_importance()
    top_features = sorted(importance, key=importance.get, reverse=True)[:top_n]
    values = [importance[feature] for feature in top_features]

    fig, ax = plt.subplots(figsize=(10, 8))
    ax.barh(top_features, values)
    ax.set_xlabel('Importance')
    ax.set_ylabel('Feature')
    ax.set_title(f'Top {top_n} Feature Importances')
    plt.tight_layout()
    return fig, ax

# Data Preparation

Load and prepare data for model evaluation. This includes fetching historical RTLMP data, creating target variables for different price thresholds, and splitting the data into training and evaluation sets.

In [None]:
start_date = '2023-01-01'
end_date = '2023-06-30'
node = DEFAULT_NODE
thresholds = PRICE_THRESHOLDS

features, targets = load_and_prepare_data(start_date, end_date, node, thresholds)
X_train, X_test, y_train, y_test = split_evaluation_data(features, targets, test_size=0.2, random_state=42)

# Basic Model Evaluation

Evaluate a single model with standard metrics such as AUC, precision, recall, and Brier score. This provides a baseline understanding of the model's performance.

In [None]:
# Load or train an XGBoost model
model = XGBoostModel(model_id='eval_model')
# Train the model if it hasn't been trained yet
if not model.is_trained():
    model.train(X_train, y_train['spike_occurred_100.0'])

# Initialize ModelEvaluator
evaluator = ModelEvaluator()

# Evaluate the model on test data
metrics = evaluator.evaluate(model, X_test, y_test['spike_occurred_100.0'])

# Display evaluation results
print(metrics)

# Performance Visualization

Visualize model performance with various plots, including ROC curves, precision-recall curves, calibration curves, and confusion matrices. These visualizations provide insights into different aspects of model behavior.

In [None]:
# Initialize ModelPerformancePlotter
plotter = ModelPerformancePlotter()
plotter.load_model_data(model_id='eval_model', y_true=y_test['spike_occurred_100.0'], y_prob=model.predict_proba(X_test), y_pred=model.predict(X_test))

# Create ROC curve plot
fig_roc, ax_roc = plotter.plot_roc_curve()
plt.show()

# Create precision-recall curve plot
fig_pr, ax_pr = plotter.plot_precision_recall_curve()
plt.show()

# Create calibration curve plot
fig_cal, ax_cal = plotter.plot_calibration_curve()
plt.show()

# Create confusion matrix plot
fig_cm, ax_cm = plotter.plot_confusion_matrix()
plt.show()

# Create comprehensive performance dashboard
fig_dashboard = plotter.create_performance_dashboard()
fig_dashboard.show()

# Threshold Analysis

Analyze model performance across different price thresholds. This helps understand the model's sensitivity to varying spike definitions and identify the optimal threshold for specific use cases.

In [None]:
# Evaluate model across different thresholds
evaluator = ModelEvaluator()
metrics_by_threshold = evaluator.evaluate_by_threshold(model, X_test, y_test, PRICE_THRESHOLDS)

# Compare metrics between thresholds
metrics_to_compare = ['precision', 'recall', 'f1', 'auc']
comparison_df = compare_thresholds(metrics_by_threshold, metrics_to_compare)
print(comparison_df)

# Initialize ThresholdOptimizer
optimizer = ThresholdOptimizer()

# Find optimal threshold for different metrics
optimal_threshold = optimizer.optimize_for_model(model, X_test, y_test['spike_occurred_100.0'])
print(f"Optimal threshold: {optimal_threshold}")

# Plot threshold optimization curves
fig_opt, ax_opt = optimizer.plot_optimization_curve()
plt.show()

# Temporal Analysis

Analyze model performance over different time periods, such as by hour of day, day of week, or month. This helps identify patterns in model performance and potential areas for improvement.

In [None]:
# Evaluate model performance over time
evaluator = ModelEvaluator()
time_column = 'timestamp'
time_grouping = 'month'
temporal_metrics = evaluator.evaluate_over_time(model, features, targets['spike_occurred_100.0'], time_column, time_grouping)

# Plot metrics by hour of day
fig_temporal, ax_temporal = plt.subplots(figsize=(10, 6))
ax_temporal.plot(temporal_metrics.index, temporal_metrics['precision'], label='Precision')
ax_temporal.plot(temporal_metrics.index, temporal_metrics['recall'], label='Recall')
ax_temporal.set_xlabel('Month')
ax_temporal.set_ylabel('Metric Value')
ax_temporal.set_title('Temporal Performance Analysis')
ax_temporal.legend()
plt.show()

# Model Comparison

Compare performance between multiple models. This helps identify the best model for different scenarios and understand the trade-offs between different model configurations.

In [None]:
# Load multiple models for comparison
model1 = XGBoostModel(model_id='model1')
model2 = XGBoostModel(model_id='model2')

# Train models if they haven't been trained yet
if not model1.is_trained():
    model1.train(X_train, y_train['spike_occurred_100.0'])
if not model2.is_trained():
    model2.train(X_train, y_train['spike_occurred_100.0'])

# Initialize ModelEvaluator
evaluator = ModelEvaluator()

# Compare models using ModelEvaluator
comparison_df = evaluator.compare_models([model1, model2], X_test, y_test['spike_occurred_100.0'])
print(comparison_df)

# Visualize model comparison with bar charts
plotter = ModelPerformancePlotter()
model_metrics = {}
model_metrics['model1'] = evaluator.evaluate(model1, X_test, y_test['spike_occurred_100.0'])
model_metrics['model2'] = evaluator.evaluate(model2, X_test, y_test['spike_occurred_100.0'])
fig_compare, ax_compare = plotter.plot_metric_comparison(model_metrics)
plt.show()

# Compare models across different thresholds
metrics_by_threshold_model1 = evaluator.evaluate_by_threshold(model1, X_test, y_test, PRICE_THRESHOLDS)
metrics_by_threshold_model2 = evaluator.evaluate_by_threshold(model2, X_test, y_test, PRICE_THRESHOLDS)

# Identify the best model for different scenarios
best_model = None
best_auc = 0
for model_id, metrics_by_threshold in {'model1': metrics_by_threshold_model1, 'model2': metrics_by_threshold_model2}.items():
    for threshold, metrics in metrics_by_threshold.items():
        if metrics['auc'] > best_auc:
            best_auc = metrics['auc']
            best_model = model_id

print(f"Best model based on AUC: {best_model} with AUC = {best_auc}")

# Backtesting Analysis

Evaluate model performance through backtesting, simulating historical forecasts over a user-specified time window. This provides a more realistic assessment of model performance in a production environment.

In [None]:
# Initialize BacktestingMetricsCalculator
calculator = BacktestingMetricsCalculator()

# Run backtesting for a specific time window
backtest_start = '2023-01-01'
backtest_end = '2023-01-31'
backtest_predictions, backtest_actuals = load_and_prepare_data(backtest_start, backtest_end, node, thresholds)
backtesting_metrics = calculator.calculate_all_metrics(backtest_predictions, backtest_actuals, model_id='eval_model', thresholds=thresholds)

# Visualize backtesting results
fig_backtest, ax_backtest = plt.subplots(figsize=(10, 6))
for threshold, metrics in backtesting_metrics.items():
    ax_backtest.plot(metrics.keys(), metrics.values(), label=f'Threshold {threshold}')
ax_backtest.set_xlabel('Metric')
ax_backtest.set_ylabel('Value')
ax_backtest.set_title('Backtesting Metrics')
ax_backtest.legend()
plt.show()

# Calculate comprehensive metrics from backtesting
backtesting_report = calculator.generate_report(model_id='eval_model', output_path='backtest_report.json')
print(backtesting_report)

# Feature Importance Analysis

Analyze feature importance and impact on model performance. This helps identify key drivers of model performance and potential areas for feature engineering improvements.

In [None]:
# Extract feature importance from the model
fig_importance, ax_importance = plot_feature_importance(model, top_n=10)
plt.show()

# Comprehensive Evaluation Report

Generate a complete evaluation report for stakeholders, summarizing key findings from the evaluation process. This report should include performance metrics, visualizations, and insights into model behavior.

In [None]:
# Generate evaluation report using ModelEvaluator
evaluator = ModelEvaluator()
report = evaluator.generate_report(model, X_test, y_test['spike_occurred_100.0'], output_path='evaluation_report.json')

# Format and display the report
print(report)

# Conclusion

Summarize key findings from the evaluation, discuss model strengths and weaknesses, and suggest improvements and next steps.

This notebook provided a comprehensive guide on evaluating ERCOT RTLMP spike prediction models. It covered various aspects of model evaluation, including performance metrics calculation, visualization, and comparison across different thresholds and time periods. By following the steps outlined in this notebook, you can gain a deep understanding of your model's performance and identify areas for improvement.