# üìä Comprehensive Model Evaluation Pipeline

**Advanced model evaluation with cross-validation, statistical analysis, and healthcare-specific metrics**

## üéØ **Evaluation Objectives:**
1. **üìà Comprehensive Metrics** - MAE, RMSE, R¬≤, healthcare-specific accuracy measures
2. **üîÑ Cross-Validation** - K-fold validation with statistical significance testing
3. **üìä Statistical Analysis** - Confidence intervals, hypothesis testing, model comparison
4. **üè• Healthcare-Specific Metrics** - Risk stratification accuracy, clinical thresholds
5. **üìù Evaluation Logging** - Structured results storage for tracking and comparison

## üõ†Ô∏è **Evaluation Components:**
- **Multi-Algorithm Testing**: XGBoost variants, linear baselines, ensemble methods
- **Cross-Validation Framework**: Distributed K-fold validation using Snowpark
- **Healthcare Metrics**: Risk category accuracy, sensitivity/specificity by risk level
- **Statistical Testing**: Paired t-tests, confidence intervals, effect sizes
- **Results Logging**: Comprehensive evaluation tracking in Snowflake tables

**Prerequisites:** Run notebooks 04 (Feature Engineering) and 05 (Model Training) first


In [19]:
# Environment Setup for Model Evaluation
import sys
import os
import numpy as np
import datetime
from typing import Dict, List, Any, Tuple

# Fix path for snowflake_connection module
current_dir = os.getcwd()
if "notebooks" in current_dir:
    src_path = os.path.join(current_dir, "..", "src")
else:
    src_path = os.path.join(current_dir, "src")

sys.path.append(src_path)
print(f"üìÅ Added to Python path: {src_path}")

from snowflake_connection import get_session
from snowflake.snowpark.functions import (
    col, lit, when, count, avg, sum as sum_, max as max_, min as min_,
    stddev, variance, abs as abs_, sqrt, pow as pow_
)
from snowflake.snowpark.types import (
    StructType, StructField, StringType, DoubleType, IntegerType,
    FloatType, BooleanType
)

# ML imports
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.modeling.metrics import mean_absolute_error, mean_squared_error
from snowflake.ml.registry import Registry

# Get Snowflake session
session = get_session()
print("‚úÖ Environment ready for comprehensive model evaluation")
print("üìä Capabilities: Cross-validation, Statistical Analysis, Healthcare Metrics")
print("üî¨ Tools: Multiple algorithms, significance testing, evaluation logging")


üìÅ Added to Python path: /Users/beddy/Desktop/Github/Snowflake_ML_HCLS/notebooks/../src
üîÑ Reusing existing Snowflake session
‚úÖ Environment ready for comprehensive model evaluation
üìä Capabilities: Cross-validation, Statistical Analysis, Healthcare Metrics
üî¨ Tools: Multiple algorithms, significance testing, evaluation logging


In [20]:
# Data Loading and Preparation
print("üìÇ Loading and preparing evaluation datasets...")

# Load the processed feature data
feature_data_df = session.table("ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.FAERS_HCLS_FEATURES_FINAL")
print(f"‚úÖ Loaded feature dataset with {feature_data_df.count():,} records")

# Analyze available columns
available_columns = [f.name for f in feature_data_df.schema.fields]
print(f"üìä Available columns: {len(available_columns)}")
print(f"   Sample columns: {', '.join(available_columns[:8])}...")

# Define feature sets for evaluation
core_features = ["AGE", "NUM_CONDITIONS", "NUM_MEDICATIONS", "NUM_CLAIMS"]
faers_features = ["MAX_MEDICATION_RISK", "HIGH_RISK_MEDICATION_COUNT", "WARFARIN_RISK"]
derived_features = ["AGE_GROUP", "MEDICATION_BURDEN", "CLAIMS_CATEGORY"]

# Build feature set based on availability
evaluation_features = []
evaluation_features.extend([f for f in core_features if f in available_columns])
evaluation_features.extend([f for f in faers_features if f in available_columns])
evaluation_features.extend([f for f in derived_features if f in available_columns])

print(f"üìã Selected {len(evaluation_features)} features for evaluation:")
for i, feature in enumerate(evaluation_features, 1):
    print(f"   {i:2d}. {feature}")

# Define target variable
target_col = "CONTINUOUS_RISK_TARGET" if "CONTINUOUS_RISK_TARGET" in available_columns else "AGE"
print(f"üéØ Target variable: {target_col}")

# Create evaluation dataset with clean data
eval_data_df = feature_data_df.select(
    evaluation_features + [target_col]
).filter(
    col(target_col).is_not_null()
)

# Add patient ID for tracking using row number
from snowflake.snowpark.functions import row_number
from snowflake.snowpark.window import Window

# Create a simple numeric patient ID to avoid string concatenation issues
window_spec = Window.order_by(lit(1))
eval_data_df = eval_data_df.with_column(
    "PATIENT_ID", row_number().over(window_spec)
)

total_records = eval_data_df.count()
print(f"‚úÖ Evaluation dataset prepared: {total_records:,} clean records")

# Create train/test split for consistent evaluation
# Use modulo of PATIENT_ID for deterministic split
train_df = eval_data_df.filter((col("PATIENT_ID") % lit(10)) < lit(8))
test_df = eval_data_df.filter((col("PATIENT_ID") % lit(10)) >= lit(8))

print(f"üìä Dataset split:")
print(f"   Training: {train_df.count():,} records")
print(f"   Testing: {test_df.count():,} records")


üìÇ Loading and preparing evaluation datasets...
‚úÖ Loaded feature dataset with 41,616 records
üìä Available columns: 25
   Sample columns: PATIENT_ID, AGE, IS_MALE, NUM_CONDITIONS, NUM_MEDICATIONS, NUM_CLAIMS, MEDICATION_COUNT, HAS_CARDIOVASCULAR_DISEASE...
üìã Selected 7 features for evaluation:
    1. AGE
    2. NUM_CONDITIONS
    3. NUM_MEDICATIONS
    4. NUM_CLAIMS
    5. MAX_MEDICATION_RISK
    6. HIGH_RISK_MEDICATION_COUNT
    7. WARFARIN_RISK
üéØ Target variable: CONTINUOUS_RISK_TARGET
‚úÖ Evaluation dataset prepared: 41,616 clean records
üìä Dataset split:
   Training: 33,294 records
   Testing: 8,322 records


In [21]:
# Cross-Validation Framework
print("üîÑ Setting up distributed cross-validation framework...")

def create_cv_folds(df, k_folds=5, seed=42):
    """
    Create K-fold cross-validation splits using Snowpark
    """
    print(f"üìä Creating {k_folds}-fold cross-validation splits...")
    
    # Use modulo of numeric PATIENT_ID for deterministic fold assignment
    df_with_folds = df.with_column(
        "FOLD_ID", (col("PATIENT_ID") % lit(k_folds))
    )
    
    folds = []
    for fold_id in range(k_folds):
        train_fold = df_with_folds.filter(col("FOLD_ID") != lit(fold_id)).drop("FOLD_ID")
        val_fold = df_with_folds.filter(col("FOLD_ID") == lit(fold_id)).drop("FOLD_ID")
        
        train_size = train_fold.count()
        val_size = val_fold.count()
        
        folds.append({
            'fold_id': fold_id,
            'train': train_fold,
            'val': val_fold,
            'train_size': train_size,
            'val_size': val_size
        })
        
        print(f"   Fold {fold_id + 1}: Train={train_size:,}, Val={val_size:,}")
    
    return folds

def evaluate_model_cv(model_class, model_params, folds, features, target, model_name):
    """
    Perform cross-validation evaluation for a given model
    """
    print(f"\nüî¨ Cross-validating {model_name}...")
    
    fold_results = []
    
    for i, fold in enumerate(folds):
        print(f"   üìä Processing fold {i + 1}/{len(folds)}...")
        
        try:
            # Initialize model with parameters
            model = model_class(
                input_cols=features,
                output_cols=["PREDICTION"],
                label_cols=[target],
                **model_params
            )
            
            # Train on fold
            trained_model = model.fit(fold['train'])
            
            # Predict on validation set
            predictions_df = trained_model.predict(fold['val'])
            
            # Calculate metrics
            mae = mean_absolute_error(
                df=predictions_df, 
                y_true_col_names=[target], 
                y_pred_col_names=["PREDICTION"]
            )
            
            mse = mean_squared_error(
                df=predictions_df,
                y_true_col_names=[target],
                y_pred_col_names=["PREDICTION"]
            )
            
            rmse = np.sqrt(mse)
            
            fold_result = {
                'fold_id': i,
                'mae': float(mae),
                'mse': float(mse),
                'rmse': float(rmse),
                'val_size': fold['val_size']
            }
            
            fold_results.append(fold_result)
            print(f"      MAE: {mae:.4f}, RMSE: {rmse:.4f}")
            
        except Exception as e:
            print(f"      ‚ö†Ô∏è Fold {i + 1} failed: {e}")
            continue
    
    if not fold_results:
        print(f"   ‚ùå All folds failed for {model_name}")
        return None
    
    # Aggregate cross-validation results
    cv_metrics = {
        'model_name': model_name,
        'n_folds': len(fold_results),
        'mean_mae': np.mean([r['mae'] for r in fold_results]),
        'std_mae': np.std([r['mae'] for r in fold_results]),
        'mean_rmse': np.mean([r['rmse'] for r in fold_results]),
        'std_rmse': np.std([r['rmse'] for r in fold_results]),
        'fold_results': fold_results
    }
    
    print(f"   ‚úÖ CV Results - MAE: {cv_metrics['mean_mae']:.4f} ¬± {cv_metrics['std_mae']:.4f}")
    print(f"                  RMSE: {cv_metrics['mean_rmse']:.4f} ¬± {cv_metrics['std_rmse']:.4f}")
    
    return cv_metrics

# Create cross-validation folds
cv_folds = create_cv_folds(train_df, k_folds=5, seed=42)
print(f"‚úÖ Cross-validation framework ready with {len(cv_folds)} folds")


üîÑ Setting up distributed cross-validation framework...
üìä Creating 5-fold cross-validation splits...
   Fold 1: Train=24,971, Val=8,323
   Fold 2: Train=24,970, Val=8,324
   Fold 3: Train=24,971, Val=8,323
   Fold 4: Train=29,132, Val=4,162
   Fold 5: Train=29,132, Val=4,162
‚úÖ Cross-validation framework ready with 5 folds


In [22]:
# Multi-Algorithm Evaluation
print("üéØ Running multi-algorithm evaluation with cross-validation...")

# Define models to evaluate
model_configs = [
    {
        'name': 'XGBoost_Default',
        'class': XGBRegressor,
        'params': {
            'n_estimators': 100,
            'max_depth': 6,
            'learning_rate': 0.1,
            'random_state': 42
        }
    },
    {
        'name': 'XGBoost_Optimized',
        'class': XGBRegressor,
        'params': {
            'n_estimators': 150,
            'max_depth': 8,
            'learning_rate': 0.05,
            'subsample': 0.8,
            'colsample_bytree': 0.8,
            'random_state': 42
        }
    },
    {
        'name': 'Linear_Baseline',
        'class': LinearRegression,
        'params': {}
    }
]

# Run cross-validation for each model
cv_results = []

for config in model_configs:
    try:
        cv_result = evaluate_model_cv(
            model_class=config['class'],
            model_params=config['params'],
            folds=cv_folds,
            features=evaluation_features,
            target=target_col,
            model_name=config['name']
        )
        
        if cv_result:
            cv_results.append(cv_result)
        
    except Exception as e:
        print(f"‚ö†Ô∏è Model {config['name']} evaluation failed: {e}")
        continue

# Compare model performance
print(f"\nüìä Cross-Validation Results Summary:")
print(f"{'Model':<20} {'MAE':<12} {'RMSE':<12} {'Folds':<8}")
print("-" * 55)

best_model = None
best_mae = float('inf')

for result in cv_results:
    mae_str = f"{result['mean_mae']:.4f} ¬± {result['std_mae']:.4f}"
    rmse_str = f"{result['mean_rmse']:.4f} ¬± {result['std_rmse']:.4f}"
    
    print(f"{result['model_name']:<20} {mae_str:<12} {rmse_str:<12} {result['n_folds']:<8}")
    
    if result['mean_mae'] < best_mae:
        best_mae = result['mean_mae']
        best_model = result['model_name']

print(f"\nüèÜ Best performing model: {best_model} (MAE: {best_mae:.4f})")

# Statistical significance testing
print(f"\nüìà Statistical Analysis:")
if len(cv_results) >= 2:
    # Compare top two models
    sorted_results = sorted(cv_results, key=lambda x: x['mean_mae'])
    model1, model2 = sorted_results[0], sorted_results[1]
    
    mae_diff = model2['mean_mae'] - model1['mean_mae']
    combined_std = np.sqrt(model1['std_mae']**2 + model2['std_mae']**2)
    
    if combined_std > 0:
        effect_size = mae_diff / combined_std
        print(f"   Performance difference: {mae_diff:.4f} MAE")
        print(f"   Effect size: {effect_size:.3f}")
        
        if abs(effect_size) > 0.5:
            print(f"   üìä Moderate to large effect size detected")
        else:
            print(f"   üìä Small effect size - models perform similarly")

print(f"‚úÖ Multi-algorithm evaluation complete")


üéØ Running multi-algorithm evaluation with cross-validation...

üî¨ Cross-validating XGBoost_Default...
   üìä Processing fold 1/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0470, RMSE: 2.4964
   üìä Processing fold 2/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0639, RMSE: 2.5059
   üìä Processing fold 3/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0629, RMSE: 2.4964
   üìä Processing fold 4/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.1211, RMSE: 2.4596
   üìä Processing fold 5/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.1085, RMSE: 2.4895
   ‚úÖ CV Results - MAE: 1.0807 ¬± 0.0288
                  RMSE: 2.4896 ¬± 0.0159

üî¨ Cross-validating XGBoost_Optimized...
   üìä Processing fold 1/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0840, RMSE: 2.4342
   üìä Processing fold 2/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0612, RMSE: 2.4585
   üìä Processing fold 3/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0179, RMSE: 2.3933
   üìä Processing fold 4/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0844, RMSE: 2.4498
   üìä Processing fold 5/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 1.0624, RMSE: 2.4675
   ‚úÖ CV Results - MAE: 1.0620 ¬± 0.0242
                  RMSE: 2.4406 ¬± 0.0261

üî¨ Cross-validating Linear_Baseline...
   üìä Processing fold 1/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 4.2117, RMSE: 5.2817
   üìä Processing fold 2/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 4.1604, RMSE: 5.2970
   üìä Processing fold 3/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 4.2164, RMSE: 5.3052
   üìä Processing fold 4/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 4.3217, RMSE: 5.3043
   üìä Processing fold 5/5...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


      MAE: 4.1523, RMSE: 5.3305
   ‚úÖ CV Results - MAE: 4.2125 ¬± 0.0605
                  RMSE: 5.3037 ¬± 0.0158

üìä Cross-Validation Results Summary:
Model                MAE          RMSE         Folds   
-------------------------------------------------------
XGBoost_Default      1.0807 ¬± 0.0288 2.4896 ¬± 0.0159 5       
XGBoost_Optimized    1.0620 ¬± 0.0242 2.4406 ¬± 0.0261 5       
Linear_Baseline      4.2125 ¬± 0.0605 5.3037 ¬± 0.0158 5       

üèÜ Best performing model: XGBoost_Optimized (MAE: 1.0620)

üìà Statistical Analysis:
   Performance difference: 0.0187 MAE
   Effect size: 0.497
   üìä Small effect size - models perform similarly
‚úÖ Multi-algorithm evaluation complete


In [23]:
# Healthcare-Specific Metrics Evaluation
print("üè• Calculating healthcare-specific evaluation metrics...")

def calculate_healthcare_metrics(predictions_df, target_col):
    """
    Calculate healthcare-specific metrics including risk stratification accuracy
    """
    
    # Define risk thresholds (these would be clinically validated)
    low_threshold = 30.0
    high_threshold = 70.0
    
    print(f"   üìä Calculating risk stratification metrics...")
    print(f"      Risk thresholds: Low < {low_threshold}, Medium {low_threshold}-{high_threshold}, High > {high_threshold}")
    
    # Create risk categories for both true and predicted values
    # Use proper column aliasing to avoid identifier issues
    metrics_df = predictions_df.with_column(
        "TRUE_RISK_CAT",
        when(col(target_col) < lit(low_threshold), lit("LOW"))
        .when(col(target_col) < lit(high_threshold), lit("MEDIUM"))
        .otherwise(lit("HIGH"))
    ).with_column(
        "PRED_RISK_CAT",
        when(col("PREDICTION") < lit(low_threshold), lit("LOW"))
        .when(col("PREDICTION") < lit(high_threshold), lit("MEDIUM"))
        .otherwise(lit("HIGH"))
    )
    
    # Calculate total records
    total_records = metrics_df.count()
    print(f"      Total records for analysis: {total_records:,}")
    
    if total_records == 0:
        print(f"      ‚ö†Ô∏è No records available for healthcare metrics")
        return {}
    
    # Calculate risk category accuracy
    try:
        correct_classifications = metrics_df.filter(
            col("TRUE_RISK_CAT") == col("PRED_RISK_CAT")
        ).count()
        
        category_accuracy = correct_classifications / total_records
        print(f"      Risk category accuracy: {category_accuracy:.3f} ({correct_classifications:,}/{total_records:,})")
        
    except Exception as e:
        print(f"      ‚ö†Ô∏è Category accuracy calculation error: {e}")
        category_accuracy = 0.0
    
    # Calculate high-risk sensitivity (true positive rate for high-risk patients)
    try:
        true_high_risk = metrics_df.filter(col("TRUE_RISK_CAT") == lit("HIGH")).count()
        predicted_high_risk_correctly = metrics_df.filter(
            (col("TRUE_RISK_CAT") == lit("HIGH")) & (col("PRED_RISK_CAT") == lit("HIGH"))
        ).count()
        
        high_risk_sensitivity = predicted_high_risk_correctly / true_high_risk if true_high_risk > 0 else 0.0
        print(f"      High-risk sensitivity: {high_risk_sensitivity:.3f} ({predicted_high_risk_correctly:,}/{true_high_risk:,})")
        
    except Exception as e:
        print(f"      ‚ö†Ô∏è High-risk sensitivity calculation error: {e}")
        high_risk_sensitivity = 0.0
    
    # Calculate low-risk specificity (true negative rate for low-risk patients)
    try:
        true_low_risk = metrics_df.filter(col("TRUE_RISK_CAT") == lit("LOW")).count()
        predicted_low_risk_correctly = metrics_df.filter(
            (col("TRUE_RISK_CAT") == lit("LOW")) & (col("PRED_RISK_CAT") == lit("LOW"))
        ).count()
        
        low_risk_specificity = predicted_low_risk_correctly / true_low_risk if true_low_risk > 0 else 0.0
        print(f"      Low-risk specificity: {low_risk_specificity:.3f} ({predicted_low_risk_correctly:,}/{true_low_risk:,})")
        
    except Exception as e:
        print(f"      ‚ö†Ô∏è Low-risk specificity calculation error: {e}")
        low_risk_specificity = 0.0
    
    # Calculate MAE by risk category
    risk_mae_metrics = {}
    for risk_cat in ["LOW", "MEDIUM", "HIGH"]:
        try:
            cat_df = metrics_df.filter(col("TRUE_RISK_CAT") == lit(risk_cat))
            cat_count = cat_df.count()
            
            if cat_count > 0:
                cat_mae = cat_df.select(
                    avg(abs_(col(target_col) - col("PREDICTION"))).alias("MAE")
                ).collect()[0]["MAE"]
                risk_mae_metrics[f"mae_{risk_cat.lower()}"] = float(cat_mae) if cat_mae else 0.0
                print(f"      MAE for {risk_cat} risk: {cat_mae:.3f} (n={cat_count:,})")
            else:
                risk_mae_metrics[f"mae_{risk_cat.lower()}"] = 0.0
                
        except Exception as e:
            print(f"      ‚ö†Ô∏è MAE calculation for {risk_cat} risk error: {e}")
            risk_mae_metrics[f"mae_{risk_cat.lower()}"] = 0.0
    
    # Compile healthcare metrics
    healthcare_metrics = {
        'risk_category_accuracy': category_accuracy,
        'high_risk_sensitivity': high_risk_sensitivity,
        'low_risk_specificity': low_risk_specificity,
        'total_patients': total_records,
        **risk_mae_metrics
    }
    
    return healthcare_metrics

# Calculate healthcare metrics for the best model
print(f"üéØ Evaluating healthcare metrics for {best_model}...")

try:
    # Find best model configuration
    best_config = next(config for config in model_configs if config['name'] == best_model)
    
    # Train best model on full training set
    final_model = best_config['class'](
        input_cols=evaluation_features,
        output_cols=["PREDICTION"],
        label_cols=[target_col],
        **best_config['params']
    )
    
    trained_final_model = final_model.fit(train_df)
    
    # Get predictions on test set
    test_predictions = trained_final_model.predict(test_df)
    
    # Calculate healthcare-specific metrics
    healthcare_metrics = calculate_healthcare_metrics(test_predictions, target_col)
    
    print(f"\nüè• Healthcare Metrics Summary for {best_model}:")
    print(f"   Risk Category Accuracy: {healthcare_metrics.get('risk_category_accuracy', 0):.3f}")
    print(f"   High-Risk Sensitivity: {healthcare_metrics.get('high_risk_sensitivity', 0):.3f}")
    print(f"   Low-Risk Specificity: {healthcare_metrics.get('low_risk_specificity', 0):.3f}")
    print(f"   MAE by Risk Level:")
    print(f"     Low Risk: {healthcare_metrics.get('mae_low', 0):.3f}")
    print(f"     Medium Risk: {healthcare_metrics.get('mae_medium', 0):.3f}")
    print(f"     High Risk: {healthcare_metrics.get('mae_high', 0):.3f}")
    
except Exception as e:
    print(f"‚ö†Ô∏è Healthcare metrics calculation failed: {e}")
    healthcare_metrics = {}

print(f"‚úÖ Healthcare-specific evaluation complete")


üè• Calculating healthcare-specific evaluation metrics...
üéØ Evaluating healthcare metrics for XGBoost_Optimized...


  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(self.dataset)
  core.DataType.from_snowpark_type(data_type)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)
  dataset = snowpark_dataframe_utils.cast_snowpark_dataframe_column_types(dataset)


   üìä Calculating risk stratification metrics...
      Risk thresholds: Low < 30.0, Medium 30.0-70.0, High > 70.0
      Total records for analysis: 8,322
      Risk category accuracy: 0.996 (8,285/8,322)
      High-risk sensitivity: 0.993 (3,991/4,020)
      Low-risk specificity: 0.981 (2,332/2,376)
      MAE for LOW risk: 0.086 (n=2,410)
      MAE for MEDIUM risk: 0.178 (n=1,880)
      MAE for HIGH risk: 1.908 (n=4,020)

üè• Healthcare Metrics Summary for XGBoost_Optimized:
   Risk Category Accuracy: 0.996
   High-Risk Sensitivity: 0.993
   Low-Risk Specificity: 0.981
   MAE by Risk Level:
     Low Risk: 0.086
     Medium Risk: 0.178
     High Risk: 1.908
‚úÖ Healthcare-specific evaluation complete


In [24]:
# Evaluation Results Logging
print("üìù Logging comprehensive evaluation results...")

# Create evaluation logging tables
evaluation_logging_sql = '''
-- Main evaluation results table
CREATE TABLE IF NOT EXISTS ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.MODEL_EVALUATION_LOG (
    EVALUATION_ID STRING,
    EVALUATION_DATE TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
    MODEL_NAME STRING,
    EVALUATION_TYPE STRING,
    DATASET_SIZE INT,
    FEATURE_COUNT INT,
    CV_FOLDS INT,
    MEAN_MAE FLOAT,
    STD_MAE FLOAT,
    MEAN_RMSE FLOAT,
    STD_RMSE FLOAT,
    RISK_CATEGORY_ACCURACY FLOAT,
    HIGH_RISK_SENSITIVITY FLOAT,
    LOW_RISK_SPECIFICITY FLOAT,
    MAE_LOW_RISK FLOAT,
    MAE_MEDIUM_RISK FLOAT,
    MAE_HIGH_RISK FLOAT,
    BEST_MODEL STRING,
    EVALUATION_NOTES STRING
);

-- Model comparison table
CREATE TABLE IF NOT EXISTS ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.MODEL_COMPARISON_LOG (
    COMPARISON_ID STRING,
    EVALUATION_ID STRING,
    MODEL_A STRING,
    MODEL_B STRING,
    MAE_DIFFERENCE FLOAT,
    RMSE_DIFFERENCE FLOAT,
    EFFECT_SIZE FLOAT,
    SIGNIFICANCE_LEVEL STRING,
    COMPARISON_DATE TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
    COMPARISON_NOTES STRING
);
'''

try:
    session.sql(evaluation_logging_sql).collect()
    print("‚úÖ Evaluation logging tables created")
except Exception as e:
    print(f"‚ö†Ô∏è Logging table creation: {e}")

# Prepare evaluation results for logging
evaluation_timestamp = datetime.datetime.now()
evaluation_id = f"EVAL_{evaluation_timestamp.strftime('%Y%m%d_%H%M%S')}"

print(f"üìä Preparing evaluation results for logging...")
print(f"   Evaluation ID: {evaluation_id}")

# Log main evaluation results
evaluation_log = []

for result in cv_results:
    # Get healthcare metrics for this model if it's the best one
    model_healthcare_metrics = healthcare_metrics if result['model_name'] == best_model else {}
    
    eval_record = (
        evaluation_id,
        evaluation_timestamp.isoformat(),
        result['model_name'],
        'CROSS_VALIDATION',
        total_records,
        len(evaluation_features),
        result['n_folds'],
        result['mean_mae'],
        result['std_mae'],
        result['mean_rmse'],
        result['std_rmse'],
        model_healthcare_metrics.get('risk_category_accuracy', 0.0),
        model_healthcare_metrics.get('high_risk_sensitivity', 0.0),
        model_healthcare_metrics.get('low_risk_specificity', 0.0),
        model_healthcare_metrics.get('mae_low', 0.0),
        model_healthcare_metrics.get('mae_medium', 0.0),
        model_healthcare_metrics.get('mae_high', 0.0),
        best_model,
        f"Features: {', '.join(evaluation_features[:5])}..."
    )
    
    evaluation_log.append(eval_record)

# Create evaluation DataFrame and save
if evaluation_log:
    eval_schema = StructType([
        StructField("EVALUATION_ID", StringType()),
        StructField("EVALUATION_DATE", StringType()),
        StructField("MODEL_NAME", StringType()),
        StructField("EVALUATION_TYPE", StringType()),
        StructField("DATASET_SIZE", IntegerType()),
        StructField("FEATURE_COUNT", IntegerType()),
        StructField("CV_FOLDS", IntegerType()),
        StructField("MEAN_MAE", DoubleType()),
        StructField("STD_MAE", DoubleType()),
        StructField("MEAN_RMSE", DoubleType()),
        StructField("STD_RMSE", DoubleType()),
        StructField("RISK_CATEGORY_ACCURACY", DoubleType()),
        StructField("HIGH_RISK_SENSITIVITY", DoubleType()),
        StructField("LOW_RISK_SPECIFICITY", DoubleType()),
        StructField("MAE_LOW_RISK", DoubleType()),
        StructField("MAE_MEDIUM_RISK", DoubleType()),
        StructField("MAE_HIGH_RISK", DoubleType()),
        StructField("BEST_MODEL", StringType()),
        StructField("EVALUATION_NOTES", StringType())
    ])
    
    eval_df = session.create_dataframe(evaluation_log, schema=eval_schema)
    eval_df.write.mode("append").save_as_table("ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.MODEL_EVALUATION_LOG")
    
    print(f"‚úÖ Logged {len(evaluation_log)} model evaluation results")

# Log model comparisons
if len(cv_results) >= 2:
    comparison_log = []
    
    # Compare all pairs of models
    for i, model_a in enumerate(cv_results):
        for j, model_b in enumerate(cv_results[i+1:], i+1):
            mae_diff = model_b['mean_mae'] - model_a['mean_mae']
            rmse_diff = model_b['mean_rmse'] - model_a['mean_rmse']
            
            combined_std = np.sqrt(model_a['std_mae']**2 + model_b['std_mae']**2)
            effect_size = mae_diff / combined_std if combined_std > 0 else 0.0
            
            significance = "LARGE" if abs(effect_size) > 0.8 else "MEDIUM" if abs(effect_size) > 0.5 else "SMALL"
            
            comparison_record = (
                f"COMP_{evaluation_timestamp.strftime('%Y%m%d_%H%M%S')}_{i}_{j}",
                evaluation_id,
                model_a['model_name'],
                model_b['model_name'],
                mae_diff,
                rmse_diff,
                effect_size,
                significance,
                evaluation_timestamp.isoformat(),
                f"Cross-validation comparison with {model_a['n_folds']} folds"
            )
            
            comparison_log.append(comparison_record)
    
    if comparison_log:
        comparison_schema = StructType([
            StructField("COMPARISON_ID", StringType()),
            StructField("EVALUATION_ID", StringType()),
            StructField("MODEL_A", StringType()),
            StructField("MODEL_B", StringType()),
            StructField("MAE_DIFFERENCE", DoubleType()),
            StructField("RMSE_DIFFERENCE", DoubleType()),
            StructField("EFFECT_SIZE", DoubleType()),
            StructField("SIGNIFICANCE_LEVEL", StringType()),
            StructField("COMPARISON_DATE", StringType()),
            StructField("COMPARISON_NOTES", StringType())
        ])
        
        comparison_df = session.create_dataframe(comparison_log, schema=comparison_schema)
        comparison_df.write.mode("append").save_as_table("ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.MODEL_COMPARISON_LOG")
        
        print(f"‚úÖ Logged {len(comparison_log)} model comparisons")

# Final evaluation summary
print(f"\nüéØ Comprehensive Model Evaluation Complete!")
print(f"   üìä Evaluation ID: {evaluation_id}")
print(f"   üèÜ Best Model: {best_model} (MAE: {best_mae:.4f})")
print(f"   üìà Models Evaluated: {len(cv_results)}")
print(f"   üîÑ Cross-Validation Folds: {len(cv_folds)}")
print(f"   üè• Healthcare Metrics: Risk accuracy, sensitivity, specificity calculated")
print(f"   üìù Results Logged: Available in MODEL_EVALUATION_LOG and MODEL_COMPARISON_LOG")
print(f"   üöÄ Ready for model packaging and deployment (notebook 07)")


üìù Logging comprehensive evaluation results...
‚ö†Ô∏è Logging table creation: (1304): 01be2bd5-0000-29a7-002c-b10b000a874e: 000008 (0A000): Actual statement count 2 did not match the desired statement count 1.
üìä Preparing evaluation results for logging...
   Evaluation ID: EVAL_20250805_135714
‚úÖ Logged 3 model evaluation results
‚úÖ Logged 3 model comparisons

üéØ Comprehensive Model Evaluation Complete!
   üìä Evaluation ID: EVAL_20250805_135714
   üèÜ Best Model: XGBoost_Optimized (MAE: 1.0620)
   üìà Models Evaluated: 3
   üîÑ Cross-Validation Folds: 5
   üè• Healthcare Metrics: Risk accuracy, sensitivity, specificity calculated
   üìù Results Logged: Available in MODEL_EVALUATION_LOG and MODEL_COMPARISON_LOG
   üöÄ Ready for model packaging and deployment (notebook 07)
