# ML Practice Questions - Part 3: Model Evaluation and Validation

This notebook covers essential concepts in model evaluation and validation, including cross-validation strategies, bias-variance tradeoff, overfitting/underfitting, performance metrics selection, and model selection criteria. Each question includes theoretical foundations, practical implementations, and real-world applications.

## Learning Objectives
- Master cross-validation techniques and their appropriate use cases
- Understand bias-variance tradeoff and its implications for model selection
- Identify and address overfitting and underfitting in machine learning models
- Select appropriate performance metrics for different types of problems
- Apply rigorous model selection and comparison methodologies

## Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with scikit-learn and numpy
- Knowledge of basic statistics and probability

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import (
    train_test_split, cross_val_score, StratifiedKFold, 
    TimeSeriesSplit, LeaveOneOut, validation_curve, learning_curve
)
from sklearn.datasets import make_classification, make_regression, load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    mean_squared_error, mean_absolute_error, r2_score,
    classification_report, confusion_matrix, roc_curve
)
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
np.random.seed(42)

## Question 1: Cross-Validation Strategies and Their Applications ⭐⭐⭐

**Problem**: You are working on three different machine learning projects:
1. A medical diagnosis system with limited, expensive-to-collect samples
2. A time series forecasting model for stock prices
3. A large-scale image classification system with millions of samples

For each scenario, determine the most appropriate cross-validation strategy, implement it, and explain why it's optimal for that specific use case.

### Theoretical Foundation

Cross-validation is a statistical method used to estimate the performance of machine learning models on unseen data. The key principle is to use different subsets of data for training and validation to get a more robust estimate of model performance.

**K-Fold Cross-Validation**:
- Divide data into k folds
- Train on k-1 folds, validate on 1 fold
- Repeat k times
- Average performance across all folds

**Stratified K-Fold**:
- Maintains class distribution in each fold
- Essential for imbalanced datasets
- Ensures each fold is representative of the overall dataset

**Time Series Split**:
- Respects temporal ordering
- No future information leak into past predictions
- Simulates real-world deployment scenario

**Leave-One-Out (LOO)**:
- Special case of k-fold where k = n (number of samples)
- Maximizes training data usage
- High variance but unbiased estimate

In [None]:
# Scenario 1: Medical diagnosis with limited samples
print("=" * 60)
print("SCENARIO 1: Medical Diagnosis (Limited Samples)")
print("=" * 60)

# Simulate small medical dataset
X_medical, y_medical = make_classification(
    n_samples=150,  # Small dataset
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    class_sep=0.8,
    random_state=42
)

# Create class imbalance (common in medical diagnosis)
mask = np.random.choice(len(y_medical), size=int(0.2 * len(y_medical)), replace=False)
y_medical[mask] = 1

print(f"Dataset size: {X_medical.shape[0]} samples")
print(f"Class distribution: {np.bincount(y_medical)}")
print(f"Class imbalance ratio: {np.bincount(y_medical)[1] / np.bincount(y_medical)[0]:.2f}")

# Compare different CV strategies for small dataset
model = LogisticRegression(random_state=42)

# Standard K-Fold
kfold_scores = cross_val_score(model, X_medical, y_medical, cv=5, scoring='roc_auc')

# Stratified K-Fold (recommended for imbalanced data)
stratified_scores = cross_val_score(
    model, X_medical, y_medical, 
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42), 
    scoring='roc_auc'
)

# Leave-One-Out (maximizes training data)
loo_scores = cross_val_score(model, X_medical, y_medical, cv=LeaveOneOut(), scoring='roc_auc')

print("\nCross-Validation Results:")
print(f"K-Fold (5): {kfold_scores.mean():.3f} ± {kfold_scores.std():.3f}")
print(f"Stratified K-Fold (5): {stratified_scores.mean():.3f} ± {stratified_scores.std():.3f}")
print(f"Leave-One-Out: {loo_scores.mean():.3f} ± {loo_scores.std():.3f}")

print("\n📊 Analysis for Medical Diagnosis:")
print("✅ RECOMMENDED: Stratified K-Fold")
print("   - Maintains class distribution in each fold")
print("   - Handles class imbalance effectively")
print("   - Provides stable performance estimates")
print("   - Computationally efficient compared to LOO")

In [None]:
# Scenario 2: Time series forecasting
print("\n" + "=" * 60)
print("SCENARIO 2: Time Series Forecasting")
print("=" * 60)

# Simulate time series data (stock prices)
np.random.seed(42)
n_timepoints = 500
time_index = np.arange(n_timepoints)

# Create features: lagged values, moving averages, etc.
price_series = 100 + np.cumsum(np.random.randn(n_timepoints) * 0.5)
X_ts = np.column_stack([
    price_series[:-1],  # lag-1
    np.roll(price_series, 2)[:-1],  # lag-2
    np.convolve(price_series, np.ones(5)/5, mode='same')[:-1],  # 5-day MA
    np.convolve(price_series, np.ones(10)/10, mode='same')[:-1]  # 10-day MA
])[10:]  # Remove first 10 rows due to moving averages

y_ts = price_series[11:]  # Target: next day price

print(f"Time series length: {len(y_ts)} time points")
print(f"Features: {X_ts.shape[1]} (lag-1, lag-2, 5-day MA, 10-day MA)")

# Compare standard CV vs Time Series CV
model_ts = Ridge(alpha=1.0)

# WRONG: Standard K-Fold (violates temporal order)
wrong_cv_scores = cross_val_score(model_ts, X_ts, y_ts, cv=5, scoring='neg_mean_squared_error')

# CORRECT: Time Series Split
tscv = TimeSeriesSplit(n_splits=5)
correct_cv_scores = cross_val_score(model_ts, X_ts, y_ts, cv=tscv, scoring='neg_mean_squared_error')

print("\nCross-Validation Results (MSE):")
print(f"❌ Standard K-Fold: {-wrong_cv_scores.mean():.3f} ± {wrong_cv_scores.std():.3f}")
print(f"✅ Time Series Split: {-correct_cv_scores.mean():.3f} ± {correct_cv_scores.std():.3f}")

# Visualize the time series splits
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Plot original time series
axes[0].plot(y_ts, alpha=0.7, label='Price Series')
axes[0].set_title('Original Time Series')
axes[0].set_ylabel('Price')
axes[0].legend()

# Visualize TimeSeriesSplit
for i, (train_idx, val_idx) in enumerate(tscv.split(X_ts)):
    if i < 3:  # Show first 3 splits
        axes[1].scatter(train_idx, [i] * len(train_idx), alpha=0.6, s=1, label=f'Train {i+1}')
        axes[1].scatter(val_idx, [i] * len(val_idx), alpha=0.8, s=2, label=f'Val {i+1}')

axes[1].set_xlabel('Time Index')
axes[1].set_ylabel('CV Split')
axes[1].set_title('Time Series Cross-Validation Splits')
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

print("\n📊 Analysis for Time Series:")
print("✅ MANDATORY: Time Series Split")
print("   - Respects temporal ordering")
print("   - Prevents data leakage from future")
print("   - Simulates realistic deployment scenario")
print("   - Training set always precedes validation set")

In [None]:
# Scenario 3: Large-scale image classification
print("\n" + "=" * 60)
print("SCENARIO 3: Large-Scale Image Classification")
print("=" * 60)

# Simulate large dataset
X_large, y_large = make_classification(
    n_samples=10000,  # Large dataset
    n_features=2048,  # High-dimensional (like CNN features)
    n_informative=1000,
    n_redundant=500,
    n_classes=10,
    random_state=42
)

print(f"Dataset size: {X_large.shape[0]:,} samples")
print(f"Feature dimension: {X_large.shape[1]:,} features")
print(f"Number of classes: {len(np.unique(y_large))}")
print(f"Class distribution: {np.bincount(y_large)}")

# For large datasets, we can use simpler CV strategies
model_large = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)

# Simple train-validation split (sufficient for large datasets)
X_train, X_val, y_train, y_val = train_test_split(
    X_large, y_large, test_size=0.2, stratify=y_large, random_state=42
)

model_large.fit(X_train, y_train)
val_accuracy = model_large.score(X_val, y_val)

print(f"\nSimple Train-Val Split Accuracy: {val_accuracy:.3f}")

# 3-Fold CV (efficient for large datasets)
cv_scores_3fold = cross_val_score(
    model_large, X_large, y_large, 
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42), 
    scoring='accuracy'
)

print(f"3-Fold Stratified CV: {cv_scores_3fold.mean():.3f} ± {cv_scores_3fold.std():.3f}")

# Compare computational time
import time

# Time simple split
start_time = time.time()
model_temp = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
model_temp.fit(X_train, y_train)
_ = model_temp.score(X_val, y_val)
simple_split_time = time.time() - start_time

# Time 3-fold CV
start_time = time.time()
_ = cross_val_score(
    RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1), 
    X_large, y_large, 
    cv=3, 
    scoring='accuracy'
)
cv_3fold_time = time.time() - start_time

print(f"\nComputational Efficiency:")
print(f"Simple Split: {simple_split_time:.2f} seconds")
print(f"3-Fold CV: {cv_3fold_time:.2f} seconds (×{cv_3fold_time/simple_split_time:.1f} slower)")

print("\n📊 Analysis for Large-Scale Classification:")
print("✅ RECOMMENDED: Simple Train-Val Split or 3-Fold CV")
print("   - Large datasets provide stable estimates with fewer folds")
print("   - Computational efficiency is crucial")
print("   - Simple split often sufficient for performance estimation")
print("   - 3-fold CV provides good balance of robustness and speed")

### Key Takeaways for Cross-Validation Strategy Selection

1. **Small Datasets (< 1000 samples)**: Use Stratified K-Fold or Leave-One-Out
2. **Time Series Data**: Always use TimeSeriesSplit to prevent data leakage
3. **Large Datasets (> 10,000 samples)**: Simple train-val split or 3-fold CV
4. **Imbalanced Classes**: Always use Stratified K-Fold
5. **Computational Constraints**: Reduce number of folds or use simple split

## Question 2: Understanding and Diagnosing Bias-Variance Tradeoff ⭐⭐⭐

**Problem**: You are comparing three different models for a regression task:
1. Linear regression (high bias, low variance)
2. Random forest (medium bias, medium variance)
3. K-nearest neighbors with k=1 (low bias, high variance)

Implement a bias-variance decomposition analysis to understand how each model's complexity affects its prediction error components.

### Theoretical Foundation

The bias-variance decomposition is fundamental to understanding model performance:

**Expected Test Error = Bias² + Variance + Irreducible Error**

Where:
- **Bias**: Error from overly simplistic assumptions
- **Variance**: Error from sensitivity to small fluctuations in training set
- **Irreducible Error**: Noise inherent in the problem

**Mathematical Formulation**:
$$\text{Bias}[\hat{f}(x)] = E[\hat{f}(x)] - f(x)$$
$$\text{Variance}[\hat{f}(x)] = E[(\hat{f}(x) - E[\hat{f}(x)])^2]$$

**Model Complexity Effects**:
- **Underfit** (High Bias, Low Variance): Model too simple
- **Overfit** (Low Bias, High Variance): Model too complex
- **Sweet Spot** (Balanced): Optimal complexity

In [None]:
def bias_variance_decomposition(model_class, model_params, X, y, n_trials=100, test_size=0.3):
    """
    Perform bias-variance decomposition for a given model.
    
    Parameters:
    -----------
    model_class : class
        The model class (e.g., LinearRegression)
    model_params : dict
        Parameters for the model
    X, y : array-like
        Training data
    n_trials : int
        Number of bootstrap trials
    test_size : float
        Fraction of data to use for testing
    
    Returns:
    --------
    dict : Bias, variance, and total error components
    """
    n_samples = X.shape[0]
    n_test = int(n_samples * test_size)
    
    # Fixed test set for fair comparison
    test_indices = np.random.choice(n_samples, n_test, replace=False)
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    predictions = []
    
    for trial in range(n_trials):
        # Bootstrap sample from training data
        train_indices = np.random.choice(
            [i for i in range(n_samples) if i not in test_indices], 
            size=n_samples - n_test, 
            replace=True
        )
        X_train_boot = X[train_indices]
        y_train_boot = y[train_indices]
        
        # Train model on bootstrap sample
        model = model_class(**model_params)
        model.fit(X_train_boot, y_train_boot)
        
        # Predict on test set
        y_pred = model.predict(X_test)
        predictions.append(y_pred)
    
    predictions = np.array(predictions)
    
    # Calculate bias and variance
    mean_prediction = np.mean(predictions, axis=0)
    bias_squared = np.mean((mean_prediction - y_test) ** 2)
    variance = np.mean(np.var(predictions, axis=0))
    
    # Total error (approximation)
    total_error = np.mean([(pred - y_test) ** 2 for pred in predictions])
    
    return {
        'bias_squared': bias_squared,
        'variance': variance,
        'total_error': total_error,
        'irreducible_error': total_error - bias_squared - variance
    }

# Generate synthetic regression data with known noise
np.random.seed(42)
n_samples = 200
X_reg = np.random.randn(n_samples, 1)
true_function = lambda x: 1.5 * x + 0.3 * x**2 - 0.1 * x**3
noise_std = 0.3
y_reg = true_function(X_reg.ravel()) + np.random.normal(0, noise_std, n_samples)

print("Bias-Variance Decomposition Analysis")
print("=" * 50)
print(f"Dataset: {n_samples} samples with noise std = {noise_std}")
print(f"True function: f(x) = 1.5x + 0.3x² - 0.1x³")

# Compare different models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

models_to_test = [
    (LinearRegression, {}, "Linear Regression (High Bias)"),
    (RandomForestRegressor, {'n_estimators': 10, 'random_state': 42}, "Random Forest (Medium)"),
    (KNeighborsRegressor, {'n_neighbors': 1}, "KNN-1 (High Variance)"),
    (KNeighborsRegressor, {'n_neighbors': 10}, "KNN-10 (Lower Variance)")
]

results = []
for model_class, params, name in models_to_test:
    decomp = bias_variance_decomposition(model_class, params, X_reg, y_reg)
    decomp['model'] = name
    results.append(decomp)
    
    print(f"\n{name}:")
    print(f"  Bias²: {decomp['bias_squared']:.4f}")
    print(f"  Variance: {decomp['variance']:.4f}")
    print(f"  Total Error: {decomp['total_error']:.4f}")
    print(f"  Irreducible: {decomp['irreducible_error']:.4f}")

In [None]:
# Visualize bias-variance tradeoff
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Bias vs Variance
models = [r['model'] for r in results]
bias_values = [r['bias_squared'] for r in results]
variance_values = [r['variance'] for r in results]

x_pos = np.arange(len(models))
width = 0.35

axes[0].bar(x_pos - width/2, bias_values, width, label='Bias²', alpha=0.7, color='red')
axes[0].bar(x_pos + width/2, variance_values, width, label='Variance', alpha=0.7, color='blue')
axes[0].set_xlabel('Model')
axes[0].set_ylabel('Error Component')
axes[0].set_title('Bias-Variance Tradeoff')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels([m.split('(')[0].strip() for m in models], rotation=45)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Total Error Components
total_errors = [r['total_error'] for r in results]
irreducible_errors = [r['irreducible_error'] for r in results]

axes[1].bar(x_pos, bias_values, label='Bias²', alpha=0.7, color='red')
axes[1].bar(x_pos, variance_values, bottom=bias_values, label='Variance', alpha=0.7, color='blue')
axes[1].bar(x_pos, irreducible_errors, 
           bottom=[b+v for b,v in zip(bias_values, variance_values)], 
           label='Irreducible', alpha=0.7, color='gray')

axes[1].scatter(x_pos, total_errors, color='black', s=100, marker='x', linewidth=3, label='Total Error')
axes[1].set_xlabel('Model')
axes[1].set_ylabel('Error')
axes[1].set_title('Total Error Decomposition')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels([m.split('(')[0].strip() for m in models], rotation=45)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find the model with best bias-variance tradeoff
best_model_idx = np.argmin(total_errors)
best_model = results[best_model_idx]

print(f"\n🏆 Best Model: {best_model['model']}")
print(f"   Total Error: {best_model['total_error']:.4f}")
print(f"   Bias-Variance Balance: {best_model['bias_squared']:.4f} vs {best_model['variance']:.4f}")

In [None]:
# Demonstrate model complexity effect with polynomial regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

print("\nModel Complexity Analysis: Polynomial Regression")
print("=" * 55)

degrees = range(1, 11)
complexity_results = []

for degree in degrees:
    # Create polynomial pipeline
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    # Simplified bias-variance analysis for efficiency
    n_trials = 50
    test_errors = []
    
    for trial in range(n_trials):
        X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=trial)
        poly_model.fit(X_train, y_train)
        y_pred = poly_model.predict(X_test)
        test_errors.append(mean_squared_error(y_test, y_pred))
    
    complexity_results.append({
        'degree': degree,
        'mean_error': np.mean(test_errors),
        'error_std': np.std(test_errors)
    })

# Plot complexity vs error
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

degrees_list = [r['degree'] for r in complexity_results]
mean_errors = [r['mean_error'] for r in complexity_results]
error_stds = [r['error_std'] for r in complexity_results]

ax.errorbar(degrees_list, mean_errors, yerr=error_stds, marker='o', capsize=5)
ax.set_xlabel('Polynomial Degree (Model Complexity)')
ax.set_ylabel('Test Error (MSE)')
ax.set_title('Model Complexity vs Test Error')
ax.grid(True, alpha=0.3)

# Mark optimal complexity
optimal_idx = np.argmin(mean_errors)
optimal_degree = degrees_list[optimal_idx]
ax.axvline(x=optimal_degree, color='red', linestyle='--', alpha=0.7, label=f'Optimal Degree = {optimal_degree}')
ax.legend()

plt.show()

print(f"Optimal polynomial degree: {optimal_degree}")
print(f"Minimum test error: {min(mean_errors):.4f} ± {error_stds[optimal_idx]:.4f}")

print("\n📊 Key Insights:")
print("1. Low complexity (degree 1-2): High bias, low variance")
print(f"2. Optimal complexity (degree {optimal_degree}): Balanced bias-variance")
print("3. High complexity (degree 8+): Low bias, high variance")
print("4. Error increases beyond optimal point due to overfitting")

### Key Takeaways for Bias-Variance Tradeoff

1. **High Bias Models**: Underfit the data, consistent but inaccurate predictions
2. **High Variance Models**: Overfit the data, accurate on training but unstable
3. **Optimal Complexity**: Minimizes total error by balancing bias and variance
4. **Model Selection**: Use validation curves to find optimal complexity
5. **Ensemble Methods**: Can reduce variance while maintaining low bias

## Question 3: Performance Metrics Selection for Different Problem Types ⭐⭐⭐

**Problem**: You are working on four different machine learning projects with distinct characteristics:

1. **Fraud Detection**: Highly imbalanced (0.1% fraud cases), high cost of false negatives
2. **Medical Screening**: Imbalanced (5% positive cases), high cost of false negatives
3. **Marketing Response**: Balanced classes, equal cost for both error types
4. **Regression Task**: Predicting house prices with outliers present

For each scenario, determine the most appropriate evaluation metric and explain why it's optimal.

### Theoretical Foundation

**Classification Metrics**:
- **Accuracy**: (TP + TN) / (TP + TN + FP + FN)
- **Precision**: TP / (TP + FP) - Quality of positive predictions
- **Recall (Sensitivity)**: TP / (TP + FN) - Coverage of actual positives
- **F1-Score**: 2 × (Precision × Recall) / (Precision + Recall)
- **ROC-AUC**: Area under Receiver Operating Characteristic curve
- **PR-AUC**: Area under Precision-Recall curve

**Regression Metrics**:
- **MSE**: Mean Squared Error - Penalizes large errors heavily
- **MAE**: Mean Absolute Error - Robust to outliers
- **MAPE**: Mean Absolute Percentage Error - Scale-independent
- **R²**: Coefficient of determination - Proportion of variance explained

In [None]:
# Scenario 1: Fraud Detection (Extremely Imbalanced)
print("=" * 60)
print("SCENARIO 1: Fraud Detection (0.1% fraud rate)")
print("=" * 60)

# Create extremely imbalanced dataset
n_samples = 10000
fraud_rate = 0.001  # 0.1% fraud
n_fraud = int(n_samples * fraud_rate)
n_normal = n_samples - n_fraud

X_fraud, _ = make_classification(
    n_samples=n_samples,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[1-fraud_rate, fraud_rate],
    random_state=42
)

y_fraud = np.zeros(n_samples)
y_fraud[:n_fraud] = 1
np.random.shuffle(y_fraud)

print(f"Dataset: {n_samples:,} transactions")
print(f"Fraud cases: {n_fraud} ({fraud_rate*100:.1f}%)")
print(f"Normal cases: {n_normal} ({(1-fraud_rate)*100:.1f}%)")

# Train models
X_train, X_test, y_train, y_test = train_test_split(
    X_fraud, y_fraud, test_size=0.3, stratify=y_fraud, random_state=42
)

# Compare different models
models_fraud = {
    'Naive Classifier': lambda: type('', (), {'predict_proba': lambda self, X: np.column_stack([np.ones(len(X))*0.999, np.ones(len(X))*0.001])}),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Balanced RF': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
}

fraud_results = []

for name, model in models_fraud.items():
    if name == 'Naive Classifier':
        # Naive classifier that always predicts majority class
        y_pred = np.zeros(len(y_test))
        y_pred_proba = np.column_stack([np.ones(len(y_test))*0.999, np.ones(len(y_test))*0.001])
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)
    
    # Calculate various metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    
    if len(np.unique(y_pred_proba[:, 1])) > 1:
        roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
        # PR-AUC (Precision-Recall AUC)
        from sklearn.metrics import average_precision_score
        pr_auc = average_precision_score(y_test, y_pred_proba[:, 1])
    else:
        roc_auc = 0.5
        pr_auc = fraud_rate
    
    fraud_results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1': f1,
        'ROC-AUC': roc_auc,
        'PR-AUC': pr_auc
    })

# Display results
fraud_df = pd.DataFrame(fraud_results)
print("\nMetrics Comparison:")
print(fraud_df.round(4))

print("\n📊 Analysis for Fraud Detection:")
print("❌ Accuracy is misleading (99.9% by always predicting 'no fraud')")
print("❌ ROC-AUC can be optimistic for extreme imbalance")
print("✅ RECOMMENDED: PR-AUC (Precision-Recall AUC)")
print("   - Focuses on positive class performance")
print("   - Not affected by large number of true negatives")
print("   - Baseline = positive class proportion")
print("✅ ALSO CONSIDER: Recall (to minimize false negatives)")

In [None]:
# Scenario 2: Medical Screening (Moderately Imbalanced)
print("\n" + "=" * 60)
print("SCENARIO 2: Medical Screening (5% positive rate)")
print("=" * 60)

# Create medical dataset
X_medical, y_medical = make_classification(
    n_samples=2000,
    n_features=15,
    n_informative=12,
    n_classes=2,
    weights=[0.95, 0.05],  # 5% positive cases
    random_state=42
)

print(f"Medical dataset: {len(y_medical):,} patients")
print(f"Positive cases: {sum(y_medical)} ({sum(y_medical)/len(y_medical)*100:.1f}%)")

# Train model
X_train_med, X_test_med, y_train_med, y_test_med = train_test_split(
    X_medical, y_medical, test_size=0.3, stratify=y_medical, random_state=42
)

model_med = RandomForestClassifier(n_estimators=100, random_state=42)
model_med.fit(X_train_med, y_train_med)
y_pred_med = model_med.predict(X_test_med)
y_pred_proba_med = model_med.predict_proba(X_test_med)

# Calculate comprehensive metrics
print("\nDetailed Classification Report:")
print(classification_report(y_test_med, y_pred_med, target_names=['Negative', 'Positive']))

# Confusion Matrix
cm = confusion_matrix(y_test_med, y_pred_med)
print("\nConfusion Matrix:")
print(f"                Predicted")
print(f"Actual    Neg    Pos")
print(f"Neg     {cm[0,0]:4d}   {cm[0,1]:4d}")
print(f"Pos     {cm[1,0]:4d}   {cm[1,1]:4d}")

# Cost-sensitive analysis
tn, fp, fn, tp = cm.ravel()
cost_fn = 100  # Cost of missing a positive case
cost_fp = 10   # Cost of false alarm

total_cost = fn * cost_fn + fp * cost_fp
print(f"\nCost Analysis:")
print(f"False Negatives: {fn} × ${cost_fn} = ${fn * cost_fn}")
print(f"False Positives: {fp} × ${cost_fp} = ${fp * cost_fp}")
print(f"Total Cost: ${total_cost}")

# ROC and PR curves
from sklearn.metrics import precision_recall_curve, roc_curve

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test_med, y_pred_proba_med[:, 1])
roc_auc_med = roc_auc_score(y_test_med, y_pred_proba_med[:, 1])

axes[0].plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc_med:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', label='Random Classifier')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve - Medical Screening')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test_med, y_pred_proba_med[:, 1])
pr_auc_med = average_precision_score(y_test_med, y_pred_proba_med[:, 1])
baseline_precision = sum(y_test_med) / len(y_test_med)

axes[1].plot(recall, precision, label=f'PR Curve (AUC = {pr_auc_med:.3f})')
axes[1].axhline(y=baseline_precision, color='k', linestyle='--', label=f'Baseline ({baseline_precision:.3f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Analysis for Medical Screening:")
print("✅ RECOMMENDED: F1-Score or PR-AUC")
print("   - Balances precision and recall")
print("   - Appropriate for moderate imbalance")
print("✅ ALSO CONSIDER: Recall (high sensitivity needed)")
print("✅ COST-SENSITIVE: Custom metric based on misclassification costs")

In [None]:
# Scenario 3: Marketing Response (Balanced)
print("\n" + "=" * 60)
print("SCENARIO 3: Marketing Response (Balanced Classes)")
print("=" * 60)

# Create balanced dataset
X_marketing, y_marketing = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    n_classes=2,
    weights=[0.5, 0.5],  # Balanced
    random_state=42
)

print(f"Marketing dataset: {len(y_marketing):,} customers")
print(f"Response rate: {sum(y_marketing)} ({sum(y_marketing)/len(y_marketing)*100:.1f}%)")

# Train model
X_train_mrkt, X_test_mrkt, y_train_mrkt, y_test_mrkt = train_test_split(
    X_marketing, y_marketing, test_size=0.3, random_state=42
)

model_mrkt = LogisticRegression(random_state=42)
model_mrkt.fit(X_train_mrkt, y_train_mrkt)
y_pred_mrkt = model_mrkt.predict(X_test_mrkt)
y_pred_proba_mrkt = model_mrkt.predict_proba(X_test_mrkt)

# Comprehensive evaluation for balanced case
accuracy_mrkt = accuracy_score(y_test_mrkt, y_pred_mrkt)
precision_mrkt = precision_score(y_test_mrkt, y_pred_mrkt)
recall_mrkt = recall_score(y_test_mrkt, y_pred_mrkt)
f1_mrkt = f1_score(y_test_mrkt, y_pred_mrkt)
roc_auc_mrkt = roc_auc_score(y_test_mrkt, y_pred_proba_mrkt[:, 1])

print(f"\nPerformance Metrics:")
print(f"Accuracy: {accuracy_mrkt:.3f}")
print(f"Precision: {precision_mrkt:.3f}")
print(f"Recall: {recall_mrkt:.3f}")
print(f"F1-Score: {f1_mrkt:.3f}")
print(f"ROC-AUC: {roc_auc_mrkt:.3f}")

print("\n📊 Analysis for Marketing Response:")
print("✅ RECOMMENDED: Accuracy or F1-Score")
print("   - Classes are balanced")
print("   - Equal cost for both error types")
print("   - Accuracy is interpretable and reliable")
print("✅ ALSO GOOD: ROC-AUC for ranking customers")

In [None]:
# Scenario 4: House Price Regression (With Outliers)
print("\n" + "=" * 60)
print("SCENARIO 4: House Price Regression (With Outliers)")
print("=" * 60)

# Create regression dataset with outliers
np.random.seed(42)
n_houses = 500
X_houses = np.random.randn(n_houses, 5)  # Features: size, location, age, etc.
true_prices = 200000 + 50000 * X_houses[:, 0] + 30000 * X_houses[:, 1] + np.random.randn(n_houses) * 10000

# Add outliers (luxury mansions)
n_outliers = 20
outlier_indices = np.random.choice(n_houses, n_outliers, replace=False)
true_prices[outlier_indices] *= 3  # Make some houses 3x more expensive

y_houses = true_prices

print(f"House dataset: {n_houses} properties")
print(f"Price range: ${y_houses.min():,.0f} - ${y_houses.max():,.0f}")
print(f"Median price: ${np.median(y_houses):,.0f}")
print(f"Number of outliers: {n_outliers}")

# Split data
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(
    X_houses, y_houses, test_size=0.3, random_state=42
)

# Train different models
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import HuberRegressor  # Robust to outliers

models_regression = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Huber Regressor': HuberRegressor()
}

regression_results = []

for name, model in models_regression.items():
    model.fit(X_train_house, y_train_house)
    y_pred_house = model.predict(X_test_house)
    
    # Calculate different regression metrics
    mse = mean_squared_error(y_test_house, y_pred_house)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test_house, y_pred_house)
    r2 = r2_score(y_test_house, y_pred_house)
    
    # MAPE (Mean Absolute Percentage Error)
    mape = np.mean(np.abs((y_test_house - y_pred_house) / y_test_house)) * 100
    
    # Median Absolute Error (robust to outliers)
    median_ae = np.median(np.abs(y_test_house - y_pred_house))
    
    regression_results.append({
        'Model': name,
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2,
        'MAPE (%)': mape,
        'Median AE': median_ae
    })

# Display results
regression_df = pd.DataFrame(regression_results)
print("\nRegression Metrics Comparison:")
for col in ['MSE', 'RMSE', 'MAE', 'Median AE']:
    regression_df[col] = regression_df[col].apply(lambda x: f"${x:,.0f}")
print(regression_df)

# Visualize predictions vs actual
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, (name, model) in enumerate(models_regression.items()):
    y_pred = model.predict(X_test_house)
    
    axes[i].scatter(y_test_house, y_pred, alpha=0.6)
    axes[i].plot([y_test_house.min(), y_test_house.max()], 
                [y_test_house.min(), y_test_house.max()], 'r--', lw=2)
    axes[i].set_xlabel('Actual Price ($)')
    axes[i].set_ylabel('Predicted Price ($)')
    axes[i].set_title(f'{name}')
    axes[i].grid(True, alpha=0.3)
    
    # Add R² to plot
    r2_val = r2_score(y_test_house, y_pred)
    axes[i].text(0.05, 0.95, f'R² = {r2_val:.3f}', 
                transform=axes[i].transAxes, fontsize=12, 
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

print("\n📊 Analysis for House Price Regression:")
print("❌ MSE heavily penalized by outliers")
print("✅ RECOMMENDED: MAE (Mean Absolute Error)")
print("   - Robust to outliers")
print("   - Interpretable in dollar terms")
print("   - Linear penalty for errors")
print("✅ ALSO CONSIDER: Median Absolute Error for extreme outliers")
print("✅ FOR EXPLANATION: R² for variance explained")

### Summary: Metric Selection Guidelines

| Problem Type | Best Metric | Reason |
|--------------|-------------|--------|
| **Extreme Imbalance** | PR-AUC, Recall | Focus on positive class |
| **Moderate Imbalance** | F1-Score, PR-AUC | Balance precision/recall |
| **Balanced Classes** | Accuracy, ROC-AUC | Simple and reliable |
| **Cost-Sensitive** | Custom Cost Function | Reflect business impact |
| **Regression (Normal)** | RMSE, R² | Standard and interpretable |
| **Regression (Outliers)** | MAE, Median AE | Robust to outliers |

### Key Principles:
1. **Consider class distribution** and business context
2. **Match metric to problem objectives**
3. **Use multiple metrics** for comprehensive evaluation
4. **Validate on realistic test conditions**

## Question 4: Model Selection and Comparison Framework ⭐⭐⭐

**Problem**: You need to select the best model among several candidates for a classification task. Implement a rigorous model comparison framework that accounts for:
1. Statistical significance of performance differences
2. Multiple evaluation metrics
3. Computational efficiency
4. Model interpretability

Create a scoring system that combines these factors to make an objective model selection decision.

### Theoretical Foundation

**Statistical Testing for Model Comparison**:
- **Paired t-test**: Compare mean performance across CV folds
- **McNemar's test**: Compare error patterns on same test set
- **Wilcoxon signed-rank test**: Non-parametric alternative

**Multi-Criteria Decision Making**:
- **Weighted scoring**: Combine normalized metrics
- **Pareto analysis**: Trade-off between competing objectives
- **Business constraints**: Incorporate practical limitations

In [None]:
import time
from scipy import stats
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

class ModelComparison:
    def __init__(self, X, y, cv_folds=5):
        self.X = X
        self.y = y
        self.cv_folds = cv_folds
        self.results = {}
        
    def evaluate_model(self, name, model, scoring=['accuracy', 'f1', 'roc_auc']):
        """Evaluate a single model with cross-validation."""
        print(f"Evaluating {name}...")
        
        # Measure training time
        start_time = time.time()
        
        # Cross-validation with multiple metrics
        cv_results = cross_validate(
            model, self.X, self.y, 
            cv=self.cv_folds, 
            scoring=scoring,
            return_train_score=True,
            n_jobs=-1
        )
        
        training_time = time.time() - start_time
        
        # Measure prediction time
        model.fit(self.X, self.y)
        start_time = time.time()
        _ = model.predict(self.X)
        prediction_time = time.time() - start_time
        
        # Store results
        self.results[name] = {
            'cv_results': cv_results,
            'training_time': training_time,
            'prediction_time': prediction_time,
            'model': model
        }
        
        # Calculate summary statistics
        for metric in scoring:
            test_scores = cv_results[f'test_{metric}']
            self.results[name][f'{metric}_mean'] = np.mean(test_scores)
            self.results[name][f'{metric}_std'] = np.std(test_scores)
            self.results[name][f'{metric}_scores'] = test_scores
    
    def statistical_comparison(self, model1, model2, metric='accuracy'):
        """Perform statistical test to compare two models."""
        scores1 = self.results[model1][f'{metric}_scores']
        scores2 = self.results[model2][f'{metric}_scores']
        
        # Paired t-test
        t_stat, p_value = stats.ttest_rel(scores1, scores2)
        
        return {
            't_statistic': t_stat,
            'p_value': p_value,
            'significant': p_value < 0.05,
            'better_model': model1 if np.mean(scores1) > np.mean(scores2) else model2
        }
    
    def create_comparison_table(self):
        """Create comprehensive comparison table."""
        comparison_data = []
        
        for name, results in self.results.items():
            comparison_data.append({
                'Model': name,
                'Accuracy': f"{results['accuracy_mean']:.3f} ± {results['accuracy_std']:.3f}",
                'F1-Score': f"{results['f1_mean']:.3f} ± {results['f1_std']:.3f}",
                'ROC-AUC': f"{results['roc_auc_mean']:.3f} ± {results['roc_auc_std']:.3f}",
                'Train Time (s)': f"{results['training_time']:.2f}",
                'Pred Time (s)': f"{results['prediction_time']:.4f}"
            })
        
        return pd.DataFrame(comparison_data)
    
    def multi_criteria_scoring(self, weights={'accuracy': 0.3, 'f1': 0.3, 'roc_auc': 0.2, 
                                            'speed': 0.1, 'interpretability': 0.1}):
        """Calculate multi-criteria scores for model selection."""
        
        # Interpretability scores (subjective, domain-dependent)
        interpretability_scores = {
            'Logistic Regression': 0.9,
            'Naive Bayes': 0.8,
            'Random Forest': 0.6,
            'SVM': 0.3,
            'Gradient Boosting': 0.4,
            'Neural Network': 0.2
        }
        
        # Normalize metrics (higher is better)
        models = list(self.results.keys())
        
        # Performance metrics (already 0-1, higher is better)
        accuracy_scores = [self.results[m]['accuracy_mean'] for m in models]
        f1_scores = [self.results[m]['f1_mean'] for m in models]
        roc_auc_scores = [self.results[m]['roc_auc_mean'] for m in models]
        
        # Speed metrics (lower is better, so invert)
        total_times = [self.results[m]['training_time'] + self.results[m]['prediction_time'] 
                      for m in models]
        max_time = max(total_times)
        speed_scores = [(max_time - t) / max_time for t in total_times]
        
        # Calculate weighted scores
        final_scores = []
        for i, model in enumerate(models):
            score = (
                weights['accuracy'] * accuracy_scores[i] +
                weights['f1'] * f1_scores[i] +
                weights['roc_auc'] * roc_auc_scores[i] +
                weights['speed'] * speed_scores[i] +
                weights['interpretability'] * interpretability_scores.get(model, 0.5)
            )
            final_scores.append(score)
        
        # Create ranking
        ranking_data = []
        for i, model in enumerate(models):
            ranking_data.append({
                'Model': model,
                'Accuracy': accuracy_scores[i],
                'F1': f1_scores[i],
                'ROC-AUC': roc_auc_scores[i],
                'Speed': speed_scores[i],
                'Interpretability': interpretability_scores.get(model, 0.5),
                'Final Score': final_scores[i]
            })
        
        ranking_df = pd.DataFrame(ranking_data)
        ranking_df = ranking_df.sort_values('Final Score', ascending=False)
        ranking_df['Rank'] = range(1, len(ranking_df) + 1)
        
        return ranking_df

# Load dataset for comparison
data = load_breast_cancer()
X_comp, y_comp = data.data, data.target

# Standardize features
scaler = StandardScaler()
X_comp = scaler.fit_transform(X_comp)

print("Model Comparison Framework")
print("=" * 50)
print(f"Dataset: {X_comp.shape[0]} samples, {X_comp.shape[1]} features")
print(f"Task: Binary classification (breast cancer detection)")

# Initialize comparison framework
comparator = ModelComparison(X_comp, y_comp, cv_folds=5)

# Define models to compare
models_to_compare = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Naive Bayes': GaussianNB(),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100,), random_state=42, max_iter=500)
}

# Evaluate all models
for name, model in models_to_compare.items():
    comparator.evaluate_model(name, model)

print("\nModel evaluation completed!")

In [None]:
# Display comparison table
print("\nPerformance Comparison:")
comparison_table = comparator.create_comparison_table()
print(comparison_table.to_string(index=False))

# Statistical significance testing
print("\n" + "=" * 60)
print("STATISTICAL SIGNIFICANCE TESTING")
print("=" * 60)

# Compare top 3 models
accuracy_means = {name: results['accuracy_mean'] for name, results in comparator.results.items()}
top_3_models = sorted(accuracy_means.items(), key=lambda x: x[1], reverse=True)[:3]

print(f"\nTop 3 models by accuracy:")
for i, (model, acc) in enumerate(top_3_models, 1):
    print(f"{i}. {model}: {acc:.3f}")

print("\nPairwise statistical comparisons (accuracy):")
for i in range(len(top_3_models)):
    for j in range(i+1, len(top_3_models)):
        model1, model2 = top_3_models[i][0], top_3_models[j][0]
        comparison = comparator.statistical_comparison(model1, model2, 'accuracy')
        
        significance = "✅ Significant" if comparison['significant'] else "❌ Not significant"
        print(f"{model1} vs {model2}:")
        print(f"   p-value: {comparison['p_value']:.4f} ({significance})")
        print(f"   Better: {comparison['better_model']}")
        print()

In [None]:
# Multi-criteria decision making
print("=" * 60)
print("MULTI-CRITERIA MODEL SELECTION")
print("=" * 60)

# Define different weighting scenarios
scenarios = {
    'Performance Focus': {'accuracy': 0.4, 'f1': 0.3, 'roc_auc': 0.3, 'speed': 0.0, 'interpretability': 0.0},
    'Balanced': {'accuracy': 0.3, 'f1': 0.3, 'roc_auc': 0.2, 'speed': 0.1, 'interpretability': 0.1},
    'Production Focus': {'accuracy': 0.2, 'f1': 0.2, 'roc_auc': 0.1, 'speed': 0.3, 'interpretability': 0.2},
    'Interpretability Focus': {'accuracy': 0.2, 'f1': 0.2, 'roc_auc': 0.1, 'speed': 0.1, 'interpretability': 0.4}
}

for scenario_name, weights in scenarios.items():
    print(f"\n{scenario_name} Scenario:")
    ranking = comparator.multi_criteria_scoring(weights)
    print(ranking[['Rank', 'Model', 'Final Score']].head(3).to_string(index=False))

# Detailed ranking for balanced scenario
print("\n" + "=" * 60)
print("DETAILED RANKING (Balanced Scenario)")
print("=" * 60)

balanced_ranking = comparator.multi_criteria_scoring(scenarios['Balanced'])
print(balanced_ranking.round(3).to_string(index=False))

# Visualization of trade-offs
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Performance vs Speed trade-off
models = balanced_ranking['Model'].values
performance = balanced_ranking['Accuracy'].values
speed = balanced_ranking['Speed'].values
interpretability = balanced_ranking['Interpretability'].values

scatter = axes[0,0].scatter(speed, performance, c=interpretability, s=100, alpha=0.7, cmap='viridis')
for i, model in enumerate(models):
    axes[0,0].annotate(model.split()[0], (speed[i], performance[i]), 
                      xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[0,0].set_xlabel('Speed Score (normalized)')
axes[0,0].set_ylabel('Accuracy')
axes[0,0].set_title('Performance vs Speed Trade-off')
axes[0,0].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[0,0], label='Interpretability')

# F1 vs ROC-AUC
f1_scores = balanced_ranking['F1'].values
roc_scores = balanced_ranking['ROC-AUC'].values
axes[0,1].scatter(f1_scores, roc_scores, s=100, alpha=0.7)
for i, model in enumerate(models):
    axes[0,1].annotate(model.split()[0], (f1_scores[i], roc_scores[i]), 
                      xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[0,1].set_xlabel('F1-Score')
axes[0,1].set_ylabel('ROC-AUC')
axes[0,1].set_title('F1 vs ROC-AUC Performance')
axes[0,1].grid(True, alpha=0.3)

# Final scores comparison
final_scores = balanced_ranking['Final Score'].values
colors = plt.cm.RdYlGn(np.linspace(0.3, 0.9, len(models)))
bars = axes[1,0].bar(range(len(models)), final_scores, color=colors)
axes[1,0].set_xlabel('Model Rank')
axes[1,0].set_ylabel('Final Score')
axes[1,0].set_title('Multi-Criteria Final Scores')
axes[1,0].set_xticks(range(len(models)))
axes[1,0].set_xticklabels([m.split()[0] for m in models], rotation=45)
axes[1,0].grid(True, alpha=0.3)

# Component contribution
components = ['Accuracy', 'F1', 'ROC-AUC', 'Speed', 'Interpretability']
weights_balanced = list(scenarios['Balanced'].values())
best_model_idx = 0  # Top ranked model
best_model_scores = [
    balanced_ranking.iloc[best_model_idx]['Accuracy'],
    balanced_ranking.iloc[best_model_idx]['F1'],
    balanced_ranking.iloc[best_model_idx]['ROC-AUC'],
    balanced_ranking.iloc[best_model_idx]['Speed'],
    balanced_ranking.iloc[best_model_idx]['Interpretability']
]
contributions = [w * s for w, s in zip(weights_balanced, best_model_scores)]

axes[1,1].pie(contributions, labels=components, autopct='%1.1f%%', startangle=90)
axes[1,1].set_title(f'Score Contribution - {models[0]}')

plt.tight_layout()
plt.show()

# Final recommendation
best_model_name = balanced_ranking.iloc[0]['Model']
best_score = balanced_ranking.iloc[0]['Final Score']

print(f"\n🏆 FINAL RECOMMENDATION: {best_model_name}")
print(f"   Overall Score: {best_score:.3f}")
print(f"   Rationale: Best balance of performance, speed, and interpretability")
print(f"   Statistical significance: Validated through cross-validation")

### Key Takeaways for Model Selection Framework

1. **Statistical Validation**: Always test for significance of performance differences
2. **Multi-Criteria Evaluation**: Consider performance, speed, interpretability, and business constraints
3. **Scenario-Based Selection**: Adapt weighting based on deployment requirements
4. **Cross-Validation**: Use robust evaluation methods for reliable estimates
5. **Documentation**: Clearly document selection criteria and trade-offs made

### Model Selection Checklist:
- ✅ Performance metrics appropriate for the problem
- ✅ Statistical significance testing
- ✅ Computational efficiency assessment
- ✅ Interpretability requirements
- ✅ Business constraints and deployment context
- ✅ Robustness to different data distributions

## Summary and Best Practices

This notebook covered essential concepts in model evaluation and validation:

### 🎯 Key Learning Points:

1. **Cross-Validation Strategy Selection**
   - Match CV strategy to data characteristics
   - Stratified K-Fold for imbalanced data
   - Time Series Split for temporal data
   - Simple splits for large datasets

2. **Bias-Variance Tradeoff**
   - Understand components of prediction error
   - Optimize model complexity for best generalization
   - Use validation curves to find optimal parameters

3. **Performance Metrics Selection**
   - Choose metrics based on problem type and business impact
   - PR-AUC for extreme imbalance
   - F1-Score for moderate imbalance
   - MAE for robust regression

4. **Model Selection Framework**
   - Statistical testing for significance
   - Multi-criteria decision making
   - Consider practical constraints

### 🚀 Next Steps:
- Practice with different datasets and problem types
- Implement custom evaluation metrics for specific business needs
- Explore ensemble methods for improved performance
- Study advanced topics like model calibration and uncertainty quantification