# SVC and Neural Network Models: Margin-Based and Deep Learning Approaches

This notebook implements two advanced models with GridSearchCV hyperparameter tuning:
1. **SVC (RBF Kernel)** - Margin-based classifier with non-linear decision boundary
2. **MLPClassifier** - Feedforward neural network for tabular classification

Each model is trained and evaluated on three datasets:
- **Raw/Original** dataset (imbalanced)
- **SMOTE Balanced** dataset (synthetic oversampling)
- **ADASYN Balanced** dataset (adaptive synthetic oversampling)

**Approach:**
- GridSearchCV with f1 or roc_auc scoring (appropriate for imbalanced classification)
- Centralized metrics storage for all model-dataset combinations
- Storage of predicted probabilities for ROC curve plotting (Task 5)
- Best hyperparameter storage for each model-dataset combination

**Prerequisites:** Run notebook 01 (data exploration) first to generate the processed data files.


## Step 1: Load Processed Data from Notebook 01

The following cell loads the pre-processed and scaled dataframe from notebook 01. This ensures we're working with the same processed data without re-running all data cleaning and feature engineering steps.


In [1]:
# ============================================================
# Load Processed Data from Notebook 01
# ============================================================

import pickle
import pandas as pd
import numpy as np
import os

# Check if processed data exists
data_dir = "../data/processed"
pkl_path = f"{data_dir}/df_scaled.pkl"

if os.path.exists(pkl_path):
    # Load the processed dataframe
    with open(pkl_path, "rb") as f:
        df_scaled = pickle.load(f)
    
    # Load the scaler (if needed for future transformations)
    scaler_path = f"{data_dir}/scaler.pkl"
    if os.path.exists(scaler_path):
        with open(scaler_path, "rb") as f:
            scaler = pickle.load(f)
    
    print("=" * 70)
    print("PROCESSED DATA LOADED SUCCESSFULLY")
    print("=" * 70)
    print(f" Loaded from: {pkl_path}")
    print(f" DataFrame shape: {df_scaled.shape}")
    print(f" Columns: {df_scaled.shape[1]}")
    print(f" Rows: {df_scaled.shape[0]}")
    print(f"\nFirst few columns: {list(df_scaled.columns[:10])}")
    print(f"\nAttrition distribution:")
    print(df_scaled["Attrition"].value_counts())
    print(f"\n Data ready for modeling!")
    print("=" * 70)
    
else:
    raise FileNotFoundError(
        f"\n{'='*70}\n"
        f"PROCESSED DATA NOT FOUND\n"
        f"{'='*70}\n"
        f"File not found: {pkl_path}\n\n"
        f"Please run notebook 01 (01-data-exploration.ipynb) first:\n"
        f"1. Execute all cells in notebook 01\n"
        f"2. This will save the processed data to {data_dir}/\n"
        f"3. Then return to this notebook\n"
        f"{'='*70}"
    )


PROCESSED DATA LOADED SUCCESSFULLY
 Loaded from: ../data/processed/df_scaled.pkl
 DataFrame shape: (1470, 29)
 Columns: 29
 Rows: 1470

First few columns: ['Attrition', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobSatisfaction', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike']

Attrition distribution:
Attrition
0    1233
1     237
Name: count, dtype: int64

 Data ready for modeling!


## Step 2: Prepare Three Datasets (Raw, SMOTE, ADASYN)

Prepare the three response variable sets for model training and evaluation.


In [2]:
# ============================================================
# Prepare Three Datasets: Raw, SMOTE, ADASYN
# ============================================================

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE, ADASYN

# Separate features and target
X = df_scaled.drop(columns=["Attrition"])
y = df_scaled["Attrition"]

# Train-test split (80-20) with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("=" * 70)
print("TRAIN-TEST SPLIT COMPLETE")
print("=" * 70)
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nTraining set Attrition distribution:")
print(y_train.value_counts())
print(f"\nTest set Attrition distribution:")
print(y_test.value_counts())
print("=" * 70)

# Prepare three datasets
training_sets = {}

# 1. Raw/Original dataset (imbalanced)
training_sets['raw'] = {
    'name': 'Raw/Original',
    'description': 'Original imbalanced dataset',
    'X_train': X_train,
    'y_train': y_train
}

# 2. SMOTE balanced dataset
print("\nApplying SMOTE...")
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
training_sets['smote'] = {
    'name': 'SMOTE Balanced',
    'description': 'SMOTE synthetic oversampling',
    'X_train': X_train_smote,
    'y_train': y_train_smote
}
print(f"SMOTE dataset shape: {X_train_smote.shape}")
print(f"SMOTE Attrition distribution:\n{pd.Series(y_train_smote).value_counts()}")

# 3. ADASYN balanced dataset
print("\nApplying ADASYN...")
adasyn = ADASYN(random_state=42)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)
training_sets['adasyn'] = {
    'name': 'ADASYN Balanced',
    'description': 'ADASYN adaptive synthetic oversampling',
    'X_train': X_train_adasyn,
    'y_train': y_train_adasyn
}
print(f"ADASYN dataset shape: {X_train_adasyn.shape}")
print(f"ADASYN Attrition distribution:\n{pd.Series(y_train_adasyn).value_counts()}")

print("\n" + "=" * 70)
print("THREE DATASETS PREPARED SUCCESSFULLY")
print("=" * 70)


TRAIN-TEST SPLIT COMPLETE
Training set shape: (1176, 28)
Test set shape: (294, 28)

Training set Attrition distribution:
Attrition
0    986
1    190
Name: count, dtype: int64

Test set Attrition distribution:
Attrition
0    247
1     47
Name: count, dtype: int64

Applying SMOTE...
SMOTE dataset shape: (1972, 28)
SMOTE Attrition distribution:
Attrition
0    986
1    986
Name: count, dtype: int64

Applying ADASYN...
ADASYN dataset shape: (1911, 28)
ADASYN Attrition distribution:
Attrition
0    986
1    925
Name: count, dtype: int64

THREE DATASETS PREPARED SUCCESSFULLY




## Step 3: Imports and Extended Evaluation Function

Import required libraries and create an extended evaluation function that includes confusion matrix components (tn, fp, fn, tp).


In [3]:
# ============================================================
# Imports and Extended Evaluation Function
# ============================================================

from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix
)
import pandas as pd
import numpy as np
import inspect

def evaluate_model(y_test, y_pred, y_prob):
    """
    Calculate all required evaluation metrics including confusion matrix components.
    
    Parameters:
    -----------
    y_test : array-like
        True labels
    y_pred : array-like
        Predicted labels
    y_prob : array-like
        Predicted probabilities for positive class
        
    Returns:
    --------
    dict : Dictionary containing all metrics and confusion matrix components
    """
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_prob),
        'confusion_matrix': cm,
        'tn': int(tn),
        'fp': int(fp),
        'fn': int(fn),
        'tp': int(tp)
    }


def train_model_with_gridsearch(model_class, param_grid, X_train, y_train, X_test, y_test,
                                model_name, dataset_name, **model_kwargs):
    """
    Unified function to train any model with GridSearchCV.
    Stores predicted probabilities for ROC curves.
    
    Parameters:
    -----------
    model_class : class
        Model class to instantiate
    param_grid : dict
        Hyperparameter grid for GridSearchCV
    X_train : DataFrame/array
        Training features
    y_train : Series/array
        Training labels
    X_test : DataFrame/array
        Test features
    y_test : Series/array
        Test labels
    model_name : str
        Name of the model (e.g., 'SVC', 'MLP')
    dataset_name : str
        Name of the dataset (e.g., 'raw', 'smote', 'adasyn')
    **model_kwargs : dict
        Additional keyword arguments for model initialization
        
    Returns:
    --------
    dict : Dictionary containing best estimator, metrics, and best parameters
    """
    
    # Check if model accepts random_state parameter
    sig = inspect.signature(model_class.__init__)
    accepts_random_state = 'random_state' in sig.parameters
    
    # Build initialization parameters
    init_params = {}
    if accepts_random_state and 'random_state' not in model_kwargs:
        init_params['random_state'] = 42
    
    # Add any additional kwargs (these will override defaults if they conflict)
    init_params.update(model_kwargs)
    
    # Initialize model
    model = model_class(**init_params)
    
    # GridSearchCV with f1 scoring (balanced metric for imbalanced data)
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        scoring='f1',  # Use f1 for balanced optimization
        cv=5,
        n_jobs=-1,
        verbose=1
    )
    
    print(f"  Running GridSearchCV for {model_name} on {dataset_name} dataset...")
    grid_search.fit(X_train, y_train)
    
    # Get best estimator
    best_estimator = grid_search.best_estimator_
    
    # Evaluate on test set
    y_test_pred = best_estimator.predict(X_test)
    y_test_prob = best_estimator.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    metrics = evaluate_model(y_test, y_test_pred, y_test_prob)
    
    print(f"  Best parameters: {grid_search.best_params_}")
    print(f"  Test Metrics - Accuracy: {metrics['accuracy']:.4f}, Precision: {metrics['precision']:.4f}, "
          f"Recall: {metrics['recall']:.4f}, F1: {metrics['f1']:.4f}, ROC-AUC: {metrics['roc_auc']:.4f}")
    print(f"  Confusion Matrix - TN: {metrics['tn']}, FP: {metrics['fp']}, FN: {metrics['fn']}, TP: {metrics['tp']}")
    
    return {
        'best_estimator': best_estimator,
        'metrics': metrics,
        'best_params': grid_search.best_params_,
        'y_test': y_test,
        'y_test_prob': y_test_prob
    }


In [4]:
# ============================================================
# Initialize Extended Storage
# ============================================================

# Centralized metrics storage
# Each entry will have: model_name, dataset_type, accuracy, precision, recall, f1, roc_auc, tn, fp, fn, tp
all_metrics = []

# ROC probabilities storage for Task 5
# Structure: {model_name: {dataset_type: {'y_true': array, 'y_score': array}}}
roc_probabilities = {
    'SVC': {},
    'MLP': {}
}

# Best hyperparameters storage
# Structure: {model_name: {dataset_type: best_params_dict}}
best_hyperparameters = {
    'SVC': {},
    'MLP': {}
}

# Best estimators storage
# Structure: {model_name: {dataset_type: best_estimator}}
best_estimators = {
    'SVC': {},
    'MLP': {}
}

print("=" * 70)
print("EXTENDED STORAGE INITIALIZED")
print("=" * 70)
print("Storage structures ready for:")
print("  - Centralized metrics table")
print("  - ROC probabilities (for Task 5)")
print("  - Best hyperparameters")
print("  - Best estimators")
print("=" * 70)


EXTENDED STORAGE INITIALIZED
Storage structures ready for:
  - Centralized metrics table
  - ROC probabilities (for Task 5)
  - Best hyperparameters
  - Best estimators


## Step 5: SVC (RBF Kernel) Implementation

Implement SVC with RBF kernel, GridSearchCV hyperparameter tuning, and store all metrics and probabilities.


In [5]:
# ============================================================
# SVC (RBF Kernel) with GridSearchCV
# ============================================================

print("=" * 70)
print("TRAINING SVC (RBF KERNEL) MODELS")
print("=" * 70)

# Define hyperparameter grid for SVC
svc_param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1]
}

# Fixed parameters: RBF kernel and probability=True (required for predict_proba)
svc_fixed_params = {
    'kernel': 'rbf',
    'probability': True  # Required for predict_proba and ROC curves
}

# Train SVC on each dataset
for dataset_key in ['raw', 'smote', 'adasyn']:
    dataset_name = training_sets[dataset_key]['name']
    print(f"\n{'='*70}")
    print(f"Training SVC on: {dataset_name}")
    print("=" * 70)
    
    # Train model
    result = train_model_with_gridsearch(
        model_class=SVC,
        param_grid=svc_param_grid,
        X_train=training_sets[dataset_key]['X_train'],
        y_train=training_sets[dataset_key]['y_train'],
        X_test=X_test,
        y_test=y_test,
        model_name='SVC',
        dataset_name=dataset_name,
        **svc_fixed_params
    )
    
    # Store best estimator
    best_estimators['SVC'][dataset_key] = result['best_estimator']
    
    # Store best hyperparameters
    best_hyperparameters['SVC'][dataset_key] = result['best_params']
    
    # Store ROC probabilities
    roc_probabilities['SVC'][dataset_key] = {
        'y_true': result['y_test'],
        'y_score': result['y_test_prob']
    }
    
    # Store metrics
    all_metrics.append({
        'model_name': 'SVC',
        'dataset_type': dataset_key,
        'accuracy': result['metrics']['accuracy'],
        'precision_positive': result['metrics']['precision'],
        'recall_positive': result['metrics']['recall'],
        'f1_positive': result['metrics']['f1'],
        'roc_auc': result['metrics']['roc_auc'],
        'tn': result['metrics']['tn'],
        'fp': result['metrics']['fp'],
        'fn': result['metrics']['fn'],
        'tp': result['metrics']['tp']
    })

print("\n" + "=" * 70)
print("SVC TRAINING COMPLETE")
print("=" * 70)


TRAINING SVC (RBF KERNEL) MODELS

Training SVC on: Raw/Original
  Running GridSearchCV for SVC on Raw/Original dataset...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
  Best parameters: {'C': 100, 'gamma': 0.01}
  Test Metrics - Accuracy: 0.8571, Precision: 0.5610, Recall: 0.4894, F1: 0.5227, ROC-AUC: 0.7786
  Confusion Matrix - TN: 229, FP: 18, FN: 24, TP: 23

Training SVC on: SMOTE Balanced
  Running GridSearchCV for SVC on SMOTE Balanced dataset...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
  Best parameters: {'C': 10, 'gamma': 0.1}
  Test Metrics - Accuracy: 0.8095, Precision: 0.3548, Recall: 0.2340, F1: 0.2821, ROC-AUC: 0.7339
  Confusion Matrix - TN: 227, FP: 20, FN: 36, TP: 11

Training SVC on: ADASYN Balanced
  Running GridSearchCV for SVC on ADASYN Balanced dataset...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
  Best parameters: {'C': 10, 'gamma': 0.1}
  Test Metrics - Accuracy: 0.8197, Precision: 0.4062, Recall: 0.2766, F

## Step 6: MLPClassifier (Neural Network) Implementation

Implement MLPClassifier with GridSearchCV hyperparameter tuning, and store all metrics and probabilities.


In [6]:
# ============================================================
# MLPClassifier (Neural Network) with GridSearchCV
# ============================================================

print("=" * 70)
print("TRAINING MLPClassifier (NEURAL NETWORK) MODELS")
print("=" * 70)

# Define hyperparameter grid for MLPClassifier
mlp_param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
    'alpha': [0.0001, 0.001, 0.01]
}

# Fixed parameters: max_iter, random_state, early_stopping
mlp_fixed_params = {
    'max_iter': 500,
    'random_state': 42,
    'early_stopping': True
}

# Train MLPClassifier on each dataset
for dataset_key in ['raw', 'smote', 'adasyn']:
    dataset_name = training_sets[dataset_key]['name']
    print(f"\n{'='*70}")
    print(f"Training MLPClassifier on: {dataset_name}")
    print("=" * 70)
    
    # Train model
    result = train_model_with_gridsearch(
        model_class=MLPClassifier,
        param_grid=mlp_param_grid,
        X_train=training_sets[dataset_key]['X_train'],
        y_train=training_sets[dataset_key]['y_train'],
        X_test=X_test,
        y_test=y_test,
        model_name='MLP',
        dataset_name=dataset_name,
        **mlp_fixed_params
    )
    
    # Store best estimator
    best_estimators['MLP'][dataset_key] = result['best_estimator']
    
    # Store best hyperparameters
    best_hyperparameters['MLP'][dataset_key] = result['best_params']
    
    # Store ROC probabilities
    roc_probabilities['MLP'][dataset_key] = {
        'y_true': result['y_test'],
        'y_score': result['y_test_prob']
    }
    
    # Store metrics
    all_metrics.append({
        'model_name': 'MLP',
        'dataset_type': dataset_key,
        'accuracy': result['metrics']['accuracy'],
        'precision_positive': result['metrics']['precision'],
        'recall_positive': result['metrics']['recall'],
        'f1_positive': result['metrics']['f1'],
        'roc_auc': result['metrics']['roc_auc'],
        'tn': result['metrics']['tn'],
        'fp': result['metrics']['fp'],
        'fn': result['metrics']['fn'],
        'tp': result['metrics']['tp']
    })

print("\n" + "=" * 70)
print("MLPClassifier TRAINING COMPLETE")
print("=" * 70)


TRAINING MLPClassifier (NEURAL NETWORK) MODELS

Training MLPClassifier on: Raw/Original
  Running GridSearchCV for MLP on Raw/Original dataset...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
  Best parameters: {'alpha': 0.0001, 'hidden_layer_sizes': (100,)}
  Test Metrics - Accuracy: 0.8571, Precision: 0.6923, Recall: 0.1915, F1: 0.3000, ROC-AUC: 0.7362
  Confusion Matrix - TN: 243, FP: 4, FN: 38, TP: 9

Training MLPClassifier on: SMOTE Balanced
  Running GridSearchCV for MLP on SMOTE Balanced dataset...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
  Best parameters: {'alpha': 0.01, 'hidden_layer_sizes': (100, 50)}
  Test Metrics - Accuracy: 0.7891, Precision: 0.3529, Recall: 0.3830, F1: 0.3673, ROC-AUC: 0.7196
  Confusion Matrix - TN: 214, FP: 33, FN: 29, TP: 18

Training MLPClassifier on: ADASYN Balanced
  Running GridSearchCV for MLP on ADASYN Balanced dataset...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
  Best parameters: {'alpha':

## Step 7: Extended Metrics Table Display

Display the comprehensive metrics table with all performance metrics and confusion matrix components.


In [7]:
# ============================================================
# Extended Metrics Table Display
# ============================================================

if len(all_metrics) > 0:
    # Convert to DataFrame
    metrics_df = pd.DataFrame(all_metrics)
    
    # Map dataset_type to readable names
    dataset_name_map = {
        'raw': 'Raw/Original',
        'smote': 'SMOTE Balanced',
        'adasyn': 'ADASYN Balanced'
    }
    metrics_df['dataset'] = metrics_df['dataset_type'].map(dataset_name_map)
    
    print("=" * 70)
    print("COMPREHENSIVE METRICS TABLE")
    print("=" * 70)
    
    # Display full table
    display_cols = ['model_name', 'dataset', 'accuracy', 'precision_positive', 
                    'recall_positive', 'f1_positive', 'roc_auc']
    print("\n--- Performance Metrics ---")
    print(metrics_df[display_cols].to_string(index=False))
    
    # Display confusion matrix components
    print("\n--- Confusion Matrix Components ---")
    cm_cols = ['model_name', 'dataset', 'tn', 'fp', 'fn', 'tp']
    print(metrics_df[cm_cols].to_string(index=False))
    
    # Summary statistics by model
    print("\n--- Summary Statistics by Model ---")
    summary_by_model = metrics_df.groupby('model_name').agg({
        'accuracy': ['mean', 'std'],
        'precision_positive': ['mean', 'std'],
        'recall_positive': ['mean', 'std'],
        'f1_positive': ['mean', 'std'],
        'roc_auc': ['mean', 'std']
    }).round(4)
    print(summary_by_model)
    
    # Summary statistics by dataset
    print("\n--- Summary Statistics by Dataset ---")
    summary_by_dataset = metrics_df.groupby('dataset').agg({
        'accuracy': ['mean', 'std'],
        'precision_positive': ['mean', 'std'],
        'recall_positive': ['mean', 'std'],
        'f1_positive': ['mean', 'std'],
        'roc_auc': ['mean', 'std']
    }).round(4)
    print(summary_by_dataset)
    
    print("\n" + "=" * 70)
else:
    print("No metrics available yet. Please run model training cells first.")


COMPREHENSIVE METRICS TABLE

--- Performance Metrics ---
model_name         dataset  accuracy  precision_positive  recall_positive  f1_positive  roc_auc
       SVC    Raw/Original  0.857143            0.560976         0.489362     0.522727 0.778620
       SVC  SMOTE Balanced  0.809524            0.354839         0.234043     0.282051 0.733913
       SVC ADASYN Balanced  0.819728            0.406250         0.276596     0.329114 0.740460
       MLP    Raw/Original  0.857143            0.692308         0.191489     0.300000 0.736153
       MLP  SMOTE Balanced  0.789116            0.352941         0.382979     0.367347 0.719614
       MLP ADASYN Balanced  0.829932            0.461538         0.382979     0.418605 0.710569

--- Confusion Matrix Components ---
model_name         dataset  tn  fp  fn  tp
       SVC    Raw/Original 229  18  24  23
       SVC  SMOTE Balanced 227  20  36  11
       SVC ADASYN Balanced 228  19  34  13
       MLP    Raw/Original 243   4  38   9
       MLP  SMOTE B

## Step 8: ROC Probabilities Storage Summary

Display the structure of stored ROC probabilities for Task 5 (ROC curve plotting).


In [8]:
# ============================================================
# ROC Probabilities Storage Summary
# ============================================================

print("=" * 70)
print("ROC PROBABILITIES STORAGE")
print("=" * 70)
print("\nStructure: {model_name: {dataset_type: {'y_true': array, 'y_score': array}}}")
print("\nStored probabilities for ROC curve plotting (Task 5):\n")

for model_name in ['SVC', 'MLP']:
    print(f"\n{model_name}:")
    print("-" * 70)
    for dataset_key in ['raw', 'smote', 'adasyn']:
        dataset_name = training_sets[dataset_key]['name']
        if dataset_key in roc_probabilities[model_name]:
            stored_data = roc_probabilities[model_name][dataset_key]
            y_true = stored_data['y_true']
            y_score = stored_data['y_score']
            print(f"  {dataset_name}:")
            print(f"    y_true shape: {y_true.shape if hasattr(y_true, 'shape') else len(y_true)}")
            print(f"    y_score shape: {y_score.shape if hasattr(y_score, 'shape') else len(y_score)}")
            print(f"    y_true unique values: {np.unique(y_true)}")
            print(f"    y_score range: [{y_score.min():.4f}, {y_score.max():.4f}]")
        else:
            print(f"  {dataset_name}: Not stored")

print("\n" + "=" * 70)
print("All probabilities stored successfully for Task 5")
print("=" * 70)


ROC PROBABILITIES STORAGE

Structure: {model_name: {dataset_type: {'y_true': array, 'y_score': array}}}

Stored probabilities for ROC curve plotting (Task 5):


SVC:
----------------------------------------------------------------------
  Raw/Original:
    y_true shape: (294,)
    y_score shape: (294,)
    y_true unique values: [0 1]
    y_score range: [0.0125, 0.7770]
  SMOTE Balanced:
    y_true shape: (294,)
    y_score shape: (294,)
    y_true unique values: [0 1]
    y_score range: [0.0000, 0.9874]
  ADASYN Balanced:
    y_true shape: (294,)
    y_score shape: (294,)
    y_true unique values: [0 1]
    y_score range: [0.0000, 0.9903]

MLP:
----------------------------------------------------------------------
  Raw/Original:
    y_true shape: (294,)
    y_score shape: (294,)
    y_true unique values: [0 1]
    y_score range: [0.0026, 0.6900]
  SMOTE Balanced:
    y_true shape: (294,)
    y_score shape: (294,)
    y_true unique values: [0 1]
    y_score range: [0.0000, 0.9812]
  AD

## Step 9: Best Hyperparameters Storage

Display the best hyperparameters for each model-dataset combination.


In [9]:
# ============================================================
# Best Hyperparameters Storage
# ============================================================

print("=" * 70)
print("BEST HYPERPARAMETERS")
print("=" * 70)
print("\nStructure: {model_name: {dataset_type: best_params_dict}}\n")

for model_name in ['SVC', 'MLP']:
    print(f"{model_name}:")
    print("-" * 70)
    for dataset_key in ['raw', 'smote', 'adasyn']:
        dataset_name = training_sets[dataset_key]['name']
        if dataset_key in best_hyperparameters[model_name]:
            params = best_hyperparameters[model_name][dataset_key]
            print(f"\n  {dataset_name}:")
            for param_name, param_value in params.items():
                print(f"    {param_name}: {param_value}")
        else:
            print(f"\n  {dataset_name}: Not available")
    print()

print("=" * 70)
print("All hyperparameters stored successfully")
print("=" * 70)


BEST HYPERPARAMETERS

Structure: {model_name: {dataset_type: best_params_dict}}

SVC:
----------------------------------------------------------------------

  Raw/Original:
    C: 100
    gamma: 0.01

  SMOTE Balanced:
    C: 10
    gamma: 0.1

  ADASYN Balanced:
    C: 10
    gamma: 0.1

MLP:
----------------------------------------------------------------------

  Raw/Original:
    alpha: 0.0001
    hidden_layer_sizes: (100,)

  SMOTE Balanced:
    alpha: 0.01
    hidden_layer_sizes: (100, 50)

  ADASYN Balanced:
    alpha: 0.01
    hidden_layer_sizes: (100, 50)

All hyperparameters stored successfully


## Step 10: Best Estimators Storage Summary

Display a summary of stored best estimators for each model-dataset combination.


In [10]:
# ============================================================
# Best Estimators Storage Summary
# ============================================================

print("=" * 70)
print("BEST ESTIMATORS STORAGE")
print("=" * 70)
print("\nStructure: {model_name: {dataset_type: best_estimator}}\n")

for model_name in ['SVC', 'MLP']:
    print(f"{model_name}:")
    print("-" * 70)
    for dataset_key in ['raw', 'smote', 'adasyn']:
        dataset_name = training_sets[dataset_key]['name']
        if dataset_key in best_estimators[model_name]:
            estimator = best_estimators[model_name][dataset_key]
            print(f"  {dataset_name}:")
            print(f"    Type: {type(estimator).__name__}")
            print(f"    Parameters: {estimator.get_params()}")
        else:
            print(f"  {dataset_name}: Not available")
    print()

print("=" * 70)
print("All best estimators stored successfully")
print("=" * 70)


BEST ESTIMATORS STORAGE

Structure: {model_name: {dataset_type: best_estimator}}

SVC:
----------------------------------------------------------------------
  Raw/Original:
    Type: SVC
    Parameters: {'C': 100, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.01, 'kernel': 'rbf', 'max_iter': -1, 'probability': True, 'random_state': 42, 'shrinking': True, 'tol': 0.001, 'verbose': False}
  SMOTE Balanced:
    Type: SVC
    Parameters: {'C': 10, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.1, 'kernel': 'rbf', 'max_iter': -1, 'probability': True, 'random_state': 42, 'shrinking': True, 'tol': 0.001, 'verbose': False}
  ADASYN Balanced:
    Type: SVC
    Parameters: {'C': 10, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.1, 'kerne