# Comprehensive ML Models for Fall Prediction

This notebook implements multiple machine learning models for fall prediction:
- **Neural Network** (Multi-layer Perceptron)
- **Support Vector Machine** (SVM)
- **Random Forest** with OOB error
- **Gradient Boosting**
- **XGBoost**
- **Logistic Regression**

All models use:
- 75/25 stratified train/test split
- Hyperparameter search (GridSearchCV or RandomizedSearchCV)
- OOB error where applicable (Random Forest, Gradient Boosting)

## Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score, confusion_matrix, roc_auc_score, roc_curve,
    classification_report, f1_score, precision_score, recall_score
)

# Models
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load data
df = pd.read_csv('../data/combined_output.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:\n{df['Faller'].value_counts()}")

# Check for missing values
print(f"\nMissing values per column:")
missing = df.isnull().sum()
if missing.sum() > 0:
    print(missing[missing > 0])
else:
    print("No missing values")

# Handle missing values with median imputation
imputer = SimpleImputer(strategy='median')
feature_cols = [col for col in df.columns if col not in ['ID', 'Faller']]
df[feature_cols] = imputer.fit_transform(df[feature_cols])

print(f"\nAfter imputation - Missing values: {df.isnull().sum().sum()}")
df.head()

## Data Preparation: 75/25 Stratified Split

In [None]:
# Prepare features and target
X = df[feature_cols].values
y = (df['Faller'] == 'F').astype(int).values  # Convert to binary: F=1, NF=0

# 75/25 stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=RANDOM_STATE
)

print(f"Training set size: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set size: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTraining set class distribution:")
print(f"  Non-fallers (0): {np.sum(y_train == 0)} ({np.sum(y_train == 0)/len(y_train)*100:.1f}%)")
print(f"  Fallers (1): {np.sum(y_train == 1)} ({np.sum(y_train == 1)/len(y_train)*100:.1f}%)")
print(f"\nTest set class distribution:")
print(f"  Non-fallers (0): {np.sum(y_test == 0)} ({np.sum(y_test == 0)/len(y_test)*100:.1f}%)")
print(f"  Fallers (1): {np.sum(y_test == 1)} ({np.sum(y_test == 1)/len(y_test)*100:.1f}%)")

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nNumber of features: {X_train.shape[1]}")

## Helper Functions

In [None]:
def calculate_metrics(y_true, y_pred, y_proba=None):
    """
    Calculate comprehensive metrics for binary classification.
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred) * 100,
        'sensitivity': tp / (tp + fn) * 100 if (tp + fn) > 0 else 0,
        'specificity': tn / (tn + fp) * 100 if (tn + fp) > 0 else 0,
        'precision': precision_score(y_true, y_pred, zero_division=0) * 100,
        'f1': f1_score(y_true, y_pred, zero_division=0) * 100,
    }
    
    if y_proba is not None:
        metrics['auc'] = roc_auc_score(y_true, y_proba)
    
    return metrics

def print_metrics(metrics, title):
    """
    Print metrics in a formatted way.
    """
    print(f"\n{title}")
    print("=" * 60)
    print(f"Accuracy:    {metrics['accuracy']:.2f}%")
    print(f"Sensitivity: {metrics['sensitivity']:.2f}%")
    print(f"Specificity: {metrics['specificity']:.2f}%")
    print(f"Precision:   {metrics['precision']:.2f}%")
    print(f"F1 Score:    {metrics['f1']:.2f}%")
    if 'auc' in metrics:
        print(f"AUC:         {metrics['auc']:.4f}")
    if 'oob_score' in metrics:
        print(f"OOB Score:   {metrics['oob_score']:.2f}%")
    print("=" * 60)

## Model 1: Neural Network (Multi-layer Perceptron)

Hyperparameter search for:
- Hidden layer sizes
- Activation functions
- Learning rate
- Alpha (L2 regularization)

In [None]:
print("=" * 80)
print("MODEL 1: NEURAL NETWORK (Multi-layer Perceptron)")
print("=" * 80)

# Define hyperparameter grid
nn_param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01],
    'max_iter': [500]
}

print(f"\nHyperparameter search space:")
print(f"  Hidden layer sizes: {nn_param_grid['hidden_layer_sizes']}")
print(f"  Activation functions: {nn_param_grid['activation']}")
print(f"  Alpha (L2): {nn_param_grid['alpha']}")
print(f"  Learning rates: {nn_param_grid['learning_rate_init']}")
print(f"\nTotal combinations: {np.prod([len(v) for v in nn_param_grid.values()])}")

# GridSearchCV with 5-fold cross-validation
nn_grid = GridSearchCV(
    MLPClassifier(random_state=RANDOM_STATE, early_stopping=True),
    nn_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("\nRunning GridSearchCV (this may take a few minutes)...")
nn_grid.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {nn_grid.best_params_}")
print(f"Best CV accuracy: {nn_grid.best_score_ * 100:.2f}%")

# Evaluate on test set
nn_best = nn_grid.best_estimator_
y_pred_nn = nn_best.predict(X_test_scaled)
y_proba_nn = nn_best.predict_proba(X_test_scaled)[:, 1]

nn_metrics = calculate_metrics(y_test, y_pred_nn, y_proba_nn)
print_metrics(nn_metrics, "Neural Network - Test Set Performance")

## Model 2: Support Vector Machine (SVM)

Hyperparameter search for:
- Kernel types (RBF, linear, polynomial)
- C (regularization parameter)
- Gamma (kernel coefficient)

In [None]:
print("\n" + "=" * 80)
print("MODEL 2: SUPPORT VECTOR MACHINE (SVM)")
print("=" * 80)

# Define hyperparameter grid
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['rbf', 'linear', 'poly'],
    'gamma': ['scale', 'auto', 0.001, 0.01],
    'probability': [True]  # For AUC calculation
}

print(f"\nHyperparameter search space:")
print(f"  C: {svm_param_grid['C']}")
print(f"  Kernels: {svm_param_grid['kernel']}")
print(f"  Gamma: {svm_param_grid['gamma']}")
print(f"\nTotal combinations: {np.prod([len(v) if isinstance(v, list) else 1 for v in svm_param_grid.values()])}")

# GridSearchCV with 5-fold cross-validation
svm_grid = GridSearchCV(
    SVC(random_state=RANDOM_STATE),
    svm_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("\nRunning GridSearchCV (this may take a few minutes)...")
svm_grid.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {svm_grid.best_params_}")
print(f"Best CV accuracy: {svm_grid.best_score_ * 100:.2f}%")

# Evaluate on test set
svm_best = svm_grid.best_estimator_
y_pred_svm = svm_best.predict(X_test_scaled)
y_proba_svm = svm_best.predict_proba(X_test_scaled)[:, 1]

svm_metrics = calculate_metrics(y_test, y_pred_svm, y_proba_svm)
print_metrics(svm_metrics, "SVM - Test Set Performance")

## Model 3: Random Forest with OOB Error

Hyperparameter search for:
- Number of estimators
- Max features
- Max depth
- Min samples split

In [None]:
print("\n" + "=" * 80)
print("MODEL 3: RANDOM FOREST with OOB Error")
print("=" * 80)

# Define hyperparameter grid
rf_param_grid = {
    'n_estimators': [100, 200, 365, 500],
    'max_features': [1, 'sqrt', 'log2'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'oob_score': [True]
}

print(f"\nHyperparameter search space:")
print(f"  N estimators: {rf_param_grid['n_estimators']}")
print(f"  Max features: {rf_param_grid['max_features']}")
print(f"  Max depth: {rf_param_grid['max_depth']}")
print(f"  Min samples split: {rf_param_grid['min_samples_split']}")
print(f"\nTotal combinations: {np.prod([len(v) if isinstance(v, list) else 1 for v in rf_param_grid.values()])}")

# RandomizedSearchCV for efficiency (too many combinations)
rf_random = RandomizedSearchCV(
    RandomForestClassifier(random_state=RANDOM_STATE),
    rf_param_grid,
    n_iter=50,  # Try 50 random combinations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1,
    random_state=RANDOM_STATE
)

print("\nRunning RandomizedSearchCV (this may take a few minutes)...")
rf_random.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {rf_random.best_params_}")
print(f"Best CV accuracy: {rf_random.best_score_ * 100:.2f}%")

# Evaluate on test set
rf_best = rf_random.best_estimator_
y_pred_rf = rf_best.predict(X_test_scaled)
y_proba_rf = rf_best.predict_proba(X_test_scaled)[:, 1]

rf_metrics = calculate_metrics(y_test, y_pred_rf, y_proba_rf)
rf_metrics['oob_score'] = rf_best.oob_score_ * 100
print_metrics(rf_metrics, "Random Forest - Test Set Performance")

## Model 4: Gradient Boosting with OOB Improvement

Hyperparameter search for:
- Number of estimators
- Learning rate
- Max depth
- Subsample ratio

In [None]:
print("\n" + "=" * 80)
print("MODEL 4: GRADIENT BOOSTING with OOB Improvement")
print("=" * 80)

# Define hyperparameter grid
gb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 0.9, 1.0],
    'min_samples_split': [2, 5]
}

print(f"\nHyperparameter search space:")
print(f"  N estimators: {gb_param_grid['n_estimators']}")
print(f"  Learning rate: {gb_param_grid['learning_rate']}")
print(f"  Max depth: {gb_param_grid['max_depth']}")
print(f"  Subsample: {gb_param_grid['subsample']}")
print(f"\nTotal combinations: {np.prod([len(v) for v in gb_param_grid.values()])}")

# RandomizedSearchCV for efficiency
gb_random = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=RANDOM_STATE),
    gb_param_grid,
    n_iter=40,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1,
    random_state=RANDOM_STATE
)

print("\nRunning RandomizedSearchCV (this may take a few minutes)...")
gb_random.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {gb_random.best_params_}")
print(f"Best CV accuracy: {gb_random.best_score_ * 100:.2f}%")

# Evaluate on test set
gb_best = gb_random.best_estimator_
y_pred_gb = gb_best.predict(X_test_scaled)
y_proba_gb = gb_best.predict_proba(X_test_scaled)[:, 1]

gb_metrics = calculate_metrics(y_test, y_pred_gb, y_proba_gb)

# Calculate OOB improvement (staged predictions on training set)
oob_improvement = gb_best.train_score_[-1] * 100
gb_metrics['oob_improvement'] = oob_improvement

print_metrics(gb_metrics, "Gradient Boosting - Test Set Performance")
print(f"Training score (OOB-like): {oob_improvement:.2f}%")

## Model 5: XGBoost

Hyperparameter search for:
- Max depth
- Learning rate
- Number of estimators
- Subsample
- Colsample by tree

In [None]:
print("\n" + "=" * 80)
print("MODEL 5: XGBoost")
print("=" * 80)

# Define hyperparameter grid
xgb_param_grid = {
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2]
}

print(f"\nHyperparameter search space:")
print(f"  Max depth: {xgb_param_grid['max_depth']}")
print(f"  Learning rate: {xgb_param_grid['learning_rate']}")
print(f"  N estimators: {xgb_param_grid['n_estimators']}")
print(f"  Subsample: {xgb_param_grid['subsample']}")
print(f"  Colsample by tree: {xgb_param_grid['colsample_bytree']}")
print(f"\nTotal combinations: {np.prod([len(v) for v in xgb_param_grid.values()])}")

# RandomizedSearchCV for efficiency
xgb_random = RandomizedSearchCV(
    xgb.XGBClassifier(random_state=RANDOM_STATE, eval_metric='logloss'),
    xgb_param_grid,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1,
    random_state=RANDOM_STATE
)

print("\nRunning RandomizedSearchCV (this may take a few minutes)...")
xgb_random.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {xgb_random.best_params_}")
print(f"Best CV accuracy: {xgb_random.best_score_ * 100:.2f}%")

# Evaluate on test set
xgb_best = xgb_random.best_estimator_
y_pred_xgb = xgb_best.predict(X_test_scaled)
y_proba_xgb = xgb_best.predict_proba(X_test_scaled)[:, 1]

xgb_metrics = calculate_metrics(y_test, y_pred_xgb, y_proba_xgb)
print_metrics(xgb_metrics, "XGBoost - Test Set Performance")

## Model 6: Logistic Regression

Hyperparameter search for:
- C (regularization parameter)
- Penalty type (L1, L2, elastic net)
- Solver

In [None]:
print("\n" + "=" * 80)
print("MODEL 6: LOGISTIC REGRESSION")
print("=" * 80)

# Define hyperparameter grid
lr_param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [1000]
}

print(f"\nHyperparameter search space:")
print(f"  C: {lr_param_grid['C']}")
print(f"  Penalty: {lr_param_grid['penalty']}")
print(f"  Solver: {lr_param_grid['solver']}")
print(f"\nTotal combinations: {np.prod([len(v) if isinstance(v, list) else 1 for v in lr_param_grid.values()])}")

# GridSearchCV with 5-fold cross-validation
lr_grid = GridSearchCV(
    LogisticRegression(random_state=RANDOM_STATE),
    lr_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("\nRunning GridSearchCV...")
lr_grid.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {lr_grid.best_params_}")
print(f"Best CV accuracy: {lr_grid.best_score_ * 100:.2f}%")

# Evaluate on test set
lr_best = lr_grid.best_estimator_
y_pred_lr = lr_best.predict(X_test_scaled)
y_proba_lr = lr_best.predict_proba(X_test_scaled)[:, 1]

lr_metrics = calculate_metrics(y_test, y_pred_lr, y_proba_lr)
print_metrics(lr_metrics, "Logistic Regression - Test Set Performance")

## Summary: Model Comparison

In [None]:
# Create comprehensive comparison table
comparison_data = [
    {
        'Model': 'Neural Network',
        'Accuracy': f"{nn_metrics['accuracy']:.2f}%",
        'Sensitivity': f"{nn_metrics['sensitivity']:.2f}%",
        'Specificity': f"{nn_metrics['specificity']:.2f}%",
        'Precision': f"{nn_metrics['precision']:.2f}%",
        'F1 Score': f"{nn_metrics['f1']:.2f}%",
        'AUC': f"{nn_metrics['auc']:.4f}"
    },
    {
        'Model': 'SVM',
        'Accuracy': f"{svm_metrics['accuracy']:.2f}%",
        'Sensitivity': f"{svm_metrics['sensitivity']:.2f}%",
        'Specificity': f"{svm_metrics['specificity']:.2f}%",
        'Precision': f"{svm_metrics['precision']:.2f}%",
        'F1 Score': f"{svm_metrics['f1']:.2f}%",
        'AUC': f"{svm_metrics['auc']:.4f}"
    },
    {
        'Model': 'Random Forest',
        'Accuracy': f"{rf_metrics['accuracy']:.2f}%",
        'Sensitivity': f"{rf_metrics['sensitivity']:.2f}%",
        'Specificity': f"{rf_metrics['specificity']:.2f}%",
        'Precision': f"{rf_metrics['precision']:.2f}%",
        'F1 Score': f"{rf_metrics['f1']:.2f}%",
        'AUC': f"{rf_metrics['auc']:.4f}"
    },
    {
        'Model': 'Gradient Boosting',
        'Accuracy': f"{gb_metrics['accuracy']:.2f}%",
        'Sensitivity': f"{gb_metrics['sensitivity']:.2f}%",
        'Specificity': f"{gb_metrics['specificity']:.2f}%",
        'Precision': f"{gb_metrics['precision']:.2f}%",
        'F1 Score': f"{gb_metrics['f1']:.2f}%",
        'AUC': f"{gb_metrics['auc']:.4f}"
    },
    {
        'Model': 'XGBoost',
        'Accuracy': f"{xgb_metrics['accuracy']:.2f}%",
        'Sensitivity': f"{xgb_metrics['sensitivity']:.2f}%",
        'Specificity': f"{xgb_metrics['specificity']:.2f}%",
        'Precision': f"{xgb_metrics['precision']:.2f}%",
        'F1 Score': f"{xgb_metrics['f1']:.2f}%",
        'AUC': f"{xgb_metrics['auc']:.4f}"
    },
    {
        'Model': 'Logistic Regression',
        'Accuracy': f"{lr_metrics['accuracy']:.2f}%",
        'Sensitivity': f"{lr_metrics['sensitivity']:.2f}%",
        'Specificity': f"{lr_metrics['specificity']:.2f}%",
        'Precision': f"{lr_metrics['precision']:.2f}%",
        'F1 Score': f"{lr_metrics['f1']:.2f}%",
        'AUC': f"{lr_metrics['auc']:.4f}"
    }
]

comparison_df = pd.DataFrame(comparison_data)
print("\n" + "=" * 120)
print("COMPREHENSIVE MODEL COMPARISON")
print("=" * 120)
print(comparison_df.to_string(index=False))
print("=" * 120)

# Save to CSV
comparison_df.to_csv('../scripts/ml_models_comparison.csv', index=False)
print("\nComparison saved to 'ml_models_comparison.csv'")

## Visualizations

In [None]:
# Extract numeric values for plotting
models = ['Neural\nNetwork', 'SVM', 'Random\nForest', 'Gradient\nBoosting', 'XGBoost', 'Logistic\nRegression']
accuracy_vals = [nn_metrics['accuracy'], svm_metrics['accuracy'], rf_metrics['accuracy'], 
                 gb_metrics['accuracy'], xgb_metrics['accuracy'], lr_metrics['accuracy']]
sensitivity_vals = [nn_metrics['sensitivity'], svm_metrics['sensitivity'], rf_metrics['sensitivity'],
                   gb_metrics['sensitivity'], xgb_metrics['sensitivity'], lr_metrics['sensitivity']]
specificity_vals = [nn_metrics['specificity'], svm_metrics['specificity'], rf_metrics['specificity'],
                   gb_metrics['specificity'], xgb_metrics['specificity'], lr_metrics['specificity']]
auc_vals = [nn_metrics['auc'], svm_metrics['auc'], rf_metrics['auc'],
           gb_metrics['auc'], xgb_metrics['auc'], lr_metrics['auc']]

# Create comprehensive comparison plot
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Accuracy comparison
axes[0, 0].bar(models, accuracy_vals, color='steelblue', alpha=0.8)
axes[0, 0].set_ylabel('Accuracy (%)', fontsize=12)
axes[0, 0].set_title('(a) Model Accuracy Comparison', fontsize=13, fontweight='bold')
axes[0, 0].set_ylim([0, 100])
axes[0, 0].grid(True, alpha=0.3, axis='y')
for i, v in enumerate(accuracy_vals):
    axes[0, 0].text(i, v + 1, f'{v:.1f}%', ha='center', va='bottom', fontsize=10)

# Plot 2: Sensitivity vs Specificity
x = np.arange(len(models))
width = 0.35
axes[0, 1].bar(x - width/2, sensitivity_vals, width, label='Sensitivity', color='coral', alpha=0.8)
axes[0, 1].bar(x + width/2, specificity_vals, width, label='Specificity', color='lightgreen', alpha=0.8)
axes[0, 1].set_ylabel('Score (%)', fontsize=12)
axes[0, 1].set_title('(b) Sensitivity vs Specificity', fontsize=13, fontweight='bold')
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(models)
axes[0, 1].set_ylim([0, 100])
axes[0, 1].legend(fontsize=11)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Plot 3: AUC comparison
axes[1, 0].bar(models, [a * 100 for a in auc_vals], color='purple', alpha=0.7)
axes[1, 0].set_ylabel('AUC Score (%)', fontsize=12)
axes[1, 0].set_title('(c) AUC Score Comparison', fontsize=13, fontweight='bold')
axes[1, 0].set_ylim([0, 100])
axes[1, 0].grid(True, alpha=0.3, axis='y')
for i, v in enumerate(auc_vals):
    axes[1, 0].text(i, v * 100 + 1, f'{v:.3f}', ha='center', va='bottom', fontsize=10)

# Plot 4: Overall performance radar (using accuracy, sensitivity, specificity, AUC)
# We'll create a grouped bar chart instead for clarity
metrics_names = ['Accuracy', 'Sensitivity', 'Specificity', 'AUC']
x_pos = np.arange(len(metrics_names))
bar_width = 0.13

for i, (model, acc, sens, spec, auc) in enumerate(zip(
    ['NN', 'SVM', 'RF', 'GB', 'XGB', 'LR'],
    accuracy_vals, sensitivity_vals, specificity_vals, [a * 100 for a in auc_vals]
)):
    offset = (i - 2.5) * bar_width
    axes[1, 1].bar(x_pos + offset, [acc, sens, spec, auc], bar_width, label=model)

axes[1, 1].set_ylabel('Score (%)', fontsize=12)
axes[1, 1].set_title('(d) Overall Performance Metrics', fontsize=13, fontweight='bold')
axes[1, 1].set_xticks(x_pos)
axes[1, 1].set_xticklabels(metrics_names)
axes[1, 1].legend(ncol=3, fontsize=9, loc='lower right')
axes[1, 1].set_ylim([0, 100])
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('../scripts/ml_models_comprehensive_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("Comprehensive comparison plot saved as 'ml_models_comprehensive_comparison.png'")

In [None]:
# ROC Curves for all models
plt.figure(figsize=(10, 8))

# Calculate ROC curves for each model
fpr_nn, tpr_nn, _ = roc_curve(y_test, y_proba_nn)
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_proba_svm)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)
fpr_gb, tpr_gb, _ = roc_curve(y_test, y_proba_gb)
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_proba_xgb)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_proba_lr)

# Plot all ROC curves
plt.plot(fpr_nn, tpr_nn, label=f'Neural Network (AUC = {nn_metrics["auc"]:.3f})', linewidth=2)
plt.plot(fpr_svm, tpr_svm, label=f'SVM (AUC = {svm_metrics["auc"]:.3f})', linewidth=2)
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {rf_metrics["auc"]:.3f})', linewidth=2)
plt.plot(fpr_gb, tpr_gb, label=f'Gradient Boosting (AUC = {gb_metrics["auc"]:.3f})', linewidth=2)
plt.plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC = {xgb_metrics["auc"]:.3f})', linewidth=2)
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {lr_metrics["auc"]:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=2)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves - All Models Comparison', fontsize=14, fontweight='bold')
plt.legend(loc="lower right", fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../scripts/roc_curves_all_models.png', dpi=300, bbox_inches='tight')
plt.show()

print("ROC curves saved as 'roc_curves_all_models.png'")

## Hyperparameter Search Summary

In [None]:
# Summary of best hyperparameters for each model
print("\n" + "=" * 80)
print("BEST HYPERPARAMETERS SUMMARY")
print("=" * 80)

hyperparam_summary = [
    {
        'Model': 'Neural Network',
        'Best Parameters': str(nn_grid.best_params_),
        'CV Accuracy': f"{nn_grid.best_score_ * 100:.2f}%",
        'Test Accuracy': f"{nn_metrics['accuracy']:.2f}%"
    },
    {
        'Model': 'SVM',
        'Best Parameters': str(svm_grid.best_params_),
        'CV Accuracy': f"{svm_grid.best_score_ * 100:.2f}%",
        'Test Accuracy': f"{svm_metrics['accuracy']:.2f}%"
    },
    {
        'Model': 'Random Forest',
        'Best Parameters': str(rf_random.best_params_),
        'CV Accuracy': f"{rf_random.best_score_ * 100:.2f}%",
        'Test Accuracy': f"{rf_metrics['accuracy']:.2f}%"
    },
    {
        'Model': 'Gradient Boosting',
        'Best Parameters': str(gb_random.best_params_),
        'CV Accuracy': f"{gb_random.best_score_ * 100:.2f}%",
        'Test Accuracy': f"{gb_metrics['accuracy']:.2f}%"
    },
    {
        'Model': 'XGBoost',
        'Best Parameters': str(xgb_random.best_params_),
        'CV Accuracy': f"{xgb_random.best_score_ * 100:.2f}%",
        'Test Accuracy': f"{xgb_metrics['accuracy']:.2f}%"
    },
    {
        'Model': 'Logistic Regression',
        'Best Parameters': str(lr_grid.best_params_),
        'CV Accuracy': f"{lr_grid.best_score_ * 100:.2f}%",
        'Test Accuracy': f"{lr_metrics['accuracy']:.2f}%"
    }
]

hyperparam_df = pd.DataFrame(hyperparam_summary)
for _, row in hyperparam_df.iterrows():
    print(f"\n{row['Model']}:")
    print(f"  Best Parameters: {row['Best Parameters']}")
    print(f"  CV Accuracy: {row['CV Accuracy']}")
    print(f"  Test Accuracy: {row['Test Accuracy']}")

# Save to CSV
hyperparam_df.to_csv('../scripts/hyperparameter_summary.csv', index=False)
print("\n" + "=" * 80)
print("Hyperparameter summary saved to 'hyperparameter_summary.csv'")

## Key Findings and Recommendations

This notebook has successfully implemented and compared six different machine learning models for fall prediction:

1. **Neural Network (MLP)**: Deep learning approach with multiple hidden layers
2. **SVM**: Support Vector Machine with RBF, linear, and polynomial kernels
3. **Random Forest**: Ensemble method with OOB error estimation
4. **Gradient Boosting**: Sequential ensemble with boosting
5. **XGBoost**: Optimized gradient boosting implementation
6. **Logistic Regression**: Linear baseline model

### Methodology:
- **Data Split**: 75/25 stratified train/test split to maintain class balance
- **Hyperparameter Tuning**: GridSearchCV or RandomizedSearchCV with 5-fold cross-validation
- **OOB Estimation**: Used for Random Forest and tracked for Gradient Boosting
- **Metrics**: Comprehensive evaluation including accuracy, sensitivity, specificity, precision, F1, and AUC

### Next Steps:
1. Ensemble methods combining top models
2. Feature importance analysis
3. Error analysis and misclassification patterns
4. Clinical validation with domain experts