### Stroke Prediction Model Training

This notebook trains both baseline and improved models for comparison.

**Note:** SMOTE has already been applied in `data_preprocessing.ipynb`

**Sections:**
1. Setup and Data Loading
2. Baseline Model Training (6 algorithms)
3. Improved Model Training (Hyperparameter Tuning)
4. Model Comparison
5. Save Models

#### 1. Setup and Data Loading

In [6]:
import pandas as pd
import numpy as np
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

# Models
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score,
    precision_recall_curve, make_scorer
)

# Model improvement
from sklearn.model_selection import GridSearchCV, StratifiedKFold

print("âœ… Libraries imported successfully")

âœ… Libraries imported successfully


In [7]:
# Detect correct path
current_dir = os.getcwd()
if 'Notebooks' in current_dir:
    data_path = '../Data/'
    models_path = '../Models/'
else:
    data_path = 'Data/'
    models_path = 'Models/'

print(f"Current directory: {current_dir}")
print(f"Data path: {data_path}")
print(f"Models path: {models_path}")

Current directory: c:\Users\chigu\Desktop\stroke-prediction-project\Notebooks
Data path: ../Data/
Models path: ../Models/


In [8]:
# Load preprocessed data (already SMOTE-balanced from data_preprocessing.ipynb)
X_train = pd.read_csv(f'{data_path}X_train_preprocessed.csv')
X_test = pd.read_csv(f'{data_path}X_test_preprocessed.csv')
y_train = pd.read_csv(f'{data_path}y_train_preprocessed.csv').values.ravel()
y_test = pd.read_csv(f'{data_path}y_test_preprocessed.csv').values.ravel()

print("âœ… Data loaded successfully")
print(f"\nTraining samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {list(X_train.columns)}")
print(f"\nClass distribution (train):")
print(f"  No Stroke: {np.bincount(y_train)[0]}")
print(f"  Stroke: {np.bincount(y_train)[1]}")
print(f"  Balance ratio: {np.bincount(y_train)[0] / np.bincount(y_train)[1]:.2f}:1")
print("\nâœ… Data is already balanced (SMOTE was applied in preprocessing)")

âœ… Data loaded successfully

Training samples: 7778
Test samples: 1022
Features: ['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status']

Class distribution (train):
  No Stroke: 3889
  Stroke: 3889
  Balance ratio: 1.00:1

âœ… Data is already balanced (SMOTE was applied in preprocessing)


#### 2. Baseline Model Training

Train 6 different models without hyperparameter tuning to establish baseline performance.

In [9]:
print("=" * 70)
print("BASELINE MODEL TRAINING")
print("=" * 70)

models = {
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42, verbosity=0),
    'LogisticRegression': LogisticRegression(max_iter=500, random_state=42),
    'SVM': SVC(kernel='linear', probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'NaiveBayes': GaussianNB()
}

baseline_results = {}
for name, model in models.items():
    print(f'\nTraining {name}...')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    
    baseline_results[name] = {
        'Accuracy': acc,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }
    
    print(f'  Accuracy: {acc:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}')

baseline_df = pd.DataFrame(baseline_results).T
print("\n" + "=" * 70)
print("BASELINE RESULTS")
print("=" * 70)
print(baseline_df.to_string())

BASELINE MODEL TRAINING

Training RandomForest...
  Accuracy: 0.9022, Precision: 0.1212, Recall: 0.1600, F1: 0.1379

Training XGBoost...
  Accuracy: 0.8992, Precision: 0.1045, Recall: 0.1400, F1: 0.1197

Training LogisticRegression...
  Accuracy: 0.7857, Precision: 0.1494, Recall: 0.7200, F1: 0.2474

Training SVM...
  Accuracy: 0.7759, Precision: 0.1434, Recall: 0.7200, F1: 0.2392

Training KNN...
  Accuracy: 0.7945, Precision: 0.0876, Recall: 0.3400, F1: 0.1393

Training NaiveBayes...
  Accuracy: 0.7172, Precision: 0.1030, Recall: 0.6200, F1: 0.1766

BASELINE RESULTS
                    Accuracy  Precision  Recall  F1 Score
RandomForest        0.902153   0.121212    0.16  0.137931
XGBoost             0.899217   0.104478    0.14  0.119658
LogisticRegression  0.785714   0.149378    0.72  0.247423
SVM                 0.775930   0.143426    0.72  0.239203
KNN                 0.794521   0.087629    0.34  0.139344
NaiveBayes          0.717221   0.102990    0.62  0.176638


#### 3. Improved Model Training

Apply improvements:
- Hyperparameter tuning with GridSearchCV
- Optimal threshold selection

Note: SMOTE and scaling already applied in preprocessing

In [16]:
print("\n" + "=" * 70)
print("IMPROVED MODEL TRAINING")
print("=" * 70)

print("\n[1/2] Optimizing hyperparameters...")

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [15, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'class_weight': ['balanced', None]
}

rf_base = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(
    rf_base,
    param_grid,
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42),
    scoring=make_scorer(f1_score),
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\n   âœ… Best parameters:")
for param, value in grid_search.best_params_.items():
    print(f"     {param}: {value}")
print(f"   Best F1 Score (CV): {grid_search.best_score_:.4f}")


IMPROVED MODEL TRAINING

[1/2] Optimizing hyperparameters...
Fitting 3 folds for each of 48 candidates, totalling 144 fits

   âœ… Best parameters:
     class_weight: None
     max_depth: None
     min_samples_leaf: 1
     min_samples_split: 2
     n_estimators: 200
   Best F1 Score (CV): 0.9427


In [11]:
print("\n[2/2] Evaluating improved model...")
best_model = grid_search.best_estimator_

y_pred_improved = best_model.predict(X_test)
y_pred_proba_improved = best_model.predict_proba(X_test)[:, 1]

print("\n" + "=" * 70)
print("IMPROVED MODEL RESULTS")
print("=" * 70)
print(classification_report(y_test, y_pred_improved, target_names=['No Stroke', 'Stroke']))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred_improved)
print(f"   True Negatives:  {cm[0][0]:4d}  (Correctly predicted No Stroke)")
print(f"   False Positives: {cm[0][1]:4d}  (Incorrectly predicted Stroke)")
print(f"   False Negatives: {cm[1][0]:4d}  (Missed Stroke cases - BAD!)")
print(f"   True Positives:  {cm[1][1]:4d}  (Correctly predicted Stroke)")

auc_score = roc_auc_score(y_test, y_pred_proba_improved)
print(f"\n   AUC-ROC Score: {auc_score:.4f}")


[2/2] Evaluating improved model...

IMPROVED MODEL RESULTS
              precision    recall  f1-score   support

   No Stroke       0.96      0.94      0.95       972
      Stroke       0.14      0.20      0.17        50

    accuracy                           0.90      1022
   macro avg       0.55      0.57      0.56      1022
weighted avg       0.92      0.90      0.91      1022


Confusion Matrix:
   True Negatives:   912  (Correctly predicted No Stroke)
   False Positives:   60  (Incorrectly predicted Stroke)
   False Negatives:   40  (Missed Stroke cases - BAD!)
   True Positives:    10  (Correctly predicted Stroke)

   AUC-ROC Score: 0.7910


In [12]:
# Find optimal threshold
print("\n" + "=" * 70)
print("OPTIMAL THRESHOLD SELECTION")
print("=" * 70)

precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_improved)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5

print(f"\n   Default threshold: 0.5")
print(f"   Optimal threshold: {optimal_threshold:.3f}")
print(f"   F1 at optimal threshold: {f1_scores[optimal_idx]:.4f}")

y_pred_optimal = (y_pred_proba_improved >= optimal_threshold).astype(int)
print("\n   Performance with optimal threshold:")
print(classification_report(y_test, y_pred_optimal, target_names=['No Stroke', 'Stroke']))


OPTIMAL THRESHOLD SELECTION

   Default threshold: 0.5
   Optimal threshold: 0.335
   F1 at optimal threshold: 0.2437

   Performance with optimal threshold:
              precision    recall  f1-score   support

   No Stroke       0.97      0.87      0.92       972
      Stroke       0.16      0.48      0.24        50

    accuracy                           0.85      1022
   macro avg       0.57      0.68      0.58      1022
weighted avg       0.93      0.85      0.89      1022



#### 4. Model Comparison

In [13]:
print("\n" + "=" * 70)
print("BASELINE vs IMPROVED COMPARISON")
print("=" * 70)

comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'AUC-ROC'],
    'Baseline (RF)': [
        baseline_results['RandomForest']['Accuracy'],
        baseline_results['RandomForest']['Precision'],
        baseline_results['RandomForest']['Recall'],
        baseline_results['RandomForest']['F1 Score'],
        0
    ],
    'Improved (RF)': [
        accuracy_score(y_test, y_pred_optimal),
        precision_score(y_test, y_pred_optimal),
        recall_score(y_test, y_pred_optimal),
        f1_score(y_test, y_pred_optimal),
        auc_score
    ]
})

comparison['Improvement'] = comparison.apply(
    lambda row: f"+{((row['Improved (RF)'] - row['Baseline (RF)']) / (row['Baseline (RF)'] + 1e-10) * 100):.1f}%" 
    if row['Baseline (RF)'] > 0 else 'N/A',
    axis=1
)

print("\n", comparison.to_string(index=False))

cm_baseline = confusion_matrix(y_test, models['RandomForest'].predict(X_test))
cm_improved = confusion_matrix(y_test, y_pred_optimal)

print("\n" + "=" * 70)
print("KEY INSIGHTS")
print("=" * 70)

if cm_baseline[1][0] > 0:
    fn_reduction = ((cm_baseline[1][0] - cm_improved[1][0]) / cm_baseline[1][0] * 100)
    print(f"\nâœ“ False Negatives reduced by: {fn_reduction:.1f}%")
    print(f"  (Fewer missed stroke cases: {cm_baseline[1][0]} â†’ {cm_improved[1][0]})")

if cm_baseline[1][1] > 0:
    tp_increase = ((cm_improved[1][1] - cm_baseline[1][1]) / cm_baseline[1][1] * 100)
    print(f"\nâœ“ True Positives increased by: {tp_increase:.1f}%")
    print(f"  (More stroke cases detected: {cm_baseline[1][1]} â†’ {cm_improved[1][1]})")

print("\nðŸ’¡ The improved model is better for medical screening!")


BASELINE vs IMPROVED COMPARISON

    Metric  Baseline (RF)  Improved (RF) Improvement
 Accuracy       0.902153       0.854207      +-5.3%
Precision       0.121212       0.163265      +34.7%
   Recall       0.160000       0.480000     +200.0%
 F1 Score       0.137931       0.243655      +76.6%
  AUC-ROC       0.000000       0.790957         N/A

KEY INSIGHTS

âœ“ False Negatives reduced by: 38.1%
  (Fewer missed stroke cases: 42 â†’ 26)

âœ“ True Positives increased by: 200.0%
  (More stroke cases detected: 8 â†’ 24)

ðŸ’¡ The improved model is better for medical screening!


#### 5. Feature Importance

In [14]:
print("\n" + "=" * 70)
print("TOP 10 MOST IMPORTANT FEATURES")
print("=" * 70)

feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\n", feature_importance.head(10).to_string(index=False))
print("\nðŸ’¡ Age is typically the strongest predictor of stroke risk")


TOP 10 MOST IMPORTANT FEATURES

           Feature  Importance
              age    0.425884
avg_glucose_level    0.187328
              bmi    0.149237
        work_type    0.074867
   smoking_status    0.053145
   Residence_type    0.029854
           gender    0.025910
     ever_married    0.024206
     hypertension    0.016049
    heart_disease    0.013520

ðŸ’¡ Age is typically the strongest predictor of stroke risk


#### 6. Save Models

In [15]:
print("\n" + "=" * 70)
print("SAVING MODELS")
print("=" * 70)

# Save baseline
joblib.dump(models['RandomForest'], f'{models_path}stroke_model_baseline.pkl')
print(f"\nâœ“ Baseline model saved: {models_path}stroke_model_baseline.pkl")

# Save improved
joblib.dump(best_model, f'{models_path}stroke_model_improved.pkl')
print(f"âœ“ Improved model saved: {models_path}stroke_model_improved.pkl")

# Save metrics
metrics = pd.DataFrame([{
    'Model': 'RandomForest_Improved',
    'Accuracy': accuracy_score(y_test, y_pred_optimal),
    'Precision': precision_score(y_test, y_pred_optimal),
    'Recall': recall_score(y_test, y_pred_optimal),
    'F1_Score': f1_score(y_test, y_pred_optimal),
    'AUC_ROC': auc_score,
    'Optimal_Threshold': optimal_threshold
}])

metrics.to_csv(f'{models_path}model_performance_improved.csv', index=False)
print(f"âœ“ Metrics saved: {models_path}model_performance_improved.csv")

comparison.to_csv(f'{models_path}model_comparison.csv', index=False)
print(f"âœ“ Comparison saved: {models_path}model_comparison.csv")

print("\n" + "=" * 70)
print("âœ… TRAINING COMPLETE!")
print("=" * 70)
print("\nTo use the improved model in your web app:")
print("\n1. Backup current model:")
print(f"   copy {models_path}stroke_model.pkl {models_path}stroke_model_backup.pkl")
print("\n2. Replace with improved model:")
print(f"   copy {models_path}stroke_model_improved.pkl {models_path}stroke_model.pkl")
print("\n3. Restart the web application")
print(f"\n4. (Optional) Update threshold in Webapp/utils.py to: {optimal_threshold:.3f}")
print("\n" + "=" * 70)


SAVING MODELS

âœ“ Baseline model saved: ../Models/stroke_model_baseline.pkl
âœ“ Improved model saved: ../Models/stroke_model_improved.pkl
âœ“ Metrics saved: ../Models/model_performance_improved.csv
âœ“ Comparison saved: ../Models/model_comparison.csv

âœ… TRAINING COMPLETE!

To use the improved model in your web app:

1. Backup current model:
   copy ../Models/stroke_model.pkl ../Models/stroke_model_backup.pkl

2. Replace with improved model:
   copy ../Models/stroke_model_improved.pkl ../Models/stroke_model.pkl

3. Restart the web application

4. (Optional) Update threshold in Webapp/utils.py to: 0.335

