# 04 Model Training & Baseline Evaluation

This notebook loads all preprocessed datasets from 03 Feature Engineering,
applies SMOTE to each variant‚Äôs training data, saves the balanced datasets,
and performs model training.

All train/test splits are fixed from 03 to prevent data leakage.

This notebook focuses on training baseline classifiers for:
- **Original structured features**
- **Word2Vec (W2V) embeddings**

Workflow:
1. Load processed datasets
2. Resampling with SMOTE
3. Define classifiers and parameter grids
4. Run mixed CV + hyperparameter search
5. Retrain best models on SMOTE-balanced sets
6. Save results (CSV + Markdown) for reporting


## 0. Setup
Import core modules


In [1]:
import psutil, os
print([p.info for p in psutil.process_iter(['pid','name']) if 'python' in p.info['name'].lower()])

[{'name': 'python.exe', 'pid': 8760}, {'name': 'python.exe', 'pid': 10172}, {'name': 'python.exe', 'pid': 25992}]


In [20]:
# --- One-time module reload cell (safe within Jupyter) ---

import importlib
import pandas as pd
import numpy as np
import os
from sklearn.metrics import roc_auc_score, make_scorer

# Core project imports
import src.models as models
import src.resampling as resampling
import src.utils as utils
import src.evaluation as evaluation

# Reload to ensure latest updates (e.g., _decision_or_proba, auc_scorer)
importlib.reload(models)
importlib.reload(resampling)
importlib.reload(utils)
importlib.reload(evaluation)

# Pull updated functions/classes into namespace
from src.models import (
    get_classifiers,
    get_param_distributions,
    get_n_iter_random_per_clf,
    repeated_cv_with_mixed_search,
    auc_scorer
)

from src.resampling import (
    resample_training_data,
    print_class_balance
)

from src.utils import resolve_path
from src.evaluation import export_summary

print("‚úÖ All modules reloaded successfully (models, resampling, utils, evaluation).")


‚úÖ All modules reloaded successfully (models, resampling, utils, evaluation).


In [2]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics import roc_auc_score, make_scorer

from src.models import (
    get_classifiers, 
    get_param_distributions,
    get_n_iter_random_per_clf,
    repeated_cv_with_mixed_search,
    auc_scorer
    
)

from src.resampling import resample_training_data, print_class_balance
from src.utils import resolve_path
from src.evaluation import export_summary


## 1. Load Feature Datasets
Load all variants (original, w2v_rad, w2v_dis, w2v_comb) produced in 03 Feature Engineering.  
Each variant includes four CSVs: `X_train`, `X_test`, `y_train`, `y_test`.

In [3]:
variants = ["original", "w2v_radiology", "w2v_discharge", "w2v_combined"]
datasets = {}

for variant in variants:
    X_train = pd.read_csv(resolve_path(f"data/processed/{variant}/data_{variant}_xtrain.csv"))
    X_test  = pd.read_csv(resolve_path(f"data/processed/{variant}/data_{variant}_xtest.csv"))
    y_train = pd.read_csv(resolve_path(f"data/processed/{variant}/data_{variant}_ytrain.csv")).squeeze()
    y_test  = pd.read_csv(resolve_path(f"data/processed/{variant}/data_{variant}_ytest.csv")).squeeze()

    datasets[variant] = {
        "X_train": X_train,
        "X_test": X_test,
        "y_train": y_train,
        "y_test": y_test
    }

    print(f"‚úÖ Loaded {variant} dataset ‚Üí Train: {X_train.shape}, Test: {X_test.shape}")
    print_class_balance(y_train, f"{variant} training set (before SMOTE)")

‚úÖ Loaded original dataset ‚Üí Train: (4166, 43), Test: (1042, 43)
original training set (before SMOTE) class balance: {0: 3204, 1: 962}
‚úÖ Loaded w2v_radiology dataset ‚Üí Train: (4166, 143), Test: (1042, 143)
w2v_radiology training set (before SMOTE) class balance: {0: 3204, 1: 962}
‚úÖ Loaded w2v_discharge dataset ‚Üí Train: (4166, 143), Test: (1042, 143)
w2v_discharge training set (before SMOTE) class balance: {0: 3204, 1: 962}
‚úÖ Loaded w2v_combined dataset ‚Üí Train: (4166, 143), Test: (1042, 143)
w2v_combined training set (before SMOTE) class balance: {0: 3204, 1: 962}


## 2. Apply SMOTE Per Variant

Resample only the training set for each variant using SMOTE.
This ensures the test set remains untouched for unbiased evaluation.

In [4]:
for variant, data in datasets.items():
    X_train_res, y_train_res = resample_training_data(
        data["X_train"], data["y_train"], method="smote"
    )
    datasets[variant]["X_train_res"] = X_train_res
    datasets[variant]["y_train_res"] = y_train_res
    print_class_balance(y_train_res, f"{variant} training set (after SMOTE)")



üîÅ Applying SMOTE to training data ...
‚úÖ Resampled training set shape: (6408, 43)
   Class balance after resampling: Counter({0: 3204, 1: 3204})
original training set (after SMOTE) class balance: {0: 3204, 1: 3204}
üîÅ Applying SMOTE to training data ...
‚úÖ Resampled training set shape: (6408, 143)
   Class balance after resampling: Counter({0: 3204, 1: 3204})
w2v_radiology training set (after SMOTE) class balance: {0: 3204, 1: 3204}
üîÅ Applying SMOTE to training data ...
‚úÖ Resampled training set shape: (6408, 143)
   Class balance after resampling: Counter({0: 3204, 1: 3204})
w2v_discharge training set (after SMOTE) class balance: {0: 3204, 1: 3204}
üîÅ Applying SMOTE to training data ...
‚úÖ Resampled training set shape: (6408, 143)
   Class balance after resampling: Counter({0: 3204, 1: 3204})
w2v_combined training set (after SMOTE) class balance: {0: 3204, 1: 3204}


## 3. Save SMOTE-Balanced Training Sets

Each variant‚Äôs SMOTE-balanced training data are saved
to `data/processed/{variant}` for external reuse and verification.
Paths use `resolve_path()` for portability.

In [5]:
for variant, data in datasets.items():
    # Create variant-specific directory
    out_dir = resolve_path(f"data/processed/{variant}")
    os.makedirs(out_dir, exist_ok=True)

    # Save SMOTE-balanced training data and labels separately
    X_train_res = pd.DataFrame(data["X_train_res"])
    y_train_res = pd.Series(data["y_train_res"], name="target")

    # Save consistent with 03_feature_engineering style
    X_train_res.to_csv(os.path.join(out_dir, f"data_{variant}_xtrain_res.csv"), index=False)
    y_train_res.to_csv(os.path.join(out_dir, f"data_{variant}_ytrain_res.csv"), index=False)

    print(f"‚úÖ Saved SMOTE-balanced training sets for {variant} under {out_dir}")

‚úÖ Saved SMOTE-balanced training sets for original under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\original
‚úÖ Saved SMOTE-balanced training sets for w2v_radiology under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_radiology
‚úÖ Saved SMOTE-balanced training sets for w2v_discharge under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_discharge
‚úÖ Saved SMOTE-balanced training sets for w2v_combined under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_combined


## 4. Define Classifiers & Hyperparameter Distributions
Load from `src/models.py`:
- `get_classifiers()` for base estimators
- `get_param_distributions()` for Grid/Randomized search spaces


In [6]:
classifiers = get_classifiers()
param_spaces = get_param_distributions()
n_iter_random_per_clf = get_n_iter_random_per_clf()

print("‚úÖ Classifiers and hyperparameter grids initialized.")
print("Available classifiers:", list(classifiers.keys()))

‚úÖ Classifiers and hyperparameter grids initialized.
Available classifiers: ['LogisticRegression', 'DecisionTree', 'RandomForest', 'GradientBoosting', 'XGB', 'LGBM', 'CatBoost', 'SVC', 'MLP', 'NaiveBayes']


## 5. Train on Structured Features (Original)

The call to `repeated_cv_with_mixed_search()` runs both non-SMOTE and
SMOTE-balanced training internally, logs results to MLflow, and saves
artifacts to the appropriate `results/models/original/` folder.

Run `repeated_cv_with_mixed_search`:
- Hyperparameter search on non-SMOTE data
- Retrain on SMOTE data with best params
- Save models to `results/models/original/`
- Evaluated on the holdout test set
- Export summary to CSV + Markdown


In [7]:
X_train_orig = datasets["original"]["X_train"]
X_test_orig  = datasets["original"]["X_test"]
y_train_orig = datasets["original"]["y_train"]
y_test_orig  = datasets["original"]["y_test"]
X_train_smote_orig = datasets["original"]["X_train_res"]
y_train_smote_orig = datasets["original"]["y_train_res"]

results_orig, summary_orig = repeated_cv_with_mixed_search(
    X_train_orig, y_train_orig, X_test_orig, y_test_orig,
    classifiers=classifiers,
    param_spaces=param_spaces,
    X_train_smote=X_train_smote_orig,
    y_train_smote=y_train_smote_orig,
    n_splits=5,
    n_repeats=10,
    scoring=auc_scorer,
    n_iter_random=50,
    n_iter_random_per_clf=n_iter_random_per_clf,
    save_prefix="results/models/{mode}/",
    mode="original_baseline",
    log_mlflow=True
)

export_summary(summary_orig, save_prefix="reports/", mode="original_baseline")
print("‚úÖ Finished model training for Original dataset.")


‚úÖ MLflow tracking initialized under unified experiment 'Thesis_ModelTraining'
Tracking URI: file:///C:/Users/tyler/OneDrive%20-%20University%20of%20Pittsburgh/BIOST%202021%20Thesis/Masters-Thesis/mlflow_tracking (Experiment ID: 169692831354922862)

üîπ Running LogisticRegression...
Fitting 50 folds for each of 44 candidates, totalling 2200 fits
   Performing descriptive StratifiedKFold CV on original training set for LogisticRegression...
   Descriptive CV AUC: 0.7039 ¬± 0.0140
üíæ Saved LogisticRegression model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\original_baseline_20251024_2113\original_baseline_20251024_2113_LogisticRegression_model.pkl
‚úÖ LogisticRegression done. Best params: {'clf__C': 0.01, 'clf__max_iter': 1000, 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
   CV ROC-AUC: 0.706 ¬± 0.018
   Holdout ROC-AUC: 0.723
üíæ Saved non-SMOTE metrics for LogisticRegression to C:\Users\tyler\OneDrive - Univers

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Performing descriptive StratifiedKFold CV on original training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC: 0.7184 ¬± 0.0177
üíæ Saved XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\original_baseline_20251024_2113\original_baseline_20251024_2113_XGB_model.pkl
‚úÖ XGB done. Best params: {'clf__colsample_bytree': 0.9165996316800473, 'clf__gamma': 0.4692763545078751, 'clf__learning_rate': 0.010233629752304298, 'clf__max_depth': 6, 'clf__min_child_weight': 5, 'clf__n_estimators': 360, 'clf__subsample': 0.7912726728878613}
   CV ROC-AUC: 0.717 ¬± 0.019
   Holdout ROC-AUC: 0.729
üíæ Saved non-SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\original_baseline_20251024_2113\original_baseline_20251024_2113_XGB_metrics_non_smote.json


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   SMOTE Holdout ROC-AUC: 0.7127
   Performing descriptive StratifiedKFold CV on SMOTE training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC (SMOTE): 0.9106 ¬± 0.0083
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\original_baseline_20251024_2113\original_baseline_20251024_2113_XGB_smote_model.pkl
üíæ Saved SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\original_baseline_20251024_2113\original_baseline_20251024_2113_XGB_metrics_smote.json
‚è±Ô∏è  Runtime for XGB: 3.47 minutes
üèÅ MLflow run for 'XGB' closed cleanly.

üîπ Running LGBM...
Fitting 50 folds for each of 50 candidates, totalling 2500 fits
   Performing descriptive StratifiedKFold CV on original training set for LGBM...
   Descriptive CV AUC: 0.7069 ¬± 0.0161
üíæ Saved LGBM model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\original_baseline_20251024_2113\original_baselin

## 6. Train on Word2Vec Features
The call to `repeated_cv_with_mixed_search()` runs both non-SMOTE and
SMOTE-balanced training internally, logs results to MLflow, and saves
artifacts to the appropriate `results/models/w2v/` folder.

Run `repeated_cv_with_mixed_search`:
- Hyperparameter search on non-SMOTE data
- Retrain on SMOTE data with best params
- Save models to `results/models/w2v/`
- Evaluated on the holdout test set
- Export summary to CSV + Markdown


In [9]:
X_train_w2v = datasets["w2v_radiology"]["X_train"]
X_test_w2v  = datasets["w2v_radiology"]["X_test"]
y_train_w2v = datasets["w2v_radiology"]["y_train"]
y_test_w2v  = datasets["w2v_radiology"]["y_test"]
X_train_smote_w2v = datasets["w2v_radiology"]["X_train_res"]
y_train_smote_w2v = datasets["w2v_radiology"]["y_train_res"]

results_w2v, summary_w2v = repeated_cv_with_mixed_search(
    X_train_w2v, y_train_w2v, X_test_w2v, y_test_w2v,
    classifiers=classifiers,
    param_spaces=param_spaces,
    X_train_smote=X_train_smote_w2v,
    y_train_smote=y_train_smote_w2v,
    n_splits=5,
    n_repeats=10,
    scoring=auc_scorer,
    n_iter_random=50,
    n_iter_random_per_clf=n_iter_random_per_clf,
    save_prefix="results/models/{mode}/",
    mode="w2v_radiology_baseline",
    log_mlflow=True
)

export_summary(summary_w2v, save_prefix="reports/", mode="w2v_radiology_baseline")
print("‚úÖ Finished model training for Radiology Word2Vec dataset.")


‚úÖ MLflow tracking initialized under unified experiment 'Thesis_ModelTraining'
Tracking URI: file:///C:/Users/tyler/OneDrive%20-%20University%20of%20Pittsburgh/BIOST%202021%20Thesis/Masters-Thesis/mlflow_tracking (Experiment ID: 169692831354922862)

üîπ Running LogisticRegression...
Fitting 50 folds for each of 44 candidates, totalling 2200 fits
   Performing descriptive StratifiedKFold CV on original training set for LogisticRegression...
   Descriptive CV AUC: 0.7345 ¬± 0.0161
üíæ Saved LogisticRegression model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_LogisticRegression_model.pkl
‚úÖ LogisticRegression done. Best params: {'clf__C': 0.1, 'clf__l1_ratio': 0.5, 'clf__max_iter': 1000, 'clf__penalty': 'elasticnet', 'clf__solver': 'saga'}
   CV ROC-AUC: 0.736 ¬± 0.017
   Holdout ROC-AUC: 0.752
üíæ Saved non-SMOTE metrics for LogisticRegression to

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Performing descriptive StratifiedKFold CV on original training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC: 0.7480 ¬± 0.0160
üíæ Saved XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_XGB_model.pkl
‚úÖ XGB done. Best params: {'clf__colsample_bytree': 0.9454044297767479, 'clf__gamma': 0.4303652916281717, 'clf__learning_rate': 0.01208563915935721, 'clf__max_depth': 10, 'clf__min_child_weight': 3, 'clf__n_estimators': 848, 'clf__subsample': 0.8454489914076949}
   CV ROC-AUC: 0.744 ¬± 0.017
   Holdout ROC-AUC: 0.752
üíæ Saved non-SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_XGB_metrics_non_smote.json


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   SMOTE Holdout ROC-AUC: 0.7420
   Performing descriptive StratifiedKFold CV on SMOTE training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC (SMOTE): 0.9533 ¬± 0.0048
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_XGB_smote_model.pkl
üíæ Saved SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_XGB_metrics_smote.json
‚è±Ô∏è  Runtime for XGB: 36.54 minutes
üèÅ MLflow run for 'XGB' closed cleanly.

üîπ Running LGBM...
Fitting 50 folds for each of 50 candidates, totalling 2500 fits
   Performing descriptive StratifiedKFold CV on original training set for LGBM...
   Descriptive CV AUC: 0.7476 ¬± 0.0167
üíæ Saved LGBM model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_2025

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


   Performing descriptive StratifiedKFold CV on original training set for SVC...
   Descriptive CV AUC: 0.7098 ¬± 0.0086
üíæ Saved SVC model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_SVC_model.pkl
‚úÖ SVC done. Best params: {'clf__C': 10, 'clf__gamma': 'scale', 'clf__kernel': 'linear', 'clf__shrinking': False}
   CV ROC-AUC: 0.711 ¬± 0.017
   Holdout ROC-AUC: 0.731
üíæ Saved non-SMOTE metrics for SVC to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\{mode}\w2v_radiology_baseline_20251025_0258\w2v_radiology_baseline_20251025_0258_SVC_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.7338
   Performing descriptive StratifiedKFold CV on SMOTE training set for SVC...
   Descriptive CV AUC (SMOTE): 0.7750 ¬± 0.0128
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained SVC model to C:\Users\tyler\One

## 7: (Optional) Future Variants ‚Äî Discharge & Combined

To extend training later, uncomment and adapt the same pattern for:
- `datasets["w2v_dis"]`
- `datasets["w2v_comb"]`

Each should call `repeated_cv_with_mixed_search()` with appropriate paths:
`save_prefix="results/models/w2v_dis/"` and `"results/models/w2v_comb/"`.

In [None]:
# ==================================================
# [Optional] Train classifiers for Discharge and Combined variants
# ==================================================
# for variant in ["w2v_dis", "w2v_comb"]:
#     data = datasets[variant]
#     print(f"\nüß† Training classifiers for {variant.upper()} variant")
#
#     results_orig, summary_orig = repeated_cv_with_mixed_search(
#         data["X_train"], data["y_train"], data["X_test"], data["y_test"],
#         classifiers, param_spaces,
#         descriptive_cv=True,
#         mode=variant
#     )
#
#     results_smote, summary_smote = repeated_cv_with_mixed_search(
#         data["X_train"], data["y_train"], data["X_test"], data["y_test"],
#         classifiers, param_spaces,
#         X_train_smote=data["X_train_res"],
#         y_train_smote=data["y_train_res"],
#         descriptive_cv=True,
#         mode=f"{variant}_smote"
#     )
#
#     results_all[variant] = {
#         "non_smote": summary_orig,
#         "smote": summary_smote
#     }


## 8. Baseline Comparison Summary

Merge and compare model performance summaries across all available feature variants.
Results are saved under `results/reports/` and optionally logged to MLflow.

In [10]:
import os
import pandas as pd
from src.utils import resolve_path

# --- Tag datasets for clarity ---
summary_orig["Dataset"] = "original"
summary_w2v["Dataset"] = "w2v_radiology"

# Optional: add discharge/combined if available
if "summary_w2v_dis" in globals():
    summary_w2v_dis["Dataset"] = "w2v_discharge"

if "summary_w2v_comb" in globals():
    summary_w2v_comb["Dataset"] = "w2v_combined"

# --- Merge all summaries into one DataFrame ---
all_summaries = [summary_orig, summary_w2v]

if "summary_w2v_dis" in globals():
    all_summaries.append(summary_w2v_dis)

if "summary_w2v_comb" in globals():
    all_summaries.append(summary_w2v_comb)

baseline_summary = pd.concat(all_summaries, axis=0)

# --- Save merged comparison summary ---
baseline_summary_path = resolve_path("results/reports/baseline_comparison.csv")
os.makedirs(os.path.dirname(baseline_summary_path), exist_ok=True)
baseline_summary.to_csv(baseline_summary_path, index=False)

print(f"üíæ Saved merged baseline summary to {baseline_summary_path}")

# --- Optional: Log to MLflow ---
if "mlflow" in globals():
    mlflow.log_artifact(baseline_summary_path, artifact_path="summaries")
    print(f"üì§ Logged baseline comparison summary to MLflow at 'summaries/'")

# --- Display tidy summary table ---
cols_to_display = [
    "Dataset",
    "Classifier",
    "Holdout ROC-AUC",
    "Holdout Precision",
    "Holdout Recall",
    "Holdout F1",
    "Holdout ROC-AUC (SMOTE)",
    "Final Holdout ROC-AUC (SMOTE)",
]

display(baseline_summary[[c for c in cols_to_display if c in baseline_summary.columns]])

üíæ Saved merged baseline summary to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\reports\baseline_comparison.csv


Unnamed: 0,Dataset,Classifier,Holdout ROC-AUC,Holdout Precision,Holdout Recall,Holdout F1,Holdout ROC-AUC (SMOTE),Final Holdout ROC-AUC (SMOTE)
6,original,CatBoost,0.734284,0.545455,0.124481,0.202703,0.724805,0.724805
4,original,XGB,0.728923,0.535714,0.124481,0.20202,0.712693,
3,original,GradientBoosting,0.726701,0.55102,0.112033,0.186207,0.703203,
2,original,RandomForest,0.726405,0.916667,0.045643,0.086957,0.702074,
5,original,LGBM,0.725975,0.482759,0.174274,0.256098,0.709917,
0,original,LogisticRegression,0.723406,0.483333,0.120332,0.192691,0.716185,
9,original,NaiveBayes,0.701084,0.406844,0.443983,0.424603,0.681948,
8,original,MLP,0.69493,0.433333,0.107884,0.172757,0.669951,
7,original,SVC,0.688066,0.777778,0.029046,0.056,0.687258,
1,original,DecisionTree,0.63486,0.516667,0.128631,0.20598,0.629568,


## 9. Completion
Models and results are saved to:
- `results/models/original/`
- `results/models/w2v/`
- `reports/`


In [11]:
print("‚úÖ Training complete. Models + summaries saved to ../results/models/ and ../results/reports/")


‚úÖ Training complete. Models + summaries saved to ../results/models/ and ../results/reports/


In [None]:
'''
Troubleshooting step for SVC, MLP, and NB

# --- Test on SVC, MLP, and NB ---
X_train_orig = datasets["original"]["X_train"]
X_test_orig  = datasets["original"]["X_test"]
y_train_orig = datasets["original"]["y_train"]
y_test_orig  = datasets["original"]["y_test"]
X_train_smote_orig = datasets["original"]["X_train_res"]
y_train_smote_orig = datasets["original"]["y_train_res"]

# --- Restrict classifiers to SVC, MLP, and Naive Bayes only ---
subset_classifiers = {k: v for k, v in classifiers.items() if k in ["SVC", "MLP", "NaiveBayes"]}

# --- Restrict param spaces accordingly ---
subset_param_spaces = {k: v for k, v in param_spaces.items() if k in ["SVC", "MLP", "NaiveBayes"]}

# --- Run cross-validation and SMOTE retraining just for these three ---
results_orig, summary_orig = repeated_cv_with_mixed_search(
    X_train_orig, y_train_orig, X_test_orig, y_test_orig,
    classifiers=subset_classifiers,
    param_spaces=subset_param_spaces,
    X_train_smote=X_train_smote_orig,
    y_train_smote=y_train_smote_orig,
    n_splits=5,
    n_repeats=10,
    scoring=auc_scorer,
    n_iter_random=50,
    n_iter_random_per_clf=n_iter_random_per_clf,
    save_prefix="results/models/original/",
    mode="original_svc_mlp_nb",
    log_mlflow=True
)

export_summary(summary_orig, save_prefix="reports/", mode="original_svc_mlp_nb")
print("‚úÖ Finished model training for Original dataset (SVC, MLP, NB only).")
'''

In [21]:
'''
# --- Minimal Dry Run: Radiology Word2Vec ---

# 1Ô∏è‚É£ Load W2V Radiology datasets
X_train_w2v = datasets["w2v_radiology"]["X_train"]
X_test_w2v  = datasets["w2v_radiology"]["X_test"]
y_train_w2v = datasets["w2v_radiology"]["y_train"]
y_test_w2v  = datasets["w2v_radiology"]["y_test"]
X_train_smote_w2v = datasets["w2v_radiology"]["X_train_res"]
y_train_smote_w2v = datasets["w2v_radiology"]["y_train_res"]

# 2Ô∏è‚É£ Define a one-point param grid for each classifier
param_spaces_dryrun = {
    "LogisticRegression": {
        "clf__penalty": ["l2"],
        "clf__solver": ["lbfgs"],
        "clf__C": [1.0],
        "clf__max_iter": [100]
    },
    "DecisionTree": {
        "clf__max_depth": [5]
    },
    "RandomForest": {
        "clf__n_estimators": [10],
        "clf__max_depth": [5]
    },
    "GradientBoosting": {
        "clf__n_estimators": [10],
        "clf__learning_rate": [0.1],
        "clf__max_depth": [3]
    },
    "XGB": {
        "clf__n_estimators": [10],
        "clf__learning_rate": [0.1],
        "clf__max_depth": [3]
    },
    "LGBM": {
        "clf__n_estimators": [10],
        "clf__learning_rate": [0.1],
        "clf__max_depth": [3]
    },
    "CatBoost": {
        "clf__iterations": [10],
        "clf__depth": [3],
        "clf__learning_rate": [0.1]
    },
    "SVC": {
        "clf__C": [1.0],
        "clf__kernel": ["linear"]
    },
    "MLP": {
        "clf__hidden_layer_sizes": [(8,)],
        "clf__activation": ["relu"],
        "clf__max_iter": [100]
    },
    "NaiveBayes": {
        "clf__var_smoothing": [1e-9]
    }
}

# 3Ô∏è‚É£ Minimal iterations/repeats
n_iter_random_per_clf_dryrun = {name: 1 for name in param_spaces_dryrun.keys()}

# 4Ô∏è‚É£ Run minimal test (no MLflow, single repeat, one candidate each)
results_dryrun, summary_dryrun = repeated_cv_with_mixed_search(
    X_train_w2v, y_train_w2v, X_test_w2v, y_test_w2v,
    classifiers=classifiers,
    param_spaces=param_spaces_dryrun,
    X_train_smote=X_train_smote_w2v,
    y_train_smote=y_train_smote_w2v,
    n_splits=2,           # reduce folds for speed
    n_repeats=1,
    scoring=auc_scorer,
    n_iter_random=1,
    n_iter_random_per_clf=n_iter_random_per_clf_dryrun,
    save_prefix="results/models/dryrun/",
    mode="dry_run",
    log_mlflow=False
)
'''


üîπ Running LogisticRegression...
Fitting 2 folds for each of 1 candidates, totalling 2 fits
   Performing descriptive StratifiedKFold CV on original training set for LogisticRegression...
   Descriptive CV AUC: 0.7246 ¬± 0.0136
üíæ Saved LogisticRegression model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_LogisticRegression_model.pkl
‚úÖ LogisticRegression done. Best params: {'clf__C': 1.0, 'clf__max_iter': 100, 'clf__penalty': 'l2', 'clf__solver': 'lbfgs'}
   CV ROC-AUC: 0.7077 ¬± 0.005
   Holdout ROC-AUC: 0.7434
üíæ Saved non-SMOTE metrics for LogisticRegression to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_LogisticRegression_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.7399
   Performing descriptive StratifiedKFold CV on SMOTE training set for LogisticRegressi

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


   Descriptive CV AUC (SMOTE): 0.7771 ¬± 0.0127
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained LogisticRegression model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_LogisticRegression_smote_model.pkl
üíæ Saved SMOTE metrics for LogisticRegression to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_LogisticRegression_metrics_smote.json
‚è±Ô∏è  Runtime for LogisticRegression: 0.13 minutes

üîπ Running DecisionTree...
Fitting 2 folds for each of 1 candidates, totalling 2 fits
   Performing descriptive StratifiedKFold CV on original training set for DecisionTree...
   Descriptive CV AUC: 0.6248 ¬± 0.0235
üíæ Saved DecisionTree model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_202510

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Performing descriptive StratifiedKFold CV on original training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC: 0.6983 ¬± 0.0189
üíæ Saved XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_XGB_model.pkl
‚úÖ XGB done. Best params: {'clf__n_estimators': 10, 'clf__max_depth': 3, 'clf__learning_rate': 0.1}
   CV ROC-AUC: 0.6932 ¬± 0.005
   Holdout ROC-AUC: 0.69
üíæ Saved non-SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_XGB_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.6722
   Performing descriptive StratifiedKFold CV on SMOTE training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC (SMOTE): 0.8369 ¬± 0.0187
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_XGB_smote_model.pkl
üíæ Saved SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_XGB_metrics_smote.json
‚è±Ô∏è  Runtime for XGB: 0.08 minutes

üîπ Running LGBM...
Fitting 2 folds for each of 1 candidates, totalling 2 fits
   Performing descriptive StratifiedKFold CV on original training set for LGBM...
   Descriptive CV AUC: 0.6978 ¬± 0.0173
üíæ Saved LGBM model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_LGBM_model.pkl
‚úÖ LGBM done. Best params: {'clf__n_estimators': 10, 'clf__max_depth': 3, '

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


   Performing descriptive StratifiedKFold CV on original training set for CatBoost...
   Descriptive CV AUC: 0.6765 ¬± 0.0153
üíæ Saved CatBoost model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_CatBoost_model.pkl
‚úÖ CatBoost done. Best params: {'clf__learning_rate': 0.1, 'clf__iterations': 10, 'clf__depth': 3}
   CV ROC-AUC: 0.6783 ¬± 0.012
   Holdout ROC-AUC: 0.687
üíæ Saved non-SMOTE metrics for CatBoost to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_CatBoost_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.6741
   Performing descriptive StratifiedKFold CV on SMOTE training set for CatBoost...
   Descriptive CV AUC (SMOTE): 0.8371 ¬± 0.0180
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained CatBoost model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


   Performing descriptive StratifiedKFold CV on original training set for SVC...
   Descriptive CV AUC: 0.6855 ¬± 0.0166
üíæ Saved SVC model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_SVC_model.pkl
‚úÖ SVC done. Best params: {'clf__C': 1.0, 'clf__kernel': 'linear'}
   CV ROC-AUC: 0.6908 ¬± 0.009
   Holdout ROC-AUC: 0.6758
üíæ Saved non-SMOTE metrics for SVC to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_SVC_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.7344
   Performing descriptive StratifiedKFold CV on SMOTE training set for SVC...
   Descriptive CV AUC (SMOTE): 0.7755 ¬± 0.0126
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained SVC model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_



   Performing descriptive StratifiedKFold CV on original training set for MLP...




   Descriptive CV AUC: 0.7269 ¬± 0.0130
üíæ Saved MLP model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_MLP_model.pkl
‚úÖ MLP done. Best params: {'clf__activation': 'relu', 'clf__hidden_layer_sizes': (8,), 'clf__max_iter': 100}
   CV ROC-AUC: 0.7078 ¬± 0.005
   Holdout ROC-AUC: 0.7299
üíæ Saved non-SMOTE metrics for MLP to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_MLP_metrics_non_smote.json




   SMOTE Holdout ROC-AUC: 0.7394
   Performing descriptive StratifiedKFold CV on SMOTE training set for MLP...




   Descriptive CV AUC (SMOTE): 0.8131 ¬± 0.0129
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained MLP model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_MLP_smote_model.pkl
üíæ Saved SMOTE metrics for MLP to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_MLP_metrics_smote.json
‚è±Ô∏è  Runtime for MLP: 0.19 minutes

üîπ Running NaiveBayes...
Fitting 2 folds for each of 1 candidates, totalling 2 fits
   Performing descriptive StratifiedKFold CV on original training set for NaiveBayes...
   Descriptive CV AUC: 0.6634 ¬± 0.0200
üíæ Saved NaiveBayes model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\dryrun\dry_run_20251026_1212\dry_run_20251026_1212_NaiveBayes_model.pkl
‚úÖ NaiveBayes done. Best params: {'clf__var_smoothi

In [26]:
# ==================================================
# üöÄ Refit Final Models Using Best Parameters (Original Structured Data)
# ==================================================
import os
from src.models import (
    get_classifiers,
    repeated_cv_with_mixed_search,
    auc_scorer
)
from src.utils import resolve_path
from src.evaluation import export_summary

# --------------------------------------------------
# 1Ô∏è‚É£  Define best parameters from your 20251024_2113 run
# --------------------------------------------------
param_spaces_fixed = {
    "LogisticRegression": [{
        "clf__C": [0.01],
        "clf__penalty": ["l2"],
        "clf__solver": ["liblinear"],
        "clf__max_iter": [1000]
    }],
    "DecisionTree": [{
        "clf__criterion": ["gini"],
        "clf__max_depth": [5],
        "clf__min_samples_leaf": [10],
        "clf__min_samples_split": [2]
    }],
    "RandomForest": [{
        "clf__bootstrap": [True],
        "clf__max_depth": [10],
        "clf__max_features": ["log2"],
        "clf__min_samples_leaf": [4],
        "clf__min_samples_split": [7],
        "clf__n_estimators": [935]
    }],
    "GradientBoosting": [{
        "clf__learning_rate": [0.013979488347959958],
        "clf__max_depth": [3],
        "clf__max_features": ["log2"],
        "clf__min_samples_leaf": [2],
        "clf__min_samples_split": [10],
        "clf__n_estimators": [445],
        "clf__subsample": [0.7293016342019151]
    }],
    "XGB": [{
        "clf__colsample_bytree": [0.9165996316800473],
        "clf__gamma": [0.4692763545078751],
        "clf__learning_rate": [0.010233629752304298],
        "clf__max_depth": [6],
        "clf__min_child_weight": [5],
        "clf__n_estimators": [360],
        "clf__subsample": [0.7912726728878613]
    }],
    "LGBM": [{
        "clf__colsample_bytree": [0.7880464524154114],
        "clf__learning_rate": [0.014223946814525337],
        "clf__max_depth": [20],
        "clf__min_child_samples": [90],
        "clf__n_estimators": [591],
        "clf__num_leaves": [193],
        "clf__subsample": [0.9370526621593617]
    }],
    "CatBoost": [{
        "clf__depth": [4],
        "clf__iterations": [847],
        "clf__l2_leaf_reg": [8],
        "clf__learning_rate": [0.014223946814525337]
    }],
    "SVC": [{
        "clf__C": [1],
        "clf__gamma": ["scale"],
        "clf__kernel": ["rbf"],
        "clf__shrinking": [True]
    }],
    "MLP": [{
        "clf__activation": ["relu"],
        "clf__alpha": [0.001],
        "clf__early_stopping": [True],
        "clf__hidden_layer_sizes": [(64,)],
        "clf__learning_rate_init": [0.001],
        "clf__n_iter_no_change": [10],
        "clf__solver": ["adam"]
    }],
    "NaiveBayes": [{
        "clf__var_smoothing": [1e-9]
    }]
}

# --------------------------------------------------
# 2Ô∏è‚É£  Prepare fixed iteration configuration
# --------------------------------------------------
n_iter_random_per_clf_fixed = {k: 1 for k in param_spaces_fixed.keys()}

# --------------------------------------------------
# 3Ô∏è‚É£  Load structured-only dataset (original)
# --------------------------------------------------
X_train_orig = datasets["original"]["X_train"]
X_test_orig  = datasets["original"]["X_test"]
y_train_orig = datasets["original"]["y_train"]
y_test_orig  = datasets["original"]["y_test"]
X_train_smote_orig = datasets["original"]["X_train_res"]
y_train_smote_orig = datasets["original"]["y_train_res"]

# --------------------------------------------------
# 4Ô∏è‚É£  Run refit (no hyperparameter search, one fit per model)
# --------------------------------------------------
results_orig_refit, summary_orig_refit = repeated_cv_with_mixed_search(
    X_train_orig, y_train_orig, X_test_orig, y_test_orig,
    classifiers=get_classifiers(),
    param_spaces=param_spaces_fixed,
    X_train_smote=X_train_smote_orig,
    y_train_smote=y_train_smote_orig,
    n_splits=5,
    n_repeats=10,
    scoring=auc_scorer,
    n_iter_random=1,
    n_iter_random_per_clf=n_iter_random_per_clf_fixed,
    descriptive_cv=True,
    save_prefix="results/models/original_baseline/",
    mode="original_baseline",
    log_mlflow=True
)

# --------------------------------------------------
# 5Ô∏è‚É£  Export summary
# --------------------------------------------------
export_summary(summary_orig_refit, save_prefix="reports/20251026_NEW", mode="original_baseline")
print("‚úÖ Finished model refit for Original Structured dataset.")

‚úÖ MLflow tracking initialized under unified experiment 'Thesis_ModelTraining'
Tracking URI: file:///C:/Users/tyler/OneDrive%20-%20University%20of%20Pittsburgh/BIOST%202021%20Thesis/Masters-Thesis/mlflow_tracking (Experiment ID: 169692831354922862)

üîπ Running LogisticRegression...
Fitting 50 folds for each of 1 candidates, totalling 50 fits




   Performing descriptive StratifiedKFold CV on original training set for LogisticRegression...
   Descriptive CV AUC: 0.7039 ¬± 0.0140
üíæ Saved LogisticRegression model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_20251027_0818\original_baseline_20251027_0818_LogisticRegression_model.pkl
‚úÖ LogisticRegression done. Best params: {'clf__C': 0.01, 'clf__max_iter': 1000, 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
   CV ROC-AUC: 0.7058 ¬± 0.018
   Holdout ROC-AUC: 0.7234
üíæ Saved non-SMOTE metrics for LogisticRegression to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_20251027_0818\original_baseline_20251027_0818_LogisticRegression_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.7162
   Performing descriptive StratifiedKFold CV on SMOTE training set for LogisticRegression...
   Descriptive CV AUC

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Performing descriptive StratifiedKFold CV on original training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC: 0.7184 ¬± 0.0177
üíæ Saved XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_20251027_0818\original_baseline_20251027_0818_XGB_model.pkl
‚úÖ XGB done. Best params: {'clf__subsample': 0.7912726728878613, 'clf__n_estimators': 360, 'clf__min_child_weight': 5, 'clf__max_depth': 6, 'clf__learning_rate': 0.010233629752304298, 'clf__gamma': 0.4692763545078751, 'clf__colsample_bytree': 0.9165996316800473}
   CV ROC-AUC: 0.7173 ¬± 0.019
   Holdout ROC-AUC: 0.7289
üíæ Saved non-SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_20251027_0818\original_baseline_20251027_0818_XGB_metrics_non_smote.json


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   SMOTE Holdout ROC-AUC: 0.7127
   Performing descriptive StratifiedKFold CV on SMOTE training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC (SMOTE): 0.9106 ¬± 0.0083
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_20251027_0818\original_baseline_20251027_0818_XGB_smote_model.pkl
üíæ Saved SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_20251027_0818\original_baseline_20251027_0818_XGB_metrics_smote.json
‚è±Ô∏è  Runtime for XGB: 0.19 minutes
üèÅ MLflow run for 'XGB' closed cleanly.

üîπ Running LGBM...
Fitting 50 folds for each of 1 candidates, totalling 50 fits
   Performing descriptive StratifiedKFold CV on original training set for LGBM...
   Descriptive CV AUC: 0.7069 ¬± 0.0161
üíæ Saved LGBM model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\original_baseline\original_baseline_

In [25]:
# ==================================================
# üöÄ Refit Final Models Using Previously Discovered Best Parameters
# ==================================================
import os
from src.models import (
    get_classifiers,
    repeated_cv_with_mixed_search,
    auc_scorer
)
from src.utils import resolve_path

# --------------------------------------------------
# 1Ô∏è‚É£  Define best parameters from your 20251025_0258 run
# --------------------------------------------------
# ==================================================
# ‚úÖ Corrected param_spaces_fixed for single-iteration refit
# ==================================================
param_spaces_fixed = {
    "LogisticRegression": [{
        "clf__C": [0.1],
        "clf__l1_ratio": [0.5],
        "clf__max_iter": [1000],
        "clf__penalty": ["elasticnet"],
        "clf__solver": ["saga"]
    }],
    "DecisionTree": [{
        "clf__criterion": ["entropy"],
        "clf__max_depth": [5],
        "clf__min_samples_leaf": [10],
        "clf__min_samples_split": [2]
    }],
    "RandomForest": [{
        "clf__bootstrap": [False],
        "clf__max_depth": [10],
        "clf__max_features": ["sqrt"],
        "clf__min_samples_leaf": [4],
        "clf__min_samples_split": [10],
        "clf__n_estimators": [356]
    }],
    "GradientBoosting": [{
        "clf__learning_rate": [0.019393987736667576],
        "clf__max_depth": [5],
        "clf__max_features": ["log2"],
        "clf__min_samples_leaf": [4],
        "clf__min_samples_split": [3],
        "clf__n_estimators": [301],
        "clf__subsample": [0.9684482051282945]
    }],
    "XGB": [{
        "clf__colsample_bytree": [0.9454044297767479],
        "clf__gamma": [0.4303652916281717],
        "clf__learning_rate": [0.01208563915935721],
        "clf__max_depth": [10],
        "clf__min_child_weight": [3],
        "clf__n_estimators": [848],
        "clf__subsample": [0.8454489914076949]
    }],
    "LGBM": [{
        "clf__colsample_bytree": [0.7880464524154114],
        "clf__learning_rate": [0.014223946814525337],
        "clf__max_depth": [20],
        "clf__min_child_samples": [90],
        "clf__n_estimators": [591],
        "clf__num_leaves": [193],
        "clf__subsample": [0.9370526621593617]
    }],
    "CatBoost": [{
        "clf__depth": [7],
        "clf__iterations": [654],
        "clf__l2_leaf_reg": [8],
        "clf__learning_rate": [0.02031655633456552]
    }],
    "SVC": [{
        "clf__C": [10],
        "clf__gamma": ["scale"],
        "clf__kernel": ["linear"],
        "clf__shrinking": [False]
    }],
    "MLP": [{
        "clf__activation": ["relu"],
        "clf__alpha": [0.0001],
        "clf__early_stopping": [True],
        "clf__hidden_layer_sizes": [(64,)],
        "clf__learning_rate_init": [0.001],
        "clf__n_iter_no_change": [10],
        "clf__solver": ["adam"]
    }],
    "NaiveBayes": [{
        "clf__var_smoothing": [1e-9]
    }]
}


# --------------------------------------------------
# 2Ô∏è‚É£  Prepare minimal refit configuration
# --------------------------------------------------
n_iter_random_per_clf_fixed = {k: 1 for k in param_spaces_fixed.keys()}

# Note: using existing datasets from memory
# Ensure these variables already loaded:
# X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v,
# X_train_smote_w2v, y_train_smote_w2v

# --------------------------------------------------
# 3Ô∏è‚É£  Run lightweight refit
# --------------------------------------------------
results_refit, summary_refit = repeated_cv_with_mixed_search(
    X_train_w2v, y_train_w2v, X_test_w2v, y_test_w2v,
    classifiers=get_classifiers(),
    param_spaces=param_spaces_fixed,
    X_train_smote=X_train_smote_w2v,
    y_train_smote=y_train_smote_w2v,
    n_splits=5,
    n_repeats=10,
    scoring=auc_scorer,
    n_iter_random=1,
    n_iter_random_per_clf=n_iter_random_per_clf_fixed,
    descriptive_cv=True,
    save_prefix="results/models/w2v_radiology_baseline/",
    mode="w2v_radiology_baseline",
    log_mlflow=True
)

print("‚úÖ Refit complete ‚Äî all models retrained with fixed parameters.")

export_summary(summary_w2v, save_prefix="reports/20251026_NEW", mode="w2v_radiology_baseline")
print("‚úÖ Finished model training for Radiology Word2Vec dataset.")


‚úÖ MLflow tracking initialized under unified experiment 'Thesis_ModelTraining'
Tracking URI: file:///C:/Users/tyler/OneDrive%20-%20University%20of%20Pittsburgh/BIOST%202021%20Thesis/Masters-Thesis/mlflow_tracking (Experiment ID: 169692831354922862)

üîπ Running LogisticRegression...
Fitting 50 folds for each of 1 candidates, totalling 50 fits
   Performing descriptive StratifiedKFold CV on original training set for LogisticRegression...
   Descriptive CV AUC: 0.7345 ¬± 0.0161
üíæ Saved LogisticRegression model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_LogisticRegression_model.pkl
‚úÖ LogisticRegression done. Best params: {'clf__C': 0.1, 'clf__l1_ratio': 0.5, 'clf__max_iter': 1000, 'clf__penalty': 'elasticnet', 'clf__solver': 'saga'}
   CV ROC-AUC: 0.7359 ¬± 0.017
   Holdout ROC-AUC: 0.7523
üíæ Saved non-SMOTE metrics for Logist

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Performing descriptive StratifiedKFold CV on original training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC: 0.7480 ¬± 0.0160
üíæ Saved XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_XGB_model.pkl
‚úÖ XGB done. Best params: {'clf__subsample': 0.8454489914076949, 'clf__n_estimators': 848, 'clf__min_child_weight': 3, 'clf__max_depth': 10, 'clf__learning_rate': 0.01208563915935721, 'clf__gamma': 0.4303652916281717, 'clf__colsample_bytree': 0.9454044297767479}
   CV ROC-AUC: 0.7435 ¬± 0.017
   Holdout ROC-AUC: 0.7522
üíæ Saved non-SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_XGB_metrics_non_smote.json


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   SMOTE Holdout ROC-AUC: 0.7420
   Performing descriptive StratifiedKFold CV on SMOTE training set for XGB...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


   Descriptive CV AUC (SMOTE): 0.9533 ¬± 0.0048
   (5 valid folds out of 5)
üíæ Saved SMOTE-trained XGB model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_XGB_smote_model.pkl
üíæ Saved SMOTE metrics for XGB to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_XGB_metrics_smote.json
‚è±Ô∏è  Runtime for XGB: 6.45 minutes
üèÅ MLflow run for 'XGB' closed cleanly.

üîπ Running LGBM...
Fitting 50 folds for each of 1 candidates, totalling 50 fits
   Performing descriptive StratifiedKFold CV on original training set for LGBM...
   Descriptive CV AUC: 0.7476 ¬± 0.0167
üíæ Saved LGBM model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_ra

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


   Performing descriptive StratifiedKFold CV on original training set for SVC...
   Descriptive CV AUC: 0.7098 ¬± 0.0086
üíæ Saved SVC model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_SVC_model.pkl
‚úÖ SVC done. Best params: {'clf__C': 10, 'clf__gamma': 'scale', 'clf__kernel': 'linear', 'clf__shrinking': False}
   CV ROC-AUC: 0.7108 ¬± 0.017
   Holdout ROC-AUC: 0.731
üíæ Saved non-SMOTE metrics for SVC to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\models\w2v_radiology_baseline\w2v_radiology_baseline_20251026_1338\w2v_radiology_baseline_20251026_1338_SVC_metrics_non_smote.json
   SMOTE Holdout ROC-AUC: 0.7338
   Performing descriptive StratifiedKFold CV on SMOTE training set for SVC...
   Descriptive CV AUC (SMOTE): 0.7750 ¬± 0.0128
   (5 valid folds out of 5)
üíæ Saved SMOTE-traine

In [27]:
# ==================================================
# üìä Merge and Export Refit Comparison Summaries
# ==================================================
import os
import pandas as pd
from src.utils import resolve_path

# --- Tag refit summaries for clarity ---
summary_orig_refit["Dataset"] = "original"
summary_refit["Dataset"] = "w2v_radiology"

# Optional: include discharge/combined refit summaries if available
if "summary_w2v_dis_refit" in globals():
    summary_w2v_dis_refit["Dataset"] = "w2v_discharge"

if "summary_w2v_comb_refit" in globals():
    summary_w2v_comb_refit["Dataset"] = "w2v_combined"

# --- Merge all available refit summaries ---
all_refit_summaries = [summary_orig_refit, summary_refit]

if "summary_w2v_dis_refit" in globals():
    all_refit_summaries.append(summary_w2v_dis_refit)

if "summary_w2v_comb_refit" in globals():
    all_refit_summaries.append(summary_w2v_comb_refit)

refit_comparison = pd.concat(all_refit_summaries, axis=0, ignore_index=True)

# --- Save merged refit comparison summary ---
refit_summary_path = resolve_path("results/reports/20251027baseline_comparison.csv")
os.makedirs(os.path.dirname(refit_summary_path), exist_ok=True)
refit_comparison.to_csv(refit_summary_path, index=False)
print(f"üíæ Saved merged refit comparison summary to {refit_summary_path}")

# --- Optional: Log to MLflow ---
if "mlflow" in globals():
    try:
        mlflow.log_artifact(refit_summary_path, artifact_path="summaries")
        print("üì§ Logged refit comparison summary to MLflow under 'summaries/'")
    except Exception as e:
        print(f"‚ö†Ô∏è MLflow logging skipped ({e})")

# --- Display tidy summary table ---
cols_to_display = [
    "Dataset",
    "Classifier",
    "Holdout ROC-AUC",
    "Holdout Precision",
    "Holdout Recall",
    "Holdout F1",
    "Holdout ROC-AUC (SMOTE)",
    "Final Holdout ROC-AUC (SMOTE)",
]

display_cols = [c for c in cols_to_display if c in refit_comparison.columns]
display(refit_comparison[display_cols].sort_values(by=["Dataset", "Holdout ROC-AUC"], ascending=[True, False]))


üíæ Saved merged refit comparison summary to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\results\reports\20251027baseline_comparison.csv


Unnamed: 0,Dataset,Classifier,Holdout ROC-AUC,Holdout Precision,Holdout Recall,Holdout F1,Holdout ROC-AUC (SMOTE),Final Holdout ROC-AUC (SMOTE)
0,original,CatBoost,0.734284,0.545455,0.124481,0.202703,0.724805,0.724805
1,original,XGB,0.728923,0.535714,0.124481,0.20202,0.712693,
2,original,GradientBoosting,0.726701,0.55102,0.112033,0.186207,0.703203,
3,original,RandomForest,0.726405,0.916667,0.045643,0.086957,0.702074,
4,original,LGBM,0.725975,0.482759,0.174274,0.256098,0.709917,
5,original,LogisticRegression,0.723406,0.483333,0.120332,0.192691,0.716185,
6,original,NaiveBayes,0.701084,0.406844,0.443983,0.424603,0.681948,
7,original,MLP,0.69493,0.433333,0.107884,0.172757,0.669951,
8,original,SVC,0.688066,0.777778,0.029046,0.056,0.687227,
9,original,DecisionTree,0.63486,0.516667,0.128631,0.20598,0.629568,
