# Predictive Maintenance Classification Workflow
## With loading if data already available to avoid redoing work - starts from full_normalized CSV with smote already included

This Jupyter notebook provides a streamlined, organized pipeline for training and loading ML models to predict machine failures. It includes:

- Data loading and preprocessing
- Baseline and tuned model training (Decision Tree, Random Forest, XGBoost on GPU)
- Pretrained model loading and saving into `TrainedModels/`
- Model evaluation and visualization with detailed metrics
- XGBoost threshold tuning and CSV export for visualization
- Summary of model performance saved to CSV



## 1. Imports and Configuration

Import necessary libraries and configure warnings, plotting, and the model save directory.

In [1]:
import os
import warnings
import gc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score, RepeatedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    recall_score, precision_score
)
import xgboost as xgb

warnings.filterwarnings("ignore")
%matplotlib inline

SAVE_DIR = "TrainedModels"
os.makedirs(SAVE_DIR, exist_ok=True)
print(f"Model directory: {SAVE_DIR}")

Model directory: TrainedModels


## 2. Data Loading & Preparation

Load the normalized dataset, select relevant features, and split into train/test sets.

In [2]:
dtype_dict = {
    'norm_power': 'float32',
    'norm_temp_diff': 'float32',
    'norm_tool_wear_adjusted': 'float32',
    'Bool_MF': 'bool'
}
use_columns = list(dtype_dict.keys())

data = pd.read_csv("full_normalized.csv", dtype=dtype_dict, usecols=use_columns)
data['Bool_MF_int'] = data['Bool_MF'].astype('int8')
X = data[['norm_power', 'norm_temp_diff', 'norm_tool_wear_adjusted']]
y = data['Bool_MF_int']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train/test shapes:", X_train.shape, X_test.shape)
gc.collect()
DTRAIN = xgb.DMatrix(X_train, label=y_train)
DTEST  = xgb.DMatrix(X_test, label=y_test)

Train/test shapes: (15443, 3) (3861, 3)


## 3. Helper Functions 

Functions to save models and evaluate both sklearn and XGBoost models.

In [3]:
def save_model(obj, fname):
    path = os.path.join(SAVE_DIR, fname)
    if hasattr(obj, 'save_model'):
        obj.save_model(path)
    else:
        joblib.dump(obj, path)
    print(f"Saved {fname}")


def evaluate_sklearn(model, name, X_tr, y_tr, X_te, y_te):
    train_pred = model.predict(X_tr)
    test_pred = model.predict(X_te)
    train_acc = accuracy_score(y_tr, train_pred)
    test_acc = accuracy_score(y_te, test_pred)
    cv_mean = cross_val_score(model, X_tr, y_tr, cv=5, scoring='accuracy', n_jobs=-1).mean()
    print(f"--- {name} ---")
    print(f"{name} - Training Accuracy: {train_acc:.4f}")
    print(f"{name} - Test Accuracy:      {test_acc:.4f}")
    print(f"{name} CV Mean Score:       {cv_mean:.6f}\n")
    print(f"{name} Confusion Matrix (Test):")
    print(confusion_matrix(y_te, test_pred), "\n")
    print(f"{name} Classification Report (Test):")
    print(classification_report(y_te, test_pred))


def evaluate_xgb(booster, name, dtr, DTE, y_tr, y_te):
    tr_proba = booster.predict(dtr, iteration_range=(0, booster.best_iteration+1))
    te_proba = booster.predict(DTE, iteration_range=(0, booster.best_iteration+1))
    tr_pred = (tr_proba >= 0.5).astype(int)
    te_pred = (te_proba >= 0.5).astype(int)
    tr_acc = accuracy_score(y_tr, tr_pred)
    te_acc = accuracy_score(y_te, te_pred)
    print(f"--- {name} ---")
    print(f"{name} - Training Accuracy: {tr_acc:.4f}")
    print(f"{name} - Test Accuracy:      {te_acc:.4f}\n")
    print(f"{name} Confusion Matrix (Test):")
    print(confusion_matrix(y_te, te_pred), "\n")
    print(f"{name} Classification Report (Test):")
    print(classification_report(y_te, te_pred))
    return te_proba


## 4. Decision Tree: Baseline and Tuned

Load or train baseline DT, then load or train tuned DT.


In [4]:
DT_BASE = "DecisionTree_baseline.pkl"
if os.path.exists(os.path.join(SAVE_DIR, DT_BASE)):
    dt_baseline = joblib.load(os.path.join(SAVE_DIR, DT_BASE))
else:
    dt_baseline = DecisionTreeClassifier(random_state=42)
    dt_baseline.fit(X_train, y_train)
    save_model(dt_baseline, DT_BASE)
# Evaluate
evaluate_sklearn(dt_baseline, "Baseline Decision Tree", X_train, y_train, X_test, y_test)

# Tuned
DT_TUNED = "DecisionTree_best.pkl"
if os.path.exists(os.path.join(SAVE_DIR, DT_TUNED)):
    best_dt = joblib.load(os.path.join(SAVE_DIR, DT_TUNED))
else:
    grid = {
        'max_depth': [6, 8, 10], 'min_samples_split': [5, 10, 15],
        'min_samples_leaf': [2, 4, 5, 7], 'criterion': ['gini', 'entropy'],
        'ccp_alpha': [0.0, 0.001, 0.01]
    }
    dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=42), grid, cv=5,
                           scoring='accuracy', n_jobs=-1, verbose=1)
    dt_grid.fit(X_train, y_train)
    best_dt = dt_grid.best_estimator_
    save_model(best_dt, DT_TUNED)
# Evaluate
evaluate_sklearn(best_dt, "Tuned Decision Tree", X_train, y_train, X_test, y_test)


--- Baseline Decision Tree ---
Baseline Decision Tree - Training Accuracy: 1.0000
Baseline Decision Tree - Test Accuracy:      0.9321
Baseline Decision Tree CV Mean Score:       0.919187

Baseline Decision Tree Confusion Matrix (Test):
[[1844  132]
 [ 130 1755]] 

Baseline Decision Tree Classification Report (Test):
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      1976
           1       0.93      0.93      0.93      1885

    accuracy                           0.93      3861
   macro avg       0.93      0.93      0.93      3861
weighted avg       0.93      0.93      0.93      3861

--- Tuned Decision Tree ---
Tuned Decision Tree - Training Accuracy: 0.9277
Tuned Decision Tree - Test Accuracy:      0.9073
Tuned Decision Tree CV Mean Score:       0.899696

Tuned Decision Tree Confusion Matrix (Test):
[[1804  172]
 [ 186 1699]] 

Tuned Decision Tree Classification Report (Test):
              precision    recall  f1-score   support



## 5. Random Forest: Baseline and Tuned


In [5]:
# Baseline
RF_BASE = "RandomForest_baseline.pkl"
if os.path.exists(os.path.join(SAVE_DIR, RF_BASE)):
    rf_baseline = joblib.load(os.path.join(SAVE_DIR, RF_BASE))
else:
    rf_baseline = RandomForestClassifier(random_state=42, n_jobs=-1)
    rf_baseline.fit(X_train, y_train)
    save_model(rf_baseline, RF_BASE)
# Evaluate
evaluate_sklearn(rf_baseline, "Baseline Random Forest", X_train, y_train, X_test, y_test)

# Tuned
RF_TUNED = "RandomForest_best.pkl"
if os.path.exists(os.path.join(SAVE_DIR, RF_TUNED)):
    best_rf = joblib.load(os.path.join(SAVE_DIR, RF_TUNED))
else:
    from scipy.stats import randint
    dist = {'n_estimators': randint(50,200), 'max_depth': randint(3,20),
            'min_samples_split': randint(2,10), 'min_samples_leaf': randint(1,4)}
    rf_search = RandomizedSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1),
                                    dist, n_iter=20, cv=5, random_state=42,
                                    n_jobs=-1, verbose=1)
    rf_search.fit(X_train, y_train)
    best_rf = rf_search.best_estimator_
    save_model(best_rf, RF_TUNED)
# Evaluate
evaluate_sklearn(best_rf, "Tuned Random Forest", X_train, y_train, X_test, y_test)

--- Baseline Random Forest ---
Baseline Random Forest - Training Accuracy: 1.0000
Baseline Random Forest - Test Accuracy:      0.9358
Baseline Random Forest CV Mean Score:       0.934339

Baseline Random Forest Confusion Matrix (Test):
[[1845  131]
 [ 117 1768]] 

Baseline Random Forest Classification Report (Test):
              precision    recall  f1-score   support

           0       0.94      0.93      0.94      1976
           1       0.93      0.94      0.93      1885

    accuracy                           0.94      3861
   macro avg       0.94      0.94      0.94      3861
weighted avg       0.94      0.94      0.94      3861

--- Tuned Random Forest ---
Tuned Random Forest - Training Accuracy: 0.9915
Tuned Random Forest - Test Accuracy:      0.9308
Tuned Random Forest CV Mean Score:       0.931231

Tuned Random Forest Confusion Matrix (Test):
[[1841  135]
 [ 132 1753]] 

Tuned Random Forest Classification Report (Test):
              precision    recall  f1-score   support



## 6. XGBoost (GPU): Baseline and Tuned



In [6]:
# %%
# Baseline
XGB_BASE = "XGBoost_baseline.json"
base_params = {'objective':'binary:logistic','eval_metric':'logloss',
               'tree_method':'gpu_hist','predictor':'gpu_predictor'}
if os.path.exists(os.path.join(SAVE_DIR, XGB_BASE)):
    xgb_baseline = xgb.Booster()
    xgb_baseline.load_model(os.path.join(SAVE_DIR, XGB_BASE))
else:
    xgb_baseline = xgb.train(base_params, DTRAIN, num_boost_round=200,
                              evals=[(DTEST,'eval')], early_stopping_rounds=10,
                              verbose_eval=False)
    save_model(xgb_baseline, XGB_BASE)
# Evaluate
_ = evaluate_xgb(xgb_baseline, "Baseline XGBoost", DTRAIN, DTEST, y_train, y_test)

# Tuned
XGB_TUNED = "XGBoost_best.json"
if os.path.exists(os.path.join(SAVE_DIR, XGB_TUNED)):
    best_xgb = xgb.Booster()
    best_xgb.load_model(os.path.join(SAVE_DIR, XGB_TUNED))
else:
    from sklearn.model_selection import RepeatedKFold
    from xgboost import XGBClassifier
    param_dist = {
        'n_estimators':[100,150,200,250,300],'max_depth':[4,6,8,10],
        'learning_rate':[0.01,0.05,0.1,0.15],'subsample':[0.7,0.8,0.9,1.0],
        'colsample_bytree':[0.7,0.8,0.9,1.0],'gamma':[0,0.1,0.5,1],
        'reg_alpha':[0,0.01,0.1,1],'reg_lambda':[1,1.5,2,3]
    }
    xgb_clf = XGBClassifier(**base_params, random_state=42)
    rkf = RepeatedKFold(n_splits=5,n_repeats=2,random_state=42)
    xgb_search = RandomizedSearchCV(xgb_clf, param_dist, n_iter=30,
                                    scoring='accuracy', cv=rkf,
                                    n_jobs=1, random_state=42, verbose=1)
    xgb_search.fit(X_train, y_train)
    params = xgb_search.best_params_.copy(); params.update(base_params)
    rounds = params.pop('n_estimators',200)
    best_xgb = xgb.train(params, DTRAIN, num_boost_round=rounds,
                         evals=[(DTEST,'eval')], early_stopping_rounds=10,
                         verbose_eval=False)
    save_model(best_xgb, XGB_TUNED)
# Evaluate
probas_best = evaluate_xgb(best_xgb, "Tuned XGBoost", DTRAIN, DTEST, y_train, y_test)


--- Baseline XGBoost ---
Baseline XGBoost - Training Accuracy: 0.9964
Baseline XGBoost - Test Accuracy:      0.9808

Baseline XGBoost Confusion Matrix (Test):
[[1944   32]
 [  42 1843]] 

Baseline XGBoost Classification Report (Test):
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1976
           1       0.98      0.98      0.98      1885

    accuracy                           0.98      3861
   macro avg       0.98      0.98      0.98      3861
weighted avg       0.98      0.98      0.98      3861

--- Tuned XGBoost ---
Tuned XGBoost - Training Accuracy: 0.9922
Tuned XGBoost - Test Accuracy:      0.9788

Tuned XGBoost Confusion Matrix (Test):
[[1932   44]
 [  38 1847]] 

Tuned XGBoost Classification Report (Test):
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1976
           1       0.98      0.98      0.98      1885

    accuracy                           0.98      3861


## 7. XGBoost Threshold Tuning & Export

Scan multiple thresholds, save results for visualization.

In [7]:
thresholds=[0.1,0.15,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
records=[]
for th in thresholds:
    preds = (probas_best>=th).astype(int)
    records.append({
        'threshold':th,
        'accuracy':accuracy_score(y_test,preds),
        'precision':precision_score(y_test,preds),
        'recall':recall_score(y_test,preds)
    })
thresh_df = pd.DataFrame(records)
thresh_df.to_csv(os.path.join(SAVE_DIR,'xgb_threshold_tuning.csv'),index=False)
print("Threshold tuning data saved.")


Threshold tuning data saved.


## 8. Model Performance Summary & Export
Compile a summary of all models and export for dashboarding.

In [8]:
models_summary = []
entries = [
    ('DT Baseline', dt_baseline.predict(X_test)),
    ('DT Tuned',    best_dt.predict(X_test)),
    ('RF Baseline', rf_baseline.predict(X_test)),
    ('RF Tuned',    best_rf.predict(X_test)),
    ('XGB Baseline', (xgb_baseline.predict(DTEST)>=0.5).astype(int)),
    ('XGB Tuned',   (probas_best>=0.5).astype(int))
]
for name, preds in entries:
    models_summary.append({
        'model':name,
        'accuracy':accuracy_score(y_test,preds),
        'precision':precision_score(y_test,preds),
        'recall':recall_score(y_test,preds)
    })
summary_df = pd.DataFrame(models_summary)
summary_df.to_csv(os.path.join(SAVE_DIR,'model_performance_summary.csv'),index=False)
print("Performance summary saved.")



Performance summary saved.


## **Workflow complete!** Use CSVs for plotting/dashboarding.

**Workflow complete!** Use CSVs for plotting/dashboarding.