# Predictive maintanence

Predictive maintanence is the maintanence of machines at a predicted future time before the machine failure. This allows scheduled maintanence of the machines, reducing the unplanned downtime costs.

In this notebook, we will build a deployable end-to-end classification model to predict whether a machine failure will occur or not. We will train state-of-the-art gradient boosted decision tree (GBDT) algorithms, and compare their performances.

## Data

We will use a simulated dataset taken from [Matzka (2020)](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset). It consists of 10,000 data points stored as rows with features like product type, air temperature, process temperature, rotational speed, torque wear, machine failure. The machine failures are grouped into 5 subcategories. For simplicity, we will predict the machine failure feature. A concise summary of the data, pair plots, and the distribution of the target variable are given below.

In [None]:
import os
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ['svg']
%matplotlib inline

In [None]:
DATA_DIR = '../input/predictive-maintenance'
FILE = 'ai4i2020.csv'
df = pd.read_csv(os.path.join(DATA_DIR, FILE))

In [None]:
df.info()

In [None]:
df.head()

### Target Distribution

In [None]:
df['Machine failure'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.show()

### Subgroups of the Machine Failure

In [None]:
df[df['Machine failure'] == 1][['TWF', 'HDF', 'PWF', 'OSF', 'RNF']].apply(pd.value_counts)

### Pairplots

In [None]:
def plot_pair():
    sns.pairplot(data=df.drop(['UDI', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1).sample(1000).select_dtypes(include='number'),
                 hue='Machine failure',
                 plot_kws={'s':6},
                 corner=True)
    plt.show()

plot_pair()

# Modeling

We will use the sklearn api and sklearn api of the xgboost, lightgbm, and catboost modeling. For parameter tuning, we will use the optuna optimization library. The performance of the models will be measured with the AUC metric.

The modeling consists of two parts. In the first part, we will write a column transformer to have a clean and reproducable data preprocessing, and then tune the hyper-parameters of each model in a 5-Fold cross validation scheme with early stopping. In the second part we will retrain the models with the best-fit parameters over the very same cross validation splits we used in the hyper-parameter tuning. The early stopping will give us 5 models that come from the 5-Fold cross validation and then we will obtain the predictions by averaging the results of 5 models for each of the GBDT algorithms.

In [None]:
import os
import time
import joblib
from functools import partial, wraps

import numpy as np
import pandas as pd

import xgboost as xgb
import lightgbm as lgb
import catboost as cb

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import StratifiedKFold
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import roc_auc_score, plot_roc_curve, plot_precision_recall_curve

import warnings
warnings.filterwarnings("ignore")

In [None]:
def read_data(data_dir, file, target, cols):
    """
    Read the tabular data and split it into input (x), output (y) components.
    """
    df = pd.read_csv(os.path.join(data_dir, file), usecols=cols)
    y = df[target]
    x = df.drop([target], axis=1)
    return x, y


def get_train_test_data():
    """
    Hard code some of the variables as they won't change for different experiments.
    """
    DATA_DIR = '../input/predictive-maintenance'
    TRAIN_FILE = 'train.csv'
    TEST_FILE = 'test.csv'
    COLS = ['Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
            'Tool wear [min]', 'Machine failure']
    CATS = ['Type']
    NUMS = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
            'Tool wear [min]']
    TARGET = 'Machine failure'

    x_train, y_train = read_data(DATA_DIR, TRAIN_FILE, TARGET, COLS)
    x_test, y_test = read_data(DATA_DIR, TEST_FILE, TARGET, COLS)
    return x_train, y_train, x_test, y_test, NUMS, CATS


def get_preprocessor(est_name, nums, cats):
    """
    We will need the transformers defined below for parameter tuning and then to obtain the best-fit model.
    So we need this function to help us call the transformers.
    """
    if est_name == 'cb':
        preprocessor = ColumnTransformer(transformers=[('num', StandardScaler(), nums)],
                                         remainder='passthrough')
    elif est_name == 'lgb':
        preprocessor = ColumnTransformer(transformers=[('num', StandardScaler(), nums),
                                                       ('cat', OrdinalEncoder(), cats)],
                                         remainder='passthrough')
    elif est_name == 'xgb':
        preprocessor = ColumnTransformer(transformers=[('num', StandardScaler(), nums),
                                                       ('cat', OneHotEncoder(), cats)],
                                         remainder='passthrough')
    return preprocessor


def get_estimator(est_name, params):
    """
    Estimators have to be instantiated inside the cross validation loops. This will help us to do that.
    Alternatively one can instantiate an estimator and then clone it.
    """
    if est_name == 'xgb':
        estimator = xgb.XGBClassifier(**params)
    elif est_name == 'lgb':
        estimator = lgb.LGBMClassifier(**params)
    elif est_name == 'cb':
        estimator = cb.CatBoostClassifier(**params)
    return estimator


def cross_validate(est_name, x, y, params, cv=None, method='predict_proba', return_model=False):
    """
    When tuning the parameters of GBDTs, we will use early stopping. Sklearn's cross validation functions do not allow early stopping.
    So we need to use this custom cross validator.
    """
    if cv is None:
        cv = StratifiedKFold(n_splits=5, random_state=84, shuffle=True)

    oof_preds = np.zeros(len(y))
    models = []

    for fold, (train_index, validation_index) in enumerate(cv.split(x, y)):
        x_train = x[train_index]
        y_train = y[train_index]

        x_validation = x[validation_index]
        y_validation = y[validation_index]

        estimator = get_estimator(est_name, params)
        estimator.fit(x_train, y_train, eval_set=[(x_validation, y_validation)], early_stopping_rounds=10,
                      verbose=False)

        if method == 'predict_proba':
            validation_pred = estimator.predict_proba(x_validation)
        elif method == 'predict':
            validation_pred = estimator.predict(x_validation)

        oof_preds[validation_index] = validation_pred[:, 1]

        if return_model:
            models.append(estimator)

    if return_model:
        return {'oof_preds': oof_preds, 'models': models}
    else:
        return {'oof_preds': oof_preds}

    
def define_objective(trial, x, y, est_name):
    """
    This will define the objective function that optuna needs.
    The parameters and their distributions are hard-coded here.
    """
    if est_name == 'xgb':
        params = {
            'eval_metric': trial.suggest_categorical('eval_metric', ['auc']),
            'n_estimators': trial.suggest_int('n_estimators', 300, 300),
            'num_parallel_tree': trial.suggest_int('num_parallel_tree', 1, 5),
            'max_depth': trial.suggest_int('max_depth', 2, 32),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 20),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 20),
            'min_child_weight': trial.suggest_float('min_child_weight', 0, 5),
            'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
            'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.1, 1, 0.01),
            'colsample_bynode': trial.suggest_discrete_uniform('colsample_bynode', 0.1, 1, 0.01),
            'colsample_bylevel': trial.suggest_discrete_uniform('colsample_bylevel', 0.1, 1, 0.01),
            'subsample': trial.suggest_discrete_uniform('subsample', 0.5, 1, 0.05)}

    elif est_name == 'lgb':
        params = {
            'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
            'boosting_type': trial.suggest_categorical('boosting_type', ['gbdt']),
            'metric': trial.suggest_categorical('metric', ['auc']),
            'feature_pre_filter': trial.suggest_categorical('feature_pre_filter', [False]),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 20),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 20),
            'num_leaves': trial.suggest_int('num_leaves', 2, 32),
            'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.5, 1, 0.01),
            'subsample': trial.suggest_discrete_uniform('subsample', 0.5, 1, 0.01),
            'subsample_freq': trial.suggest_int('subsample_freq', 7, 7),
            'min_child_samples': trial.suggest_int('min_child_samples', 1, 30),
            'early_stopping_round': trial.suggest_int('early_stopping_round', 10, 10),
            'n_estimators': trial.suggest_int('n_estimators', 100, 100),
            'verbosity': trial.suggest_categorical('verbosity', [-1])}

    elif est_name == 'cb':
        params = {
            'loss_function': trial.suggest_categorical('loss_function', ['Logloss']),
            'eval_metric': trial.suggest_categorical('eval_metric', ['AUC']),
            'iterations': trial.suggest_int('iterations', 50, 50),
            'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.5),
            'depth': trial.suggest_int('depth', 6, 12),
            'verbose': trial.suggest_categorical('verbose', [False]),
            'early_stopping_rounds': trial.suggest_categorical('early_stopping_rounds', [10]),
            'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0, 100),
            'bagging_temperature': trial.suggest_float('bagging_temperature', 0.8, 1),
            'cat_features': trial.suggest_categorical('cat_features', [[5]])}

    results = cross_validate(est_name, x, y, params, method='predict_proba')

    return roc_auc_score(y_train, results['oof_preds'])


def tune_parameters(x_train, y_train, est_name, n_trials, nums, cats):
    """
    This will tune the parameters of a given model and store the parameters in a file.
    """
    preprocessor = get_preprocessor(est_name, nums, cats)
    x_train = preprocessor.fit_transform(x_train)

    study_name = 'study_' + est_name + '.pkl'
    if os.path.exists(study_name):
        study = joblib.load(study_name)
    else:
        sampler = optuna.samplers.TPESampler()
        study = optuna.create_study(sampler=sampler, direction='maximize')

    objective = partial(define_objective, x=x_train, y=y_train, est_name=est_name)
    study.optimize(objective, n_trials=n_trials, gc_after_trial=True)
    joblib.dump(study, study_name)

In [None]:
# Tune all the models with N_TRIALS.
EST_NAMES = ['xgb', 'lgb', 'cb']
N_TRIALS = 50

# Read data
x_train, y_train, x_test, y_test, nums, cats = get_train_test_data()

# Tune hyper-parameters
for EST_NAME in EST_NAMES:
    print('Tuning ' + EST_NAME + ' parameters...')
    tune_parameters(x_train, y_train, est_name=EST_NAME, n_trials=N_TRIALS, nums=nums, cats=cats)
print('Done')

We have 5 models in total, one for each of the five folds. Next, we will write a sklearn classifier that will 1) retrain the 5 models with the best-fit parameters, 2) predict the probabilities by averaging the ouputs of 5 models.

In [None]:
def get_best_params(est_name='xgb'):
    study_name = 'study_' + est_name + '.pkl'
    study = joblib.load(study_name)
    params = study.best_params
    return params

class MeanClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, transformer, est_name, params):
        self.transformer = transformer
        self.est_name = est_name
        self.params = params

    def fit(self, X, y=None):
        self.classes_ = np.unique(y)

        X = self.transformer.fit_transform(X)
        self.models_ = cross_validate(self.est_name, X, y, self.params, cv=None, method='predict_proba',
                                      return_model=True)['models']
        return self

    def predict_proba(self, X, y=None):
        X = self.transformer.transform(X)
        y_pred = np.zeros(len(X))
        for model in self.models_:
            y_pred += model.predict_proba(X)[:, 1] / len(self.models_)
        return y_pred

    def predict(self, X, y=None, threshold=0.5):
        y_pred = self.predict_proba(X)
        return np.where(y_pred < threshold, 0, 1)

    def score(self, x_true, y_true):
        y_pred = self.predict_proba(x_true)
        return roc_auc_score(y_true, y_pred)

In [None]:
# Best parameters
lgbparams = get_best_params('lgb')
xgbparams = get_best_params('xgb')
cbparams = get_best_params('cb')

# Transformers
transformer_lgb = get_preprocessor('lgb', nums, cats)
transformer_xgb = get_preprocessor('xgb', nums, cats)
transformer_cb = get_preprocessor('cb', nums, cats)

# Classifiers
clf_lgb = MeanClassifier(transformer_lgb, 'lgb', lgbparams)
clf_xgb = MeanClassifier(transformer_xgb, 'xgb', xgbparams)
clf_cb = MeanClassifier(transformer_cb, 'cb', cbparams)

## Model Fitting and Test Results

The classifiers were written in the sklearn classifier format. Now, we can easily fit the models and obtain the test scores. Note that the classifiers defined above include the data preprocessors (column transformers); hence, they are ready to be saved and deployed right away.

In [None]:
clfs = [clf_lgb, clf_xgb, clf_cb]
for clf in clfs:
    clf.fit(x_train, y_train)
    print('%s Test Score (AUC): %f' % (clf.models_[0].__class__.__name__, clf.score(x_test, y_test)))

In [None]:
fig, ax = plt.subplots()
ax.set(xlim=[-0.05, 1.05], ylim=[-0.05, 1.05],
       title="Receiver operating characteristic")
for clf in clfs:
    viz = plot_roc_curve(clf, x_test, y_test, ax=ax, name=clf.est_name)
plt.show()