## Naive Bayes
---

This notebook fits a Naive Bayes model to the Scania Trucks Air Pressure System (APS) predictive maintenance dataset, obtained from [UCI's data repository](https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks). 

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from tqdm import tqdm
from itertools import product

from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import auc, roc_curve, precision_recall_curve, make_scorer, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline

sns.set()

### Loading Data

In [4]:
df_train = pd.read_csv(r'./data/aps_failure_training_set_data_only.csv')

### Misclassification Cost Function

The dataset comes with a pre-defined challenge metric, shown below:

Cost of Misclassification = 10*(False Positive) + 500*(False Negative)

This cost will be used in lieu of traditional metrics, such as accuracy and/or AUROC/AUPR. 

In [5]:
def calc_misclassification_cost(y, y_pred):
    tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
    return 10*fp + 500*fn


misclassification_cost = make_scorer(
    calc_misclassification_cost,
    greater_is_better=False,
    needs_proba=False,
    needs_threshold=False
)

### Strategy
A similar tuning strategy will be used to tune the Naive Bayes model. 

* All parameters remain the same, except C & class_weights from LogisticRegression is replaced with alpha & norm.

* MaxAbs and StandardScaler are removed as they result in negative values, which are not accepted by the Naive Bayes algorithm.

For this project, the ComplementNB variant of the algorithm will be used as scikit-learn recommends it for imbalanced datasets.

### Grid Search Over All Strategies

This can also be considered a Full Factorial Design of Experiments (DOE) as the hyperparameter optimization also includes experimenting with data scalers, filling NaN values, and data sampling (e.g., not just optimizing hyperparameters, but also optimizing preprocessing steps).

In [6]:
# df --> X_train & y_train
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']

In [27]:
# Run all combinations (e.g. Full Factorial)
fill_na = [0, -1, -100, -10_000, -1_000_000, 'mean', 'most_frequent']
scalers = ['minmax']
imbalance = [None, 'smote']
alpha = [0.001, 0.1, 1, 10, 100, 1000, 10_000]
norm = [True, False]

s = [fill_na, scalers, imbalance, alpha, norm]
df_full_fact = pd.DataFrame(list(product(*s)), columns=['fill_na', 'scaler', 'imbalance', 'alpha', 'norm'])

display(df_full_fact)
print('Total number of runs = %i' % df_full_fact.shape[0])

Unnamed: 0,fill_na,scaler,imbalance,alpha,norm
0,0,minmax,,0.001,True
1,0,minmax,,0.001,False
2,0,minmax,,0.100,True
3,0,minmax,,0.100,False
4,0,minmax,,1.000,True
...,...,...,...,...,...
191,most_frequent,minmax,smote,100.000,False
192,most_frequent,minmax,smote,1000.000,True
193,most_frequent,minmax,smote,1000.000,False
194,most_frequent,minmax,smote,10000.000,True


Total number of runs = 196


In [28]:
def fit_sklearn_pipeline(srs):
    steps = []

    # IMPUTATION METHOD
    if type(srs.fill_na) == int:
        steps.append(('impute', SimpleImputer(strategy='constant', fill_value=srs.fill_na)))
    else:
        steps.append(('impute', SimpleImputer(strategy=srs.fill_na)))

    # SCALING METHOD
    if srs.scaler == 'minmax':
        steps.append(('scale', MinMaxScaler()))
    elif srs.scaler == 'maxabs':
        steps.append(('scale', MaxAbsScaler()))
    else:
        steps.append(('scale', StandardScaler()))

    # SMOTE & WEIGHTS
    if srs.imbalance == 'smote':
        steps.append(('smote', SMOTE(random_state=1)))
        steps.append(('naive_bayes', ComplementNB(alpha=srs.alpha, norm=srs.norm)))
        pipe = imbPipeline(steps=steps)

    else:
        steps.append(('naive_bayes', ComplementNB(alpha=srs.alpha, norm=srs.norm)))
        pipe = Pipeline(steps=steps)

    # FIT & PREDICT
    pipe.fit(X_train, y_train)

    return pipe

In [30]:
%%time
results = []

for row in tqdm(df_full_fact.iterrows(), total=df_full_fact.shape[0]):
    # Fit pipeline
    pipe = fit_sklearn_pipeline(row[1])

    # Calculate average misclassification cost over all KFolds, best C & coeffs
    scores = cross_val_score(pipe, cv=5, X=X_train, y=y_train, scoring=misclassification_cost, n_jobs=-1)
    cv_mean_cost = -scores.mean()

    # Append to results df
    results.append((*row[1].tolist(), cv_mean_cost))

# Create results dataframe
df_results = pd.DataFrame(results, columns=['fill_na', 'scaler', 'imbalance', 'alpha', 'norm', 'cv_mean_cost'])
df_results.to_csv(r'./results/cnb_tuning.csv', index=False)

100%|██████████| 196/196 [03:43<00:00,  1.14s/it]

CPU times: user 3min 21s, sys: 1min 45s, total: 5min 6s
Wall time: 3min 43s





In [31]:
df_results = pd.read_csv(r'./results/cnb_tuning.csv').fillna('None')
df_results.sort_values(by='cv_mean_cost')[:10]

Unnamed: 0,fill_na,scaler,imbalance,alpha,norm,cv_mean_cost
47,-1,minmax,smote,1.0,False,14474.0
45,-1,minmax,smote,0.1,False,14584.0
31,-1,minmax,,0.1,False,14684.0
43,-1,minmax,smote,0.001,False,14684.0
3,0,minmax,,0.1,False,14702.0
19,0,minmax,smote,1.0,False,14726.0
17,0,minmax,smote,0.1,False,14728.0
33,-1,minmax,,1.0,False,14758.0
5,0,minmax,,1.0,False,14774.0
21,0,minmax,smote,10.0,False,14846.0


* Unlike Logistic Regression, Naive Bayes performs better when NaN values are not offset from the data. However, it does perform slightly better if NaNs are filled with -1 instead of 0. This may be due to the fact that some features contain 0 as a valid value.

* SMOTE seems to be effective at decreasing the algorithm's misclassification cost, however, only slightly.

### Best Estimator - Plots

In [32]:
pipe = fit_sklearn_pipeline(df_full_fact.iloc[47])

In [33]:
scores = cross_val_score(pipe, cv=5, X=X_train, y=y_train, scoring=misclassification_cost, n_jobs=-1)
cv_mean_cost = -scores.mean()
print('Cross-Validated Cost: %i' % cv_mean_cost)

Cross-Validated Cost: 14474


In [35]:
def plot_roc_auc(y_true, y_pred, model_name, file_path, figsize=(10, 8)):
    # Create figure
    fig = plt.figure(figsize=figsize)

    # Calculate ROC Curve & AUC
    fpr, tpr, thresholds = roc_curve(y_true, y_pred)
    area = auc(fpr, tpr)
    plt.title('ROC Curve | %s | AUC = %0.5f' % (model_name, area))
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')

    # Save & close plot
    plt.plot(fpr, tpr)
    fig.savefig(file_path)
    plt.close(fig)

    return area


def plot_precision_recall_auc(y_true, y_pred, model_name, file_path, figsize=(10, 8)):
    # Create figure
    fig = plt.figure(figsize=figsize)

    # Calculate ROC Curve & AUC
    pr, rc, thresholds = precision_recall_curve(y_true, y_pred)
    area = auc(rc, pr)
    plt.title('Precision-Recall Curve | %s | AUC = %0.5f' % (model_name, area))
    plt.xlabel('Recall')
    plt.ylabel('Precision')

    # Save & close plot
    plt.plot(rc, pr)
    fig.savefig(file_path)
    plt.close(fig)

    return area

probs = pipe.predict_proba(X_train)

plot_roc_auc(y_train.replace({'neg': 0, 'pos': 1}), probs[:, 1], 'Complement Naive Bayes', r'./results/cnb_roc.jpg');
plot_precision_recall_auc(y_train.replace({'neg': 0, 'pos': 1}), probs[:, 1], 'Complement Naive Bayes', r'./results/cnb_pr.jpg');

Note that probabilities that comes out of a Naive Bayes algorithm are not accurate. Plot is provided for reference only.

![image](./results/cnb_roc.jpg)
![image](./results/cnb_pr.jpg)

### Misclassification Cost on Test Set

In [36]:
df_test = pd.read_csv(r'./data/aps_failure_test_set_data_only.csv')

X_test = df_test.drop('class', axis=1)
y_test = df_test['class']
y_pred = pipe.predict(X_test)

print('Misclassification Cost on Test Data: %i' % calc_misclassification_cost(y_test, y_pred))

Misclassification Cost on Test Data: 19060


In [37]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('Number of Type 1 Faults: %i' % fp)
print('Number of Type 2 Faults: %i' % fn)

Number of Type 1 Faults: 806
Number of Type 2 Faults: 22


* The Naive Bayes algorithm performed worse than the tuned Logistic Regression on the test set.

* However, Naive Bayes had the same number of Type 2 Faults as the Logistic Regression, which is the more important fault to minimize (Cost = 500 for Type 2)