In [31]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline
sns.set_style('darkgrid')
seed = 42

# APS Failure at Scania Trucks

It seems like a lot of the classification tasks worth pursuing have low (< 5%) target prevalence, and in many of those tasks, there are a large number of both categorical and continuous predictors. In this notebook, I'll walk through a variety of approaches for dealing with unbalanced datasets.

## Data Description
`aps_failure_test_set.csv`: 11.9MB (16,000 obs)

`aps_failure_training_set.csv`: 44.7MB (60,000 obs)

The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

The attributes are as follows: class, then anonymized operational data. The operational data have an identifier and a bin id, like `Identifier_Bin`. In total there are 171 attributes, of which 7 are histogram variables. Missing values are denoted by `na`.

## Challenge Metric
Since this dataset was part of a challenge, they also provided a "challenge metric" formula to weight the cost of false positives and false negatives:

`Cost_1(FP) = 10` and `cost_2(FN) = 500`

We will want to minimize this.

In [2]:
def cost(y_true, y_pred, fp_cost=10, fn_cost=500, normalize=True):
    cm = confusion_matrix(y_true, y_pred)
    fp = cm[1][1]
    fn = cm[0][1]
    
    c = fp * fp_cost + fn * fn_cost
    
    return c / len(y_true) if normalize else c

## Exploration

In [3]:
train_df = pd.read_csv('aps_failure_training_set.csv', header=14, na_values='na')
train_df.head()

Unnamed: 0,class,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
0,neg,76698,,2130706000.0,280.0,0.0,0.0,0.0,0.0,0.0,...,1240520.0,493384.0,721044.0,469792.0,339156.0,157956.0,73224.0,0.0,0.0,0.0
1,neg,33058,,0.0,,0.0,0.0,0.0,0.0,0.0,...,421400.0,178064.0,293306.0,245416.0,133654.0,81140.0,97576.0,1500.0,0.0,0.0
2,neg,41040,,228.0,100.0,0.0,0.0,0.0,0.0,0.0,...,277378.0,159812.0,423992.0,409564.0,320746.0,158022.0,95128.0,514.0,0.0,0.0
3,neg,12,0.0,70.0,66.0,0.0,10.0,0.0,0.0,0.0,...,240.0,46.0,58.0,44.0,10.0,0.0,0.0,0.0,4.0,32.0
4,neg,60874,,1368.0,458.0,0.0,0.0,0.0,0.0,0.0,...,622012.0,229790.0,405298.0,347188.0,286954.0,311560.0,433954.0,1218.0,0.0,0.0


Convert target into a binary variable and rename column so it doesn't use a keyword (aka `class`) that prevents dot accessibility.

In [4]:
train_df['target'] = train_df['class'].map({'neg': 0, 'pos': 1})
train_df = train_df.drop('class', axis=1)

## Metadata Generation
I find it helpful to put together a metadata-set that describes important characteristics of each variable.

In [5]:
def generate_metadata(df):
    meta = df.isnull().sum().to_frame('n_missing')
    meta['perc_missing'] = meta['n_missing'] / len(df)
    meta['n_unique'] = df.nunique()
    
    descs = train_df.describe().T
    descs['n_valid'] = descs['count'].copy()
    return meta.join(descs.drop('count', axis=1))

In [6]:
meta = generate_metadata(train_df)

Uncomment the cell below to view the metadata in its entirety.

In [7]:
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):
#     display(meta.sort_values('perc_missing', ascending=True))

A few things to notice about `meta`...

- Everything is numeric. This is typical of UCI datasets, but normally, we would have to think about how to handle other types.
- `cd_000` has only one unique value. We'll start by encoding it as a binary variable.

For the baseline model, we will drop predictors that are over 25% missing and impute the median for the rest.

### Remove variables with > 25% missingness

In [8]:
bad_vars = meta[meta.perc_missing > 0.25].index

In [9]:
train_df = train_df.loc[:, ~train_df.columns.isin(bad_vars)]

### Train/Dev Split
We want to save the test set for an unbiased, out-of-sample assessment of the final model, so let's split the training data into a new training set and a development set. 

In [10]:
golden_data = train_test_split(train_df.drop('target', axis=1), train_df.target, test_size=.2, stratify=train_df.target)

In [11]:
data = golden_data.copy()

## Establishing a baseline model

Let's start by getting a baseline model with logistic regression with L1 and L2 regularization (default in `sklearn`). 

In [12]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('lr', LogisticRegression(solver='lbfgs', random_state=seed))
])

In [13]:
def assess_model(pl, data):
    X_train, X_dev, y_train, y_dev = data
    
    pl.fit(X_train, y_train)
    
    # train assessment
    y_preds = pl.predict(X_train)
    print('##### Train #####')
    print(classification_report(y_train, y_preds))
    print(f'Normalized train cost: {cost(y_train, y_preds):.{2}f}\n')
    
    # dev assessment
    y_preds = pl.predict(X_dev)
    print('##### Test #####')
    print(classification_report(y_dev, y_preds))
    print(f'Normalized dev cost: {cost(y_dev, y_preds):.{2}f}\n')
    
    return pl

In [14]:
pl = assess_model(pl, data)

##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     47200
           1       0.86      0.68      0.76       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.93      0.84      0.88     48000
weighted avg       0.99      0.99      0.99     48000

Normalized train cost: 1.03

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     11800
           1       0.80      0.66      0.73       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.90      0.83      0.86     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 1.44





It looks like our baseline normalized cost for on the `dev` set is 1.44. Let's see if we can beat it!

## Experiment 1: ElasticNet
I'm always confused about the difference between elastic net and logistic regression in `sklearn` because the logistic regression uses L1 and L2 regularization by default.

In [16]:
from sklearn.linear_model import SGDClassifier

### Vanilla ElasticNet

In [27]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='elasticnet', random_state=seed))
])

In [28]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     47200
           1       0.73      0.63      0.67       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.86      0.81      0.83     48000
weighted avg       0.99      0.99      0.99     48000

Normalized train cost: 2.07

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.68      0.60      0.64       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.84      0.80      0.82     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 2.43



### ElasticNet + Adaptive Learning Rate

In [29]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='elasticnet', learning_rate='adaptive', eta0=1, random_state=seed))
])

In [30]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     47200
           1       0.81      0.14      0.24       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.90      0.57      0.62     48000
weighted avg       0.98      0.99      0.98     48000

Normalized train cost: 0.31

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.78      0.15      0.26       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.88      0.58      0.63     12000
weighted avg       0.98      0.99      0.98     12000

Normalized dev cost: 0.40



### L1 Regularization

In [41]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='l1', random_state=seed))
])

In [42]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     47200
           1       0.77      0.70      0.73       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.88      0.85      0.86     48000
weighted avg       0.99      0.99      0.99     48000

Normalized train cost: 1.89

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.70      0.64      0.67       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.85      0.82      0.83     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 2.36



### L1 Regularization + Adaptive Learning Rate

In [37]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='l1', learning_rate='adaptive', eta0=1, random_state=seed))
])

In [38]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     47200
           1       0.85      0.43      0.57       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.92      0.72      0.78     48000
weighted avg       0.99      0.99      0.99     48000

Normalized train cost: 0.73

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.83      0.40      0.54       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.91      0.70      0.76     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 0.73



Interesting....the L1-only model gives very similar results for both the train and dev sets.

### L2 Penality only

In [43]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='l2', random_state=seed))
])

In [44]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     47200
           1       0.76      0.61      0.68       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.88      0.80      0.84     48000
weighted avg       0.99      0.99      0.99     48000

Normalized train cost: 1.74

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.69      0.59      0.64       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.84      0.79      0.82     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 2.31



### L2 Penality + Adaptive Learning Rate

In [45]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='l2', learning_rate='adaptive', eta0=1, random_state=seed))
])

In [46]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     47200
           1       0.70      0.13      0.22       800

   micro avg       0.98      0.98      0.98     48000
   macro avg       0.84      0.56      0.60     48000
weighted avg       0.98      0.98      0.98     48000

Normalized train cost: 0.47

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.71      0.16      0.26       200

   micro avg       0.98      0.98      0.98     12000
   macro avg       0.85      0.58      0.63     12000
weighted avg       0.98      0.98      0.98     12000

Normalized dev cost: 0.57



### No regularization + Adaptive Learning Rate

In [47]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('en', SGDClassifier(loss='log', penalty='none', learning_rate='adaptive', eta0=1, random_state=seed))
])

In [48]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     47200
           1       0.81      0.36      0.50       800

   micro avg       0.99      0.99      0.99     48000
   macro avg       0.90      0.68      0.75     48000
weighted avg       0.99      0.99      0.99     48000

Normalized train cost: 0.78

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     11800
           1       0.72      0.34      0.47       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.85      0.67      0.73     12000
weighted avg       0.98      0.99      0.98     12000

Normalized dev cost: 1.18



### Results
The elastic net model with learning rate adaptation significantly outperformed the baseline model based on the challenge metric. However, I am having some heartburn about the whole challenge metric optimization...is a false  positive really 50 times worse than a false negative???

## Experiment 2: RandomForest

### Vanilla RandomForest

In [32]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('rf', RandomForestClassifier())
])

In [33]:
assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     47200
           1       1.00      0.97      0.98       800

   micro avg       1.00      1.00      1.00     48000
   macro avg       1.00      0.98      0.99     48000
weighted avg       1.00      1.00      1.00     48000

Normalized train cost: 0.17

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     11800
           1       0.83      0.64      0.72       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.91      0.82      0.86     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 1.23



Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('en', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None,...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

Hmmm...this is the largest disparity in train/test cost that we've seen so far, which makes me think the model is overfit. I wonder if a RF would do better if we did some feature selection first.

### Feature Selection + Random Forest

In [34]:
from sklearn.feature_selection import SelectFromModel

In [35]:
pl = Pipeline([
    ('imputer', SimpleImputer(np.nan, strategy='median')),
    ('scaler',  StandardScaler()),
    ('selector', SelectFromModel(SGDClassifier(loss='log', penalty='elasticnet', learning_rate='adaptive', eta0=1, random_state=seed), )),
    ('rf', RandomForestClassifier())
])

In [36]:
pl = assess_model(pl, data)



##### Train #####
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     47200
           1       1.00      0.97      0.98       800

   micro avg       1.00      1.00      1.00     48000
   macro avg       1.00      0.98      0.99     48000
weighted avg       1.00      1.00      1.00     48000

Normalized train cost: 0.17

##### Test #####
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     11800
           1       0.83      0.65      0.73       200

   micro avg       0.99      0.99      0.99     12000
   macro avg       0.91      0.82      0.86     12000
weighted avg       0.99      0.99      0.99     12000

Normalized dev cost: 1.19



Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectFromModel(estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       ea...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

The feature selection helped our testing performance a bit, but not by much. 