## Preface  

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.  
https://pypi.python.org/pypi/imbalanced-learn 

Imbalanced data is commonly observed in the real-world setting, real-fraudulent transaction, healthy-infected patient, to name a few. Incautious attempts of machine learning techniques on the problem could give very bad results/prediction. Furthermore, the inappropriate metric of performance measurement provides wrong conclusion due to the nature of the evaluation method. In this project, we try to tackle such problem and compare the model improvement.  


## Preprocessing

### Sampling techniques
1. Undersampling
    - Random undersampling
    - Cluster Centroids
    - Near Miss
2. Oversampling
    - Random oversampling: generate new samples by random resampling with replacement of under represented class
    - Synthetic Minority Oversampling (SMOTE)
3. Combined over and under sampling
    - SMOTEENN
    - SMOTETomek  

### Training techniques  
1. Class weighting
2. Sample weighting

## Import the Library

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import ClusterCentroids, NearMiss, RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.ensemble import BalanceCascade

from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, \
f1_score, precision_score, auc, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler

from IPython.display import display

## Load the dataset

In [27]:
df = pd.read_csv('./input/creditcard.csv')
display(df.head(5))

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [29]:
display(df.describe())

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.758743e-12,-8.252298e-13,-9.636929e-13,8.316157e-13,1.591952e-13,4.247354e-13,-3.05018e-13,8.693344e-14,-1.179712e-12,...,-3.406543e-13,-5.713163e-13,-9.725303e-13,1.464139e-12,-6.989087e-13,-5.61526e-13,3.332112e-12,-3.518886e-12,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [28]:
df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [4]:
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
df = df.drop(['Time'], axis = 1)
display(df.head(5))

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.244964,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,-0.342475,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.160686,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.140534,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,-0.073403,0


## Split the data 

In [9]:
X = df.iloc[:, df.columns != 'Class']
y = df.iloc[:, df.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

print(X_train.shape)
print(X_test.shape)

(199364, 29)
(85443, 29)


In [31]:
print(y.head(10))
print(y.values[]

   0
0  0
1  0
2  0
3  0
4  0
5  0
6  0
7  0
8  0
9  0


AttributeError: 'numpy.ndarray' object has no attribute 'head'

In [10]:
def benchmark(sampling_type, X, y):
    lr = LogisticRegression(penalty = 'l1')
    param_grid = {'C': [0.01, 1, 100]}
    g_search = GridSearchCV(estimator = lr, param_grid = param_grid, scoring = 'accuracy',
                            cv = 5, verbose = 2)
    g_search = g_search.fit(X.values, y.values.ravel())
    return sampling_type, g_search.best_score_, g_search.best_params_['C']

def transform(transformer, X, y):
    print('Transforming {}'.format(transformer.__class__.__name__))
    X_resampled, y_resampled = transformer.fit_sample(X.values, y.values.ravel())
    return transformer.__class__.__name__, pd.DataFrame(X_resampled), pd.DataFrame(y_resampled)

In [11]:
datasets = []
datasets.append(('base', X_train, y_train))
datasets.append(transform(SMOTE(n_jobs=-1), X_train, y_train))
datasets.append(transform(RandomOverSampler(), X_train, y_train))
datasets.append(transform(NearMiss(n_jobs=-1), X_train, y_train))
datasets.append(transform(RandomUnderSampler(), X_train, y_train))
## datasets.append(transform(SMOTEENN(), X_train, y_train))
## datasets.append(transform(SMOTETomek(), X_train, y_train))

Transforming SMOTE
Transforming RandomOverSampler
Transforming NearMiss
Transforming RandomUnderSampler


### Determine Hyper-parameters

In [12]:
benchmark_scores = []
for sample_type, X, y in datasets:
    print('------')
    print('{}'.format(sample_type))
    benchmark_scores.append(benchmark(sample_type, X, y))
    print('------')


------
base
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   2.2s
[CV] C=0.01 ..........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.3s remaining:    0.0s


[CV] ........................................... C=0.01, total=   2.1s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   2.6s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   2.1s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   2.3s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  12.7s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  13.7s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  17.4s
[CV] C=1 .............................................................
[CV] .

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  2.9min finished


------
------
SMOTE
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   9.2s
[CV] C=0.01 ..........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.3s remaining:    0.0s


[CV] ........................................... C=0.01, total=   8.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   9.3s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   8.4s
[CV] C=0.01 ..........................................................
[CV] .......................................... C=0.01, total=132.8min
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  39.4s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  29.6s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  33.4s
[CV] C=1 .............................................................
[CV] .

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 138.3min finished


------
------
RandomOverSampler
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=  10.7s
[CV] C=0.01 ..........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.8s remaining:    0.0s


[CV] ........................................... C=0.01, total=  11.4s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=  10.9s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=  11.5s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=  11.8s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  23.4s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  20.9s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=  22.8s
[CV] C=1 .............................................................
[CV] .

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  4.9min finished


------
------
NearMiss
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=   0.0s
[CV] C=1 ........................................................

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] .............................................. C=1, total=   0.0s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=   0.0s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.0s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.6s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.6s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.8s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.7s


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    3.0s finished


------
------
RandomUnderSampler
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ........................................... C=0.01, total=   0.0s
[CV] C=1 .............................................................
[CV] .............................................. C=1, total=   0.0s
[CV] C=1 ..............................................

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ............................................ C=100, total=   0.5s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.5s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.3s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.3s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   0.3s
------


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished


In [15]:
display(benchmark_scores)

[('base', 0.9992375754900584, 100),
 ('SMOTE', 0.9455263475220065, 100),
 ('RandomOverSampler', 0.949543289521283, 1),
 ('NearMiss', 0.9683734939759037, 100),
 ('RandomUnderSampler', 0.9337349397590361, 1)]

### Evaluation models

In [18]:
scores = []
for sampling_type, score, param in benchmark_scores:
    print('Training on {}'.format(sampling_type))
    lr = LogisticRegression(penalty='l1', C = param)
    for s_type, X, y in datasets:
        if s_type == sampling_type:
            lr.fit(X.values, y.values.ravel())
            pred_test = lr.predict(X_test.values)
            pred_test_prob = lr.predict_proba(X_test.values)
            probs = lr.decision_function(X_test.values)
            fpr, tpr, threshold = roc_curve(y_test.values.ravel(), pred_test)
            prec, recall, thres = precision_recall_curve(y_test.values.ravel(), probs)
            scores.append((sampling_type,
                           f1_score(y_test.values.ravel(), pred_test),
                           precision_score(y_test.values.ravel(), pred_test),
                           recall_score(y_test.values.ravel(), pred_test),
                           accuracy_score(y_test.values.ravel(), pred_test),
                           auc(fpr, tpr),
                           auc(prec, recall, reorder = True),
                           confusion_matrix(y_test.values.ravel(), pred_test)))

Training on base
Training on SMOTE
Training on RandomOverSampler
Training on NearMiss
Training on RandomUnderSampler


In [19]:
sampling_results = pd.DataFrame(scores, columns=['Sampling Type', 'f1', 'precision',
                                                 'recall', 'accuracy', 'auc_roc',
                                                 'auc_pr', 'confusion_matrix'])
display(sampling_results)

Unnamed: 0,Sampling Type,f1,precision,recall,accuracy,auc_roc,auc_pr,confusion_matrix
0,base,0.698182,0.834783,0.6,0.999029,0.799889,0.743501,"[[85264, 19], [64, 96]]"
1,SMOTE,0.114575,0.061097,0.91875,0.973409,0.946131,0.754611,"[[83024, 2259], [13, 147]]"
2,RandomOverSampler,0.122792,0.065825,0.9125,0.975586,0.944102,0.759541,"[[83211, 2072], [14, 146]]"
3,NearMiss,0.005759,0.002888,0.975,0.369545,0.671704,0.100784,"[[31419, 53864], [4, 156]]"
4,RandomUnderSampler,0.109172,0.058034,0.91875,0.971923,0.945386,0.722427,"[[82897, 2386], [13, 147]]"


### Weighted Class

In [23]:
lr = LogisticRegression(penalty = 'l1', class_weight = 'balanced')
lr.fit(X_train.values, y_train.values.ravel())
scores = []
pred_test = lr.predict(X_test.values)
pred_test_proba = lr.predict_proba(X_test.values)
proba = lr.decision_function(X_test.values)
fpr, tpr, thresholds = roc_curve(y_test.values.ravel(), pred_test)
p, r, t = precision_recall_curve(y_test.values.ravel(), proba)
scores.append(("weighted_base", f1_score(y_test.values.ravel(),pred_test),
               precision_score(y_test.values.ravel(),pred_test),
               recall_score(y_test.values.ravel(),pred_test),
               accuracy_score(y_test.values.ravel(),pred_test),
               auc(fpr, tpr),
               auc(p, r, reorder=True),
               confusion_matrix(y_test.values.ravel(),pred_test)))
scores = pd.DataFrame(scores, columns = ['Sampling Type','f1','precision',
                                         'recall','accuracy','auc_roc','auc_pr',
                                         'confusion_matrix'])

In [24]:
results = sampling_results.append(scores)
display(results)

Unnamed: 0,Sampling Type,f1,precision,recall,accuracy,auc_roc,auc_pr,confusion_matrix
0,base,0.698182,0.834783,0.6,0.999029,0.799889,0.743501,"[[85264, 19], [64, 96]]"
1,SMOTE,0.114575,0.061097,0.91875,0.973409,0.946131,0.754611,"[[83024, 2259], [13, 147]]"
2,RandomOverSampler,0.122792,0.065825,0.9125,0.975586,0.944102,0.759541,"[[83211, 2072], [14, 146]]"
3,NearMiss,0.005759,0.002888,0.975,0.369545,0.671704,0.100784,"[[31419, 53864], [4, 156]]"
4,RandomUnderSampler,0.109172,0.058034,0.91875,0.971923,0.945386,0.722427,"[[82897, 2386], [13, 147]]"
0,weighted_base,0.121162,0.064889,0.9125,0.975212,0.943915,0.759453,"[[83179, 2104], [14, 146]]"


## Reflection
- Undersampling leads to high recall, as long as a huge downside of precision
- SMOTE sampling and RandomOverSampler perform the best considering auc_roc and auc_pr with acceptable levels of false positives
- Class weighting could give comparable results to sampling techniques