## Preface  

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.  
https://pypi.python.org/pypi/imbalanced-learn 

Imbalanced data is commonly observed in the real-world setting, real-fraudulent transaction, healthy-infected patient, to name a few. Incautious attempts of machine learning techniques on the problem could give very bad results/prediction. Furthermore, the inappropriate metric of performance measurement provides wrong conclusion due to the nature of the evaluation method. In this project, we try to tackle such problem and compare the model improvement.  


## Preprocessing

### Sampling techniques
1. Undersampling
    - Random undersampling
    - Cluster Centroids
    - Near Miss
2. Oversampling
    - Random oversampling: generate new samples by random resampling with replacement of under represented class
    - Synthetic Minority Oversampling (SMOTE)
3. Combined over and under sampling
    - SMOTEENN
    - SMOTETomek  

### Training techniques  
1. Class weighting
2. Sample weighting

## Import the Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import display
%matplotlib inline

from imblearn.under_sampling import ClusterCentroids, NearMiss, RandomUnderSampler
from imblearn.over_sampling import SMOTE, RandomOverSampler

from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.ensemble import BalanceCascade

from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, \
f1_score, precision_score, auc, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler


## Load the dataset

In [None]:
df = pd.read_csv('./input/creditcard.csv')
display(df.head(5))

In [None]:
display(df.describe())

In [None]:
df['Class'].value_counts()

In [None]:
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
df = df.drop(['Time'], axis = 1)
display(df.head(5))

## Split the data 

In [None]:
X = df.iloc[:, df.columns != 'Class']
y = df.iloc[:, df.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

print(X_train.shape)
print(X_test.shape)

In [None]:
def transform(transformer, X, y):
    print('Transforming {}'.format(transformer.__class__.__name__))
    X_resampled, y_resampled = transformer.fit_sample(X.values, y.values.ravel())
    return transformer.__class__.__name__, pd.DataFrame(X_resampled), pd.DataFrame(y_resampled)

def benchmark(sampling_type, X, y):
    lr = LogisticRegression(penalty = 'l1')
    param_grid = {'C': [0.1,1,10]}
    g_search = GridSearchCV(estimator = lr, param_grid = param_grid, scoring = 'accuracy',
                            cv = 4, verbose = 2)
    g_search = g_search.fit(X.values, y.values.ravel())
    return sampling_type, g_search.best_score_, g_search.best_params_['C']


In [None]:
datasets = []
datasets.append(('base', X_train, y_train))
datasets.append(transform(RandomUnderSampler(), X_train, y_train))
datasets.append(transform(NearMiss(n_jobs=-1), X_train, y_train))
datasets.append(transform(RandomOverSampler(), X_train, y_train))
datasets.append(transform(SMOTE(n_jobs=-1), X_train, y_train))

## It is computational demanding for larger data sets.
## datasets.append(transform(SMOTEENN(), X_train, y_train))
## datasets.append(transform(SMOTETomek(), X_train, y_train))

In [None]:
display([item[0] for item in datasets])

### Determine Hyper-Parameters

In [None]:
## sampling_type, g_search.best_score_, g_search.best_params_['C']
benchmark_scores = []
for sample_type, X, y in datasets:
    print('------')
    print('{}'.format(sample_type))
    benchmark_scores.append(benchmark(sample_type,X,y))
    print('------')


In [None]:
display(benchmark_scores)

### Evaluation models

In [None]:
print (X_train.shape)

In [None]:
print(type(benchmark_scores))

In [None]:
## http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

## http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py
from sklearn.metrics import average_precision_score

scores = []
for sampling_type, score, param in benchmark_scores:
    print('Training on {}'.format(sampling_type))
    lr = LogisticRegression(penalty='l1', C = param)
    for s_type, X, y in datasets:
        if s_type == sampling_type:
            lr.fit(X.values, y.values.ravel())
            y_pred_class = lr.predict(X_test.values)
            
            ## Probability estimates
            y_pred_prob = lr.predict_proba(X_test.values)
            
            ## Predict confidence scores for samples.
            y_pred_confi = lr.decision_function(X_test.values)
            
            fpr, tpr, threshold = roc_curve(y_test.values.ravel(), y_pred_class)
            prec, recall, thres = precision_recall_curve(y_test.values.ravel(), y_pred_confi)

            
            ## average_precision = average_precision_score(y_test.values.ravel(), y_pred_confi)

            
            scores.append((sampling_type,
                           accuracy_score(y_test.values.ravel(), y_pred_class),
                           f1_score(y_test.values.ravel(), y_pred_class),
                           precision_score(y_test.values.ravel(), y_pred_class),
                           recall_score(y_test.values.ravel(), y_pred_class),
                           average_precision_score(y_test.values.ravel(), y_pred_confi),
                           auc(fpr, tpr),
                           auc(prec, recall, reorder = True),
                           confusion_matrix(y_test.values.ravel(), y_pred_class)))  ## tn, fp, fn, tp 

In [None]:
sampling_results = pd.DataFrame(scores, columns=['Sampling Type', 'accuracy', 'f1', 'precision',
                                                 'recall', 'average_precision', 
                                                 'auc_roc', 'auc_pr', 'confusion_matrix'])
display(sampling_results)

### Weighted Class

In [None]:
lr = LogisticRegression(penalty = 'l1', class_weight = 'balanced')
lr.fit(X_train.values, y_train.values.ravel())

In [None]:
scores = []
y_pred_class = lr.predict(X_test.values)
y_pred_proba = lr.predict_proba(X_test.values)
y_pred_confi = lr.decision_function(X_test.values)
fpr, tpr, thresholds = roc_curve(y_test.values.ravel(), y_pred_class)
precision, recall, thres = precision_recall_curve(y_test.values.ravel(), y_pred_confi)
scores.append(("weighted_base", 
               accuracy_score(y_test.values.ravel(),y_pred_class), 
               f1_score(y_test.values.ravel(),y_pred_class),
               precision_score(y_test.values.ravel(),y_pred_class),
               recall_score(y_test.values.ravel(),y_pred_class),
               average_precision_score(y_test.values.ravel(), y_pred_confi),
               auc(fpr, tpr),
               auc(precision, recall, reorder=True),
               confusion_matrix(y_test.values.ravel(),y_pred_class)))
scores = pd.DataFrame(scores, columns = ['Sampling Type', 'accuracy', 'f1', 'precision',
                                                 'recall', 'average_precision', 
                                                 'auc_roc', 'auc_pr', 'confusion_matrix'])

In [None]:
results = sampling_results.append(scores)
display(results)


## Reflection
- Undersampling leads to high recall, as long as a huge downside of precision
- SMOTE sampling and RandomOverSampler perform the best considering auc_roc and auc_pr with acceptable levels of false positives
- Class weighting could give comparable results to sampling techniques