# Outline of this code:
1. undersample non fraud dataset and make a balanced dataset as 1:1 to train the model
2. use precision, recall, f1 and kappa to choose the best parameter c for regularization
3. try different ratio of data to get more accurate and stable results
4. using SMOTE to generate synthetic data points, accuracy is 0.944 and recall is 0.914 by logistic regression
5. try to use random forest and orignal 1:1 ratio data set
6. try to combine random forest and data after SMOTE
7. eventually achieve recall rate is 1 and accuracy is 0.999871035061 and AUC is 0.99987081469

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import roc_auc_score, accuracy_score, recall_score, f1_score, cohen_kappa_score
from sklearn import preprocessing
from sklearn.model_selection import KFold

In [2]:
data = pd.read_csv("./creditcard.csv")
data.head()
print("Non fraud rate:")
len(data.loc[data.loc[:, 'Class'] == 0, :]) / len(data.loc[:, 'Class'])

Non fraud rate:


0.9982725143693799

We found that the data is really biased toward non fraud data points. Therefore, in order to get robust result we have to resample the data. Firstly, we try to under sample the non-fraud data points to get 1:1 ratio dataset.

In [4]:
# under sample non fraud
len_fraud = len(data.loc[data.loc[:, 'Class'] == 1, :])
print("number of fraud: ", len_fraud)
# sample from non fraud to have 1:1 propotion
sub_non_fraud = data.loc[data.loc[:, 'Class'] == 0, :].sample(len_fraud)
sub_non_fraud.head()

number of fraud:  492


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
142700,84885.0,1.282544,0.343615,0.155205,0.608999,-0.19951,-0.868578,0.075792,-0.17847,0.063093,...,-0.311994,-0.899064,0.08581,-0.016022,0.262946,0.116895,-0.023705,0.027006,1.29,0
148328,89718.0,-0.151769,0.773317,0.21751,-0.990222,1.273956,-0.133331,1.220424,-0.233007,0.057706,...,-0.32835,-0.66878,-0.04967,-0.043039,-0.537554,0.123571,0.180082,0.013355,11.99,0
283061,171347.0,-1.159551,-1.26701,-0.664532,-0.333422,2.151709,-0.265853,0.929949,-0.313039,-0.008446,...,0.197362,1.26117,1.161797,-0.329007,0.094537,-0.122662,0.069772,-0.314031,84.39,0
58107,48234.0,-2.11849,2.251822,1.969822,2.925761,-1.280051,0.295898,-0.624246,0.633182,-0.263062,...,0.112593,0.271349,0.054213,0.719109,-0.267708,0.072916,-0.863805,-0.067662,0.76,0
235474,148402.0,-1.303719,1.105589,-3.37912,-0.383921,0.847778,1.403939,-0.032561,-1.787864,-1.020774,...,-0.554459,2.550605,-0.325024,-1.667217,-1.654295,-0.054046,0.206256,0.047341,233.29,0


In [5]:
# combine resample fraud and non fraud data
data_resample = pd.concat([sub_non_fraud, data.loc[data.loc[:, 'Class'] == 1, :]])
print("Non fraud rate:")
len(data_resample.loc[data_resample.loc[:, 'Class'] == 0, :]) / len(data_resample.loc[:, 'Class'])

Non fraud rate:


0.5

After under sample the non-fraud data points, we try logistic regression first as out benchmark. In this practice, we care not only the accuracy rate but also the recall rate. We introduce F1 and Cohen’s kappa to compare different regularization power c.  

In [11]:
# use k fold to find the highest recall rate parameter
X = data_resample.iloc[:, 1:29]
Y = data_resample.loc[:, "Class"]
kf = KFold(n_splits=5, shuffle=True)
kf.get_n_splits(X)
# test on different regularation power
cs = [0.001, 0.01, 0.1, 1, 10, 100]
for c in cs:
    accuracy = []
    recall = []
    f1 = []
    kappa = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
        y_train, y_test = Y.iloc[train_index], Y.iloc[test_index]
        logreg = LogisticRegression(penalty='l1', solver='liblinear', C=c, max_iter=100)
        logreg.fit(X_train, y_train)
        y_test_pred = logreg.predict(X_test)
        recall.append(recall_score(y_test, y_test_pred))
        accuracy.append(accuracy_score(y_test, y_test_pred))
        f1.append(f1_score(y_test, y_test_pred))
        kappa.append(cohen_kappa_score(y_test, y_test_pred))
    print("For c=", c, "\t recall rate is", np.mean(recall), " and accuracy is", np.mean(accuracy), " f1 =", np.mean(f1), " kappa =", np.mean(kappa))

For c= 0.001 	 recall rate is 0.958063986415  and accuracy is 0.764171760075  f1 = 0.801660541863  kappa = 0.528178323233
For c= 0.01 	 recall rate is 0.920821294594  and accuracy is 0.933922096757  f1 = 0.931979526091  kappa = 0.86710778374
For c= 0.1 	 recall rate is 0.902105846883  and accuracy is 0.942054283642  f1 = 0.939361869465  kappa = 0.883856840724
For c= 1 	 recall rate is 0.910399919161  and accuracy is 0.937004040195  f1 = 0.93401401012  kappa = 0.872990971213
For c= 10 	 recall rate is 0.915312438843  and accuracy is 0.935957733347  f1 = 0.933854859828  kappa = 0.871268818006
For c= 100 	 recall rate is 0.914643641694  and accuracy is 0.938008909147  f1 = 0.936494466004  kappa = 0.875465334199


### After comparing Kappa and f1 score, we choose to use c = 1 as the best parameter for regularization.  
Then we use c=1 and resample dataset to train logistic regression and predict on the original whole dataset.

In [13]:
# use all data after resample to train the model
X = data_resample.iloc[:, 1:29]
Y = data_resample.loc[:, "Class"]
logreg = LogisticRegression(penalty='l1', solver='liblinear', C=1, max_iter=100)
logreg.fit(X, Y)
# apply to original dataset
ori_data = pd.read_csv("./creditcard.csv")
X_test = ori_data.iloc[:, 1:29]
Y_test = ori_data.loc[:, 'Class']
Y_test_predict = logreg.predict(X_test)
accuracy_score(Y_test, Y_test_predict)
recall_score(Y_test, Y_test_predict)
# AUC score
roc_auc_score(Y_test, Y_test_predict)

0.94710745367306304

## Different ratio of data  
There is other method to tackle imbalance dataset. First I would like to try different ratio of fraud and non-fraud data such as 1:1.5, 1:2 and so on. Let's see what can it bring to us.

In [16]:
# function to try to use different ratio and output the result
def result_by_ratio(size_non_fraud):
    # generate different ratio of data
    sub_non_fraud = data.loc[data.loc[:, 'Class'] == 0, :].sample(int(len_fraud*size_non_fraud))
    data_resample = pd.concat([sub_non_fraud, data.loc[data.loc[:, 'Class'] == 1, :]])
    print("-----------------------------------------------")
    print("Non fraud rate:", len(data_resample.loc[data_resample.loc[:, 'Class'] == 0, :]) / len(data_resample.loc[:, 'Class']))
    # use k fold to find the highest recall rate parameter
    X = data_resample.iloc[:, 1:29]
    Y = data_resample.loc[:, "Class"]
    kf = KFold(n_splits=5, shuffle=True)
    kf.get_n_splits(X)
    # test on different regularation power
    cs = [0.001, 0.01, 0.1, 1, 10, 100]
    for c in cs:
        accuracy = []
        recall = []
        f1 = []
        kappa = []
        for train_index, test_index in kf.split(X):
            X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
            y_train, y_test = Y.iloc[train_index], Y.iloc[test_index]
            logreg = LogisticRegression(penalty='l1', solver='liblinear', C=c, max_iter=100)
            logreg.fit(X_train, y_train)
            y_test_pred = logreg.predict(X_test)
            recall.append(recall_score(y_test, y_test_pred))
            accuracy.append(accuracy_score(y_test, y_test_pred))
            f1.append(f1_score(y_test, y_test_pred))
            kappa.append(cohen_kappa_score(y_test, y_test_pred))
    chosed_c = cs[kappa.index(max(kappa))]
    # use all data after resample to train the model
    X = data_resample.iloc[:, 1:29]
    Y = data_resample.loc[:, "Class"]
    logreg = LogisticRegression(penalty='l1', solver='liblinear', C=chosed_c, max_iter=100)
    logreg.fit(X, Y)
    # apply to original dataset
    ori_data = pd.read_csv("./creditcard.csv")
    X_test = ori_data.iloc[:, 1:29]
    Y_test = ori_data.loc[:, 'Class']
    Y_test_predict = logreg.predict(X_test)
    print("when ratio is 1:", size_non_fraud, "accuracy is ", \
              accuracy_score(Y_test, Y_test_predict), "and recall is",recall_score(Y_test, Y_test_predict), \
          " and ROC score is", roc_auc_score(Y_test, Y_test_predict))

In [17]:
ratio_list = [1, 1.5, 2, 2.5, 3, 3.5, 4]
for ratio in ratio_list:
    result_by_ratio(ratio)

-----------------------------------------------
Non fraud rate: 0.5
when ratio is 1: 1 accuracy is  0.970580077035 and recall is 0.90243902439  and ROC score is 0.936568508907
-----------------------------------------------
Non fraud rate: 0.6
when ratio is 1: 1.5 accuracy is  0.981457618668 and recall is 0.908536585366  and ROC score is 0.94506019603
-----------------------------------------------
Non fraud rate: 0.6666666666666666
when ratio is 1: 2 accuracy is  0.99102901263 and recall is 0.867886178862  and ROC score is 0.929564143543
-----------------------------------------------
Non fraud rate: 0.7142857142857143
when ratio is 1: 2.5 accuracy is  0.994792965061 and recall is 0.857723577236  and ROC score is 0.926376868723
-----------------------------------------------
Non fraud rate: 0.75
when ratio is 1: 3 accuracy is  0.550460487277 and recall is 0.955284552846  and ROC score is 0.752522251099
-----------------------------------------------
Non fraud rate: 0.7777777777777778


from the above result, we saw some dramatic drop in accuracy when we use some sample size and seems there is no good ratio of data that can help us to cure the imbalanced data. 
#### now I would like to generate Synthetic Samples by SMOTE  
The SMOTE is a synthetic minority over-sampling technique to over sample the data. You may refer to [Imbalace learning](https://github.com/scikit-learn-contrib/imbalanced-learn) to have more info and different method. Here I use the basic one. For proformance purpose, I reduce the nonfraud data point to quarter to reduce the time to train the model.

In [36]:
from imblearn.over_sampling import SMOTE
sub_non_fraud = data.loc[data.loc[:, 'Class'] == 0, :].sample(int(len(data.loc[:, 'Class']) / 2))
data_resample = pd.concat([sub_non_fraud, data.loc[data.loc[:, 'Class'] == 1, :]])
X = data_resample.iloc[:, 1:29]
y = data_resample.loc[:, "Class"]
sm = SMOTE(kind='regular')
X_resampled, y_resampled = sm.fit_sample(X, y)

In [37]:
# size of X and y after SMOTE
print("Size of X", X_resampled.shape)
print("Size of y", y_resampled.shape)
print("Size of fraud", y_resampled[y_resampled == 1].shape)

Size of X (284806, 28)
Size of y (284806,)
Size of fraud (142403,)


After SMOTE, we use logistic regression based on resample data.

In [38]:
from sklearn.model_selection import train_test_split
X_resampled = pd.DataFrame(X_resampled)
y_resampled = pd.DataFrame(y_resampled)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size = 0.3, random_state = 42)
c = [0.01, 0.1, 1, 10]
logreg = LogisticRegressionCV(penalty='l2', solver='sag', Cs=c, refit=True, cv=10, max_iter=100)
logreg.fit(X_train, y_train)
y_test_predict = logreg.predict(X_test)

  y = column_or_1d(y, warn=True)


In [39]:
print("accuracy is ", accuracy_score(y_test, y_test_predict), "and recall is",recall_score(y_test, y_test_predict))
print("AUC score is ", roc_auc_score(y_test, y_test_predict))

accuracy is  0.945413262798 and recall is 0.915253050844
AUC score is  0.945509582836


### The result seems similar to previous one and doesn't improve a lot. Let's try different learning algorithm for example Random Forest. We use resample dataset by SMOTE combined with random forest. 

In [40]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
y_test_predict = rf.predict(X_test)
print("accuracy is ", accuracy_score(y_test, y_test_predict), "and recall is",recall_score(y_test, y_test_predict))
print("AUC score is ", roc_auc_score(y_test, y_test_predict))



accuracy is  0.999719107699 and recall is 1.0
AUC score is  0.999718210638


## Here we get recall rate is almost 1 and really high AUC score. 
The result seems incredibly good. Does the result come from SMOTE or Random Forest. Let's use only the original data to train the random forest again without SMOTE resample dataset.

In [21]:
# under sample non fraud
len_fraud = len(data.loc[data.loc[:, 'Class'] == 1, :])
# sample from non fraud to have 50/50 propotion
sub_non_fraud = data.loc[data.loc[:, 'Class'] == 0, :].sample(len_fraud)
data_resample = pd.concat([sub_non_fraud, data.loc[data.loc[:, 'Class'] == 1, :]])
X = data_resample.iloc[:, 1:29]
y = data_resample.loc[:, "Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
rf2 = RandomForestClassifier(n_estimators=100)
rf2.fit(X_train, y_train)

ori_data = pd.read_csv("./creditcard.csv")
X_test = ori_data.iloc[:, 1:29]
y_test = ori_data.loc[:, 'Class']
y_test_predict = rf2.predict(X_test)
print("accuracy rate is ", accuracy_score(y_test, y_test_predict))
print("recall rate is ", recall_score(y_test, y_test_predict))
print("AUC score is ", roc_auc_score(y_test, y_test_predict))

accuracy rate is  0.970839902109
recall rate is  0.967479674797
AUC score is  0.969162695848


#### we conclude that combining SMOTE and Random forest, we can get a really good result. If we use only random forest, the result seems decent. 