In [1]:
import numpy as np
import pandas as pd 
import os
import xgboost as xgb
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt

# Exploratory data analysis
We start by loading the dataset, looking and some summary statistics, and the first few rows of data

In [2]:
df = pd.read_csv('./data/creditcard.csv')
print("Data shape is: ", df.shape)
print("Data types are:")
print(df.dtypes)

Data shape is:  (284807, 31)
Data types are:
Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object


It seems that all of our data is numeric and none of the features are named. This is anonomized data, so its not likely we will get any information about what each feature actually means. This leaves us with little in the way of feature engineering to be done. With purly numerical features on tabular data the most we can normally do is arithmetic transformations. This can lead to overfitting however, so we will avoid it here.

Lets look at some summary statistics.

In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Time,284807.0,94813.86,47488.145955,0.0,54201.5,84692.0,139320.5,172792.0
V1,284807.0,3.91956e-15,1.958696,-56.40751,-0.920373,0.018109,1.315642,2.45493
V2,284807.0,5.688174e-16,1.651309,-72.715728,-0.59855,0.065486,0.803724,22.057729
V3,284807.0,-8.769071e-15,1.516255,-48.325589,-0.890365,0.179846,1.027196,9.382558
V4,284807.0,2.782312e-15,1.415869,-5.683171,-0.84864,-0.019847,0.743341,16.875344
V5,284807.0,-1.552563e-15,1.380247,-113.743307,-0.691597,-0.054336,0.611926,34.801666
V6,284807.0,2.010663e-15,1.332271,-26.160506,-0.768296,-0.274187,0.398565,73.301626
V7,284807.0,-1.694249e-15,1.237094,-43.557242,-0.554076,0.040103,0.570436,120.589494
V8,284807.0,-1.927028e-16,1.194353,-73.216718,-0.20863,0.022358,0.327346,20.007208
V9,284807.0,-3.137024e-15,1.098632,-13.434066,-0.643098,-0.051429,0.597139,15.594995


The data looks to be somewhat normalized. All the means are quite close to zero, and the standard deviations range from 0.5 to 2. Depending on the model we choose we may need to normalize the data. A SVM model would almost certianly require further normalization to get good results. Tree based models tend not to require normalization.

# Class imbalance
Now lets look at the class balance. In fraud detection the rate of positive classes is typically very low (less than 1%). It is important we take this into account when building our model. It is very easy to build a model that correctly classifies 99% of the data if we simply predict negative/False for the entire data set. This model is not useful for fraud detection however.

In these cases there are a few strategies we can try.

1. Use a scoring metric that is insensitive to class imbalance. Examples of this are Area under the ROC and F1-Score
2. Upsample or Downsample the dataset to elimate the class imbalance
3. When training the ML model, use a weighted loss function

In [4]:
not_features = ['Class', 'Time']
all_features = [x for x in df.columns if x not in not_features]
print('negative classes:', sum(df['Class'] == 0))
print('positive classes:', sum(df['Class'] == 1))
print('ratio:', df['Class'].mean())

negative classes: 284315
positive classes: 492
ratio: 0.001727485630620034


We will do some simple feature selection. This this case we will just remove features that have very low corrilation to the target. 

In [5]:
plt.figure(figsize=(18,14))
df_corr = df.corr()

ax = sns.heatmap(df_corr, linewidths=.5, square=True, annot=True, fmt='.1f')

In [6]:
features = [x for x in df_corr[df_corr['Class'].abs() > 0.05][['Class']].index if x not in not_features]

We would like some way to validate our model, so we will split the data into a train and test set. The test set will be kept until the very end to check if we are overfitting. For model tuning, we will be using cross validation.

In [7]:
split = int(len(df)*0.7)
train_df = df.iloc[:split]
test_df = df.iloc[split:]

In [8]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
import lightgbm as lgbm
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
model = lgbm.LGBMClassifier(random_state=0, n_jobs=7)

First lets see how the a default LGBM model performs. We did some feature selection, so its a good idea to check if removing those features was a good idea. Often there are non-linear relations in data and simply removing features with low corrilation can give worse performance.

In [9]:
result = cross_val_score(estimator=model, X=train_df[features], y=train_df['Class'], cv=kfold, scoring='f1')

print('mean score: ', np.mean(result))
print('score std: ',np.std(result))

mean score:  0.26905332645
score std:  0.173131260422


In [10]:
result = cross_val_score(estimator=model, X=train_df[all_features], y=train_df['Class'], cv=kfold, scoring='f1')

print('mean score: ', np.mean(result))
print('score std: ',np.std(result))

mean score:  0.397089286786
score std:  0.180945240547


So our f1 score is much better with all the features included, so we will include all of them in the remander of the training.

Now lets compare a few different models. We will check if using a weighted loss function gives better results, and we will check two different boosted tree models

In [11]:
def calc_scores(feat_list, estimator, n_reps=1):
    result = []
    for i in range(n_reps):
        kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
        splits = kfold.split(X=train_df[feat_list], y=train_df['Class']) 
        score = cross_val_score(estimator=estimator, X=train_df[feat_list], y=train_df['Class'], cv=splits, scoring='f1')
        result += list(score)
    return result

In [12]:
import warnings
warnings.filterwarnings("ignore")

model_list = []

weight_dict = {0: train_df['Class'].mean(), 1:1/train_df['Class'].mean()}
model_list.append(('lgbm weighted', lgbm.LGBMClassifier(class_weight=weight_dict, 
                                                        random_state=0, 
                                                        n_jobs=7)))
model_list.append(('xgb weighted', xgb.XGBClassifier(random_state=0,
                                                     n_jobs=7,
                                                     scale_pos_weight=sum(train_df['Class']==1)/sum(train_df['Class']==0))))
model_list.append(('lgbm default', lgbm.LGBMClassifier(random_state=0,
                                                      n_jobs=7)))
model_list.append(('xgb default', xgb.XGBClassifier(random_state=0,
                                                   n_jobs=7)))

for label, model in model_list:
    score = calc_scores(all_features, estimator=model, n_reps=5)
    print(label, np.mean(score))

lgbm weighted 0.762343259776
xgb weighted 0.0
lgbm default 0.305161573662
xgb default 0.852409453922


So our weighted LightGBM model looks much better, but interestingly enough, the default XGBoost model is performing better, so we will continue with XGBoost as our main model.

An f1 score of ~0.85 is quite good all things considered, but we can try to get some improvement by doing some model tuning. Here we will do some simple tuning. There are some fairly complex and lengthy methods we can employ to tune a model, but they tend to give only marginal improvements when compared with a simple grid search across some select parameters.

In [13]:
result = []
for param in [0.01,0.05,0.1]:
    model = xgb.XGBClassifier(random_state=0, learning_rate=param, n_jobs=7)
    score = calc_scores(all_features, estimator=model, n_reps=5)
    result.append((param, np.mean(score)))
    print(param, np.mean(score))

0.01 0.793200878223
0.05 0.846185834246
0.1 0.852409453922


In [14]:
result = []
for param in [0.5,0.6,0.7,0.8,0.9,1]:
    model = xgb.XGBClassifier(random_state=0, learning_rate=0.1, subsample=param, n_jobs=7)
    score = calc_scores(all_features, estimator=model, n_reps=5)
    result.append((param, np.mean(score)))
    print(param, np.mean(score))

0.5 0.853910510358
0.6 0.859727479774
0.7 0.857970291882
0.8 0.856966034868
0.9 0.855173748065
1 0.852409453922


In [15]:
result = []
for param in [0.5,0.6,0.7,0.8,0.9,1]:
    model = xgb.XGBClassifier(random_state=0, learning_rate=0.1, subsample=0.6, colsample_bytree=param, n_jobs=7)
    score = calc_scores(all_features, estimator=model, n_reps=5)
    result.append((param, np.mean(score)))
    print(param, np.mean(score))

0.5 0.861995713354
0.6 0.861012062285
0.7 0.859574192337
0.8 0.86078150756
0.9 0.860119479661
1 0.859727479774


We have gotten little improvement by model tuning, so lets stop here and train/test our final model

In [16]:
model = xgb.XGBClassifier(random_state=0, 
                          learning_rate=0.1, 
                          subsample=0.6, 
                          colsample_bytree=0.5, 
                          n_jobs=7)

model.fit(X=train_df[all_features], y=train_df['Class'])

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.5, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=7,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.6, verbosity=1)

In [17]:
from sklearn.metrics import f1_score

f1_score(y_pred=model.predict(test_df[all_features]), y_true=test_df['Class'])

0.80000000000000004

Not bad! The score on our hold out set is comparible to the score we were achiving in cross validation, which means we are not overfitting very much, and the model is generalizing very well!