# Credit Card Fraud Detection


# 1. Introduction
<a id='1'></a>

We use a dataset of credit card transactions made over a two-day period in September 2013 by European cardholders. The dataset contains 284,807 transactions, of which 492 (0.17%) are fraudulent.

Each transaction has 30 features, all of which are numerical. The features `V1, V2, ..., V28` are the result of a PCA transformation. To protect confidentiality, background information on these features is not available. The `Time` feature contains the time elapsed since the first transaction, and the `Amount` feature contains the transaction amount. The response variable, `Class`, is 1 in the case of fraud, and 0 otherwise.

Our goal in this project is to construct models to predict whether a credit card transaction is fraudulent. We'll attempt a supervised learning approach. We'll also create visualizations to help us understand the structure of the data and unearth any interesting patterns.

# 2. Data Analysis
<a id='2'></a>

Import basic libraries:

In [2]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas options
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

# Plotting options
%matplotlib inline
mpl.style.use('ggplot')
sns.set(style='whitegrid')

Read in the data into a pandas dataframe.

In [3]:
!gdown 1c2a2xgV47ijS7xH0TgPwPAdTZBQ605sP

'gdown' is not recognized as an internal or external command,
operable program or batch file.


In [5]:
import gdown

# Google Drive file ID
file_id = '1c2a2xgV47ijS7xH0TgPwPAdTZBQ605sP'

# URL to download the file
url = f'https://drive.google.com/uc?id={file_id}'

# Output file name
output_file = 'creditcard.csv'  # Specify the desired file extension

# Download the file
gdown.download(url, output_file, quiet=False)  # Set quiet=True to suppress output


Access denied with the following error:



 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1c2a2xgV47ijS7xH0TgPwPAdTZBQ605sP 



In [4]:
transactions = pd.read_csv('/content/creditcard.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/creditcard.csv'

Check basic metadata.

In [4]:
transactions.shape

(284807, 31)

In [5]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

viewing class distribution

In [8]:
transactions['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [9]:
transactions['Class'].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

Only 0.17% (492 out of 284,807) transactions are fraudulent.

# 3. Train/Test Split
<a id='3'></a>

## 3.1 Undersampling
<a id='3.1'></a>

Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions.

Number of Fraudulent Transactions --> 492

In [10]:
legit = transactions[transactions.Class == 0]
fraud = transactions[transactions.Class == 1]

In [11]:
legit_sample = legit.sample(n=100000)

Concatenating two DataFrames

In [12]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [13]:
new_dataset['Class'].value_counts()

0    100000
1       492
Name: Class, dtype: int64

In [14]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
0,94684.84012,0.007943,-0.005912,0.01492,-0.009291,0.00497,-0.000108,0.011054,-0.00049,-0.001861,0.010758,-0.009876,0.012303,-0.006096,0.009838,-3e-06,0.009235,0.011778,-0.000133,0.001286,-0.000278,-0.000979,-0.001494,-0.000279,-0.001035,-0.001618,-0.000296,-0.00137,-0.000316,88.531656
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,-5.676883,3.800173,-6.259393,-0.109334,-6.971723,-0.092929,-4.139946,-6.665836,-2.246308,0.680659,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Before we begin preprocessing, we split off a test data set. First split the data into features and response variable:

In [15]:
X = new_dataset.drop(labels='Class', axis=1) # Features
y = new_dataset.loc[:,'Class']               # Response
del new_dataset                              # Delete the original data

We'll use a test size of 20%. We also stratify the split on the response variable, which is very important to do because there are so few fraudulent transactions.

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
del X, y

In [18]:
X_train.shape

(80393, 30)

In [19]:
X_test.shape

(20099, 30)

In [20]:
# Prevent view warnings
X_train.is_copy = False
X_test.is_copy = False

## 3.1 Upsampling using SMOTE
<a id='3.1'></a>

In [21]:
def resamplingDataPrep(X_train, y_train, target_var): 
    # concatenate our training data back together
    resampling = X_train.copy()
    resampling[target_var] = y_train.values
    # separate minority and majority classes
    majority_class = resampling[resampling[target_var]==0]
    minority_class = resampling[resampling[target_var]==1]
    # Get a class count to understand the class imbalance.
    print('majority_class: '+ str(len(majority_class)))
    print('minority_class: '+ str(len(minority_class)))
    return majority_class, minority_class

In [22]:
maj, min = resamplingDataPrep(X_train, y_train, 0)

majority_class: 79999
minority_class: 394


In [23]:
from imblearn.over_sampling import SMOTE

In [24]:
def upsample_SMOTE(X_train, y_train, ratio=1.0):
    """Upsamples minority class using SMOTE.
    Ratio argument is the percentage of the upsampled minority class in relation
    to the majority class. Default is 1.0
    """
    sm = SMOTE(random_state=23, sampling_strategy=ratio)
    X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)
    #print(len(X_train_sm))
    resampling = X_train_sm.copy()
    target_var = 0;
    resampling[target_var] = y_train_sm.values
    # separate minority and majority classes
    majority_class = resampling[resampling[target_var]==0]
    minority_class = resampling[resampling[target_var]==1]
    # Get a class count to understand the class imbalance.
    print('majority_class: '+ str(len(majority_class)))
    print('minority_class: '+ str(len(minority_class)))
    return X_train_sm, y_train_sm

In [25]:
X_train_sm, y_train_sm = upsample_SMOTE(X_train, y_train, ratio=1.0)

majority_class: 79999
minority_class: 79999


### Final distribution in train set:
80,000 normal and fraudulent transactions

In [26]:
X_train, y_train = X_train_sm, y_train_sm

# 4. Mutual Information between Fraud and the Predictors
<a id='4'></a>

[Mutual information](https://en.wikipedia.org/wiki/Mutual_information) is a non-parametric method to estimate the mutual dependence between two variables. Mutual information of 0 indicates no dependence, and higher values indicate higher dependence. According to the [sklearn User Guide](http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection), "mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation." We have 227,845 training samples, so mutual information should work well. Because the target variable is discrete, we use [`mutual_info_classif`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) (as opposed to [`mutual_info_regression`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression) for a continuous target).

In [None]:
from sklearn.feature_selection import mutual_info_classif

In [None]:
mutual_infos = pd.Series(data=mutual_info_classif(X_train, y_train, discrete_features=False, random_state=1), index=X_train.columns)

The calculated mutual informations of each variable with `Class`, in descending order:

In [None]:
mutual_infos.sort_values(ascending=False)

V14       0.531440
V10       0.465248
V12       0.464279
V4        0.441128
V17       0.441111
V11       0.415013
V3        0.394386
V16       0.352956
V7        0.329546
Amount    0.318155
V2        0.292301
V9        0.279042
V27       0.251656
Time      0.248493
V21       0.244110
V1        0.228429
V18       0.207154
V6        0.196228
V28       0.179425
V8        0.172210
V5        0.155805
V20       0.120983
V19       0.106473
V24       0.067919
V23       0.060694
V26       0.056855
V25       0.034999
V22       0.032365
V15       0.025263
V13       0.024378
dtype: float64

The five most correlated variables with `Class` are, in decreasing order, V17, V14, V10, V12, and V11.

# 5. Evaluation method
<a id='5'></a>

In [27]:
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score, roc_auc_score

In [28]:
def classification_eval(estimator, X_test, y_test):
    """
    Print several metrics of classification performance of an estimator, given features X_test and true labels y_test.
    
    Input: estimator or GridSearchCV instance, X_test, y_test
    Returns: text printout of metrics
    """
    y_pred = estimator.predict(X_test)
    
    # Number of decimal places based on number of samples
    dec = np.int64(np.ceil(np.log10(len(y_test))))
    
    print('CONFUSION MATRIX')
    print(confusion_matrix(y_test, y_pred), '\n')
    
    print('CLASSIFICATION REPORT')
    print(classification_report(y_test, y_pred, digits=dec))
    
    print('SCALAR METRICS')
    format_str = '%%13s = %%.%if' % dec
    print(format_str % ('MCC', matthews_corrcoef(y_test, y_pred)))
    if y_test.nunique() <= 2: # Additional metrics for binary classification
        try:
            y_score = estimator.predict_proba(X_test)[:,1]
        except:
            y_score = estimator.decision_function(X_test)
        print(format_str % ('AUPRC', average_precision_score(y_test, y_score)))
        print(format_str % ('AUROC', roc_auc_score(y_test, y_score)))
    print(format_str % ("Cohen's kappa", cohen_kappa_score(y_test, y_pred)))
    print(format_str % ('Accuracy', accuracy_score(y_test, y_pred)))

According to the MCC, the random forest performed better on the test set than on the training set. This is probably due to the refit model being trained on the entire training data set, and not on the smaller CV folds.

# 6. Modeling
<a id='6'></a>

The following models are trained
* Logistic regression
* Support vector classifier
* Random forest
* Decision Tree

## 6.1 Logistic Regression and Support Vector Classifier
<a id='6.1'></a>

The class [`SGDClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) implements multiple linear classifiers with SGD training, which makes learning much faster on large datasets. We'll implement the model as a machine learning pipeline that includes [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for data standardization (rescaling each variable to zero mean and unit variance).

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier

In [None]:
pipeline_sgd = Pipeline([
    ('scaler', StandardScaler(copy=False)),
    ('model', SGDClassifier(max_iter=1000, tol=1e-3, random_state=1, warm_start=True))
])

We'll conduct a grid search over several hyperparameter choices. The search uses 5-fold cross-validation with stratified folds. The type of linear classifier is chosen with the `loss` hyperparameter. For a linear SVC we set `loss = 'hinge'`, and for logistic regression we set `loss = 'log'`.

Set the hyperparameter grids to search over, one grid for the linear SVC and one for logistic regression:

In [None]:
param_grid_sgd = [{
    'model__loss': ['log'],
    'model__penalty': ['l1', 'l2'],
    'model__alpha': np.logspace(start=-3, stop=3, num=20)
}, {
    'model__loss': ['hinge'],
    'model__alpha': np.logspace(start=-3, stop=3, num=20),
    'model__class_weight': [None, 'balanced']
}]

The grid search, implemented by [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), uses [`StratifiedKFold`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) with 5 folds for the train/validation splits. We'll use [`matthews_corrcoef`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html) (the [Matthews correlation coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient), MCC) as our scoring metric.

In [31]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef

In [None]:
MCC_scorer = make_scorer(matthews_corrcoef)
grid_sgd = GridSearchCV(estimator=pipeline_sgd, param_grid=param_grid_sgd, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)

Perform the grid search:

In [None]:
import warnings
with warnings.catch_warnings(): # Suppress warnings from the matthews_corrcoef function
    warnings.simplefilter("ignore")
    grid_sgd.fit(X_train, y_train)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


Mean cross-validated MCC score of the best estimator found:

In [None]:
grid_sgd.best_score_

0.9590021961840508

This is a pretty good MCC score---random guessing has a score of 0, and a perfect predictor has a score of 1. Now check the best hyperparameters found in the grid search:

In [None]:
grid_sgd.best_params_

{'model__alpha': 0.001, 'model__loss': 'log', 'model__penalty': 'l1'}

So the linear SVC performed better than logistic regression, and with a high level of regularization ($\alpha\approx 483$).

In [None]:
classification_eval(grid_sgd, X_test, y_test)

CONFUSION MATRIX
[[19797   204]
 [    9    89]] 

CLASSIFICATION REPORT
              precision    recall  f1-score   support

           0    0.99955   0.98980   0.99465     20001
           1    0.30375   0.90816   0.45524        98

    accuracy                        0.98940     20099
   macro avg    0.65165   0.94898   0.72495     20099
weighted avg    0.99615   0.98940   0.99202     20099

SCALAR METRICS
          MCC = 0.52187
        AUPRC = 0.84508
        AUROC = 0.96483
Cohen's kappa = 0.45123
     Accuracy = 0.98940


## 6.2 Random Forest
<a id='6.2'></a>

Next we'll try a random forest model, implemented in [`RandomForestClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, matthews_corrcoef

We do not need to rescale the data for tree-based models, so our pipeline will simply consist of the random forest model. We'll leave the pipeline implementation in place in case we want to add preprocessing steps in the future.

In [None]:
pipeline_rf = Pipeline([
    ('model', RandomForestClassifier(n_jobs=-1, random_state=1))
])

The random forest takes much longer to train on this fairly large dataset, so we don't actually do a hyperparameter grid search, only specifiying the number of estimators. We'll leave the grid search implemented in case we decide to try different hyperparameter values in the future.

In [None]:
param_grid_rf = {'model__n_estimators': [75]}

In [None]:
from sklearn.model_selection import GridSearchCV
MCC_scorer = make_scorer(matthews_corrcoef)
grid_rf = GridSearchCV(estimator=pipeline_rf, param_grid=param_grid_rf, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)

Perform the grid search:

In [None]:
grid_rf.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('model',
                                        RandomForestClassifier(n_jobs=-1,
                                                               random_state=1))]),
             n_jobs=-1, param_grid={'model__n_estimators': [75]},
             scoring=make_scorer(matthews_corrcoef), verbose=1)

In [None]:
grid_rf.best_score_

0.9994500956412578

The random forest performed much better than the linear SVC---and without any hyperparameter tweaking!

In [None]:
grid_rf.best_params_

{'model__n_estimators': 75}

In [None]:
classification_eval(grid_rf, X_test, y_test)

CONFUSION MATRIX
[[19994     7]
 [   13    85]] 

CLASSIFICATION REPORT
              precision    recall  f1-score   support

           0    0.99935   0.99965   0.99950     20001
           1    0.92391   0.86735   0.89474        98

    accuracy                        0.99900     20099
   macro avg    0.96163   0.93350   0.94712     20099
weighted avg    0.99898   0.99900   0.99899     20099

SCALAR METRICS
          MCC = 0.89469
        AUPRC = 0.89078
        AUROC = 0.97938
Cohen's kappa = 0.89424
     Accuracy = 0.99900


## 6.3 Decision Tree
<a id='6.3'></a>

In [36]:
from sklearn.metrics import make_scorer, matthews_corrcoef
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [37]:
MCC_scorer = make_scorer(matthews_corrcoef)
params = {'criterion':['gini','entropy'],'max_depth':[4,10]}
grid_search_cv = GridSearchCV(estimator=DecisionTreeClassifier(random_state=1), param_grid=params, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)
grid_search_cv.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [4, 10]},
             scoring=make_scorer(matthews_corrcoef), verbose=1)

In [38]:
grid_search_cv.best_score_

0.9833975545664597

In [39]:
grid_search_cv.best_params_

{'criterion': 'gini', 'max_depth': 10}

In [41]:
classification_eval(grid_search_cv, X_test, y_test)

CONFUSION MATRIX
[[19819   182]
 [   10    88]] 

CLASSIFICATION REPORT
              precision    recall  f1-score   support

           0    0.99950   0.99090   0.99518     20001
           1    0.32593   0.89796   0.47826        98

    accuracy                        0.99045     20099
   macro avg    0.66271   0.94443   0.73672     20099
weighted avg    0.99621   0.99045   0.99266     20099

SCALAR METRICS
          MCC = 0.53782
        AUPRC = 0.65611
        AUROC = 0.94032
Cohen's kappa = 0.47450
     Accuracy = 0.99045


# 7. Conclusion
<a id='7'></a>

We were able to accurately identify fraudulent credit card transactions using a random forest model. We found that the five variables most correlated with fraud are, in decreasing order, V17, V14, V10, V12, and V11. 

We used the [Matthews correlation coefficient (MCC)](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) to compare the performance of different models. In cross validation, the best linear model (logistic regression, linear SVC) achieved a cross-validated MCC score of 0.52, decision tree achieved mcc score of 0.53, and a random forest achieved a cross-validated MCC score of 0.89. We therefore chose the random forest as the better model, which obtained an MCC of 0.89 on the test set.

To improve a chosen model, we searched over a grid of hyperparameters and compared performance with cross-validation. It may be possible to improve the random forest model by further tweaking the hyperparameters, given additional time and/or computational power.

References

* [Kaggle Dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)
* [sklearn](https://scikit-learn.org/)
