# Credit Card Fraud

We will be detecting credit card fraud based on the different features of our dataset with 3 different models. Here is the Logistic Regression one.

We're looking to minimize the False Negative Rate or FNR.

Since the dataset is unbalanced, we can try two techniques that may help us have better predictions:

    - Adding some noise (gaussian) to the fraud data to create more and reduce the imbalance
    - Randomly sample the fraud data and train k models and average them out (or choose the best)
    
 

In [1]:
import numpy as np
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import random

In [2]:
import os 

data_path = os.path.join('dataset', 'creditcard.csv')
df = pd.read_csv(data_path, low_memory=False)
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,171525.0,-1.429956,-0.103944,-1.597035,-0.931097,2.273441,-1.46514,1.035409,-0.185441,0.012495,...,-0.131308,0.159357,0.468893,0.233064,-0.149044,0.199873,-0.294163,-0.326358,5.0,0
1,160248.0,-0.075964,0.441855,1.282697,-0.49723,-0.15482,-0.47178,0.419333,-0.272168,1.068102,...,0.246722,0.985217,-0.167083,0.062401,-0.403061,-0.343578,-0.316589,-0.166177,9.99,0
2,57461.0,-0.860359,1.49919,0.575222,1.070761,-0.497898,-1.065072,0.229828,0.320389,-0.736921,...,0.219184,0.571068,0.020289,0.748144,-0.166166,-0.400871,-0.565567,-0.355837,1.5,0
3,115415.0,1.68017,-1.783177,-1.95464,-2.906774,1.013218,3.475922,-1.277428,0.970543,2.709952,...,0.2113,0.399229,0.084951,0.720211,-0.238584,-0.908217,0.087708,-0.001138,186.52,0
4,122880.0,2.02653,-0.993197,-1.167852,-0.600632,-0.414395,0.191167,-0.706712,0.069864,-0.003371,...,-0.298261,-0.398354,0.146119,0.022785,-0.07693,-0.692099,0.041413,-0.0357,63.0,0


In [3]:
frauds = df.loc[df['Class'] == 1]
non_frauds = df.loc[df['Class'] == 0]
print("We have", len(frauds), "fraud data points and", len(non_frauds), "nonfraudulent data points.")

We have 492 fraud data points and 284315 nonfraudulent data points.


In [4]:
from sklearn import datasets, linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

X = df.iloc[:,:-1]
y = df['Class']

print("X and y sizes, respectively:", X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)
print("Train and test sizes, respectively:", len(X_train), len(y_train), "|", len(X_test), len(y_test))
print("Total number of frauds:", len(y.loc[df['Class'] == 1]), '--', len(y.loc[df['Class'] == 1])/len(y))
print("Number of frauds on y_test:", len(y_test.loc[df['Class'] == 1]), '--',len(y_test.loc[df['Class'] == 1]) / len(y_test))
print("Number of frauds on y_train:", len(y_train.loc[df['Class'] == 1]), '--', len(y_train.loc[df['Class'] == 1])/len(y_train))

X and y sizes, respectively: (284807, 30) (284807,)
Train and test sizes, respectively: 185124 185124 | 99683 99683
Total number of frauds: 492 -- 0.001727485630620034
Number of frauds on y_test: 172 -- 0.001725469739072861
Number of frauds on y_train: 320 -- 0.0017285711198980144


In [5]:
import statsmodels.api as sm

logit_model = sm.Logit(y_train,X_train)
result = logit_model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.009763
         Iterations 12
                           Logit Regression Results                           
Dep. Variable:                  Class   No. Observations:               185124
Model:                          Logit   Df Residuals:                   185094
Method:                           MLE   Df Model:                           29
Date:                Sun, 29 Nov 2020   Pseudo R-squ.:                  0.2325
Time:                        10:13:12   Log-Likelihood:                -1807.4
converged:                       True   LL-Null:                       -2355.1
Covariance Type:            nonrobust   LLR p-value:                6.135e-212
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Time        -9.07e-05   1.97e-06    -46.015      0.000   -9.46e-05   -8.68e-05
V1             0.5406      0

We can drop `V12` `V24` for simplicity

In [6]:
X_train = X_train.drop(['V12', 'V24'], axis=1)
X_test = X_test.drop(['V12', 'V24'], axis=1)

X_train.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V19,V20,V21,V22,V23,V25,V26,V27,V28,Amount
13644,135878.0,-0.385107,-0.321071,1.873765,-2.109192,-0.102245,0.734279,-0.12376,0.133729,-0.720243,...,-1.220042,-0.319007,-0.265724,-0.213265,-0.045185,-0.537135,0.33761,-0.059367,-0.07691,37.8
63998,42065.0,-0.885626,0.354243,3.024195,1.229266,-0.599005,-0.373351,-0.317182,0.19173,-0.173917,...,0.278815,0.27304,0.381962,0.995761,-0.164883,0.144407,-0.095235,0.122233,0.113776,31.7
269636,10257.0,-0.40876,1.178969,1.67134,1.025924,0.111683,0.167013,0.203719,0.110314,0.415913,...,1.360106,0.06197,-0.261476,-0.578248,-0.01865,-0.562634,0.28839,-0.054011,0.105176,0.89
208294,169773.0,-1.597879,1.740235,-1.139454,-0.462958,0.440525,-1.476728,1.096408,-0.501919,1.727109,...,-0.275614,0.45755,-0.035504,0.824499,-0.178688,-0.188269,-0.225491,-0.106443,-0.14678,16.45
265542,40884.0,-2.869366,-2.379694,0.176013,2.098922,-3.063481,2.051683,5.337109,-0.302851,-1.801196,...,-2.283251,2.704565,0.616966,-0.685312,3.152048,0.128802,-0.545887,-0.358066,0.230097,1319.74


In [7]:
from sklearn.linear_model import LogisticRegression 
 
log_clf = LogisticRegression(max_iter=500)
log_clf.fit(X_train, y_train)
 
y_predict = log_clf.predict(X_test)

In [8]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve

confusion = confusion_matrix(y_test, y_predict)
print("Confusion matrix:\n%s" % confusion)
print('\n')
print(classification_report(y_test, y_predict))

Confusion matrix:
[[99492    19]
 [   67   105]]


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     99511
           1       0.85      0.61      0.71       172

    accuracy                           1.00     99683
   macro avg       0.92      0.81      0.85     99683
weighted avg       1.00      1.00      1.00     99683



In [9]:
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)

confusion = confusion_matrix(y_test, y_predict)
print("Confusion matrix:\n%s" % confusion)
print('\n')
print(classification_report(y_test, y_predict))

Confusion matrix:
[[99511     0]
 [  172     0]]


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     99511
           1       0.00      0.00      0.00       172

    accuracy                           1.00     99683
   macro avg       0.50      0.50      0.50     99683
weighted avg       1.00      1.00      1.00     99683



  _warn_prf(average, modifier, msg_start, len(result))
