# Credit Card Fraud

We will be detecting credit card fraud based on the different features of our dataset with 3 different models. Here is the Logistic Regression one.

We're looking to minimize the False Negative Rate or FNR.

Since the dataset is unbalanced, we can try two techniques that may help us have better predictions:

    - Adding some noise (gaussian) to the fraud data to create more and reduce the imbalance
    - Randomly sample the fraud data and train k models and average them out (or choose the best)
    
 

In [1]:
import numpy as np
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import random

In [2]:
import os 

data_path = os.path.join('./dataset', 'creditcard.csv')
df = pd.read_csv(data_path, low_memory=False)
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,133521.0,-0.637538,0.296389,0.18733,-1.105026,1.80686,-0.574477,0.845785,-0.019007,-0.078737,...,-0.301556,-1.007152,-0.079411,0.029798,-0.279752,0.046322,-0.144923,-0.071432,3.99,0
1,149237.0,-0.987315,1.538775,-0.518598,-0.216469,0.245469,-0.93791,0.597796,0.409474,-2.700244,...,-0.043974,0.084921,-0.213016,1.075975,0.487331,-0.455578,-0.260338,-0.001232,3.0,0
2,166066.0,-1.146722,1.888179,-1.820423,-1.279848,0.618486,-1.33588,0.854469,0.558976,-0.542045,...,0.288139,0.862489,-0.193448,-0.455373,-0.135807,0.112807,0.348545,0.24906,1.46,0
3,59278.0,-1.100552,-0.646598,2.38453,-1.618709,-0.307001,0.379083,-0.865919,0.488413,-0.330671,...,0.332767,1.057012,-0.47855,-0.434706,0.813838,0.049642,0.295435,0.114912,20.0,0
4,133906.0,-0.14398,0.938837,-0.685807,-0.080526,0.393012,-1.171641,1.44646,-0.511394,-0.078431,...,0.305385,1.077396,-0.074748,-0.005408,-0.356848,-0.179243,-0.057604,0.125009,102.85,0


In [3]:
frauds = df.loc[df['Class'] == 1]
non_frauds = df.loc[df['Class'] == 0]
print("We have", len(frauds), "fraud data points and", len(non_frauds), "nonfraudulent data points.")

We have 492 fraud data points and 284315 nonfraudulent data points.


In [4]:
from sklearn import datasets, linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

X = df.iloc[:,:-1]
y = df['Class']

print("X and y sizes, respectively:", X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)
print("Train and test sizes, respectively:", len(X_train), len(y_train), "|", len(X_test), len(y_test))
print("Total number of frauds:", len(y.loc[df['Class'] == 1]), '--', len(y.loc[df['Class'] == 1])/len(y))
print("Number of frauds on y_test:", len(y_test.loc[df['Class'] == 1]), '--',len(y_test.loc[df['Class'] == 1]) / len(y_test))
print("Number of frauds on y_train:", len(y_train.loc[df['Class'] == 1]), '--', len(y_train.loc[df['Class'] == 1])/len(y_train))

X and y sizes, respectively: (284807, 30) (284807,)
Train and test sizes, respectively: 185124 185124 | 99683 99683
Total number of frauds: 492 -- 0.001727485630620034
Number of frauds on y_test: 153 -- 0.0015348655237101612
Number of frauds on y_train: 339 -- 0.0018312050301419588


In [5]:
import statsmodels.api as sm

logit_model = sm.Logit(y_train,X_train)
result = logit_model.fit()
print(result.summary())

# y_predicted = np.array(logistic.predict(X_test))
# y_right = np.array(y_test)

Optimization terminated successfully.
         Current function value: 0.009919
         Iterations 12
                           Logit Regression Results                           
Dep. Variable:                  Class   No. Observations:               185124
Model:                          Logit   Df Residuals:                   185094
Method:                           MLE   Df Model:                           29
Date:                Fri, 27 Nov 2020   Pseudo R-squ.:                  0.2582
Time:                        23:55:12   Log-Likelihood:                -1836.2
converged:                       True   LL-Null:                       -2475.3
Covariance Type:            nonrobust   LLR p-value:                8.542e-251
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Time       -9.273e-05      2e-06    -46.441      0.000   -9.66e-05   -8.88e-05
V1             0.6174      0

We can drop `V12` `V24` for simplicity

In [6]:
X_train = X_train.drop(['V12', 'V24'], axis=1)
X_test = X_test.drop(['V12', 'V24'], axis=1)

X_train.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V19,V20,V21,V22,V23,V25,V26,V27,V28,Amount
138657,62043.0,-0.798938,1.495366,1.125734,0.661778,0.170331,0.041542,0.315464,0.508031,-0.813616,...,0.343001,0.071559,-0.138857,-0.34426,-0.058045,-0.122893,-0.49525,0.279362,0.125115,1.18
127359,48640.0,0.991741,-0.956233,0.606811,-1.220208,-1.223123,-0.245266,-0.614406,0.191199,1.76721,...,0.702099,0.025903,0.219173,0.71166,-0.184273,0.436225,0.14875,0.026962,0.020726,99.95
14315,35723.0,1.175925,0.155353,0.678996,1.327546,-0.527452,-0.559155,-0.014592,-0.090783,0.477159,...,0.047419,-0.131084,-0.249055,-0.557257,0.017509,0.518787,-0.526744,0.036864,0.028948,17.62
153558,80059.0,1.159657,0.134954,0.585037,0.506047,-0.362031,-0.321897,-0.134882,0.08411,-0.186849,...,-0.262652,-0.139834,-0.168162,-0.50458,0.182394,0.077123,0.09519,-0.018086,0.006852,1.78
147962,85710.0,1.193034,-1.493623,0.527295,-1.290145,-1.889211,-0.765774,-0.954508,-0.046217,-1.943084,...,0.023915,-0.125894,-0.32207,-0.988427,0.122612,-0.027468,-0.520887,-0.00108,0.040963,147.19


In [7]:
from sklearn.linear_model import LogisticRegression 
 
log_clf = LogisticRegression(max_iter=500)
log_clf.fit(X_train, y_train)
 
y_predict = log_clf.predict(X_test)

In [8]:
from sklearn.metrics import classification_report, confusion_matrix

confusion = confusion_matrix(y_test, y_predict)
print("Confusion matrix:\n%s" % confusion)
print('\n')
print(classification_report(y_test, y_predict))

Confusion matrix:
[[99506    24]
 [   65    88]]


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     99530
           1       0.79      0.58      0.66       153

    accuracy                           1.00     99683
   macro avg       0.89      0.79      0.83     99683
weighted avg       1.00      1.00      1.00     99683

