# Credit Card Fraud

We will be detecting credit card fraud based on the different features of our dataset with 3 different models. Here is the Logistic Regression one.

We're looking to minimize the False Negative Rate or FNR.

Since the dataset is unbalanced, we can try two techniques that may help us have better predictions:

    - Adding some noise (gaussian) to the fraud data to create more and reduce the imbalance
    - Randomly sample the fraud data and train k models and average them out (or choose the best)
    
 

In [3]:
import numpy as np
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import scale
import random

In [4]:
df = pd.read_csv('creditcard.csv', low_memory=False)
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,145721.0,1.965977,-0.101403,-1.211262,1.270604,0.244049,-0.044854,-0.059925,0.052383,0.564367,...,0.209118,0.816229,-0.000331,0.703329,0.376746,-0.420003,0.003401,-0.060886,2.0,0
1,776.0,1.139557,0.159935,0.049789,0.54235,-0.024712,-0.788885,0.480907,-0.332393,-0.31456,...,-0.462033,-1.508388,0.105812,-0.114298,0.183347,-0.112136,-0.049833,0.031446,88.88,0
2,37569.0,0.989122,-1.32929,0.503809,-1.077895,-0.769903,1.575578,-1.194731,0.670416,-0.401293,...,-0.237766,-0.211168,0.155213,-0.98193,-0.323915,1.424766,-0.01076,-0.004914,89.9,0
3,68203.0,1.468156,-1.059905,0.306992,-1.501983,-1.349983,-0.506555,-1.00017,-0.084204,-2.080436,...,-0.174713,-0.182721,-0.031887,-0.007128,0.394275,-0.205539,0.022378,0.010371,30.0,0
4,117822.0,1.873501,-1.292506,-1.898019,-0.37675,-0.302375,-0.313058,-0.182886,-0.206041,-0.066789,...,-0.009603,0.137327,-0.174894,0.013225,0.223711,-0.023733,-0.036527,-0.024956,190.0,0


In [5]:
frauds = df.loc[df['Class'] == 1]
non_frauds = df.loc[df['Class'] == 0]
print("We have", len(frauds), "fraud data points and", len(non_frauds), "nonfraudulent data points.")

We have 492 fraud data points and 284315 nonfraudulent data points.


In [9]:
from sklearn import datasets, linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

In [10]:
X = df.iloc[:,:-1]
y = df['Class']

print("X and y sizes, respectively:", len(X), len(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)
print("Train and test sizes, respectively:", len(X_train), len(y_train), "|", len(X_test), len(y_test))
print("Total number of frauds:", len(y.loc[df['Class'] == 1]), len(y.loc[df['Class'] == 1])/len(y))
print("Number of frauds on y_test:", len(y_test.loc[df['Class'] == 1]), len(y_test.loc[df['Class'] == 1]) / len(y_test))
print("Number of frauds on y_train:", len(y_train.loc[df['Class'] == 1]), len(y_train.loc[df['Class'] == 1])/len(y_train))

X and y sizes, respectively: 284807 284807
Train and test sizes, respectively: 185124 185124 | 99683 99683
Total number of frauds: 492 0.001727485630620034
Number of frauds on y_test: 163 0.0016351835317957926
Number of frauds on y_train: 329 0.001777187182645146


In [28]:
y_predicted = np.array(logistic.predict(X_test))
y_right = np.array(y_test)

In [29]:
confusion = confusion_matrix(y_right, y_predicted)
print("Confusion matrix:\n%s" % confusion)

Confusion matrix:
[[95831  3689]
 [   14   149]]


In [27]:
print("False negetive rate",99473/(99473+107))

False negetive rate 0.9989254870455915


# Logistic Regression with balanced class weights

In [20]:
best_c, best_fnr = 1, 1
for _ in range(20):
    c = random.uniform(1, 10000)
    logistic = linear_model.LogisticRegression(C=c, class_weight="balanced")
    logistic.fit(X_train, y_train)
    #print("Score: ", logistic.score(X_test, y_test))
    y_predicted2 = np.array(logistic.predict(X_test))
    y_right2 = np.array(y_test)
    confusion_matrix2 = confusion_matrix(y_right2, y_predicted2)
    #print("Confusion matrix:\n%s" % confusion_matrix2)
    #confusion_matrix2.plot(normalized=True)
    #plt.show()
    #confusion_matrix2.print_stats()
    