# Credit Card Fraud Detection

Credit card fraud is a wide-ranging term for theft and fraud committed using or involving a payment card, such as a credit card or debit card, as a fraudulent source of funds in a transaction. I have used credit card fraud dataset obatined from kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud ) to detect credit card fraud.
1. Exploratory data analysis
2. Split data into train and test. Normalise Data.
3. Train a MLP:
     a. With two hidden layers (8nodes and 4nodes) and one output layer (2 nodes)
     b. Relu activation
     c. Adam Optimizer
4. Train Autoencoder-Decoder
     a. Encoder of 8 nodes
     b. tanh activation
     c. RMSProp Optimiser
     d.Use auc score to validate the result
 

# Exploratory Data Analysis

Load data

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
from sklearn.metrics import roc_auc_score as auc 
import seaborn as sns
import matplotlib as plt

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 

  from ._conv import register_converters as _register_converters


In [2]:
df = pd.read_csv('creditcard.csv')

Exploration of Data

In [3]:
df.shape

(284807, 31)

In [4]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

31 features three of which are time, amount and label(class). Others are obtained after dimensionality reduction using PCA to keep the user data private.

In [5]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [6]:
df.isnull().values.any()

False

There is no null value in dataset

The dataset is unbalanced with only 492 fradulent enteries out of 284807 observations.

# Test and Train Split

In [7]:
from sklearn.model_selection import train_test_split
import random
df.sort_values('Time', inplace = True)
df=df[:25000]
X_train, X_test = train_test_split(df, test_size=0.25, random_state=42)
y_train = X_train['Class']
y_test = X_test['Class']
y_train1 = X_train['Class']
y_test1 = X_test['Class']

In [8]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10827270190304489385
]


Normalise Data

In [9]:
from sklearn.preprocessing import StandardScaler

X_train = StandardScaler(with_mean=False).fit_transform(X_train)
X_test = StandardScaler(with_mean=False).fit_transform(X_test)

# MLP

In [10]:
#parameter intialisation
n_epochs = 20
batch_size = 200
learning_rate = 0.001

n_input = X_train.shape[1]
n_hidden_1 =8
n_hidden_2 = 4
n_classes = 2

In [11]:
#intialising placeholders
X = tf.placeholder(tf.float32, shape=[None, n_input])
y_ = tf.placeholder(tf.int32, shape=[None,n_classes])

In [12]:
#initialising weights and biases
weights= {
    'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_classes])),
}
biases = {
    'b1': tf.Variable(tf.zeros([n_hidden_1])),
    'b2': tf.Variable(tf.zeros([n_hidden_2])),
    'b3': tf.Variable(tf.zeros([n_classes])),
}

Relu Activation

In [13]:
hidden_layer_1 =  tf.nn.relu(tf.add(tf.matmul(X, weights['h1']), biases['b1']))
hidden_layer_2 =  tf.nn.relu(tf.add(tf.matmul(hidden_layer_1, weights['h2']), biases['b2']))
out_layer = tf.add(tf.matmul(hidden_layer_2, weights['out']), biases['b3'])
pred_probs = tf.nn.softmax(out_layer)

Adam Optimiser

In [14]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=out_layer))

optimizer = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



In [15]:
# Convert Y to a numpy array and split train and validation data (80:20)

Y = np.zeros((df.shape[0], 2))
Y[range(df.shape[0]), df['Class'].values] = 1              #one hot encoding
x_train = X_train[:int(X_train.shape[0] * (0.8))]
y_train = Y[:int(X_train.shape[0] *(0.8) )]
y_test = Y[X_train.shape[0]:]
validation_x = X_train[int(X_train.shape[0] *  (0.2)):]
validation_y = Y[int(X_train.shape[0] * (0.2)):X_train.shape[0]]


# Initializing the variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    total_batch = int(X_train.shape[0]/batch_size)
    # Training cycle
    for epoch in range(n_epochs):
        # Loop over all mini batches
        for i in range(total_batch):
            batch_idx = np.random.choice(x_train.shape[0], batch_size)
            batch_xs = x_train[batch_idx]
            batch_ys = y_train[batch_idx]
            # Run optimiser
            _, c = sess.run([optimizer, cross_entropy], feed_dict={X: batch_xs, y_: batch_ys})


        #Display logs per epoch step
        val_probs = sess.run(pred_probs, feed_dict={X: validation_x})
        print("Epoch:", '%04d' % (epoch+1),
              "loss=", "{:.9f}".format(c), 
              "Auc Value=", "{:.6f}".format(auc(validation_y[:, 1], val_probs[:, 1])))

    print("Optimization Finished!")
    test_probs = sess.run(pred_probs, feed_dict={X: X_test})
    
    print("Test auc score: {}".format(auc(y_test[:, 1], test_probs[:, 1])))

Epoch: 0001 loss= 0.395792812 Auc Value= 0.446548
Epoch: 0002 loss= 0.290122271 Auc Value= 0.448552
Epoch: 0003 loss= 0.163585544 Auc Value= 0.450091
Epoch: 0004 loss= 0.151765630 Auc Value= 0.451070
Epoch: 0005 loss= 0.193790928 Auc Value= 0.453087
Epoch: 0006 loss= 0.227496713 Auc Value= 0.455138
Epoch: 0007 loss= 0.236617491 Auc Value= 0.455749
Epoch: 0008 loss= 0.142748922 Auc Value= 0.457164
Epoch: 0009 loss= 0.134193167 Auc Value= 0.457846
Epoch: 0010 loss= 0.186997682 Auc Value= 0.458721
Epoch: 0011 loss= 0.096065715 Auc Value= 0.460010
Epoch: 0012 loss= 0.102554210 Auc Value= 0.461164
Epoch: 0013 loss= 0.187635824 Auc Value= 0.462299
Epoch: 0014 loss= 0.100742072 Auc Value= 0.463049
Epoch: 0015 loss= 0.169581801 Auc Value= 0.463922
Epoch: 0016 loss= 0.109514125 Auc Value= 0.464569
Epoch: 0017 loss= 0.136733741 Auc Value= 0.465322
Epoch: 0018 loss= 0.160967365 Auc Value= 0.465747
Epoch: 0019 loss= 0.122621067 Auc Value= 0.465869
Epoch: 0020 loss= 0.153940678 Auc Value= 0.466207


# Encoder-Decoder

In [16]:
#initialise placeholder, weights and biases
X = tf.placeholder("float", [None, n_input])

weights = {
    'w1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'w2': tf.Variable(tf.random_normal([n_hidden_1, n_input])),
}
biases = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_input])),
}

In [17]:
# Encoder Hidden layer with tanh activation
encoder_layer_1 = tf.nn.tanh(tf.add(tf.matmul(X, weights['w1']), biases['b1']))
decoder_layer_1 = tf.nn.tanh(tf.add(tf.matmul(encoder_layer_1, weights['w2']),biases['b2']))

In [18]:
#predicted Y
y_pred = decoder_layer_1
#True Y
y_true = X

In [19]:
# mean square error
batch_mse = tf.reduce_mean(tf.pow(y_true - y_pred, 2), 1)

# Define loss and RMSProp Optimizer, minimize the squared error
loss = tf.reduce_mean(tf.pow(y_true - y_pred, 2))
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)

In [20]:
# Initializing the variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    total_batch = int(X_train.shape[0]/batch_size)
    # Training cycle
    for epoch in range(n_epochs):
        # Loop over all batches
        for i in range(total_batch):
            batch_idx = np.random.choice(X_train.shape[0], batch_size)
            batch_xs = X_train[batch_idx]
            # Run optimization op (backprop) and cost op (to get loss value)
            _, c = sess.run([optimizer, loss], feed_dict={X: batch_xs})
            
        # Display logs per epoch step
        train_batch_mse = sess.run(batch_mse, feed_dict={X: X_train})
        print("Epoch:", '%04d' % (epoch+1),
              "loss=", "{:.9f}".format(c), 
              "Train Auc=", "{:.6f}".format(auc(y_train1, train_batch_mse)))

    print("Optimization Finished!")
    test_batch_mse = sess.run(batch_mse, feed_dict={X: X_test})
    print("Test auc score: {:.6f}".format(auc(y_test1, test_batch_mse)))
    

Epoch: 0001 loss= 1.736669660 Train Auc= 0.997516
Epoch: 0002 loss= 1.494547486 Train Auc= 0.997532
Epoch: 0003 loss= 1.819013476 Train Auc= 0.997541
Epoch: 0004 loss= 1.213060141 Train Auc= 0.997556
Epoch: 0005 loss= 1.621577978 Train Auc= 0.997563
Epoch: 0006 loss= 1.089150071 Train Auc= 0.997583
Epoch: 0007 loss= 0.933269382 Train Auc= 0.997580
Epoch: 0008 loss= 1.380723596 Train Auc= 0.997597
Epoch: 0009 loss= 1.258286715 Train Auc= 0.997616
Epoch: 0010 loss= 0.810753524 Train Auc= 0.997624
Epoch: 0011 loss= 0.893336356 Train Auc= 0.997627
Epoch: 0012 loss= 0.925928354 Train Auc= 0.997637
Epoch: 0013 loss= 0.797889352 Train Auc= 0.997649
Epoch: 0014 loss= 0.883463025 Train Auc= 0.997657
Epoch: 0015 loss= 0.983460784 Train Auc= 0.997657
Epoch: 0016 loss= 0.688339233 Train Auc= 0.997660
Epoch: 0017 loss= 0.643325567 Train Auc= 0.997652
Epoch: 0018 loss= 0.998958409 Train Auc= 0.997655
Epoch: 0019 loss= 0.816455722 Train Auc= 0.997661
Epoch: 0020 loss= 0.929107070 Train Auc= 0.997661


In [21]:
# TRAIN StARTS
# save_model = os.path.join(data_dir, 'temp_saved_model_1layer.ckpt')
# saver = tf.train.Saver()
#     save_path = saver.save(sess, save_model)
#     print("Model saved in file: %s" % save_path)
    

# Conclusion

Auto encoder gives test auc score of nearly 99% while MLP gives lower auc score.