# Fraud Detection Using AutoEncoders - An Unsupervised Learning Method

In this lab we'll use a credit card fraud dataset to predict fraudulent transactions.  This dataset The contains transactions made by credit cards in September 2013 by European cardholders.  This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions (Highly Imbalanced!)

## Introduction

In many real life situations, you do not have labeled data and still you need to detect fraudaulent activites. This technique walks you through how to learn features and build a model using unlabeled dataset and helps you to detect frauds.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import boto3
import os
import sagemaker
from sklearn.metrics import recall_score, classification_report, auc, roc_curve, precision_recall_curve, confusion_matrix
from sklearn.model_selection import train_test_split
from sagemaker.predictor import csv_serializer 
from sklearn.preprocessing import StandardScaler
from sagemaker.tensorflow import TensorFlow


In [None]:
# sagemaker session
session = sagemaker.Session()
sagemaker_iam_role = sagemaker.get_execution_role()

# data location
bucket = session.default_bucket()
prefix = 'sagemaker/DEMO-autoencoder-fraud'

## Investigate and process the data

Let's start by downloading and reading in the credit card fraud data set.

In [None]:
# creating directory structure
!mkdir -p ./data

In [None]:
# download from s3 to the ./data/ folder
!curl https://s3-us-west-2.amazonaws.com/sagemaker-e2e-solutions/fraud-detection/creditcardfraud.zip -o ./data/creditcardfraud.zip

In [None]:
# uncompress the dataset
!unzip -o ./data/creditcardfraud.zip -d ./data/

In [None]:
# read in data with pandas
data = pd.read_csv('./data/creditcard.csv', delimiter=',')

Let's take a peek at our data (we only show a subset of the columns in the table):

In [None]:
print(data.columns)
data.head(10)

Last column represents the class i.e fraud or non-fraud. Let's plot this class against the Frequency to see how the data is distributed. We are not going to use this label as its a non-supervised technique.

This dataset has 28 columns, $V_i$ for $i=1..28$ of anonymized features along with columns for time, amount, and class. We already know that the columns $V_i$ have been normalized to have $0$ mean and unit standard deviation as the result of a PCA. You can read more about PCA here - https://en.wikipedia.org/wiki/Principal_component_analysis 

In [None]:
data.describe()

In [None]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data: {:.3%}'.format(frauds/(frauds + nonfrauds)))

The class column corresponds to whether or not a transaction is fradulent. We see that the majority of data is non-fraudulant with only $492$ ($.173\%$) of the data corresponding to fraudulant examples.

In [None]:
labels = ['normal','fraud']
classes = pd.value_counts(data['Class'], sort = True)
classes.plot(kind = 'bar', rot=0)
plt.title("Transaction class distribution")
plt.xticks(range(2), labels)
plt.xlabel("Class")
plt.ylabel("Frequency");

It clearly indicates that its a highly imbalanced dataset.

In [None]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

### Prepare Data and Upload to S3

In [None]:
# move class to the first column
model_data = data
model_data = pd.concat([model_data['Class'], model_data.drop(['Class'], axis=1)], axis=1)
model_data.head()

In [None]:
# data split and save as CSV - 70/20/10
train_df, temp_df = train_test_split(model_data, test_size=0.3, random_state=4321, shuffle=True, stratify=model_data['Class'])
val_df, test_df = train_test_split(temp_df, test_size=0.33333, random_state=4321, shuffle=True, stratify=temp_df['Class'])

Other part to note is that we have two columns Time and Amount which are not scaled. So let's scale them by using standard scaler from sklearn. 

In [None]:
time_scaler = StandardScaler()
time_scaler.fit(train_df['Time'].values.reshape(-1, 1))
train_df['Time'] = time_scaler.transform(train_df['Time'].values.reshape(-1, 1))
val_df['Time'] = time_scaler.transform(val_df['Time'].values.reshape(-1, 1))
test_df['Time'] = time_scaler.transform(test_df['Time'].values.reshape(-1, 1))

amount_scaler = StandardScaler()
amount_scaler.fit(train_df['Amount'].values.reshape(-1, 1))
train_df['Amount'] = amount_scaler.transform(train_df['Amount'].values.reshape(-1, 1))
val_df['Amount'] = amount_scaler.transform(val_df['Amount'].values.reshape(-1, 1))
test_df['Amount'] = amount_scaler.transform(test_df['Amount'].values.reshape(-1, 1))


In [None]:
train_df.describe()

## Build the Model - AutoEncoder

AutoEncoders are special kind of neural networks, where your input is 'x' and you have your output as 'x' as well. What this really means is that we are trying to learn a function, where the input and output are the same.  

Few things to note. 

- We are reducing the number of nodes, which will force network to learn the features from the dataset. Intuition being that this "code" is a set of abstracted features which represents or creates a fingerprint for "fraud" or a "non-fraud" activitiy.
- Since we are starting with the input 'x', reducing into a abstracted features and then reconstructing back the 'x' means we don't need a labeled dataset. 
- The "code" is intutively a representation of abstracted features. In case of credit card frauds features abstraction would be "when does a fraud occur?". It can be a credit card transaction by the same person, from multiple places, using the same credit card etc.

For our credit card dataset, we are going to get all the non-fraudulent data and will try to re-create the same (f(x) = x). During this process the network should try to learn a unique representation of what's a non-fradulent activity. Once the model is trained with whats 'normal' that means anything which does not match this normal representation can be declared as abnormal. 

For inference, we are going to give both fraud and non-fraud data to the model. Model prediction will allow us to calculate the reconstruction error. This is where we set the threshold which let's domain expert define what tolerance is ok consider normal and when to declare as abnormal data. 

In [None]:
train_x = train_df.loc[train_df['Class'] == 0]
train_x = train_x.drop(['Class'], axis=1)
train_x.to_csv('./data/train_normal.csv', header=False, index=False)

In [None]:
val_df.to_csv('./data/validation_normal.csv', index=False)

In [None]:
test_y = test_df['Class']
test_x = test_df.drop(['Class'], axis=1)

Let's check one more time, what columns are being used in the training dataset

In [None]:
print(train_x.columns)

In [None]:
print('Shapes are: Train=',train_x.shape,' Test=',test_x.shape)

### Now we upload the data to S3 using boto3.

In [None]:
output_location = f's3://{bucket}/{prefix}/output'
s3_train_data = f's3://{bucket}/{prefix}/train/train_normal.csv'
s3_validation_data = f's3://{bucket}/{prefix}/validation/validation_normal.csv'

# upload to s3 bucket defined above
session.upload_data('./data/train_normal.csv',bucket,f'{prefix}/train')
session.upload_data('./data/validation_normal.csv',bucket,f'{prefix}/validation')

print('Uploaded training data location: {}'.format(s3_train_data))
print('Uploaded training data location: {}'.format(s3_validation_data))
print('Training artifacts will be uploaded to: {}'.format(output_location))

## Build the Model - AutoEncoder

We are going to create two layers both for encoders and decoders. 

First layer will have the dimension close to half of the input dimension to the network and 2nd layer will be half of the 1st layer. In this case it will be 14 and 7

In [None]:
!pygmentize ./code/fraud_ae_tf.py

## Train the AutoEncoder Model

In [None]:
tf_estimator = TensorFlow(entry_point='./code/fraud_ae_tf.py', 
                          role=sagemaker_iam_role,
                          instance_count=1, 
                          instance_type='ml.m5.xlarge',
                          framework_version='1.12', 
                          py_version='py3',
                          script_mode=True,
                          use_spot_instances=True,
                          max_run=3600,
                          max_wait=3600,
                          hyperparameters={
                              'epochs': 25,
                              'batch-size': 256,
                              'learning-rate': 0.0002}
                         )

In [None]:
tf_estimator.fit({'training': s3_train_data, 'validation': s3_validation_data})

## Predicting with AutoEncoder

Let's first deploy the trained model to a real time endpoint

In [None]:
%%time

import time 
tf_endpoint_name = 'keras-tf-autoencoder-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

tf_predictor = tf_estimator.deploy(initial_instance_count=1,
                         instance_type='ml.m5.xlarge',      
                         endpoint_name=tf_endpoint_name)    

We are going to calculate the mean absolute error between predicted and the expected values. This will be our reconstruction error.  We can process the whole test dataset at once.  Note that if we needed to we could chunk the data to predict in batches.  The endpoint timeout duration will dictate if we batch or not.

In [None]:
y_pred = tf_predictor.predict(test_x.to_numpy())
mae = np.mean(np.abs(test_x.to_numpy() - y_pred['predictions']), axis=1)

Let's see how is reconstuction error w.r.t predicted class

In [None]:
error_df = pd.DataFrame({'reconstruction_error': mae,'true_class': test_y})
error_df.describe()

## Reconstruction error without fraud

In [None]:
error_df[(error_df['true_class']== 0)]['reconstruction_error'].describe()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
normal_error_df = error_df[(error_df['true_class']== 0) & (error_df['reconstruction_error'])]
_ = ax.hist(normal_error_df.reconstruction_error.values, bins=50)


## Reconstruction error with fraud

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
fraud_error_df = error_df[error_df['true_class'] == 1]
_ = ax.hist(fraud_error_df.reconstruction_error.values, bins=50)

In [None]:
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(111)
_ = ax.hist(error_df[error_df['true_class']==0]['reconstruction_error'].values, bins=50,density=True,color='blue',edgecolor='black',alpha=0.5,label='normal')
_ = ax.hist(error_df[error_df['true_class']==1]['reconstruction_error'], bins=50,density=True,color='red',edgecolor='black',alpha=0.5,label='fraud')
plt.legend()
plt.xlabel('reconstruction error, MAE')
plt.ylabel('normalized count')
#plt.ylim((0,50))

In [None]:
fpr, tpr, thresholds = roc_curve(error_df.true_class, error_df.reconstruction_error)
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();

In [None]:
precision, recall, th = precision_recall_curve(error_df.true_class, error_df.reconstruction_error)
plt.plot(recall, precision, 'b', label='Precision-Recall curve')
plt.title('Recall vs Precision')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

In [None]:
plt.plot(th, precision[1:], 'b', label='Threshold-Precision curve')
plt.title('Precision for different threshold values')
plt.xlabel('Threshold')
plt.ylabel('Precision')
plt.show()

In [None]:
plt.plot(th, recall[1:], 'b', label='Threshold-Recall curve')
plt.title('Recall for different threshold values')
plt.xlabel('Reconstruction error')
plt.ylabel('Recall')
plt.show()

In [None]:
threshold = 2

In [None]:
groups = error_df.groupby('true_class')
fig, ax = plt.subplots()

for name, group in groups:
    ax.plot(group.index, group.reconstruction_error, marker='o', ms=3.5, linestyle='',
            label= "Fraud" if name == 1 else "Normal")
ax.hlines(threshold, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show();

In [None]:
LABELS = ["Normal", "Fraud"]

In [None]:
y_pred = [1 if e > threshold else 0 for e in error_df.reconstruction_error.values]
conf_matrix = confusion_matrix(error_df.true_class, y_pred)
plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

In [None]:
print("Classification Report: ")
print(classification_report(y_true=error_df.true_class, y_pred=y_pred))

In [None]:
# clean up
tf_predictor.delete_endpoint()

## Data Acknowledgements


The dataset used to demonstrated the fraud detection solution has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project We cite the following works:

- Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
- Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
- Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
- Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
- Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
- Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing