# Credit Card Fraud Detection
_**Predict fraudulent credit card transactions using Amazon SageMaker's Linear-Learner Algorithm**_

----

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)
1. [Predict](#Predict)
1. [CleanUp](#CleanUp)




## Background

A fraudulent transaction is one unauthorized by the credit card holder. It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.  This notebook illustrates how one can use SageMaker's built-in algorithms, linear learner, for predicting whether a credit card transaction is fraudulent.

The dataset used in this notebook contains Europen cardholder credit card transactions in September 2013, over a period of two days.  Out of the total 284,807 transactions, there are identified as 492 fraudulent. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

More details about the dataset can be found here - https://www.kaggle.com/mlg-ulb/creditcardfraud

## Setup

* Import the necessary libraries.
* Specify the SageMaker role arn used to give access to your data. The snippet below will use the same role used by your SageMaker notebook instance, if you're using other.  Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
* Specify the S3 bucket that you want to use for training and storing model artifacts.

In [None]:
import os
import boto3
import re
import io
import json
import sagemaker
from sagemaker import get_execution_role

import numpy as np 
import pandas as pd


In [None]:
role = get_execution_role()

# We will use the S3 bucket associated with the session.  You can always choose to use a different S3 bucket as needed.
sess = sagemaker.Session()
bucket = sess.default_bucket()

prefix = 'fraud-detection-linear-learner' # folder to upload training files within the bucket

## Data

#### Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the [DefeatFraud](https://mlg.ulb.ac.be/wordpress/portfolio_page/defeatfraud-assessment-and-validation-of-deep-feature-engineering-and-learning-solutions-for-fraud-detection/) project
We cite the following works:
* Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
* Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
* Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
* Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
* Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
* Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Let's start by downloading and reading in the credit card fraud data set.

In [None]:
%%bash
wget https://s3-us-west-2.amazonaws.com/sagemaker-e2e-solutions/fraud-detection/creditcardfraud.zip
unzip creditcardfraud.zip

In [None]:
data = pd.read_csv('creditcard.csv', delimiter=',')

Let's take a peek at our data (we only show a subset of the columns in the table):

In [None]:
print(data.columns)
data[['Time', 'V1', 'V2', 'V27', 'V28', 'Amount', 'Class']].describe()

##TODO : Show more/less details about the data? Make it clear again about the anonymized data.

The class column corresponds to whether or not a transaction is fradulent. We see that the majority of data is non-fraudulant with only $492$ ($.173\%$) of the data corresponding to fraudulant examples.

In [None]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', 100.*frauds/(frauds + nonfrauds))

This dataset has 28 columns, $V_i$ for $i=1..28$ of anonymized features along with columns for time, amount, and class. We already know that the columns $V_i$ have been normalized to have $0$ mean and unit standard deviation as the result of a PCA. You can read more about PCA here:. 

Tip: For our dataset this amount of preprocessing will give us reasonable accuracy, but it's important to note that there are more preprocessing steps one can use to improve accuracy . For unbalanced data sets like ours where the positive (fraudulent) examples occur much less frequently than the negative (legitimate) examples, we may try “over-sampling” the minority dataset by generating synthetic data (read about SMOTE in Data Mining for Imbalanced Datasets: An Overview (https://link.springer.com/chapter/10.1007%2F0-387-25465-X_40) or undersampling the majority class by using ensemble methods (see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.6858&rep=rep1&type=pdfor).

In [None]:
##Split the data into training and testing data
##Does the linear learner algorithm need a validation data set??

from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, test_size=0.2)

print("Rows in entire data : " , data.size)
print("Rows in training data : " , train_data.size)
print("Rows in test data : " , test_data.size)

In [None]:
#Convert training data into features and labels
feature_columns = train_data.columns[:-1]
#print(feature_columns)
label_column = train_data.columns[-1]
#print(label_column)

features = train_data[feature_columns].values.astype('float32')
labels = (train_data[label_column].values).astype('float32')

#Convert test data into features and labels
test_feature_columns = test_data.columns[:-1]
test_label_column = test_data.columns[-1]

test_features = test_data[test_feature_columns].values.astype('float32')
test_labels = (test_data[test_label_column].values).astype('float32')

In [None]:
## Save the test features as a csv file, so we can use this for real time and batch predictions.
test_df = pd.DataFrame(test_features)
test_df.to_csv('test_data.csv', index=False)

In [None]:
##Save the test_data dataframe as a csv file
print("Rows in test data : " , test_data.size)
print("Columns with label", len(test_data.columns))
test_data_no_label = test_data[test_data.columns[:-1]]
print("Columns no label", len(test_data_no_label.columns))

In [None]:
fraud_transaction_indices = np.where(test_labels == 1)
non_fraud_transaction_indices = np.where(test_labels == 0)
 
print('fraud_transaction_indices  : ', fraud_transaction_indices)
print('non_fraud_transaction_indices  : ', non_fraud_transaction_indices)

#### Prepare Data and Upload to S3

The Amazon common libraries provide utilities to convert NumPy n-dimensional arrays into a the Record-IO format which SageMaker uses for a concise representation of features and labels. The Record-IO format is implemented via protocol buffer so the serialization is very efficient.

In [None]:
import io
import sagemaker.amazon.common as smac

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, features, labels)
buf.seek(0);

Now we upload the data to S3 using boto3.

In [None]:
import boto3
import os
import sagemaker

session = sagemaker.Session()
bucket = sagemaker.Session().default_bucket()

prefix = 'linear-learner'
key = 'recordio-pb-data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)

s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('Uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

## Train

Now we train a Linear Learner using SageMaker's built-in algorithm. To specify the Linear Learner algorithm, we use a utility function to obtain it's URI. A complete list of build-in algorithms is found here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'linear-learner')

SageMaker abstracts training with Estimators. We can pass container, and all parameters to the estimator, as well as the hyperparameters for the linear learner and fit the estimator to the data in S3.

Training takes about 11 minutes.

In [None]:
from sagemaker import get_execution_role

linear = sagemaker.estimator.Estimator(container,
                                       get_execution_role(), 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=session)
linear.set_hyperparameters(feature_dim=features.shape[1],
                           predictor_type='binary_classifier',
                           mini_batch_size=200)

linear.fit({'train': s3_train_data})

## Host

Now we deploy the estimator to and endpoint. This step takes about 10 minutes

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

linear_predictor = linear.deploy(initial_instance_count=1,
                                 endpoint_name="fraud-detection-ep",
                                 instance_type='ml.m4.xlarge')
# Specify input and output formats.
linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

## Predict

### RealTime

In [None]:
#Function to get predictions from the sagemaker endpoint.
def get_fraud_prediction(data):
    sagemaker_endpoint_name = 'fraud-detection-endpoint'
    sagemaker_runtime = boto3.client('sagemaker-runtime')
    response = sagemaker_runtime.invoke_endpoint(EndpointName=sagemaker_endpoint_name, ContentType='text/csv',
                                                 Body=data)
    print('\nresponse:', response)
    result = json.loads(response['Body'].read().decode())
    print('\nresult:', result)
    pred = int(result['predictions'][0]['predicted_label'])
    return pred

In [None]:
#data_payload = str(test_data[0])
fraud_transaction_index = fraud_transaction_indices[0][0]
print("fraud_transaction_index ", fraud_transaction_index)
data_payload = ','.join(map(str, test_features[fraud_transaction_index]))
pred = get_fraud_prediction(data_payload)
print('\nprediction:', pred)
print('Original : ', test_labels[fraud_transaction_index])

### Batch Predictions

In [None]:
#boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload('test.csv')
key='test_data.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_file('test_data.csv')
  

This takes about 5 minutes

In [None]:
%%time

sm_transformer = linear.transformer(1, 'ml.m4.xlarge')

# start a transform job
input_location = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print(input_location)
sm_transformer.transform(input_location, split_type='Line', content_type='text/csv',compression_type='None')
sm_transformer.wait()

### Download and look at the output generated by batch transform TODO

In [None]:
# Download the output data from S3 to local filesystem
batch_output = sm_transformer.output_path
!mkdir -p batch_data/output
!aws s3 cp --recursive $batch_output/ batch_data/output/
# Head to see what the batch output looks like
!head batch_data/output/*

## CleanUp

Delete the prediction endpoint when you're done. You can do that at the Amazon SageMaker console in the Endpoints page. Or you can uncomment and execute the cell below.

In [None]:
#linear_predictor.delete_endpoint()