# Credit Card Fraud Detection

In this tutorial, we are going to leverage **Amazon SageMaker to train and deploy an XGBoost model** that is able to detect whether or not a person is going to default on their credit card payment. We use the “default of credit card clients” dataset1, from the UCI machine learning repository.


First, you need to set up your notebook by defining the S3 bucket you’re using, importing the libraries you will need, and getting the Amazon SageMaker execution IAM role from the notebook environment that you will need to access training jobs.

In [5]:
bucket = 'mmalikg-sagemaker-repo'
prefix = 'sagemaker/xgboost_credit_risk'

# Define IAM role
import boto3
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer

role = get_execution_role()

The next step is to download the dataset, which comes as a **.xls file**.

In [6]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls

--2019-06-12 18:35:53--  https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5539328 (5.3M) [application/x-httpd-php]
Saving to: ‘default of credit card clients.xls.1’


2019-06-12 18:35:53 (16.8 MB/s) - ‘default of credit card clients.xls.1’ saved [5539328/5539328]



In [7]:
dataset = pd.read_excel('default of credit card clients.xls', index_col=0)
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 15)
dataset

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,...,X18,X19,X20,X21,X22,X23,Y
ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,...,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,20000,2,2,1,24,2,2,...,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,...,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,...,1518,1500,1000,1000,1000,5000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29997,150000,1,3,2,43,-1,-1,...,1837,3526,8998,129,0,0,0
29998,30000,1,2,2,37,4,3,...,0,0,22000,4200,2000,3100,1
29999,80000,1,3,1,41,1,-1,...,85900,3409,1178,1926,52964,1804,1
30000,50000,1,2,1,46,0,0,...,2078,1800,1430,1000,1000,1000,1


In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30001 entries, ID to 30000
Data columns (total 24 columns):
X1     30001 non-null object
X2     30001 non-null object
X3     30001 non-null object
X4     30001 non-null object
X5     30001 non-null object
X6     30001 non-null object
X7     30001 non-null object
X8     30001 non-null object
X9     30001 non-null object
X10    30001 non-null object
X11    30001 non-null object
X12    30001 non-null object
X13    30001 non-null object
X14    30001 non-null object
X15    30001 non-null object
X16    30001 non-null object
X17    30001 non-null object
X18    30001 non-null object
X19    30001 non-null object
X20    30001 non-null object
X21    30001 non-null object
X22    30001 non-null object
X23    30001 non-null object
Y      30001 non-null object
dtypes: object(24)
memory usage: 5.7+ MB


By reading the dataset we see that it has 30,000 records, and each record has 23 associated attributes to describe features relevant to the credit scores of the person the record represents. The attributes are the following:

X1: Amount of the given credit.

X2: Gender (1 = male; 2 = female).

X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

X4: Marital status (1 = married; 2 = single; 3 = others).

X5: Age (year).

X6 – X11: History of past payments. Tracked past monthly payment records (from April to September, 2005) are displayed as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005… X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months… 8 = payment delay for eight months; 9 = payment delay for nine months and above.

X12-X17: Amount of bill statement X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005… X17 = amount of bill statement in April, 2005.

X18-X23: Amount of previous payment. X18 = amount paid in September, 2005; X19 = amount paid in August, 2005…. X23 = amount paid in April, 2005.

Y: Did the person default? (Yes = 1, No = 0)

The “Y” attribute is known as the target attribute. This is the attribute that we want the XGBoost to predict. Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification. In this dataset, a 1 in the Y column means that the person previously defaulted and 0 means that they have not defaulted in the past.

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format. For this example, we use CSV. 

**It should have the following:**
  - Have the predictor variable in the first column
  - Not have a header row

In [42]:
dataset = dataset.drop('ID')
dataset = pd.concat([dataset['Y'], dataset.drop(['Y'], axis=1)], axis=1)

Here, we split our dataset into a **training, validation, and testing set.** XGBoost will train on the training dataset and use the validation set as data to evaluate prediction results as the model is trained. We will make predictions against the testing set after the model has been deployed.

In [48]:
train_data, validation_data, test_data = np.split(dataset.sample(frac=1, random_state=1729), [int(0.7 * len(dataset)), int(0.9 * len(dataset))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

**Setup data channels for XGboost Estimator to use.**

In [57]:
#Upload train.csv to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')

#upload validation.csv to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

train_data = 's3://{}/{}/{}'.format(bucket, prefix, 'train')

validation_data = 's3://{}/{}/{}'.format(bucket, prefix, 'validation')

s3_output_location = 's3://{}/{}/{}'.format(bucket, prefix, 'xgboost_fraud_detection_model')

train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')

data_channels = {'train': train_channel, 'validation': valid_channel}

The next step kicks off our **XGBoost training job**. We first define the location of the Amazon SageMaker XGBoost training containers.
We then create an Amazon SageMaker estimator. By simply changing values such as train_instance_count and train_instance_type, 
we can change the size and number of instances we want to run on, which scales and distributes the training.

XGBoost also has a number of hyperparameters that we can tune to improve model performance. Here, we set values for some of the most commonly tuned hyperparameters. Notice the objective parameter is set to binary:logistic. This parameter tells XGBoost what kind of problem we are solving (classification, regression, ranking, etc.). In this case we are solving a binary classification proble –predicting whether or not a person is likely to default on their credit card payments.

In [59]:
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost')
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=s3_output_location,
                                    sagemaker_session=sess)
xgb.set_hyperparameters(eta=0.1,
                        objective='binary:logistic',
                        num_round=25)

xgb.fit(inputs=data_channels,  logs=True)

2019-06-07 20:19:27 Starting - Starting the training job...
2019-06-07 20:19:29 Starting - Launching requested ML instances......
2019-06-07 20:20:31 Starting - Preparing the instances for training......................................................
2019-06-07 20:30:00 Downloading - Downloading input data
2019-06-07 20:30:00 Training - Downloading the training image...
[31mArguments: train[0m
[31m[2019-06-07:20:30:20:INFO] Running standalone xgboost training.[0m
[31m[2019-06-07:20:30:20:INFO] File size need to be processed in the node: 2.32mb. Available memory size in the node: 8447.71mb[0m
[31m[2019-06-07:20:30:20:INFO] Determined delimiter of CSV input is ','[0m
[31m[20:30:20] S3DistributionType set as FullyReplicated[0m
[31m[20:30:20] 21000x23 matrix with 483000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-06-07:20:30:20:INFO] Determined delimiter of CSV input is ','[0m
[31m[20:30:20] S3DistributionType set as Fully

This step kicks off our XGBoost training job. We first define the location of the Amazon SageMaker XGBoost training containers. We then create an Amazon SageMaker estimator. By simply changing values such as train_instance_count and train_instance_type, we can change the size and number of instances we want to run on, which scales and distributes the training.

In [60]:
test_data

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6,...,X17,X18,X19,X20,X21,X22,X23
21002,1,280000,1,3,1,40,2,...,192023,10000,9000,8000,6738,6974,7600
25402,0,90000,2,1,2,24,0,...,36415,2366,4000,3000,2000,3480,15000
14067,0,140000,1,2,2,39,1,...,128118,7000,0,6000,5200,5000,5000
25302,0,80000,2,3,2,25,0,...,38121,3200,4000,3000,1500,1500,1300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22054,0,50000,2,3,2,29,0,...,43662,1840,1311,2000,42705,1569,2822
1268,0,70000,2,2,2,34,1,...,31873,1500,2900,0,2500,4000,0
3175,0,100000,2,2,2,23,-2,...,4737,9187,5408,12920,9656,4737,4513
1678,0,290000,2,1,1,50,0,...,83687,7000,10009,5000,5000,5000,10000


In [62]:
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------!

In [63]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.as_matrix()[:, 1:])
predictions




array([0.59283876, 0.1414548 , 0.31579986, ..., 0.41416201, 0.11323805,
       0.11232713])

The output is an array with our predictions. The first element of the array is the probability that the person corresponding to the first inputted data row will default on their credit card payments. The array continues with probabilities for all rows in our testing set.