# Credit card fraud detector

In this solution we will build the core of a credit card fraud detection system using SageMaker. We will start by 


ing an anomaly detection algorithm, then proceed to train two XGBoost models for supervised training. To deal with the highly unbalanced data common in fraud detection, our first model will use re-weighting of the data, and the second will use re-sampling, using the popular SMOTE technique for oversampling the rare fraud data.

Our solution includes an example of making calls to a REST API to simulate a real deployment, using AWS Lambda to trigger both the anomaly detection and XGBoost model.

## Investigate and process the data

Let's start by reading in the credit card fraud data set.

In [1]:
!pwd

/home/ec2-user/SageMaker/PUBG-clustering-player-behavior-for-cheaters


In [2]:
# !pip install -r /home/ec2-user/SageMaker/PUBG-clustering-player-behavior-for-cheaters/requirements.txt

In [3]:
#!aws s3 cp s3://sagemaker-fraud-machine-learning-inputbucket-ee-x-pod-lab-kp /home/ec2-user/SageMaker/source/notebooks --recursive


In [4]:
# !aws s3 ls s3://sagemaker-fraud-machine-learning-inputbucket-ee-x-pod-lab-kp/pubgclustering/data/PUBG_Player_Statistics.csv

In [5]:
!pwd

/home/ec2-user/SageMaker/PUBG-clustering-player-behavior-for-cheaters


In [6]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.covariance import EllipticEnvelope
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
import boto
import boto3

In [7]:
# data = pd.read_csv('creditcard.csv', delimiter=',')
file = "/home/ec2-user/SageMaker/PUBG-clustering-player-behavior-for-cheaters/data/PUBG_Player_Statistics.csv"
data = pd.read_csv(file, delimiter=',')

Let's take a peek at our data (we only show a subset of the columns in the table):

In [8]:
print(data.columns)
# data[['Time', 'V1', 'V2', 'V27', 'V28', 'Amount', 'Class']].describe()

Index(['player_name', 'tracker_id', 'solo_KillDeathRatio', 'solo_WinRatio',
       'solo_TimeSurvived', 'solo_RoundsPlayed', 'solo_Wins',
       'solo_WinTop10Ratio', 'solo_Top10s', 'solo_Top10Ratio',
       ...
       'squad_RideDistance', 'squad_MoveDistance', 'squad_AvgWalkDistance',
       'squad_AvgRideDistance', 'squad_LongestKill', 'squad_Heals',
       'squad_Revives', 'squad_Boosts', 'squad_DamageDealt', 'squad_DBNOs'],
      dtype='object', length=152)


In [9]:
#---------Preprocessing
## Create a copy of the dataframe
df = data.copy()
cols = np.arange(52, 152, 1)

# Drop entries if they have null values
df.dropna(inplace = True)

## Drop columns after the 52nd index
df.drop(df.columns[cols], axis = 1, inplace = True)

## Drop player_name and tracker id
df.drop(df.columns[[0, 1]], axis = 1, inplace = True)

## Drop Knockout and Revives
df.drop(df.columns[[49]], axis = 1, inplace = True)
df.drop(columns = ['solo_Revives'], inplace = True)

## Drop the string solo from all strings
df.rename(columns = lambda x: x.lstrip('solo_').rstrip(''), inplace = True)

## Combine a few columns 
df['TotalDistance'] = df['WalkDistance'] + df['RideDistance']
df['AvgTotalDistance'] = df['AvgWalkDistance'] + df['AvgRideDistance']

# Remove Outliers
df = df.drop(df[df['RoundsPlayed'] < df['RoundsPlayed'].mean()].index)

In [10]:
df.columns

Index(['KillDeathRatio', 'WinRatio', 'TimeSurvived', 'RoundsPlayed', 'Wins',
       'WinTop10Ratio', 'Top10s', 'Top10Ratio', 'Losses', 'Rating',
       'BestRating', 'DamagePg', 'HeadshotKillsPg', 'HealsPg', 'KillsPg',
       'MoveDistancePg', 'RevivesPg', 'RoadKillsPg', 'TeamKillsPg',
       'TimeSurvivedPg', 'Top10sPg', 'Kills', 'Assists', 'Suicides',
       'TeamKills', 'HeadshotKills', 'HeadshotKillRatio', 'VehicleDestroys',
       'RoadKills', 'DailyKills', 'WeeklyKills', 'RoundMostKills',
       'MaxKillStreaks', 'WeaponAcquired', 'Days', 'LongestTimeSurvived',
       'MostSurvivalTime', 'AvgSurvivalTime', 'WinPoints', 'WalkDistance',
       'RideDistance', 'MoveDistance', 'AvgWalkDistance', 'AvgRideDistance',
       'LongestKill', 'Heals', 'Boosts', 'DamageDealt', 'TotalDistance',
       'AvgTotalDistance'],
      dtype='object')

In [11]:
# Create train and test set using Sci-Kit Learn
train, test = train_test_split(df, test_size=0.3, random_state = 10)
dev, test = train_test_split(test, test_size = 0.2, random_state = 10)
data = train

print("The number of training samples is", len(train))
print("The number of development samples is", len(dev))
print("The number of testing samples is", len(test))

The number of training samples is 20771
The number of development samples is 7121
The number of testing samples is 1781


In [12]:
with pd.option_context('display.max_columns', 52):
    print(data.describe(include = 'all'))

       KillDeathRatio      WinRatio  TimeSurvived  RoundsPlayed          Wins  \
count    20771.000000  20771.000000  2.077100e+04  20771.000000  20771.000000   
mean         1.289158      2.204012  1.484172e+05    174.985894      3.554475   
std          0.602602      2.510500  9.339460e+04    113.147056      4.939222   
min          0.100000      0.000000  3.813548e+04     80.000000      0.000000   
25%          0.900000      0.680000  9.091498e+04    104.000000      1.000000   
50%          1.160000      1.460000  1.195404e+05    139.000000      2.000000   
75%          1.520000      2.910000  1.733681e+05    205.000000      4.000000   
max         17.410000     40.210000  1.219536e+06   1552.000000    102.000000   

       WinTop10Ratio        Top10s    Top10Ratio        Losses        Rating  \
count   20771.000000  20771.000000  20771.000000  20771.000000  20771.000000   
mean        0.138708     23.884743     14.369067    171.431419   2059.159131   
std         0.137145     19.21

In [13]:
label_train = train['Rating']
feature_train = train.drop('Rating', 1)

In [14]:
label_test = test['Rating']
feature_test = test.drop('Rating', 1)

In [15]:
# Scale the data (Normalize)
scaler1 = StandardScaler()
X_train = scaler1.fit_transform(feature_train)
scaler2 = StandardScaler()
y_train = scaler2.fit_transform(label_train.values.reshape(-1,1))


In [16]:
# Scale the data (Normalize)

X_test = scaler1.transform(feature_test)
y_test = scaler2.transform(label_test.values.reshape(-1,1))

In [31]:
X_train.shape

(20771, 49)

## Supervised Learning

Once we have gathered an adequate amount of labeled training data, we can use a supervised learning algorithm that discovers relationships between the features and the dependent class.

We will use Gradient Boosted Trees as our model, as they have a proven track record, are highly scalable and can deal with missing data, reducing the need to pre-process datasets.

### Prepare Data and Upload to S3

First we copy the data to an in-memory buffer.

In [17]:
y_train = [y[0] for y in y_train]

In [18]:
import io
import sklearn
from sklearn.datasets import dump_svmlight_file   
from package import config
import os
import sagemaker

buf = io.BytesIO()
bucket = config.MODEL_DATA_S3_BUCKET
prefix = 'fraud-classifier-score-detection'
session = sagemaker.Session()

sklearn.datasets.dump_svmlight_file(X_train, y_train, buf)
buf.seek(0);

Now we upload the data to S3 using boto3.

In [19]:
key = 'fraud-dataset-score1'
subdir = 'base'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', subdir, key)).upload_fileobj(buf)

s3_train_data = 's3://{}/{}/train/{}/{}'.format(bucket, prefix, subdir, key)
print('Uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

Uploaded training data location: s3://sagemaker-fraud-machine-learning-modeldatabucket-pb9mj8rs1vhd/fraud-classifier-score-detection/train/base/fraud-dataset-score1
Training artifacts will be uploaded to: s3://sagemaker-fraud-machine-learning-modeldatabucket-pb9mj8rs1vhd/fraud-classifier-score-detection/output


We can now train using SageMaker's built-in XGBoost algorithm. To specify the XGBoost algorithm, we use a utility function to obtain its URI. A complete list of built-in algorithms is found here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

In [20]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='1.0-1')


'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


SageMaker abstracts training via Estimators. We can pass the classifier and parameters along with hyperparameters to the estimator, and fit the estimator to the data in S3. An important parameter here is `scale_pos_weight` which scales the weights of the positive vs. negative class examples. This is crucial to do in an imbalanced dataset like the one we are using here, otherwise the majority class would dominate the learning.

In [21]:
from math import sqrt
from sagemaker import get_execution_role

# Because the data set is so highly skewed, we set the scale position weight conservatively,
# as sqrt(num_nonfraud/num_fraud).
# Other recommendations for the scale_pos_weight are setting it to (num_nonfraud/num_fraud).

hyperparams = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"reg:linear",
        "num_round":"50"
}



Let us explain the hyper-parameters used above. The one that's very relevant for learning from skewed data is `scale_pos_weight`. This is a ratio that weighs the examples of the positive class (fraud) against the negative class (legitimate). Commonly this is set to `(num_nonfraud/num_fraud)`, but our data is exteremely skewed so we will set it to `sqrt(num_nonfraud/num_fraud)`.  For the data in this example, this would be `sqrt(284,807/492)` which would give our fraud examples a weight of ~24.

The rest of the hyper-parameters are as follows:

* `max_depth`: This is the maximum depth of the trees that will be built for our ensemble. A max depth of 5 will give us trees with up to 32 leaves. Note that tree size grows exponentially when increasing this parameter (`num_leaves=2^max_depth`), so a max depth of 10 would give us trees with 1024 leaves, which are likely to overfit.
* `subsample`: The subsample ratio that we use to select a subset of the complete data to train each tree in the ensemble. With a value of 0.8, each tree is trained on a random sample containing 80% of the complete data. This is used to prevent overfitting.
* `num_round`: This is the size of the ensemble. We will for 100 "rounds", each training round adding a new tree to the ensemble.
* `eta`: This is the step size shrinkage applied at each update. This value will shrink the weights of new features to prevent overfitting.
* `gamma`: This is the minimum loss reduction to reach before splitting a leaf. Splitting a leaf can sometimes have a small benefit, and splitting such leaves can lead to overfitting. By setting `gamma` to values larger than zero, we ensure that there should be at least some non-negligible amount of accuracy gain before splitting a leaf.
* `min_child_weight`: This parameter has a similar effect to gamma, setting it to higher values means we'll wait until enough gain will be possible before splitting a leaf.
* `objective`: We are doing binary classification, so we use a logistic loss objective.
* `eval_metric`: Having a good evaluation metric is crucial when dealing with imbalanced data (see discussion below). We use AUC here.

In [22]:
clf = sagemaker.estimator.Estimator(container,
                                    get_execution_role(),
                                    hyperparameters=hyperparams,
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=output_location,
                                    sagemaker_session=session)



We can now fit our supervised training model, the call to fit below should take around 5 minutes to complete.

In [23]:
clf.fit({'train': s3_train_data})



2020-09-23 00:14:51 Starting - Starting the training job...
2020-09-23 00:14:53 Starting - Launching requested ML instances......
2020-09-23 00:16:14 Starting - Preparing the instances for training.........
2020-09-23 00:17:48 Downloading - Downloading input data
2020-09-23 00:17:48 Training - Downloading the training image...
2020-09-23 00:18:21 Uploading - Uploading generated training model.[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:linear to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34m[00:18:18] 20771x49 matrix with 976237 entries loaded from /opt/ml/input/data/train[0m
[34mINFO:root:Single node training.[0m
[34mINFO:root:Train matrix has 20771 rows[0m
[34mParameters: 

### Host Classifier

Now we deploy the estimator to and endpoint. As before progress will be indicated by `-`, and the deployment should be done after 10 minutes.

In [38]:
predictor.delete_endpoint()

In [39]:
from sagemaker.predictor import csv_serializer

predictor = clf.deploy(initial_instance_count=1,
                       model_name="{}-xgb-score2".format(config.STACK_NAME),
                       endpoint_name="{}-xgb-score2".format(config.STACK_NAME),
                       instance_type='ml.m4.xlarge', 
                       serializer=csv_serializer,
                       deserializer=None,
                       content_type='text/csv'
                       )



---------------!

## Evaluation

Once we have trained the model we can use it to make predictions for the test set.

In [40]:
# Because we have a large test set, we call predict on smaller batches
def predict(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [41]:
raw_preds = predict(predictor, X_test)

In [42]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(raw_preds,y_test)

0.12294080013954453

In [43]:
y_pred = scaler2.inverse_transform(raw_preds)

In [46]:
y_pred[100:120]

array([1848.66862436, 2535.32131814, 1673.89029883, 1942.82280686,
       2652.46228703, 2008.38311133, 2082.21514075, 1990.20157275,
       1960.61736451, 2016.94419206, 2737.14591935, 2062.73540727,
       2229.97074948, 2499.78695679, 2275.62574551, 2121.69896321,
       2119.75447128, 2300.04836151, 2212.26562226, 2084.63478919])

In [47]:
label_test[100:120]

57444    1880.39
3409     2554.83
29238    1671.34
5743     1960.45
11757    2718.97
33620    1995.55
14746    2066.20
43903    1987.61
41285    1949.17
87335    2028.89
145      2813.79
42213    2108.29
2183     2216.46
1676     2546.16
86401    2301.43
15801    2137.92
61349    2132.76
28713    2298.64
11791    2193.26
52764    2080.42
Name: Rating, dtype: float64