# Using Amazon SageMaker builtin algorithm to predict fraud 

# Part 1: Preprocessing data
If you are intersted in learning about preprocessing data, you should start here, otherwise you could simply start from part 2, when we load the data from npy files.

In [None]:
#imports
import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
import sagemaker.amazon.common as smac
import os
import sagemaker
from sagemaker.predictor import csv_serializer, json_deserializer
import re
from sagemaker import get_execution_role

### Add your name to the name variable, don't change any other variable in this section

In [None]:
name = "" # Can include hyphens (-), but not spaces
region = boto3.session.Session().region_name
bucket = 'sagemaker-eu-west-1-483308273948'
original_key = 'visa-kaggle/original.csv'
protocol="s3://"
datafile = 'data/original.csv'
prefix = name+ '/dataset'
 
# Define IAM role
role = get_execution_role()

sagemaker_session = sagemaker.Session()

### Data ingestion
1. Downloding file locally
2. Loading file into pandas for inspection
3. conversting data to numpy
4. shuffling the data
5. spliting data into test and training
6. breaking up each data set to data and label

In [None]:
#Downloading the file to a local folder
!mkdir -p data
client = boto3.client('s3')
with open(datafile, 'wb') as f:
    client.download_fileobj(bucket, original_key, f)

### Loading the data and understanding it's format
Read the local CSV and understand it's format.
Use pandas to load the file

```python
df = pd.read_csv(datafile)
```

and read the first 5 lines using:

```python
df.head(5)
```

In [None]:
# loading data into pandas for inspection


Using pandas, we can get more info about the data.
```python 
df['Class'].count()
``` 
shows us that there are 284,807 records in this dataset
```python 
df[df['Class'] == 1]['Class'].count()
``` 
but only 492 of them are fraud

In [None]:
# Use the command above to examine the data


### Converting pandas to numpy
We'll convert the data to numpy for data shuffling, data manipulation and extracting the labels from the dataset

Run 
```python
raw_data = df.as_matrix()
``` 
to convert the df to numpy.  
```python 
raw_data.shape
``` 
will print the structure of the matrix

In [None]:
# Converting Data Into Numpy


### Shuffling the data and spliting between data and labels
using 
```python
np.random.seed(123)
``` 
we can configure the numpy random generator to use a constant seed

shuffle the data 
```
np.random.shuffle(raw_data)
```

and split the data 

```
label = raw_data[:, -1]
data = raw_data[:, :-1]
```

let's make sure that both have the same number of records:
```python
print("label_shape = {}; data_shape= {}".format(label.shape, data.shape))
```

In [None]:
# Shuffling the data and splitting between data and label


### Spliting data into training and validation datasets
In this example we'll use 60% of the data for training and 40% for validation.

we can get the training dataset size using:
```python
train_size = int(data.shape[0]*0.6)
```

we'll split both the training and validation data sets (data and labels)
```python
train_data  = data[:train_size, :]
val_data = data[train_size:, :]

train_label = label[:train_size]
val_label = label[train_size:]
```

We'll verifiy the shapes:
```python
print("training data shape= {}; training label shape = {} \nValidation data shape= {}; validation label shape = {}".format(train_data.shape, train_label.shape,val_data.shape,val_label.shape))
```

In [None]:
#Splitting data into validation and training and breaking dataset into data and label


# Part 2: Training
In this part we load the data from pre-processed files and train the model.
We'll start by creating the train and test sets:
```python
train_set = (train_data, train_label)
test_set = (val_data, val_label)
```

In [None]:
# Creating the train and test sets


# Data Conversion
Amazon Algorithms support csv and recordio/protobuf. recordio is faster than CSV and specially in algorithms that deal with sparse matrices.

```python
vectors = np.array([t.tolist() for t in train_set[0]]).astype('float32')
labels = np.array([t.tolist() for t in train_set[1]]).astype('float32')

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)
```

In [None]:
# Data Conversion


# Upload training data
Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it.

Upload to S3:
```python
key = 'recordio-pb-data'
boto3.resource('s3').Bucket(sagemaker_session.default_bucket()).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(sagemaker_session.default_bucket(), prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))
```

In [None]:
# Upload training data


Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm.
```python
output_location = 's3://{}/{}/output'.format(sagemaker_session.default_bucket(), prefix)
print('training artifacts will be uploaded to: {}'.format(output_location))
```

In [None]:
# Output location


# Training the model
At this point we are using an linear learner from amazon algorithms. Docker file containing the model is located in multiple regions. We tool the following steps
1. define containers
2. Create am Estimator object and pass the hyper-parameters as well as the model location to it.
3. run Estimator.fit to begin training the model

SageMaker uses one of thse prebuilt containers for the linear-learner built in algo
```python
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest'}
```

In [None]:
# Create the containers dictionary


Use the Estimator function to create a new sagemaker trainning job
```python
sess = sagemaker.Session()
linear = sagemaker.estimator.Estimator(containers[region],
                                       role, #S3 role, so the notebook can read the data and upload the model
                                       train_instance_count=1, #number of instances for training
                                       train_instance_type='ml.c4.xlarge', # type of training instance
                                       output_path=output_location, #S3 location for the trained model
                                       sagemaker_session=sess,
                                       base_job_name='linear-learner-' + name)
```
Set the hyperparamaeters
```python
linear.set_hyperparameters(feature_dim=30, # dataset has 30 columns (features)
                           predictor_type='binary_classifier',
                           mini_batch_size=200)
```
send the link to S3 to the trainning job and start the trainning
```python
linear.fit({'train': s3_train_data})
```

In [None]:
# Start the trainning job


# Hosting the model
We use sagemaker to host the live model by calling deploy from estimator we defined previously. This action will create a dockerized environment using ECS and permits autoscaling. 

```python
linear_predictor = linear.deploy(initial_instance_count=1, #Initial number of instances. 
                                                           #Autoscaling can increase the number of instances.
                                 instance_type='ml.m4.xlarge',# instance type
                                 name='linear-learner-' + name)
```

In [None]:
# Host a SageMaker endpoint for predicition


# Prediction
deploy resturn a live endpoint (linear_predictor). Predictors in sagemaker accept csv and json. In this case we use json serialization.

configure the predicition endpoint:
```python
linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer
```

Predict a single item:
```python
print(train_set[0][48:49])
print("The data Actual label: " + str(train_set[1][48:49][0]))
linear_predictor.predict(train_set[0][48:49])
```

Create a confusion matrix from the validation data:
```python
non_zero = np.count_nonzero(test_set[1])
zero = len(test_set[1]) - non_zero
print("validation set includes: {} non zero and {} items woth value zero".format(non_zero, zero))

predictions = []
for array in np.array_split(test_set[0], 100):
    result = linear_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

import pandas as pd

pd.crosstab(test_set[1], predictions, rownames=['actuals'], colnames=['predictions'])
```

The confusion matrix above indicates that:
- of 162 fraudulant cases we detected 143 correcly
- 39 times a non-fraudulant transaction has been flagges as fraud from a total of 113761 transactions.


# Delete the endpoint
if you're ready to be done with this notebook, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
import sagemaker

sagemaker.Session().delete_endpoint(linear_predictor.endpoint)