# Zero to AI Hero ft. SageMaker

Welcome to the WWC Connect 2019 SageMaker workshop!

In this Notebook we are going to show you how to train and deploy a machine learning model on SageMaker.

We will be looking at a breast-cancer dataset and using the linear learning algorithm to classify tumors as either malignant or benign.

The dataset is historical and anonymized patient data from the US, which contains information on 10 different attributes of patient tumors.

(classification 1= Benign, 2= Malignant)

In [26]:
import numpy as np
import pandas as pd

import boto3
import re

import sagemaker
from sagemaker import get_execution_role
from sklearn.preprocessing import LabelEncoder
from sagemaker.predictor import csv_serializer

## Setting up your bucket
Amazon Simple Storage Service(S3) provides object storage and is designed to make web-scale computing easier for developers.

S3 allows users to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.

For more info on s3 - https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html


In [3]:
# Specify your bucket name
bucket_name = 'udemy-workshop-chloe'

training_folder = r'wwc-workshop/data/'
validation_folder = r'wwc-workshop/validation/'
test_folder = r'wwc-workshop/validation/'

s3_model_output_location = r's3://{0}/wwc-workshop/model'.format(bucket_name)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_folder)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_folder)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,test_folder)

In [4]:
print(s3_model_output_location)
print(s3_training_file_location)

s3://udemy-workshop-chloe/wwc-workshop/model
s3://udemy-workshop-chloe/wwc-workshop/data/


## Data
We have already pre-processed the data for this workshop and have split it into 3 files, which you have dowloaded:
1. training.csv
2. validation.csv
3. test.csv

If you would like to see how we did this, the code can be found in the Data-cleaning.ipynb file.
For SageMaker to use these files we need to move them into the S3 bucket that we have created using the following code.


In [6]:
# Write and Reading from S3 is just as easy
# files are referred as objects in S3.  
# file name is referred as key name in S3

# File stored in S3 is automatically replicated across 3 different availability zones 
# in the region where the bucket was created.

# http://boto3.readthedocs.io/en/latest/guide/s3.html
def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

In [7]:
write_to_s3('train.csv', 
            bucket_name,
            training_folder + 'train.csv')

In [8]:
write_to_s3('validation.csv',
            bucket_name,
            validation_folder + 'validation.csv')

In [None]:
write_to_s3('test.csv',
            bucket_name,
            test_folder + 'test.csv')

## Training Algorithm Docker Image
SageMaker maintains a separate image for algorithm and region
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

In [90]:
# Establish a session with AWS
sess = sagemaker.Session()

In [91]:
role = get_execution_role()

In [92]:
# Sagemaker API now maintains the algorithm container mapping for us
# Specify the region, algorithm and version
container = sagemaker.amazon.amazon_estimator.get_image_uri(
    sess.boto_region_name,
    "linear-learner", 
    "latest")

print('Using SageMaker linear-learner container:\n{} ({})'.format(container, sess.boto_region_name))

Using SageMaker linear-learner container:
644912444149.dkr.ecr.eu-west-2.amazonaws.com/linear-learner:latest (eu-west-2)


In [99]:
# Configure the training job
# Specify type and number of instances to use
# S3 location where final artifacts needs to be stored

#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html

estimator = sagemaker.estimator.Estimator(
    container,
    role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.xlarge',
    output_path=s3_model_output_location,
    sagemaker_session=sess,
    base_job_name ='linear-workshop-v5-1')

In [100]:
estimator.set_hyperparameters(feature_dim=9,
                           predictor_type='binary_classifier',
                           mini_batch_size=200)

In [101]:
estimator.hyperparameters()

{'feature_dim': 9,
 'predictor_type': 'binary_classifier',
 'mini_batch_size': 200}

# Specify Training Data Location and Optionally, Validation Data Location

In [102]:
training_input_config = sagemaker.session.s3_input(
    s3_data=s3_training_file_location,
    content_type='text/csv',
    s3_data_type='S3Prefix')

validation_input_config = sagemaker.session.s3_input(
    s3_data=s3_validation_file_location,
    content_type='text/csv',
    s3_data_type='S3Prefix'
)

data_channels = {'train': training_input_config, 'validation': validation_input_config}

In [103]:
print(training_input_config.config)
print(validation_input_config.config)

{'DataSource': {'S3DataSource': {'S3DataDistributionType': 'FullyReplicated', 'S3DataType': 'S3Prefix', 'S3Uri': 's3://udemy-workshop-chloe/wwc-workshop/data/'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataDistributionType': 'FullyReplicated', 'S3DataType': 'S3Prefix', 'S3Uri': 's3://udemy-workshop-chloe/wwc-workshop/validation/'}}, 'ContentType': 'text/csv'}


# Train the model

In [104]:
estimator.fit(data_channels)

2019-11-25 20:32:59 Starting - Starting the training job...
2019-11-25 20:33:00 Starting - Launching requested ML instances...
2019-11-25 20:33:55 Starting - Preparing the instances for training......
2019-11-25 20:34:51 Downloading - Downloading input data...
2019-11-25 20:35:24 Training - Downloading the training image...
2019-11-25 20:35:56 Uploading - Uploading generated training model[31mDocker entrypoint called with argument(s): train[0m
[31m[11/25/2019 20:35:47 INFO 139924829296448] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'


2019-11-25 20:36:03 Completed - Training job completed
Training seconds: 72
Billable seconds: 72


## Deploy Model

In [105]:
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge',
                             endpoint_name = 'linear-workshop-v5-1')

-------------------------------------------------------------------------------------!

## Run Predictions

In [123]:
test = pd.read_csv('test.csv')

In [127]:
test.head()

Unnamed: 0,diagnosis,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses
0,0,5,1,1,1,2,1,2,1,1
1,0,1,1,1,1,2,1,3,1,1
2,0,3,1,1,1,2,5,5,1,1
3,1,10,5,7,3,3,7,3,3,8
4,0,3,1,1,1,1,1,2,1,1


In [129]:
testLabels = test['diagnosis']

In [132]:
testData = test.iloc[:, 1:]

In [148]:
testData = testData.to_numpy()

In [158]:
count =0
for data in testData:
    print(data)
    print("Actual = "+ str(testLabels[count]))
    prediction = predictor.predict(data).decode('utf-8')
    print(prediction)
    count +=1
    

[5 1 1 1 2 1 2 1 1]
Actual = 0
{"predictions": [{"score": 0.33217474818229675, "predicted_label": 0.0}]}
[1 1 1 1 2 1 3 1 1]
Actual = 0
{"predictions": [{"score": 0.3345429301261902, "predicted_label": 1.0}]}
[3 1 1 1 2 5 5 1 1]
Actual = 0
{"predictions": [{"score": 0.3365776836872101, "predicted_label": 1.0}]}
[10  5  7  3  3  7  3  3  8]
Actual = 1
{"predictions": [{"score": 0.33779898285865784, "predicted_label": 1.0}]}
[3 1 1 1 1 1 2 1 1]
Actual = 0
{"predictions": [{"score": 0.332842618227005, "predicted_label": 0.0}]}
[ 5  5  5  2  5 10  4  3  1]
Actual = 1
{"predictions": [{"score": 0.3437749147415161, "predicted_label": 1.0}]}
[5 1 1 1 2 1 1 1 1]
Actual = 0
{"predictions": [{"score": 0.3319191038608551, "predicted_label": 0.0}]}
[4 1 1 1 2 1 1 2 1]
Actual = 0
{"predictions": [{"score": 0.3324126899242401, "predicted_label": 0.0}]}
[5 2 1 1 2 1 3 1 1]
Actual = 0
{"predictions": [{"score": 0.333195298910141, "predicted_label": 1.0}]}
[1 1 1 1 2 1 1 1 1]
Actual = 0
{"predictions":