# AWS Sagemaker Training and Deploying
## Cyclone Kenneth 2019-04-25
### Part II

In this part II notebook, we will upload the data to AWS S3 that we generated for training in the previous notebook. We will kick off an AWS Sagemaker object detection job and monitor the results. At the end of this notebook, you will have trained your own OSM-based CNN object detector!

![](assets/happycloud.png)



## Neural Network (Super simplified)

We have a bunch of stacked 'neurons' that are mathematical function with weights.

The number of neurons and how they are connected to each other defines an 'architecture'.

We have a loss function that is iteratively checked to assess whether the neurons (and the weights) are trending to 'good': do the predictions align with the truth (this is validation data)?

Weights are defined randomly (typically) to start. The net is pretty dumb. It is through the iterative process of training with many examples that learning is achived through imrpoving the weights.

### Goal: Minimize the loss function!

A couple of things worth noting:

ðŸ¤” ML models are not super useful unless they are scaled across a large amount of data

ðŸ¤” To effectively scale across data, you need to be efficient

ðŸ¤” Because we will be passing sensitive data to this notebook in order to scale our cloud compute through Sagemaker, we will use papermill to run this notebook from within python. It creates a simple wrapper around the notebook so that we can specify variables.

e.g.

``` python
import papermill as pm
pm.execute_notebook('osm_ml_training_pt2.ipynb','osm_ml_training_pt2_out.ipynb', parameters = dict(sage_bucket='',my_bucket='', role=''))

```

In [1]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

We will use 'papermill' (https://github.com/nteract/papermill) to pass sensitive variables to this jupyter notebook. Things like passwords, cloud locations, etc, should be paramterized as a best practice -- Never stored in a repo (especially public facing).

You will need to run `aws configure --profile uw` and enter in the crednetials I give you.

In [2]:
ACCESS_KEY=''
SECRET_KEY=''
sage_bucket=''         #this is the 'top-level' s3 bucket, in which you will have a team data-folder
my_bucket=''           #this is the 'folder' where your sagemaker data lives
prefix = my_bucket     #this is your model prefix
sessname =''
nclass = 1
epochs =50              #number of iterations
mini_batch_size =2     #amount of data to use per iteration
lr = 0.001
lr_scheduler_factor =0.1
momentum =0.9
weight_decay =0.0005
overlap = 0.5
momentum = 0.45
weight_decay =0.0005
nms_thresh = 0.45
image_shape =256
label_width =600
n_train_samples = 16551
network ='resnet-50'
optim = 'sgd'           #Stochastic gradient descent is an iterative method for optimizing an objective function      
role = ''

In [3]:
# Parameters
sage_bucket = "eagleview-data"
my_bucket = "team_echidna"
role = "arn:aws:iam::649760770673:role/service-role/AmazonSageMaker-ExecutionRole-20190910T173949"
ACCESS_KEY = "AKIAZOSGCRJY5YOTSRPR"
SECRET_KEY = "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"


In [4]:
import boto3

my_east_sesison = boto3.Session(region_name = 'us-east-2',profile_name='uw')
s3_client = my_east_sesison.client(
    's3',
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY
)
s3 = my_east_sesison.resource('s3')

s3_client.upload_file('rec/val.rec', sage_bucket, my_bucket+'/validation/val.rec')
s3_client.upload_file('rec/train.rec', sage_bucket, my_bucket+'/train/train.rec')

In [5]:
sess = sagemaker.Session(boto_session=my_east_sesison)
training_image = get_image_uri(sess.boto_region_name, 'object-detection', repo_version="latest")


In [6]:
s3_train_data = 's3://{}/{}'.format(sage_bucket, my_bucket+'/train/')
s3_validation_data = 's3://{}/{}'.format(sage_bucket, my_bucket+'/validation/')

s3_output_location = 's3://{}/{}/output'.format(sage_bucket, my_bucket)

od_model = sagemaker.estimator.Estimator(training_image,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.p2.xlarge',
                                         train_volume_size = 50,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)
                                         
od_model.set_hyperparameters(base_network=network,
                             use_pretrained_model=1,
                             num_classes=nclass,
                             mini_batch_size=mini_batch_size,
                             epochs=epochs,
                             learning_rate=lr,
                             lr_scheduler_step='3,6',
                             lr_scheduler_factor=lr_scheduler_factor,
                             optimizer=optim,
                             momentum=momentum,
                             weight_decay=weight_decay,
                             overlap_threshold=overlap,
                             nms_threshold=nms_thresh,
                             image_shape=image_shape,   
                             label_width=label_width,		
                             num_training_samples=n_train_samples)

train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='application/x-recordio', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='application/x-recordio', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}
od_model.fit(inputs=data_channels, logs=True)    

  

2019-09-11 05:34:56 Starting - Starting the training job

.

.

.


2019-09-11 05:34:57 Starting - Launching requested ML instances

.

.

.

.

.

.


2019-09-11 05:36:19 Starting - Preparing the instances for training

.

.

.

.

.

.


2019-09-11 05:37:20 Downloading - Downloading input data

.

.

.


2019-09-11 05:37:47 Training - Downloading the training image

.

.

.


2019-09-11 05:38:34 Training - Training image download completed. Training in progress.

.

[31mDocker entrypoint called with argument(s): train[0m
[31m[09/11/2019 05:38:37 INFO 140277027632960] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'label_width': u'350', u'early_stopping_min_epochs': u'10', u'epochs': u'30', u'overlap_threshold': u'0.5', u'lr_scheduler_factor': u'0.1', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0005', u'mini_batch_size': u'32', u'use_pretrained_model': u'0', u'freeze_layer_pattern': u'', u'lr_scheduler_step': u'', u'early_stopping': u'False', u'early_stopping_patience': u'5', u'momentum': u'0.9', u'num_training_samples': u'', u'optimizer': u'sgd', u'_tuning_objective_metric': u'', u'early_stopping_tolerance': u'0.0', u'learning_rate': u'0.001', u'kv_store': u'device', u'nms_threshold': u'0.45', u'num_classes': u'', u'base_network': u'vgg-16', u'nms_topk': u'400', u'_kvstore': u'device', u'image_shape': u'300'}[0m
[31m[09/11/2019 05:38:37 INFO 140277027632960] Merging with provid

[31m[09/11/2019 05:38:54 INFO 140277027632960] #quality_metric: host=algo-1, epoch=0, batch=10 train cross_entropy <loss>=(0.9417682073812569)[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] #quality_metric: host=algo-1, epoch=0, batch=10 train smooth_l1 <loss>=(1.3592917940257925)[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] #quality_metric: host=algo-1, epoch=0, validation mAP <score>=(0.007122408475943063)[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] Updating the best model with validation-mAP=0.007122408475943063[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[31m[09/11/2019 05:38:54 INFO 140277027632960] #progress_metric: host=algo-1, completed 2 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0,

[31m[09/11/2019 05:38:59 INFO 140277027632960] #quality_metric: host=algo-1, epoch=3, batch=9 train cross_entropy <loss>=(0.8111637453489666)[0m
[31m[09/11/2019 05:38:59 INFO 140277027632960] #quality_metric: host=algo-1, epoch=3, batch=9 train smooth_l1 <loss>=(1.3640248503866075)[0m
[31m[09/11/2019 05:38:59 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:38:59 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:39:00 INFO 140277027632960] #quality_metric: host=algo-1, epoch=3, validation mAP <score>=(0.02092431031130315)[0m
[31m[09/11/2019 05:39:00 INFO 140277027632960] #progress_metric: host=algo-1, completed 8 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max":

[31m[09/11/2019 05:39:03 INFO 140277027632960] #quality_metric: host=algo-1, epoch=5, validation mAP <score>=(0.009521666973255)[0m
[31m[09/11/2019 05:39:03 INFO 140277027632960] #progress_metric: host=algo-1, completed 12 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 6, "sum": 6.0, "min": 6}}, "EndTime": 1568180343.700893, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/Object Detection", "epoch": 5}, "StartTime": 1568180342.076891}
[0

[31m[09/11/2019 05:39:13 INFO 140277027632960] #quality_metric: host=algo-1, epoch=11, batch=9 train cross_entropy <loss>=(0.7663869636574971)[0m
[31m[09/11/2019 05:39:13 INFO 140277027632960] #quality_metric: host=algo-1, epoch=11, batch=9 train smooth_l1 <loss>=(0.6520126701630268)[0m
[31m[09/11/2019 05:39:13 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:39:13 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:39:14 INFO 140277027632960] #quality_metric: host=algo-1, epoch=11, validation mAP <score>=(0.016914574133555422)[0m
[31m[09/11/2019 05:39:14 INFO 140277027632960] #progress_metric: host=algo-1, completed 24 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "

[31m[09/11/2019 05:39:24 INFO 140277027632960] #quality_metric: host=algo-1, epoch=17, batch=9 train cross_entropy <loss>=(0.7644274594291808)[0m
[31m[09/11/2019 05:39:24 INFO 140277027632960] #quality_metric: host=algo-1, epoch=17, batch=9 train smooth_l1 <loss>=(0.7978552674490308)[0m
[31m[09/11/2019 05:39:24 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:39:24 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:39:24 INFO 140277027632960] #quality_metric: host=algo-1, epoch=17, validation mAP <score>=(0.008188251692166889)[0m
[31m[09/11/2019 05:39:24 INFO 140277027632960] #progress_metric: host=algo-1, completed 36 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "

[31m[09/11/2019 05:39:39 INFO 140277027632960] #quality_metric: host=algo-1, epoch=26, batch=10 train cross_entropy <loss>=(0.7620755615880934)[0m
[31m[09/11/2019 05:39:39 INFO 140277027632960] #quality_metric: host=algo-1, epoch=26, batch=10 train smooth_l1 <loss>=(0.9095835604910123)[0m
[31m[09/11/2019 05:39:39 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:39:40 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:39:40 INFO 140277027632960] #quality_metric: host=algo-1, epoch=26, validation mAP <score>=(0.015967914094507214)[0m
[31m[09/11/2019 05:39:40 INFO 140277027632960] #progress_metric: host=algo-1, completed 54 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1,

[31m[09/11/2019 05:39:50 INFO 140277027632960] #quality_metric: host=algo-1, epoch=32, batch=10 train cross_entropy <loss>=(0.7492496317083185)[0m
[31m[09/11/2019 05:39:50 INFO 140277027632960] #quality_metric: host=algo-1, epoch=32, batch=10 train smooth_l1 <loss>=(0.7660203472641874)[0m
[31m[09/11/2019 05:39:50 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:39:50 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:39:50 INFO 140277027632960] #quality_metric: host=algo-1, epoch=32, validation mAP <score>=(0.03994922693949471)[0m
[31m[09/11/2019 05:39:50 INFO 140277027632960] #progress_metric: host=algo-1, completed 66 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, 

[31m[09/11/2019 05:40:00 INFO 140277027632960] #quality_metric: host=algo-1, epoch=38, batch=10 train cross_entropy <loss>=(0.7530057083847176)[0m
[31m[09/11/2019 05:40:00 INFO 140277027632960] #quality_metric: host=algo-1, epoch=38, batch=10 train smooth_l1 <loss>=(0.8252594185690595)[0m
[31m[09/11/2019 05:40:00 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:40:00 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:40:01 INFO 140277027632960] #quality_metric: host=algo-1, epoch=38, validation mAP <score>=(0.029008919811766672)[0m
[31m[09/11/2019 05:40:01 INFO 140277027632960] #progress_metric: host=algo-1, completed 78 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1,


2019-09-11 05:40:22 Uploading - Uploading generated training model

[31m[09/11/2019 05:40:10 INFO 140277027632960] #quality_metric: host=algo-1, epoch=44, batch=10 train cross_entropy <loss>=(0.7405127572341704)[0m
[31m[09/11/2019 05:40:10 INFO 140277027632960] #quality_metric: host=algo-1, epoch=44, batch=10 train smooth_l1 <loss>=(0.8364117917880206)[0m
[31m[09/11/2019 05:40:10 INFO 140277027632960] Round of batches complete[0m
[31m[09/11/2019 05:40:11 INFO 140277027632960] Updated the metrics[0m
[31m[09/11/2019 05:40:11 INFO 140277027632960] #quality_metric: host=algo-1, epoch=44, validation mAP <score>=(0.01968443735513238)[0m
[31m[09/11/2019 05:40:11 INFO 140277027632960] #progress_metric: host=algo-1, completed 90 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, 

[31m[09/11/2019 05:40:20 INFO 140277027632960] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[31m[09/11/2019 05:40:20 INFO 140277027632960] Test data is not provided.[0m
[31m#metrics {"Metrics": {"epochs": {"count": 1, "max": 50, "sum": 50.0, "min": 50}, "totaltime": {"count": 1, "max": 102302.70314216614, "sum": 102302.70314216614, "min": 102302.70314216614}, "setuptime": {"count": 1, "max": 12.18414306640625, "sum": 12.18414306640625, "min": 12.18414306640625}}, "EndTime": 1568180420.083149, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/Object Detection"}, "StartTime": 1568180317.861361}
[0m



2019-09-11 05:40:43 Completed - Training job completed


Training seconds: 203
Billable seconds: 203


So now you are training!!! This will take a little while. We are only training for a very small number of epochs (2!), so we don't expect to have a really robust model. Potentially many 100s of epochs may be required depeneding on the quality and amount of training data we have. 

To level set, this model will be CRAPPY. But that is ok. You now have the basic tools required to set up and improve upon your own problem.

ðŸ¤” What are the big considerations as a data scientist?

ðŸ¤” What could we do to improve our model?

ðŸ¤” How could we evaluate the quality of our data?


In [7]:
object_detector = od_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')   

#response = object_detector.predict(data)

# Tears down the SageMaker endpoint and endpoint configuration
#object_detector.delete_endpoint()

# Deletes the SageMaker model
#object_detector.delete_model()


-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

!

Outside of this notebook, you can investigate your endpoint in the sagemaker console and run the test.sh script with the appropriate aws keys/role. This will generate some stats.

You can also edit the output in the 'endpoint_infer_slippygeo.py' to write out a geojson that you can then explore in QGIS, etc.