# Lab 2: Training using Amazon SageMaker

**Goal:**

   In this lab, we'll train a model using XGBoost and the training dataset created in Lab 1.  In a  real data science development lifecycle, this would be one of many experiments.
   
   * **Lab Outcome:**: The outcome of this lab is to create a trained model resulting in SageMaker model artifact that we will then deploy and evaluate in Lab3

**Dependendencies:**
   
   1. This lab requires the training/validation/test datasets created in Lab1.

----

## Step 1: Configure Training Job

Setup/Configure training job...

In [None]:
%%time

import os
import boto3
import re
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

bucket = sagemaker.Session().default_bucket()
model_prefix = 'workshop/model'


# customize to your bucket where you have stored the data
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)

**Get XGBoost container image**

We are utilizing SageMaker's XGBoost built-in-algorithm so we will pull the managed image for the appropriate region below. 

*You can Ignore the Warning indicating a newer version is available*

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

### Set data input paths from Lab 1
Set variables pointing to our training/validation data created in Lab1. We are pulling in data_prefix which was a string variable set in Lab1. 

In [None]:
# Retrieve the stored variables (variables stored in Lab1)
%store -r data_prefix

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, data_prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, data_prefix), content_type='csv')

## Step 2: Execute Training Job


The example below illustrates kicking off training with the Amazon SageMaker Python SDK.  You can also kick off the training job using AWS SDK for Python - Boto3 using the *create_training_job* method. 

**References:**

  * *[Common Parameters for Built-In Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html):* Reference specifying common parameters for Amazon SageMaker algorithms such as XGBoost
  
**Amazon SageMaker XGBoost Notes:**

  * **[XGBoost Hyperparameters:](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)** 
      * We are going to experiment with an initial set of hyperparameters setting our objective to binary:logistic meaning  we are going to perform logistic regression for binary classification.   For this objective, the expected output is a probability. 
      * We have not explicitly identified an evaluation metric. By default, SageMaker will assign an evaluation metrics based on the objective set.  In this case, train:error and validation:error.  
      * Imbalanced Dataset: From Lab1, we know we have an imbalanced dataset where the number of transactions identified as a recurring payment is much lower than non-recurring payments (class imbalance).  One experiment we will try below is to adjust the scale_pos_weight hyperparameter.  A general recommendation is: 
                  scale_pos_weight = sum(negative classes)/sum(positive classes) 
                  From Lab1: Negative Classes = 208715, Positive Classes = 2358
                             208715/2358 = ~89
 
      
  * **Instance Type:** 
       * Amazon SageMaker XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the out-of-core feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.
       * During experimentation and for small datasets, you can also choose to train on your [notebook locally](https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/) using the compute/memory of your notebook instance as opposed to training instances.  However, this is only recommended for small experience to ensure your notebook is not overprovisioned. 

In [None]:
sess = sagemaker.Session()

# Converting datetime object to string
from datetime import datetime
dateTimeObj = datetime.now() 
timestampStr = dateTimeObj.strftime("%d%m%Y-%H%M%S%f")
training_job_name = 'sagemaker-xgboost-workshop-' +  timestampStr

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, model_prefix),
                                    enable_sagemaker_metrics=True,
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        scale_pos_weight=89,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation},job_name=training_job_name) 

In [None]:
print(training_job_name)

### View Model Artifact in S3 

Once the training job is complete, the model will be output to an S3 bucket.  This model artifact will be used for hosting our model and getting predictions.  

In [None]:
from IPython.core.display import Markdown

s3_model_artifact = 'https://s3.console.aws.amazon.com/s3/buckets/'+ bucket + '/' + model_prefix + '/output/' + training_job_name + '/output/?region=us-east-1&tab=overview'
display(Markdown('S3 Model Artifact: [link]('+s3_model_artifact+')'))

### View Cloudwatch Training Metrics

In [None]:
%matplotlib inline
from sagemaker.analytics import TrainingJobAnalytics

validation_metric_name = 'validation:error'
validation_metrics_dataframe = TrainingJobAnalytics(training_job_name=training_job_name,metric_names=[validation_metric_name]).dataframe()
validation_metrics_dataframe.head()

training_metric_name = 'train:error'
training_metrics_dataframe = TrainingJobAnalytics(training_job_name=training_job_name,metric_names=[training_metric_name]).dataframe()
training_metrics_dataframe.head()

metrics = training_metrics_dataframe.append(validation_metrics_dataframe)
metrics.head()


### View Training Job in SageMaker Console 

You will see training logs as output in the notebook above; however, you can also view your training job from the SageMaker console as well.  You can also scroll down to monitor and evaluate system metrics (CPU/Memory/Disk) to help in future right sizing of training instances for cost optimization and performance.  Click [HERE](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs/) to find your training job within the console and view metrics.

# Congratuations - You've completed Lab2

In this lab we utilized the training/validation datasets created in Lab 1 to train a model.  In the next lab, we'll host the model for predictions.

In [None]:
# Let's collect & store variables we will need to use for Lab3

%store training_job_name
%store data_prefix
