# Train Models with SageMaker
Ok! Now that we've spent a healthy amount of time formatting our data and making changes to it, let's train our first model on SageMaker. 

At a minimum, during a 3 day course, participants should be able to produce a single model. As a strectch goal they should be able to produce 2 or more models and compare them.

In [7]:
import pandas as pd
import sagemaker

In [8]:
data = pd.read_csv("../Data/fewer_labeled_rows_by_block.csv")

Because the dimensionality here is so high, it's wise to use a simple model. We'll start off with linear learner.

In [9]:
data.shape

(33063, 1512)

In [10]:
import boto3
from sagemaker import get_execution_role

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

In [11]:
import numpy as np

def create_a_train_set(data):
    ys = np.array(data["Target"])
        
    drop_list = ["Target", "Date", "Primary Type"]
    
    xs = np.array(data.drop(drop_list, axis=1))
    
    return [xs, ys]
    
train_set = create_a_train_set(data)

In [12]:
len(train_set[0][0])

1509

In [None]:
train_records = multiclass_estimator.record_set(train_features, train_labels, channel='train')
val_records = multiclass_estimator.record_set(val_features, val_labels, channel='validation')
test_records = multiclass_estimator.record_set(test_features, test_labels, channel='test')

In [21]:
sess = sagemaker.Session()
bucket = "sagemaker-chicago-data"

role = get_execution_role()

In [37]:
import os

prefix = "sagemaker"

key = "modeling-data"

boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)

s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)

print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://sagemaker-chicago-data/sagemaker/train/modeling-data


In [38]:
output_location = 's3://{}/{}/model_artifacts'.format(bucket, prefix)

print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://sagemaker-chicago-data/sagemaker/model_artifacts


In [40]:
linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m4.4xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess)

linear.set_hyperparameters(feature_dim=1509,
                           predictor_type='multiclass_classifier',
                           mini_batch_size=200,
                          num_classes = 6)

linear.fit({'train': s3_train_data})

INFO:sagemaker:Creating training-job with name: linear-learner-2018-10-22-22-32-27-495


2018-10-22 22:32:27 Starting - Starting the training job...
2018-10-22 22:32:29 Starting - Launching requested ML instances......
2018-10-22 22:33:32 Starting - Preparing the instances for training......
2018-10-22 22:34:55 Downloading - Downloading input data
2018-10-22 22:34:55 Training - Downloading the training image.
[31mDocker entrypoint called with argument(s): train[0m
[31m[10/22/2018 22:35:01 INFO 139848689157952] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_method': u'uniform', u'init_sigma': u'0.01', u'lr_scheduler_minimum_lr': u'auto', u'target_recall'


2018-10-22 22:35:00 Training - Training image download completed. Training in progress.[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 5.417369101478746e-05, "sum": 5.417369101478746e-05, "min": 5.417369101478746e-05}}, "EndTime": 1540247716.171216, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 1}, "StartTime": 1540247716.171152}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 4.388509922357502e-06, "sum": 4.388509922357502e-06, "min": 4.388509922357502e-06}}, "EndTime": 1540247716.171317, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 1}, "StartTime": 1540247716.1713}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 0.0, "sum": 0.0, "min": 0.0}}, "EndTime": 1540247716.171361, "Dimensions": {"model": 2, "Host": "algo-1", "Op

[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 5.1867107380971766e-05, "sum": 5.1867107380971766e-05, "min": 5.1867107380971766e-05}}, "EndTime": 1540247729.77974, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 3}, "StartTime": 1540247729.779677}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 7.808701202245147e-07, "sum": 7.808701202245147e-07, "min": 7.808701202245147e-07}}, "EndTime": 1540247729.779826, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 3}, "StartTime": 1540247729.779811}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 0.0, "sum": 0.0, "min": 0.0}}, "EndTime": 1540247729.779869, "Dimensions": {"model": 2, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 3}, "StartTime": 15402

[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 5.180459314336379e-05, "sum": 5.180459314336379e-05, "min": 5.180459314336379e-05}}, "EndTime": 1540247736.538656, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 4}, "StartTime": 1540247736.538591}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 5.71764143497796e-07, "sum": 5.71764143497796e-07, "min": 5.71764143497796e-07}}, "EndTime": 1540247736.538738, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 4}, "StartTime": 1540247736.538724}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 0.0, "sum": 0.0, "min": 0.0}}, "EndTime": 1540247736.53879, "Dimensions": {"model": 2, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 4}, "StartTime": 1540247736.


2018-10-22 22:35:51 Uploading - Uploading generated training model
2018-10-22 22:35:51 Completed - Training job completed
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 5.166427635898193e-05, "sum": 5.166427635898193e-05, "min": 5.166427635898193e-05}}, "EndTime": 1540247743.322549, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 5}, "StartTime": 1540247743.322483}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 4.374278055779128e-07, "sum": 4.374278055779128e-07, "min": 4.374278055779128e-07}}, "EndTime": 1540247743.322633, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 5}, "StartTime": 1540247743.322618}
[0m
[31m#metrics {"Metrics": {"train_multiclass_cross_entropy_objective": {"count": 1, "max": 0.0, "sum": 0.0, "min": 0.0}}, "EndTime": 1540247743.322676, "Dimensions

Billable seconds: 77


In [1]:
from sklearn.model_selection import train_test_split

In [6]:
import sagemaker

sess = sagemaker.Session()

endpoint_name = "chicago-predictor"

predictor = sagemaker.predictor.RealTimePredictor(endpoint_name, sess)

# Now that we have our predictor, let's evaluate the model

In [16]:
def create_train_test_set(train_set):
        
    np.random.seed(0)

    train_features, test_features, train_labels, test_labels = train_test_split(
    train_set[0], train_set[1], test_size=0.2)
    
    
    return train_features, test_features, train_labels, test_labels
    
train_xs, test_xs, train_ys, test_ys = create_train_test_set(train_set)

In [3]:
def evaluate_metrics(predictor, test_features, test_labels):
    """
    Evaluate a model on a test set using the given prediction endpoint. 
    Display classification metrics.
    This function was written by Zohar and others on the SageMaker blog
    """
    # split the test dataset into 100 batches and evaluate using prediction endpoint
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 100)]

    # parse protobuf responses to extract predicted labels
#     extract_label = lambda x: x.label['predicted_label'].float32_tensor.values
#     test_preds = np.concatenate([np.array([extract_label(x) for x in batch]) for batch in prediction_batches])
#     test_preds = test_preds.reshape((-1,))
    
#     # calculate accuracy
#     accuracy = (test_preds == test_labels).sum() / test_labels.shape[0]
    
#     # calculate recall for each class
#     recall_per_class, classes = [], []
#     for target_label in np.unique(test_labels):
#         recall_numerator = np.logical_and(test_preds == target_label, test_labels == target_label).sum()
#         recall_denominator = (test_labels == target_label).sum()
#         recall_per_class.append(recall_numerator / recall_denominator)
#         classes.append(label_map[target_label])
#     recall = pd.DataFrame({'recall': recall_per_class, 'class_label': classes})
#     recall.sort_values('class_label', ascending=False, inplace=True)

#     # calculate confusion matrix
#     label_mapper = np.vectorize(lambda x: label_map[x])
#     confusion_matrix = pd.crosstab(label_mapper(test_labels), label_mapper(test_preds), 
#                                    rownames=['Actuals'], colnames=['Predictions'], normalize='index')

#     # display results
#     sns.heatmap(confusion_matrix, annot=True, fmt='.2f', cmap="YlGnBu").set_title('Confusion Matrix')  
#     ax = recall.plot(kind='barh', x='class_label', y='recall', color='steelblue', title='Recall', legend=False)
#     ax.set_ylabel('')
#     print('Accuracy: {:.3f}'.format(accuracy))


In [18]:
evaluate_metrics(predictor, test_xs, test_ys)

ParamValidationError: Parameter validation failed:
Invalid type for parameter Body, value: [[50.          0.          0.         ...  0.          0.
   0.        ]
 [46.88888889  0.          0.         ...  0.          0.
   0.        ]
 [50.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [80.55555556  0.          0.         ...  0.          0.
   0.        ]
 [41.33333333  0.          0.         ...  0.          0.
   0.        ]
 [69.11111111  0.          0.         ...  0.          0.
   0.        ]], type: <class 'numpy.ndarray'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object

# Next, we'll run the model with balanced class weights, and observe the improvement in performance