# Boston Housing - Price Prediction with Amazon SageMaker

In this module we will use the same dataset and model, but will update the model to use features of SageMaker to scale dataset transformation and model training beyond Jupyter notebook.

This notebook includes all key steps such as preprocessing data with SageMaker Processing, and model training and deployment with SageMaker hosted training and inference. Automatic Model Tuning in SageMaker is used to tune the model's hyperparameters. If you are using TensorFlow 2, you can use the Amazon SageMaker prebuilt TensorFlow 2 framework container with training scripts similar to those you would use outside SageMaker.

In this first notebook we will predict house prices based on the well-known Boston Housing dataset with a simple regression model in Tensorflow 2. This public dataset contains 13 features regarding housing stock of towns in the Boston area.  Features include average number of rooms, accessibility to radial highways, adjacency to a major river, etc.  

To begin, we'll import some necessary packages and set up directories for training and test data.  We'll also set up a SageMaker Session to perform various operations, and specify an Amazon S3 bucket to hold input data and output.  The default bucket used here is created by SageMaker if it doesn't already exist, and named in accordance with the AWS account ID and AWS Region.  

## Data preprocessing

Next, we'll import the dataset and transform it with SageMaker Processing, which can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances.  SageMaker Processing includes off-the-shelf support for Scikit-learn, as well as a Bring Your Own Container option, so it can be used with many different data transformation technologies and tasks.  An alternative to SageMaker Processing is [SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/), a visual data preparation tool integrated with the SageMaker Studio UI.    

To work with SageMaker Processing, first we'll load the Boston Housing dataset, save the raw feature data and upload it to Amazon S3 so it can be accessed by SageMaker Processing.  We'll also save the labels for training and testing.

In [51]:
import os
import boto3
import sagemaker

session = sagemaker.session.Session()

data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

In [72]:
import numpy as np
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

np.save(os.path.join(data_dir, 'X.npy'), X)
np.save(os.path.join(data_dir, 'y.npy'), y)

raw_dataset = session.upload_data(path=data_dir, key_prefix='boston-housing/data/raw')

In [56]:
print(raw_dataset)

s3://sagemaker-eu-west-1-761128311188/boston-housing/data/raw


Next, simply supply an ordinary Python data preprocessing script as shown below.  For this example, we're using a SageMaker prebuilt Scikit-learn framework container, which includes many common functions for processing data.  There are few limitations on what kinds of code and operations you can run, and only a minimal API contract:  input and output data must be placed in specified directories.  If this is done, SageMaker Processing automatically loads the input data from S3 and uploads transformed data back to S3 when the job is complete.

Before starting the SageMaker Processing job, we instantiate a `SKLearnProcessor` object.  This object allows you to specify the instance type to use in the job, as well as how many instances.  Spinning a cluster is just a matter of setting `instahce_count` to 2 or more, but our transformation has a `StandardScaler` which must be run over all training data and applied equally to train and test data. That can't be parallelized with `scikit-learn`, but since the dataset is small, that is not a problem.

In [127]:
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

execution_role = get_execution_role()

sklearn_processor = SKLearnProcessor(framework_version='0.23-1',
                                     role=execution_role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

We're now ready to run the Processing job.  To enable distributing the data files equally among the instances, we specify the `ShardedByS3Key` distribution type in the `ProcessingInput` object.  This ensures that if we have `n` instances, each instance will receive `1/n` files from the specified S3 bucket.  It may take around 3 minutes for the following code cell to run, mainly to set up the cluster.  At the end of the job, the cluster automatically will be torn down by SageMaker.  

In [128]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

bucket = session.default_bucket() 

processed_dataset = 's3://{}/boston-housing/data/processed'.format(bucket)

sklearn_processor.run(
    code='code/preprocessing.py',
    inputs=[ProcessingInput(
        source=raw_dataset,
        destination='/opt/ml/processing/input'
    )],
    outputs=[ProcessingOutput(
        source='/opt/ml/processing/output',
        destination=processed_dataset
    )]
)


Job Name:  sagemaker-scikit-learn-2021-04-13-14-57-57-462
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-761128311188/boston-housing/data/raw', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-761128311188/sagemaker-scikit-learn-2021-04-13-14-57-57-462/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-761128311188/boston-housing/data/processed', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
......................
[34mINPUT FILE LIST: [0m
[34m['/

In the log output of the SageMaker Processing job above, you should be able to see logs in two different colors for the two different instances, and that each instance received different files.  Without the `ShardedByS3Key` distribution type, each instance would have received a copy of **all** files.  By spreading the data equally among `n` instances, you should receive a speedup by approximately a factor of `n` for most stateless data transformations.  After saving the job results locally, we'll move on to training and inference code.

## Training

Now that we've prepared a dataset, we can move on to SageMaker's model training functionality. With SageMaker hosted training the actual training itself occurs not on the notebook instance, but on a separate cluster of machines managed by SageMaker. Before starting hosted training, the data must be in S3, or an EFS or FSx for Lustre file system. We'll upload to S3 now, and confirm the upload was successful.

We're now ready to set up an Estimator object for hosted training. We simply call `fit` to start the actual hosted training.

In [79]:
from sagemaker.tensorflow import TensorFlow

hyperparameters = {'epochs': 50, 'batch_size': 128, 'learning_rate': 0.01}

hosted_estimator = TensorFlow(
                       source_dir='code',
                       entry_point='train.py',
                       instance_type='ml.c5.xlarge',
                       instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       framework_version='2.3.1',
                       py_version='py37')

After starting the hosted training job with the `fit` method call below, you should observe the valication loss converge with each epoch.  Can we do better? We'll look into a way to do so in the **Automatic Model Tuning** section below. In the meantime, the hosted training job should take about 3 minutes to complete.  

In [None]:
hosted_estimator.fit({'input':processed_dataset})

The training job produces a model saved in S3 that we can retrieve.  This is an example of the modularity of SageMaker: having trained the model in SageMaker, you can now take the model out of SageMaker and run it anywhere else.  Alternatively, you can deploy the model into a production-ready environment using SageMaker's hosted endpoints functionality, as shown in the **SageMaker hosted endpoint** section below.

Retrieving the model from S3 is very easy:  the hosted training estimator you created above stores a reference to the model's location in S3.  You simply copy the model from S3 using the estimator's `model_data` property and unzip it to inspect the contents.

In [81]:
!aws s3 cp {hosted_estimator.model_data} ./model/model.tar.gz

download: s3://sagemaker-eu-west-1-761128311188/tensorflow-training-2021-04-12-23-00-29-769/output/model.tar.gz to model/model.tar.gz


The unzipped archive should include the assets required by TensorFlow Serving to load the model and serve it, including a .pb file:  

In [82]:
!tar -xvzf ./model/model.tar.gz -C ./model

1/
1/saved_model.pb
1/assets/
1/variables/
1/variables/variables.data-00000-of-00001
1/variables/variables.index


## Validation
    
The final step in this pipeline is offline, batch scoring (inference/prediction).  The inputs to this step will be the model we trained earlier, and the test data.  A simple, ordinary Python script is all we need to do the actual batch inference.

In [98]:
batch_input = 's3://{}/boston-housing/data/processed/X_val.json'.format(bucket)
batch_output = 's3://{}/boston-housing/batch-output'.format(bucket)

tf_transformer = hosted_estimator.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=batch_output
)

tf_transformer.transform(data=batch_input, content_type='application/json')

........................
[34mINFO:__main__:starting services[0m
[34mINFO:tfs_utils:using default model name: model[0m
[34mINFO:tfs_utils:tensorflow serving model config: [0m
[34mmodel_config_list: {
  config: {
    name: "model",
    base_path: "/opt/ml/model",
    model_platform: "tensorflow"
  }[0m
[34m}

[0m
[34mINFO:__main__:using default model name: model[0m
[34mINFO:__main__:tensorflow serving model config: [0m
[34mmodel_config_list: {
  config: {
    name: "model",
    base_path: "/opt/ml/model",
    model_platform: "tensorflow"
  }[0m
[34m}

[0m
[34mINFO:__main__:tensorflow version info:[0m
[34mTensorFlow ModelServer: 2.3.0-rc0+dev.sha.no_git[0m
[34mTensorFlow Library: 2.3.0[0m
[34mINFO:__main__:tensorflow serving command: tensorflow_model_server --port=10000 --rest_api_port=10001 --model_config_file=/sagemaker/model-config.cfg --max_num_load_retries=0 [0m
[34mINFO:__main__:started tensorflow serving (pid: 12)[0m
[34mINFO:__main__:nginx config: [0m

In [99]:
!aws s3 cp --recursive $tf_transformer.output_path ./

download: s3://sagemaker-eu-west-1-761128311188/boston-housing/batch-output/X_val.json.out to ./X_val.json.out


In [102]:
!aws s3 cp s3://$bucket/boston-housing/data/processed/y_val.json ./

download: s3://sagemaker-eu-west-1-761128311188/boston-housing/data/processed/y_val.json to ./y_val.json


In [117]:
import json 

with open("X_val.json.out", "r") as read_file:
    validation_pred = json.load(read_file)
    validation_pred = [item for item_arr in validation_pred['predictions'] for item in item_arr]

with open("y_val.json", "r") as read_file:
    validation_true = json.load(read_file)

In [119]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(validation_true, validation_pred)
print('Validation MSE: {}'.format(mse))

Validation MSE: 41.97837042075817


## Automatic Model Tuning

So far we have simply run one Hosted Training job without any real attempt to tune hyperparameters to produce a better model.  Selecting the right hyperparameter values to train your model can be difficult, and typically is very time consuming if done manually. The right combination of hyperparameters is dependent on your data and algorithm; some algorithms have many different hyperparameters that can be tweaked; some are very sensitive to the hyperparameter values selected; and most have a non-linear relationship between model fit and hyperparameter values.  SageMaker Automatic Model Tuning helps automate the hyperparameter tuning process:  it runs multiple training jobs with different hyperparameter combinations to find the set with the best model performance.

We begin by specifying the hyperparameters we wish to tune, and the range of values over which to tune each one.  We also must specify an objective metric to be optimized:  in this use case, we'd like to minimize the validation loss.

In [120]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
  'learning_rate': ContinuousParameter(0.001, 0.2, scaling_type="Logarithmic"),
  'epochs': IntegerParameter(20, 70),
  'batch_size': IntegerParameter(64, 256),
}

metric_definitions = [{'Name': 'loss',
                       'Regex': ' loss: ([0-9\\.]+)'},
                     {'Name': 'val_loss',
                       'Regex': ' val_loss: ([0-9\\.]+)'}]

objective_metric_name = 'val_loss'
objective_type = 'Minimize'

Next we specify a HyperparameterTuner object that takes the above definitions as parameters.  Each tuning job must be given a budget:  a maximum number of training jobs.  A tuning job will complete after that many training jobs have been executed.  

We also can specify how much parallelism to employ, in this case five jobs, meaning that the tuning job will complete after three series of five jobs in parallel have completed.  For the default Bayesian Optimization tuning strategy used here, the tuning search is informed by the results of previous groups of training jobs, so we don't run all of the jobs in parallel, but rather divide the jobs into groups of parallel jobs.  There is a trade-off: using more parallel jobs will finish tuning sooner, but likely will sacrifice tuning search accuracy. 

Now we can launch a hyperparameter tuning job by calling the `fit` method of the HyperparameterTuner object.  The tuning job may take around 10 minutes to finish.  While you're waiting, the status of the tuning job, including metadata and results for invidual training jobs within the tuning job, can be checked in the SageMaker console in the **Hyperparameter tuning jobs** panel.  

In [123]:
tuner = HyperparameterTuner(hosted_estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=15,
                            max_parallel_jobs=5,
                            objective_type=objective_type)

tuner.fit({'input':processed_dataset}, job_name='boston-housing-tuning')
tuner.wait()

..........................................................................................................................!
!


After the tuning job is finished, we can use the `HyperparameterTuningJobAnalytics` object from the SageMaker Python SDK to list the top 5 tuning jobs with the best performance. Although the results vary from tuning job to tuning job, the best validation loss from the tuning job (under the FinalObjectiveValue column) likely will be substantially lower than the validation loss from the hosted training job above.  

In [125]:
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics('boston-housing-tuning')
tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)

Unnamed: 0,batch_size,epochs,learning_rate,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
3,66.0,70.0,0.011654,boston-housing-tuning-012-3215c008,Completed,10.0801,2021-04-13 11:45:00+00:00,2021-04-13 11:46:00+00:00,60.0
2,181.0,70.0,0.057179,boston-housing-tuning-013-3bd04d5a,Completed,12.0952,2021-04-13 11:45:08+00:00,2021-04-13 11:45:58+00:00,50.0
7,64.0,68.0,0.011654,boston-housing-tuning-008-94bb43d9,Completed,12.29,2021-04-13 11:42:00+00:00,2021-04-13 11:43:09+00:00,69.0
14,76.0,41.0,0.025032,boston-housing-tuning-001-37e37537,Completed,12.4254,2021-04-13 11:38:42+00:00,2021-04-13 11:39:32+00:00,50.0
1,239.0,31.0,0.113001,boston-housing-tuning-014-bf9cb803,Completed,16.5758,2021-04-13 11:45:14+00:00,2021-04-13 11:46:04+00:00,50.0


The total training time and training jobs status can be checked with the following lines of code. Because automatic early stopping is by default off, all the training jobs should be completed normally.  For an example of a more in-depth analysis of a tuning job, see the SageMaker official sample [HPO_Analyze_TuningJob_Results.ipynb](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) notebook.

In [126]:
total_time = tuner_metrics.dataframe()['TrainingElapsedTimeSeconds'].sum() / 3600
print("The total training time is {:.2f} hours".format(total_time))
tuner_metrics.dataframe()['TrainingJobStatus'].value_counts()

The total training time is 0.24 hours


Completed    15
Name: TrainingJobStatus, dtype: int64

## Endpoint deployment

Assuming the best model from the tuning job is better than the model produced by the individual hosted training job above, we could now easily deploy that model to production.  A convenient option is to use a SageMaker hosted endpoint, which serves real time predictions from the trained model (For asynchronous, offline predictions on large datasets, you can use either SageMaker Processing or SageMaker Batch Transform.). The endpoint will retrieve the TensorFlow SavedModel created during training and deploy it within a SageMaker TensorFlow Serving container. This all can be accomplished with one line of code.  

More specifically, by calling the `deploy` method of the HyperparameterTuner object we instantiated above, we can directly deploy the best model from the tuning job to a SageMaker hosted endpoint.

In [134]:
!aws s3 cp s3://$bucket/boston-housing/data/processed/StandardScaler.pkl ./code/

download: s3://sagemaker-eu-west-1-761128311188/boston-housing/data/processed/StandardScaler.pkl to code/StandardScaler.pkl


In [137]:
best_model = tuner.best_estimator().create_model(entry_point='inference.py',
              source_dir='code')


2021-04-13 11:46:00 Starting - Preparing the instances for training
2021-04-13 11:46:00 Downloading - Downloading input data
2021-04-13 11:46:00 Training - Training image download completed. Training in progress.
2021-04-13 11:46:00 Uploading - Uploading generated training model
2021-04-13 11:46:00 Completed - Training job completed


In [138]:
tuning_predictor = best_model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-----------------------------*

UnexpectedStatusException: Error hosting endpoint boston-housing-tuning-012-3215c008-2021-04-13-15-50-12-267: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

We can compare the predictions generated by this endpoint with the actual target values: 

In [133]:
results = tuning_predictor.predict(x_test[:10])['predictions'] 
flat_list = [float('%.1f'%(item)) for sublist in results for item in sublist]
print('predictions: \t{}'.format(np.array(flat_list)))
print('target values: \t{}'.format(y_test[:10].round(decimals=1)))

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "{"error": "{\n    \"error\": \"JSON Value: [\\n    [\\n        18.0846,\\n        0.0,\\n        18.1,\\n        0.0,\\n        0.679,\\n        6.434,\\n        100.0,\\n        1.8347,\\n        24.0,\\n        666.0,\\n        20.2,\\n        27.25,\\n        29.05\\n    ],\\n    [\\n        0.12329,\\n        0.0,\\n        10.01,\\n        0.0,\\n        0.547,\\n        5.913,\\n        92.9,\\n        2.3534,\\n        6.0,\\n        432.0,\\n        17.8,\\n        394.95,\\n        16.21\\n    ],\\n    [\\n        0.05497,\\n        0.0,\\n        5.19,\\n        0.0,\\n        0.515,\\n        5.985,\\n        45.4,\\n        4.8122,\\n        5.0,\\n        224.0,\\n        20.2,\\n        396.9,\\n        9.74\\n    ],\\n    [\\n        1.27346,\\n        0.0,\\n        19.58,\\n        1.0,\\n        0.605,\\n        6.25,\\n        92.6,\\n        1.7984,\\n        5.0,\\n        403.0,\\n        14.7,\\n        338.92,\\n        5.5\\n    ],\\n    [\\n        0.07151,\\n        0.0,\\n        4.49,\\n        0.0,\\n        0.449,\\n        6.121,\\n        56.8,\\n        3.7476,\\n        3.0,\\n        247.0,\\n        18.5,\\n        395.15,\\n        8.44\\n    ],\\n    [\\n        0.27957,\\n        0.0,\\n        9.69,\\n        0.0,\\n        0.585,\\n        5.926,\\n        42.6,\\n        2.3817,\\n        6.0,\\n        391.0,\\n        19.2,\\n        396.9,\\n        13.59\\n    ],\\n    [\\n        0.03049,\\n        55.0,\\n        3.78,\\n        0.0,\\n        0.484,\\n        6.874,\\n        28.1,\\n        6.4654,\\n        5.0,\\n        370.0,\\n        17.6,\\n        387.97,\\n        4.61\\n    ],\\n    [\\n        0.03551,\\n        25.0,\\n        4.86,\\n        0.0,\\n        0.426,\\n        6.167,\\n        46.7,\\n        5.4007,\\n        4.0,\\n        281.0,\\n        19.0,\\n        390.64,\\n        7.51\\n    ],\\n    [\\n        0.09299,\\n        0.0,\\n        25.65,\\n        0.0,\\n        0.581,\\n        5.961,\\n        92.9,\\n        2.0869,\\n        2.0,\\n        188.0,\\n        19.1,\\n        378.09,\\n        17.93\\n    ],\\n    [\\n        3.56868,\\n        0.0,\\n        18.1,\\n        0.0,\\n        0.58,\\n        6.437,\\n        75.0,\\n        2.8965,\\n        24.0,\\n        666.0,\\n        20.2,\\n        393.37,\\n        14.36\\n    ]\\n] Is not object\"\n}"}". See https://eu-west-1.console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logEventViewer:group=/aws/sagemaker/Endpoints/boston-housing-tuning-012-3215c008-2021-04-13-15-19-39-268 in account 761128311188 for more information.