## Regression with Amazon SageMaker Linear Learner algorithm


_**Single machine training for regression with Amazon SageMaker Linear Learner algorithm**_

---

---
## Contents
1. [Introduction](#Introduction)
2. [Setup](#Setup)
   1. [Exploring the dataset](#Exploring-the-dataset)
3. [Training the Linear Learner Model](#Training-the-linear-learner-model)
   1. [Training with SageMaker Training](#Training-with-sagemaker-training)
   2. [Training with Automatic Model Tuning (HPO)](#Training-with-automatic-model-tuning-HPO)
4. [Set up hosting for the model](#Set-up-hosting-for-the-model)
5. [Inference](#Inference)
6. [Delete the Endpoint](#Delete-the-Endpoint)
7. [Assignment](#Assignment)
---
## Introduction

This notebook demonstrates the use of Amazon SageMaker’s implementation of the Linear Learner algorithm to train and host a regression model. We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). 

The dataset contains 9 fields, starting with the Rings number which is a number indicating the age of the abalone (as age equals to number of rings plus 1.5). Usually the number of rings are counted through microscopes to estimate the abalone's age. So we will use our algorithm to predict the abalone age based on the other features which are mentioned respectively as below within the dataset. 

'Rings','sex','Length','Diameter','Height','Whole Weight','Shucked Weight','Viscera Weight' and 'Shell Weight'

The above features starting from sex to Shell.weight are physical measurements that can be measured using the correct tools, so we improve the complixety of having to examine the abalone under microscopes to understand it's age.


---
## Setup


This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 buckets and prefixes that you want to use for training data and model data. This should be within the same region as the Notebook Instance, training, and hosting.
1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [2]:
! pip install --upgrade sagemaker



In [3]:
import os
import boto3
import re
import sagemaker


role = sagemaker.get_execution_role()
region = boto3.Session().region_name

# S3 bucket for training data.
# Feel free to specify a different bucket and prefix.
data_bucket = f"sagemaker-example-files-prod-{region}"
data_prefix = "datasets/tabular/uci_abalone"


# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket and prefix
output_bucket = sagemaker.Session().default_bucket()
output_prefix = "sagemaker/DEMO-linear-learner-abalone-regression"

  import scipy.sparse


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Exploring the dataset

We pre-processed the Abalone dataset [1] and stored in a S3 bucket. It was downloaded from the [National Taiwan University's CS department's tools for regression on the abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone). Scripts used in downloading and pre-processing can be found in the [Appendix](#Appendix). These include downloading data, converting data from libsvm format to csv format, dividing it into train, validation and test and uploading it to S3 bucket. 

The dataset contains a total of 9 fields. Throughout this notebook, they will be named as follows 'age','sex','Length','Diameter','Height','Whole.weight','Shucked.weight','Viscera.weight' and 'Shell.weight' respectively.

The below data frame representation explain the value of each field.
Note: the age field is in integer representation and the rest of the features are in the format of "feature_number":"feature_value"

```
**'data.frame'**:
age              int  15 7 9 10 7 8 20 16 9 19 ...
Sex               <feature_number>: Factor w/ 3 levels "F","I","M" values of 1,2,3
Length            <feature_number>: float  0.455 0.35 0.53 0.44 0.33 0.425 ...
Diameter          <feature_number>: float  0.365 0.265 0.42 0.365 0.255 0.3 ...
Height            <feature_number>: float  0.095 0.09 0.135 0.125 0.08 0.095 ...
Whole.weight      <feature_number>: float  0.514 0.226 0.677 0.516 0.205 ...
Shucked.weight    <feature_number>: float  0.2245 0.0995 0.2565 0.2155 0.0895 ...
Viscera.weight    <feature_number>: float  0.101 0.0485 0.1415 0.114 0.0395 ...
Shell.weight      <feature_number>: float  0.15 0.07 0.21 0.155 0.055 0.12 ...
```
>[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [4]:
import boto3

s3 = boto3.client("s3")

FILE_TRAIN = "abalone_dataset1_train.csv"
FILE_TEST = "abalone_dataset1_test.csv"
FILE_VALIDATION = "abalone_dataset1_validation.csv"

# downloading the train, test, and validation files from data_bucket
s3.download_file(data_bucket, f"{data_prefix}/train_csv/{FILE_TRAIN}", FILE_TRAIN)
s3.download_file(data_bucket, f"{data_prefix}/test_csv/{FILE_TEST}", FILE_TEST)
s3.download_file(data_bucket, f"{data_prefix}/validation_csv/{FILE_VALIDATION}", FILE_VALIDATION)
s3.upload_file(FILE_TRAIN, output_bucket, f"{output_prefix}/train/{FILE_TRAIN}")
s3.upload_file(FILE_TEST, output_bucket, f"{output_prefix}/test/{FILE_TEST}")
s3.upload_file(FILE_VALIDATION, output_bucket, f"{output_prefix}/validation/{FILE_VALIDATION}")

In [5]:
import pandas as pd  # Read in csv and store in a pandas dataframe

df = pd.read_csv(
    FILE_TRAIN,
    sep=",",
    encoding="latin1",
    names=[
        "age",
        "sex",
        "Length",
        "Diameter",
        "Height",
        "Whole.weight",
        "Shucked.weight",
        "Viscera.weight",
        "Shell.weight",
    ],
)
print(df.head(1))

   age  sex  Length  Diameter  Height  Whole.weight  Shucked.weight  \
0    8    2   0.615      0.48    0.16        1.2525           0.585   

   Viscera.weight  Shell.weight  
0          0.2595          0.33  



---
Let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our [data channels](https://sagemaker.readthedocs.io/en/v1.2.4/session.html#). These objects are then put in a simple dictionary, which the algorithm consumes. Notice that here we use a `content_type` as `text/csv` for the pre-processed file in the data_bucket. We use two channels here one for training and the second one for validation. The testing samples from above will be used on the prediction step.

In [6]:
# creating the inputs for the fit() function with the training and validation location
s3_train_data = f"s3://{output_bucket}/{output_prefix}/train"
print(f"training files will be taken from: {s3_train_data}")
s3_validation_data = f"s3://{output_bucket}/{output_prefix}/validation"
print(f"validation files will be taken from: {s3_validation_data}")
output_location = f"s3://{output_bucket}/{output_prefix}/output"
print(f"training artifacts output location: {output_location}")

# generating the session.s3_input() format for fit() accepted by the sdk
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)

training files will be taken from: s3://sagemaker-us-east-1-051826717648/sagemaker/DEMO-linear-learner-abalone-regression/train
validation files will be taken from: s3://sagemaker-us-east-1-051826717648/sagemaker/DEMO-linear-learner-abalone-regression/validation
training artifacts output location: s3://sagemaker-us-east-1-051826717648/sagemaker/DEMO-linear-learner-abalone-regression/output


## Training the Linear Learner Model

Training can be done by either calling SageMaker Training with a set of hyperparameters values to train with, or by leveraging SageMaker Automatic Model Tuning ([AMT](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)). AMT, also known as hyperparameter tuning (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

In this notebook, both methods are used for demonstration purposes, but the model that the HPO job creates is the one that is eventually hosted. You can instead choose to deploy the model created by the standalone training job by changing the below variable `deploy_amt_model` to False.

### Training with SageMaker Training

First, we retrieve the image for the Linear Learner Algorithm according to the region.

In [7]:
# getting the linear learner image according to the region
from sagemaker.image_uris import retrieve

container = retrieve("linear-learner", boto3.Session().region_name, version="1")
print(container)
deploy_amt_model = True

382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1


Then we create an [estimator from the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) using the Linear Learner container image and we setup the training parameters and hyperparameters configuration.

In [8]:
%%time
import boto3
import sagemaker
from time import gmtime, strftime

sess = sagemaker.Session()

job_name = "DEMO-linear-learner-abalone-regression-" + strftime("%Y%m%d-%H-%M-%S", gmtime())
print("Training job", job_name)

linear = sagemaker.estimator.Estimator(
    container,
    role,
    input_mode="File",
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path=output_location,
    sagemaker_session=sess,
)

linear.set_hyperparameters(
    feature_dim=8,
    epochs=16,
    wd=0.01,
    loss="absolute_loss",
    predictor_type="regressor",
    normalize_data=True,
    optimizer="adam",
    mini_batch_size=100,
    lr_scheduler_step=100,
    lr_scheduler_factor=0.99,
    lr_scheduler_minimum_lr=0.0001,
    learning_rate=0.1,
)

Training job DEMO-linear-learner-abalone-regression-20240920-18-56-52
CPU times: user 70.7 ms, sys: 8.12 ms, total: 78.8 ms
Wall time: 104 ms


---
After configuring the Estimator object and setting the hyperparameters for this object. The only remaining thing to do is to train the algorithm. The following cell will train the algorithm. Training the algorithm involves a few steps. Firstly, the instances that we requested while creating the Estimator classes are provisioned and are setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take time, depending on the size of the data. Therefore it might be a few minutes before we start getting data logs for our training jobs. The data logs will also print out Mean Average Precision (mAP) on the validation data, among other losses, for every run of the dataset once or one epoch. This metric is a proxy for the quality of the algorithm.

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator. For this example,the training time takes between 4 and 6 minutes.


In [10]:
%%time
linear.fit(inputs={"train": train_data, "validation": validation_data}, job_name=job_name)

INFO:sagemaker:Creating training-job with name: DEMO-linear-learner-abalone-regression-20240920-18-56-52
ERROR:sagemaker:Please check the troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html#sagemaker-python-sdk-troubleshooting-create-training-job


ResourceInUse: An error occurred (ResourceInUse) when calling the CreateTrainingJob operation: Training job names must be unique within an AWS account and region, and a training job with this name already exists (arn:aws:sagemaker:us-east-1:051826717648:training-job/demo-linear-learner-abalone-regression-20240920-18-56-52)

### Training with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) <a id='AMT'></a>
***
As mentioned above, instead of manually configuring our hyper parameter values and training with SageMaker Training, we'll use Amazon SageMaker Automatic Model Tuning. 
        
The code sample below shows you how to use the HyperParameterTuner. For recommended default hyparameter ranges, check the [Amazon SageMaker Linear Learner HPs documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html). 

The tuning job will take 8 to 10 minutes to complete.
***

In [12]:
import time
from sagemaker.tuner import IntegerParameter, ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

job_name = "DEMO-ll-aba-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print("Tuning job name: ", job_name)

# Linear Learner tunable hyper parameters can be found here https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner-tuning.html
hyperparameter_ranges = {
    "wd": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "learning_rate": ContinuousParameter(1e-5, 1, scaling_type="Auto"),
    "mini_batch_size": IntegerParameter(100, 2000, scaling_type="Auto"),
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 6
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2


hp_tuner = HyperparameterTuner(
    linear,
    "validation:mse",
    hyperparameter_ranges,
    max_jobs=max_jobs,
    max_parallel_jobs=max_parallel_jobs,
    objective_type="Minimize",
)


# Launch a SageMaker Tuning job to search for the best hyperparameters
hp_tuner.fit(inputs={"train": train_data, "validation": validation_data}, job_name=job_name)

INFO:sagemaker:Creating hyperparameter tuning job with name: DEMO-ll-aba-2024-09-20-19-14-56


Tuning job name:  DEMO-ll-aba-2024-09-20-19-14-56
........................................................!


## Set up hosting for the model

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same instance (or type of instance) that we used to train. Training is a prolonged and compute heavy job that require a different of compute and memory requirements that hosting typically do not. We can choose any type of instance we want to host the model. In our case we chose the ml.m4.xlarge instance to train, but we choose to host the model on the less expensive cpu instance, ml.c4.xlarge. The endpoint deployment can be accomplished as follows:


In [13]:
%%time
# creating the endpoint out of the trained model

if deploy_amt_model:
    linear_predictor = hp_tuner.deploy(initial_instance_count=1, instance_type="ml.c4.xlarge")
else:
    linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.c4.xlarge")
print(f"\ncreated endpoint: {linear_predictor.endpoint_name}")


2024-09-20 19:18:24 Starting - Preparing the instances for training
2024-09-20 19:18:24 Downloading - Downloading the training image
2024-09-20 19:18:24 Training - Training image download completed. Training in progress.
2024-09-20 19:18:24 Uploading - Uploading generated training model
2024-09-20 19:18:24 Completed - Resource reused by training job: DEMO-ll-aba-2024-09-20-19-14-56-004-cdb2c9f9

INFO:sagemaker:Creating model with name: DEMO-ll-aba-2024-09-20-19-19-53-792





INFO:sagemaker:Creating endpoint-config with name DEMO-ll-aba-2024-09-20-19-14-56-001-c4ec9f34
INFO:sagemaker:Creating endpoint with name DEMO-ll-aba-2024-09-20-19-14-56-001-c4ec9f34


--------!
created endpoint: DEMO-ll-aba-2024-09-20-19-14-56-001-c4ec9f34
CPU times: user 105 ms, sys: 8.32 ms, total: 113 ms
Wall time: 4min 37s


## Inference

Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. To do this, we are going to configure the [predictor object](https://sagemaker.readthedocs.io/en/v1.2.4/predictors.html) to parse contents of type text/csv and deserialize the reply received from the endpoint to json format.


In [16]:
# configure the predictor to accept to serialize csv input and parse the reposne as json
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

linear_predictor.serializer = CSVSerializer()
linear_predictor.deserializer = JSONDeserializer()

---
We then use the test file containing the records of the data that we kept to test the model prediction. By running below cell multiple times we are selecting random sample from the testing samples to perform inference with.

In [15]:
%%time
import json
from itertools import islice
import math
import struct
import boto3
import random

# getting testing sample from our test file
test_data = [l for l in open(FILE_TEST, "r")]
sample = random.choice(test_data).split(",")
actual_age = sample[0]
payload = sample[1:]  # removing actual age from the sample
payload = ",".join(map(str, payload))

# Invoke the predicor and analyise the result
result = linear_predictor.predict(payload)

# extracting the prediction value
result = round(float(result["predictions"][0]["score"]), 2)


accuracy = str(round(100 - ((abs(float(result) - float(actual_age)) / float(actual_age)) * 100), 2))
print(f"Actual age: {actual_age}\nPrediction: {result}\nAccuracy: {accuracy}")

Actual age: 10
Prediction: 11.7
Accuracy: 83.0
CPU times: user 17.6 ms, sys: 0 ns, total: 17.6 ms
Wall time: 576 ms


## Delete the Endpoint
Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint.

In [13]:
sagemaker.Session().delete_endpoint(linear_predictor.endpoint_name)
print(f"deleted {linear_predictor.endpoint_name} successfully!")

INFO:sagemaker:Deleting endpoint with name: DEMO-ll-aba-2024-09-04-02-39-11-002-d76b32f1


deleted DEMO-ll-aba-2024-09-04-02-39-11-002-d76b32f1 successfully!


# Assignment 2

## Assignment Questions

Please submit the completed notebook with your answers for your assignment.

### Question 1

Changing the model algorithm from Linear Regression to an alternative (e.g., Ridge Regression, Random Forest Regression, Neural Network Regresssion) may improve ML performance. 

*Answer in the cell below:* True/False? If true, in 3 or less sentences say why this may be the case.

True:  the data is some what corrrelated and complex in that different ages will likely have confliciting and similar data points for the various attributes so no direct correlation between say age and weight for example. 

### Question 2

Increasing the amount of training data may improve ML performance. 

*Answer in the cell below*: True/False? If true, in 3 or less sentences say why this may be the case.

False: the coomplexity in the data will remaoin

### Question 3

Automated Hyperparameter Optimization (HPO) can help with finding the best training hyperparameters for most algorithms.

*Answer in the cell below*: True/False? If true, in 3 or less sentences say why this may be the case.

True: Because different hyperparamter variables can be used to determine the most optimal ones. 

### Question 4

No feature engineering was carried out on the dataset in this notebook. By carrying out some feature engineering we can likely get better results. 

*Answer in the cell below*: Can you suggest one or two fields/attributes/columns that we might apply feature engineering on, and how the data in the field might be transformed.

Add a ratio value between some of the fields to help normalise the data in more few clusters. 

## One Step Further ...

For those of you who have the time and motivation to learn more ...

- Add code to evaluate the model on the full set of test data records 'test_data', outputting useful metrics for a Regression model
- Add you code in the cell below


## One More Step Further ...

For those of you who have the time and motivation to learn more ...

- Try implementing you ideas with respect to questions 1 and/or 4 and see the impact on the models performance
- Add your code in the cells below
