# California Housing Regression Model Building

This demo shows how you can use SageMaker Studio Notebooks to build machine learning models. We'll cover jupyter extensions, local model building, scaled SageMaker training jobs, Hyperparameter optimnization, and model deployment.

Now we will demonstrate these capabilities through a `California Housing` regression example. The experiment will be organized as follows:

Make sure you selected `Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)` kernel.

### Setup

In [None]:
# Installed Libraries
import os
import time
import boto3
import itertools
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sagemaker.tensorflow import TensorFlow
import sagemaker
from sagemaker import get_execution_role

# Project Imports
from california_housing_tf2 import get_model, train_model

## Exploratory Data Analysis

### Download California Housing dataset

In [None]:
data_dir = os.path.join(os.getcwd(), "data")
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), "data/train")
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), "data/test")
os.makedirs(test_dir, exist_ok=True)

data_set = fetch_california_housing(as_frame=True)

In [None]:
data_set.frame.head()

#### Objective
The target contains the median of the house value for each district. Therefore, this problem is a regression problem.

### Install Plotting Libraries

In [None]:
%pip install -q plotly nbformat matplotlib

### Visualize Data with Matplotlib

In [None]:
import matplotlib.pyplot as plt

data_set.frame.hist(figsize=(12, 10), bins=15, edgecolor="black")
plt.subplots_adjust(hspace=0.7, wspace=0.4)

### Interactively Visualize Data with Plotly

In [None]:
import plotly.express as px

fig = px.histogram(data_set.frame["HouseAge"], x="HouseAge", nbins=15)
fig.show()

### Data Transformations

In [None]:
X = pd.DataFrame(data_set.data, columns=data_set.feature_names)
Y = pd.DataFrame(data_set.target)

# We partition the dataset into 2/3 training and 1/3 test set.
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)

scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

np.save(os.path.join(train_dir, "x_train.npy"), x_train)
np.save(os.path.join(test_dir, "x_test.npy"), x_test)
np.save(os.path.join(train_dir, "y_train.npy"), y_train)
np.save(os.path.join(test_dir, "y_test.npy"), y_test)

## Build Model Locally

In [None]:
my_model = get_model()
print(my_model.summary())

In [None]:
learning_rate = 0.1
epochs = 20
batch_size = 64
train_model(model=my_model, learning_rate=learning_rate, epochs=epochs,
            batch_size=batch_size,
            x_train=x_train, y_train=y_train, x_test=x_test,
            y_test=y_test, output_dir=os.getcwd())

## Use SageMaker Training Jobs for Scaled Training

In [None]:
sess = boto3.Session()
sm = sess.client("sagemaker")
role = get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)
bucket = sagemaker_session.default_bucket()
prefix = "tf2-california-housing-experiment"

### Upload Data to S3

In [None]:
s3_inputs_train = sagemaker.Session().upload_data(
    path="data/train", bucket=bucket, key_prefix=prefix + "/train"
)
s3_inputs_test = sagemaker.Session().upload_data(
    path="data/test", bucket=bucket, key_prefix=prefix + "/test"
)
inputs = {"train": s3_inputs_train, "test": s3_inputs_test}
print(inputs)

### Step 1 - Set up the Experiment

Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for : [1] a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or [2] a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or [3] a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

### Create an Experiment

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

In [None]:
california_housing_experiment = Experiment.create(
    experiment_name=f"tf2-california-housing-{int(time.time())}",
    description="Training on california housing dataset",
    sagemaker_boto_client=sm,
)
print(california_housing_experiment)

### Step 2 - Track Experiment
### Now create a Trial for each training run to track its inputs, parameters, and metrics.
While training the ResNet-50 CNN model on SageMaker, you will experiment with several values for the number of hidden channel in the model. You will create a Trial to track each training job run. You will also create a `TrialComponent` from the tracker we created before, and add to the Trial. This will enrich the Trial with the parameters we captured from the data pre-processing stage.

In [None]:
hyperparam_options = {"learning_rate": [0.1, 0.5, 0.3], "epochs": [100]}

hypnames, hypvalues = zip(*hyperparam_options.items())
trial_hyperparameter_set = [dict(zip(hypnames, h)) for h in itertools.product(*hypvalues)]
trial_hyperparameter_set

If you want to run the following training jobs asynchronously, you may need to increase your resource limit. Otherwise, you can run them sequentially.


In [None]:
from sagemaker.tensorflow import TensorFlow

run_number = 1
for trial_hyp in trial_hyperparameter_set:
    # Combine static hyperparameters and trial specific hyperparameters
    hyperparams = trial_hyp

    # Create unique job name with hyperparameter and time
    time_append = int(time.time())
    hyp_append = "-".join([str(elm).replace(".", "-") for elm in trial_hyp.values()])
    training_job_name = f"tf2-california-housing-training-{hyp_append}-{time_append}"
    trial_name = f"trial-tf2-california-housing-training-{hyp_append}-{time_append}"
    trial_desc = f"my-tensorflow2-california-housing-run-{run_number}"

    # Create a new Trial and associate Tracker to it
    tf2_california_housing_trial = Trial.create(
        trial_name=trial_name,
        experiment_name=california_housing_experiment.experiment_name,
        sagemaker_boto_client=sm,
        tags=[{"Key": "trial-desc", "Value": trial_desc}],
    )

    # Create an experiment config that associates training job to the Trial
    experiment_config = {
        "ExperimentName": california_housing_experiment.experiment_name,
        "TrialName": tf2_california_housing_trial.trial_name,
        "TrialComponentDisplayName": training_job_name,
    }

    metric_definitions = [
        {"Name": "loss", "Regex": "loss: ([0-9\\.]+)"},
        {"Name": "accuracy", "Regex": "accuracy: ([0-9\\.]+)"},
        {"Name": "val_loss", "Regex": "val_loss: ([0-9\\.]+)"},
        {"Name": "val_accuracy", "Regex": "val_accuracy: ([0-9\\.]+)"},
    ]

    # Create a TensorFlow Estimator with the Trial specific hyperparameters
    tf2_california_housing_estimator = TensorFlow(
        entry_point="california_housing_tf2.py",
        role=sagemaker.get_execution_role(),
        instance_count=1,
        instance_type="ml.m5.large",
        framework_version="2.4.1",
        hyperparameters=hyperparams,
        py_version="py37",
        metric_definitions=metric_definitions,
        enable_sagemaker_metrics=True,
        tags=[{"Key": "trial-desc", "Value": trial_desc}],
    )

    # Launch a training job
    tf2_california_housing_estimator.fit(
        inputs, job_name=training_job_name, experiment_config=experiment_config, wait=False,
    )

    # give it a while before dispatching the next training job
    time.sleep(2)
    run_number = run_number + 1

### Compare the model training runs for an experiment

Now you will use the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

In [None]:
from sagemaker.analytics import ExperimentAnalytics

experiment_name = california_housing_experiment.experiment_name

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session, experiment_name=experiment_name
)
trial_comp_ds_jobs = trial_component_analytics.dataframe()
trial_comp_ds_jobs

Let's show the accuracy, epochs and optimizer.
You will sort the results by accuracy descending.

In [None]:
trial_comp_ds_jobs = trial_comp_ds_jobs.sort_values("val_loss - Last", ascending=False)
trial_comp_ds_jobs[["TrialComponentName", "val_loss - Last", "epochs", "learning_rate"]]

### Compare Experiments, Trials, and Trial Components in Amazon SageMaker Studio

You can compare experiments, trials, and trial components by selecting the entities and opening them in the trial components list. The trial components list is referred to as the Studio Leaderboard. In the Leaderboard you can do the following:
- View detailed information about the entities
- Compare entities
- Stop a training job
- Deploy a model

<b>To compare experiments, trials, and trial components</b>
- In the left sidebar of SageMaker Studio, choose the <b>SageMaker Experiment List icon</b>.
- In the <b>Experiments</b> browser, choose either the experiment or trial list. 
- Choose the experiments or trials that you want to compare, right-click the selection, and then choose <b>Open in trial component list</b>. The Leaderboard opens and lists the associated Experiments entities as shown in the following screenshot.

![studio_trial_component_list](./images/studio_trial_component_list.png)

## Use Hyperparamter Tuning

In [None]:
from sagemaker.tuner import ContinuousParameter, HyperparameterTuner

objective_metric_name = "loss"
objective_type = "Minimize"
metric_definitions = [
    {"Name": "loss", "Regex": "loss: ([0-9\\.]+)"},
    {"Name": "accuracy", "Regex": "accuracy: ([0-9\\.]+)"},
    {"Name": "val_loss", "Regex": "val_loss: ([0-9\\.]+)"},
    {"Name": "val_accuracy", "Regex": "val_accuracy: ([0-9\\.]+)"},
]

hyperparamter_range = {"learning_rate": ContinuousParameter(1e-4, 1e-3)}

tf2_california_housing_estimator = TensorFlow(
    entry_point="california_housing_tf2.py",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.large",
    framework_version="2.4.1",
    py_version="py37",
)

tuner = HyperparameterTuner(
    tf2_california_housing_estimator,
    objective_metric_name,
    hyperparamter_range,
    metric_definitions,
    base_tuning_job_name="housing-hpo",
    strategy="Bayesian",
    max_jobs=6,
    max_parallel_jobs=3,
    objective_type=objective_type,
)

tuner.fit(inputs)

In [None]:
# results = tuner.
results = tuner.analytics()
results.training_job_summaries()

### Deploy Best Model

In [None]:
predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

In [None]:
predictions = predictor.predict(x_test[:10])
print(predictions)

In [None]:
predictor.delete_endpoint()