# Students Do: Housing Price Prediction on SageMaker

* **Dataset:** [Boston house prices dataset - Harrison, D. and Rubinfeld, D.L.](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)
* **Goal:** Predict the price of a house using linear regression given certain input features.

**Note:** You should import and run this notebook into your notebook instance on Amazon SageMaker.

In [None]:
# Initial imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


## Loading the Boston House Price Data from `sklearn`

In [None]:
# Loading the Boston house price data from sklearn
from sklearn.datasets import load_boston

boston_dataset = load_boston()



In [None]:
dir(boston_dataset)



In [None]:
print(boston_dataset.DESCR)



In [None]:
# Creating a DataFrame with the Boston House Data features
features_df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
features_df.head()



In [None]:
# Creating a DataFrame with the target data
target_df = pd.DataFrame(boston_dataset.target)
target_df.head()



In [None]:
# Plot target distribution
target_df.plot.hist(bins=20)


## Data Preparation

A linear regression model will be trained using the average number of rooms per dwelling (`RM`) to predict the house price.

* `X` is the predictor variable vector with the values of `RM`.
* `Y` is the target variable vector with the house prices value.

In [None]:
# Define the X and Y vectors
X = 
Y = 

# Split the data in training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = 


## Machine Learning Model Creation

In [None]:
bucket = "your_s3_bucket_name_here"
prefix = "boston-housing-regression"

# Amazon SageMaker and related imports
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker import get_execution_role
import boto3  # AWS Python sdk

import os
import io
import time
import json
import re

# AWS IAM role
role = get_execution_role()


### Uploading Training Data to Amazon S3

In order to train your machine learning model using Amazon SageMaker, the training data should passed through an Amazon S3 Bucket formatted as a [protobuf recordIO format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#td-serialization).

The profobuf recordIO format, is a method to serialize structured data (similar to `JSON`), to allow different applications to communicate with each other or for storing data.

Using the profobuf recordIO format, allows you to take advantage of _Pipe mode_ when training the algorithms that support it. In _Pipe mode_, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput.

The following code converts the training data as a Protocol Buffer, next the data is uploaded to the Amazon S3 bucket.

In [None]:
# Encode the training data as Protocol Buffer
buf = io.BytesIO()
vectors = np.array(X_train).astype("float32")
labels = np.array(Y_train).astype("float32")
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)

# Upload encoded training data to Amazon S3
key = "linear_train.data"
boto3.resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train", key)
).upload_fileobj(buf)
s3_train_data = "s3://{}/{}/train/{}".format(bucket, prefix, key)
print("Training data uploaded to: {}".format(s3_train_data))


#### Upload Test Data to Amazon S3

If you provide test data, the algorithm logs include the test score for the final model.

In [None]:
# Encode the testing data as Protocol Buffer


# Upload encoded testing data to Amazon S3



### Training the Machine Learning Model

Once you have uploaded your data to Amazon S3, it's time to train the machine learning model. In this activity, you will use the Amazon SageMaker's [_linear learner algorithm_](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) to run a linear regression prediction model.

You can learn more about the different Amazon SageMaker built-in algorithms [in this page](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).

First, an instance of the linear learner algorithm is created.

In [None]:
# Create an instance of the linear learner algorithm
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, "linear-learner")


Next, the estimator container is created in an AWS EC2 instance (@ train_instance_type) using a `ml.m4.xlarge`.

**Note:** This step might take a few minutes.

In [None]:
# Start the Amazon SageMaker session
sess = sagemaker.Session()

# Create an instance of the linear learner estimator
linear = sagemaker.estimator.Estimator(
    container,
    role,
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)

# Define linear learner hyperparameters
linear.set_hyperparameters(
    feature_dim=1,
    mini_batch_size=100,
    predictor_type="regressor",
    epochs=10,
    num_models=32,
    loss="absolute_loss",
)

# Fitting the linear learner model with the training data
linear.fit({"train": s3_train_data, "test": s3_test_data})


### Deploying the Model to Make Predictions

In this section, the `linear-learner` model that was trained will be used to make predictions of house prices. Deploy the model using a `ml.t2.medium` instance type.

**Note:** This step might take a few minutes.

In [None]:
# An instance of the linear-learner predictor is created
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.t2.medium")



In [None]:
# Linear predictor configurations
linear_predictor.content_type = "text/csv"
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer



In [None]:
# Making some predictions using the test data.
result = linear_predictor.predict(X_test)
y_predictions = np.array([r["score"] for r in result["predictions"]])
y_predictions


### Model Evaluation

To evaluate the model, a plot to contrast the predicted housing prices values versus the real values is created. Additionally, the `RMSE` and `R2` scores are calculated.

In [None]:
# Plotting predicted Vs. actual values
plt.plot(np.array(Y_test), label="actual")
plt.plot(y_predictions, label="predict")
plt.legend()
plt.show()



In [None]:
# Calculating the RMSE and R2 scores
from sklearn.metrics import mean_squared_error, r2_score

rmse = np.sqrt(mean_squared_error(Y_test, y_predictions))
r2 = r2_score(Y_test, y_predictions)

print("RMSE: {}".format(rmse))
print("R2 score: {}".format(r2))


Finally the end point is deleted to avoid additional AWS resources usage.

In [None]:
# Delete Amazon SageMaker end-point
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)
