# Regression with Amazon SageMaker XGBoost algorithm
## house prices prediction

---

---
## Contents

1. [Data Preprocessing](#Data-Preprocessing)
2. [Training the XGBoost model](#Training-the-XGBoost-model)
3. [Set up hosting for the model](#Set-up-hosting-for-the-model)
5. [Using SageMaker Endpoint](#Using-SageMaker-Endpoint)

---

In [None]:
import os
import boto3
import re
import copy
import time
import pandas as pd
import numpy as np
import sagemaker
from mxnet import nd
from time import gmtime, strftime
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()

bucket='sagemaker-eu-west-1-483308273948' # put your s3 bucket name here, and create s3 bucket
prefix = 'house_prices'
# customize to your bucket where you have stored the data
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)

# Data Preprocessing

We'll read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., a
ssuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.


In [None]:
train = pd.read_csv('data/train.csv')
label = pd.read_csv('data/label.csv', header=None)[1]

In [None]:
pd.options.display.max_columns = 999
train.head(5)

In [None]:
# Splitting data into validation and training and breaking dataset into data and label

# 80%-20% training to validation
train = train.as_matrix()
label = label.as_matrix()
train_size = int(train.shape[0]*0.8)

train_data  = train[:train_size,:]
val_data = train[train_size:,:]

train_label = label[:train_size]
val_label = label[train_size:]

In [None]:
import io
train_data_url = ""
validation_data_url = ""
def to_libsvm(f, labels, values):
     f.write(bytes('\n'.join(
         ['{} {}'.format(label, ' '.join(['{}:{}'.format(i + 1, el) for i, el in enumerate(vec)])) for label, vec in
          zip(labels, values)]), 'utf-8'))
     return f

def write_to_s3(fobj, bucket, key):
    return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fobj)

partitions = [('train', (train_data,train_label)), ('validation', (val_data,val_label))]
for partition_name, partition in partitions:
    print('{}: {} {}'.format(partition_name, partition[0].shape, partition[1].shape))
    labels = partition[1].tolist()
    vectors = partition[0].tolist()
    f = io.BytesIO()
    to_libsvm(f, labels, vectors)
    f.seek(0)
    key = "{}/csv/{}".format(prefix,partition_name)
    url = 's3://{}/{}'.format(bucket, key)
    print('Writing to {}'.format(url))
    write_to_s3(f, bucket, key)
    print('Done writing to {}'.format(url))
    if (partition_name == "train"):
        train_data_url = url
    else:
        validation_data_url = url

output = 's3://{}/{}'.format(bucket, prefix+'/output')

In [None]:
output


## Training the XGBoost model


In [None]:
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}
container = containers[boto3.Session().region_name]

In [None]:
%%time
import boto3
import sagemaker
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter, ContinuousParameter

sess = sagemaker.Session()
region = boto3.Session().region_name

train = sagemaker.s3_input(s3_data=train_data_url,content_type='libsvm')
validation = sagemaker.s3_input(s3_data=validation_data_url,content_type='libsvm')

# Creating a new sagemaker job
estimator = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output,
                                       sagemaker_session=sess)

# Settings the job hyperparamaters
estimator.set_hyperparameters(eval_metric='rmse',
                           objective="reg:linear",
                           num_round=100)

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1), # The eta parameter shrinks the feature weights to make the boosting process more conservative.
                         'alpha' : ContinuousParameter(0, 2), # L1 regularization term on weights. Increasing this value makes models more conservative.
                         'min_child_weight' : ContinuousParameter(1, 10), # Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning.
                         'max_depth' : IntegerParameter(1, 10)} # Maximum depth of a tree.

objective_metric_name = 'validation:rmse'

tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type='Minimize',
                            max_jobs=9,
                            max_parallel_jobs=3)
tuner.fit({'train': train,'validation' : validation})


## Set up hosting for the model
Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint.  This will allow out to make predictions (or inference) from the model dyanamically.

_Note, Amazon SageMaker allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target._

In [None]:
xgboost_predictor = tuner.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

# Using SageMaker Endpoint

In [None]:
from sagemaker.predictor import json_serializer, json_deserializer

xgboost_predictor.content_type = 'text/csv'
xgboost_predictor.serializer = json_serializer
xgboost_predictor.deserializer = json_deserializer

In [None]:
from math import exp
item = 44
print("Data: \n" + str(val_data[item].tolist()) + "\n")
print("Predicted price: $" + str(round(exp(xgboost_predictor.predict(val_data[item].tolist())))))
print("Real price: $" + str(round(exp(val_label[item]))))

# Predict on new data

In [None]:
test = pd.read_csv('data/test.csv')
len(test)

In [None]:
prices = []
for item in test.values:
    data = {}
    for column in test.keys():
        data[column] = item[test.columns.get_loc(column)]
    data['Price'] = exp(xgboost_predictor.predict(item.tolist()))
    prices.append(data)

In [None]:
df = pd.DataFrame(prices)

In [None]:
import matplotlib.pyplot as plt
df.plot.scatter(x='LotArea', y='Price')
plt.show()

# Delete the Endpoint

In [None]:
import sagemaker

sagemaker.Session().delete_endpoint(xgboost_predictor.endpoint)

In [None]:
values = {}
for i,item in enumerate(val_data):
    values[i] = abs(xgboost_predictor.predict(val_data[i].tolist()) - val_label[i])
    
import operator
sorted_values = sorted(values.items(), key=operator.itemgetter(1))

print(sorted_values)

In [None]:
!aws s3 cp s3://sagemaker-eu-west-1-483308273948/house_prices/output/xgboost-180614-1835-004-9859f1c0/output/model.tar.gz ./

In [None]:
!tar xzvf model.tar.gz

In [None]:
!pip install xgboost

In [None]:
import pickle as pkl
import xgboost as xgb
model = pkl.load(open('xgboost-model','rb' ))

In [None]:
data = xgb.DMatrix(val_data[1].tolist())

In [None]:
val_data[1]

In [None]:
xgboost_predictor.predict(val_data[0].tolist())

In [None]:
model.predict(data)

In [None]:
val_data[0]