# Recomendations with XGBoost
_**Using Gradient Boosted Trees to Provide Movie Recommendations**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Compile](#Compile)
1. [Host](#Host)
  1. [Evaluate](#Evaluate)
  1. [Relative cost of errors](#Relative-cost-of-errors)
1. [Extensions](#Extensions)

---

## Background


TODO

This notebook will NOT be part of the workshop.  This one is used to pretrain the xgboost movie recommendation model.  Trained model should be uploaded to the correct S3 bucket to be used for deployment


---

## Setup

_This notebook was created and tested on an ml.m4.xlarge TODO : Check notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
bucket = 'sagemaker-us-west-2-555360056434'  ##TODO : Change this to session bucket
prefix = 'sagemaker/recommendations-xgboost-movie'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()


Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

---
## Data

Explain Movie Lens Data

In [None]:
## Is the movie lens data with all feature prepped??
feature_data_prepared = False

if (os.path.exists('ml-100k/movielens_data_allfeatures.csv')):
    feature_data_prepared = True


In [None]:
##TODO : Make this conditional.  Only need this the first time the data is being prepared

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

Combine data from multiple files to create training data.

In [None]:
##TODO : Make this conditional.  Only need this the first time the data is being prepared


## Start by reading ua.base into a dataframe TODO : Should this be u.user??
df = pd.read_csv('ml-100k/u.data', header=None, delimiter = '\t')
df.columns = ["User", "Item", "Rating", "TimeStamp"]

len(df)
print( df["User"])

In [None]:
##TODO : Make this conditional.  Only need this the first time the data is being prepared

## Now get the additional columns for user gender, age, occupation, Zipcode
df_user = pd.read_csv('ml-100k/u.user', header=None, delimiter = '|')
df_user.columns = ["User", "Age", "Gender", "Occupation", "Zipcode"]

len(df_user)

In [None]:
##Combine the two dataframes to get the complete data set.
#Iterate through the dataframe


for i, row in df.iterrows():
    #print("row is ", type(row), " : ", row)
    user_id = row['User']
    
    #print("find a match for user_id ", user_id)
    ##For this user_id get gender, occupation and zipcode
    match = df_user.loc[df_user['User'] == user_id]
    user_gender = match['Gender'].values[0]
    user_occupation = match['Occupation'].values[0]
    user_zipcode = match['Zipcode'].values[0]
    
    df.at[i,"Gender"] = user_gender
    df.at[i,"Occupation"] = user_occupation    
    df.at[i,"Zip Code"] = user_zipcode 
 
print("After update")
print(df[:100])

df.to_csv("ml-100k/movielens_data_allfeatures.csv")


## Preprocess feature of the movie lens data 

In [None]:
movie_lens_data_df = pd.read_csv('ml-100k/movielens_data_allfeatures.csv')
pd.set_option('display.max_columns', 500)
movie_lens_data_df

In [None]:
##Remove the unnamed:0 column
movie_lens_data_df.drop(['Unnamed: 0'],axis=1, inplace=True)


## One hot encode categorial values

In [None]:
# One hot encode "Gender" 
movie_lens_data_df = pd.concat([movie_lens_data_df,pd.get_dummies(movie_lens_data_df['Gender'], prefix='Gender')],axis=1)
movie_lens_data_df

In [None]:
##Drop the original feature, since it is not needed anymore.
movie_lens_data_df.drop(['Gender'],axis=1, inplace=True)

In [None]:
# One hot encode the 'Occupation' attribute 
movie_lens_data_df = pd.concat([movie_lens_data_df,pd.get_dummies(movie_lens_data_df['Occupation'], prefix='Occupation')],axis=1)
movie_lens_data_df

In [None]:
##Drop the original feature, since it is not needed anymore.
movie_lens_data_df.drop(['Occupation'],axis=1, inplace=True)

In [None]:
##For SageMaker XGBoost, the predictor variable should be the first column and there should be no headers in the file.
##So move the 'Rating' colum to the begining of the dataframe.
rating = movie_lens_data_df['Rating']
movie_lens_data_df.drop(labels=['Rating'], axis=1,inplace = True)
movie_lens_data_df.insert(0, 'Rating', rating)
movie_lens_data_df

In [None]:
##Check the columns after all the processing.
movie_lens_data_df.columns

## Explore the data : TODO

And now let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

In [None]:
##Pick up here again.

train_data, validation_data, test_data = np.split(movie_lens_data_df.sample(frac=1, random_state=1729), [int(0.7 * len(movie_lens_data_df)), int(0.9 * len(movie_lens_data_df))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)


print("Number of training samples : " , len(train_data))
print("Number of validation samples : " , len(validation_data))
print("Number of test samples : " , len(test_data))




Now we'll upload these files to S3.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=6,
                        eta=0.2,
                        gamma=5,
                        min_child_weight=6,
                        subsample=0.9,
                        silent=0,
                        objective='reg:linear',
                        num_round=60)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

---
## Host

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')



### Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
print("test_data type is ", type(test_data))

ratings = test_data['Rating']

print("ratings type ", type(ratings))

test_data.drop('Rating', axis=1, inplace=True)

In [None]:
##Indice causing errors : 48,75,95.134
test_data_matrix = test_data.as_matrix()

#Try removing these indices : TODO

print("test_data_matrix type is ", type(test_data_matrix), " shape ", test_data_matrix.shape)

test_data_matrix_subset = test_data_matrix[:10]

predictions=[]

for i in range(0, 20):
    predicted_value = xgb_predictor.predict(test_data_matrix[i])
    predictions.append(predicted_value)
    print("predicted value ", predicted_value)
    
print("Number of predictions ", len(predictions))
print("Number of original ratings ", len(ratings))

In [None]:
##Compare with the original values 
for i in range(0, 20):
    #Prediction returned is a byte array.  Convert this to float to compare with the original
    prediction = float(predictions[i].decode())  
    print("predicted value ", prediction, " original value ", ratings.values[i])


TODO : Show some metrics

### (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)