# Music Recomendations with XGBoost

Using Gradient Boosted Trees to Provide Music Recommendations


This notebook will NOT be part of the workshop. This is used to pretrain the xgboost music recommendation model. Trained model should be uploaded to the correct S3 bucket to be used for deployment


This notebook was created and tested on an ml.m4.xlarge TODO : Check notebook instance.



# Data

Explain Million Song Data

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import time

In [None]:
triplets_file = 'https://static.turi.com/datasets/millionsong/10000.txt'
songs_metadata_file = 'https://static.turi.com/datasets/millionsong/song_data.csv'

In [None]:
data = pd.read_csv(triplets_file, sep="\t", header = None)
data.columns = ['user_id', 'song_id', 'listen_count']

#Read song  metadata
song_df_2 =  pd.read_csv(songs_metadata_file)

In [None]:
song_df =pd.merge(data, song_df_2.drop_duplicates(['song_id']), on="song_id", how="left")

In [None]:
song_df = song_df.head(10000)

#Merge song title and artist_name columns to make a merged column
song_df['song'] = song_df['title'].map(str) + " - " + song_df['artist_name']

In [None]:
song_df.head()

In [None]:
song_df['user_id'] = np.arange(song_df.shape[0])

In [None]:
song_df.head()


In [None]:
song_df['song_id'] = np.arange(song_df.shape[0])

In [None]:
song_df.head()


In [None]:
##For SageMaker XGBoost, the predictor variable should be the first column and there should be no headers in the file.
##So move the 'listen_count' column to the begining of the dataframe.
rating = song_df['listen_count']
song_df.drop(labels=['listen_count'], axis=1,inplace = True)
song_df.insert(0, 'listen_count', rating)
song_df

In [None]:
# One hot encode the 'song' attribute 
song_df = pd.concat([song_df,pd.get_dummies(song_df['song'], prefix='song')],axis=1)
song_df

In [None]:
##Drop the original feature, since it is not needed anymore.
song_df.drop(['song'],axis=1, inplace=True)

In [None]:
song_df

In [None]:
##Drop the original features, since it is not needed anymore.
song_df.drop(['title'],axis=1, inplace=True)
song_df.drop(['release'],axis=1, inplace=True)
song_df.drop(['artist_name'],axis=1, inplace=True)

In [None]:
song_df

In [None]:
song_df.columns

# Setup

This notebook was created and tested on an ml.p2.xlarge TODO : Check notebook instance.


Specify the below-

    The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
    The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).


In [None]:
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

In [None]:
bucket = 'reinvent2019-sagemaker'  ##TODO : Change this to session bucket
prefix = 'sagemaker/recommendations-xgboost-songsnew'



And now let's split the data into training, validation, and test sets. This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

In [None]:
train_data, validation_data, test_data = np.split(song_df.sample(frac=1, random_state=1729), [int(0.7 * len(song_df)), int(0.9 * len(song_df))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)


print("Number of training samples : " , len(train_data))
print("Number of validation samples : " , len(validation_data))
print("Number of test samples : " , len(test_data))



In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

# Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

In [None]:

import sagemaker
from sagemaker.predictor import csv_serializer


s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')


In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=6,
                        eta=0.2,
                        gamma=5,
                        min_child_weight=6,
                        subsample=0.9,
                        silent=0,
                        objective='reg:linear',
                        num_round=60)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

# Host

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')


# Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request. But first, we'll need to setup serializers and deserializers for passing our test_data NumPy arrays to the model behind the endpoint.


In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

In [None]:
print("test_data type is ", type(test_data))

ratings = test_data['listen_count']

print("ratings type ", type(ratings))

test_data.drop('listen_count', axis=1, inplace=True)

In [None]:
##Indice causing errors : 48,75,95.134
test_data_matrix = test_data.as_matrix()

#Try removing these indices : TODO

print("test_data_matrix type is ", type(test_data_matrix), " shape ", test_data_matrix.shape)

test_data_matrix_subset = test_data_matrix[:10]

predictions=[]

for i in range(0, 20):
    predicted_value = xgb_predictor.predict(test_data_matrix[i])
    predictions.append(predicted_value)
    print("predicted value ", predicted_value)
    
print("Number of predictions ", len(predictions))
print("Number of original ratings ", len(ratings))

In [None]:
##Compare with the original values 
for i in range(0, 20):
    #Prediction returned is a byte array.  Convert this to float to compare with the original
    prediction = float(predictions[i].decode())  
    print("predicted value ", prediction, " original value ", ratings.values[i])