# Recomendations with XGBoost
_**Using Gradient Boosted Trees to Provide Movie Recommendations**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Compile](#Compile)
1. [Host](#Host)
  1. [Evaluate](#Evaluate)
  1. [Relative cost of errors](#Relative-cost-of-errors)
1. [Extensions](#Extensions)

---

## Background


TODO

This notebook will NOT be part of the workshop.  This one is used to pretrain the xgboost movie recommendation model.  Trained model should be uploaded to the correct S3 bucket to be used for deployment


---

## Setup

_This notebook was created and tested on an ml.m4.xlarge TODO : Check notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [1]:
bucket = 'sagemaker-us-west-2-555360056434'  ##TODO : Change this to session bucket
prefix = 'sagemaker/recommendations-xgboost-movie'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()


Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

---
## Data

Explain Movie Lens Data

In [3]:
## Is the movie lens data with all feature prepped??
feature_data_prepared = False

if (os.path.exists('ml-100k/movielens_data_allfeatures.csv')):
    feature_data_prepared = False


In [4]:
##TODO : Make this conditional.  Only need this the first time the data is being prepared

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2019-11-26 19:04:50--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2019-11-26 19:04:51 (11.8 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

Combine data from multiple files to create training data.

In [5]:
##Explore item data

df = pd.read_csv('ml-100k/u.item', delimiter ='|', encoding='latin-1')

df.columns = ["movie id", "movie title", "release date", "video release date",
              "IMDb URL", "unknown", "Action", "Adventure", "Animation",
              "Children's","Comedy","Crime","Documentary","Drama","Fantasy",
              "Film-Noir","Horror","Musical","Mystery", "Romance","Sci-Fi",
              "Thriller","War","Western"]

#We can drop the IMDB URL
df.drop(['IMDb URL'],axis=1, inplace=True)

## TODO : Can we drop release date  and video release date
df

#df.columns = ["User", "Item", "Rating", "TimeStamp"]

Unnamed: 0,movie id,movie title,release date,video release date,unknown,Action,Adventure,Animation,Children's,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,2,GoldenEye (1995),01-Jan-1995,,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,3,Four Rooms (1995),01-Jan-1995,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,4,Get Shorty (1995),01-Jan-1995,,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,5,Copycat (1995),01-Jan-1995,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,7,Twelve Monkeys (1995),01-Jan-1995,,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,8,Babe (1995),01-Jan-1995,,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
7,9,Dead Man Walking (1995),01-Jan-1995,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,10,Richard III (1995),22-Jan-1996,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,11,Seven (Se7en) (1995),01-Jan-1995,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [10]:
##TODO : Make this conditional.  Only need this the first time the data is being prepared

## Start by reading ua.base into a dataframe
df_data = pd.read_csv('ml-100k/ua.base', header=None, delimiter = '\t')
df_data.columns = ["User", "Item", "Rating", "TimeStamp"]


print(df_data)
len(df_data)

       User  Item  Rating  TimeStamp
0         1     1       5  874965758
1         1     2       3  876893171
2         1     3       4  878542960
3         1     4       3  876893119
4         1     5       3  889751712
5         1     6       5  887431973
6         1     7       4  875071561
7         1     8       1  875072484
8         1     9       5  878543541
9         1    10       3  875693118
10        1    11       2  875072262
11        1    12       5  878542960
12        1    13       5  875071805
13        1    14       5  874965706
14        1    15       5  875071608
15        1    16       5  878543541
16        1    17       3  875073198
17        1    18       4  887432020
18        1    19       5  875071515
19        1    21       1  878542772
20        1    22       4  875072404
21        1    23       4  875072895
22        1    24       3  875071713
23        1    25       4  875071805
24        1    26       3  875072442
25        1    27       2  876892946
2

90570

In [7]:
##TODO : Make this conditional.  Only need this the first time the data is being prepared

## Now get the additional columns for user gender, age, occupation, zipcode
df_user = pd.read_csv('ml-100k/u.user', header=None, delimiter = '|')
df_user.columns = ["User", "Age", "Gender", "Occupation", "Zipcode"]

print("Number of users : ", len(df_user))
df_user

Number of users :  943


Unnamed: 0,User,Age,Gender,Occupation,Zipcode
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,05201
8,9,29,M,student,01002
9,10,53,M,lawyer,90703


In [8]:
##Also, get the additional columns for the item (movie)
df_item = pd.read_csv('ml-100k/u.item', delimiter ='|', encoding='latin-1')

df_item.columns = ["movie id", "movie title", "release date", "video release date",
              "IMDb URL", "unknown", "Action", "Adventure", "Animation",
              "Children's","Comedy","Crime","Documentary","Drama","Fantasy",
              "Film-Noir","Horror","Musical","Mystery", "Romance","Sci-Fi",
              "Thriller","War","Western"]

print("Number of items (movies) : ", len(df_item))

Number of items (movies) :  1681


In [13]:
##Combine the user and items dataframes to get the complete data set.
#Iterate through the dataframe
##Not the most efficient code.  Can use better utilities

for i, row in df_data.iterrows():
    #print("row is ", type(row), " : ", row)
    user_id = row['User']
    item_id = row['Item']
 
    ##Initial items genres with 0
    df_data.at[i,"unknown"] = 0
    df_data.at[i,"Action"] = 0
    df_data.at[i,"Adventure"] = 0
    df_data.at[i,"Children's"] = 0
    df_data.at[i,"Comedy"] = 0
    df_data.at[i,"Crime"] = 0
    
    df_data.at[i,"Documentary"] = 0
    df_data.at[i,"Drama"] = 0
    df_data.at[i,"Fantasy"] = 0
    df_data.at[i,"Film-Noir"] = 0
    df_data.at[i,"Horror"] = 0
    
    df_data.at[i,"Musical"] = 0
    df_data.at[i,"Mystery"] = 0
    df_data.at[i,"Romance"] = 0
    df_data.at[i,"Sci-Fi"] = 0
    df_data.at[i,"Thriller"] = 0
    df_data.at[i,"War"] = 0
    df_data.at[i,"Western"] = 0
    
    item_match = df_item.loc[df_item['movie id'] == item_id]
    #print("Matching item is  ", item_match, ' with len ', len(item_match))
    
    if len(item_match) != 0 : 
        
        df_data.at[i,"unknown"] = item_match['unknown'].values[0]
        df_data.at[i,"Action"] = item_match['Action'].values[0]
        df_data.at[i,"Adventure"] = item_match['Adventure'].values[0]
        df_data.at[i,"Children's"] = item_match['Children\'s'].values[0]
        df_data.at[i,"Comedy"] = item_match['Comedy'].values[0]
        df_data.at[i,"Crime"] = item_match['Crime'].values[0]

        df_data.at[i,"Documentary"] = item_match['Documentary'].values[0]
        df_data.at[i,"Drama"] = item_match['Drama'].values[0]
        df_data.at[i,"Fantasy"] = item_match['Fantasy'].values[0]
        df_data.at[i,"Film-Noir"] = item_match['Film-Noir'].values[0]
        df_data.at[i,"Horror"] = item_match['Horror'].values[0]

        df_data.at[i,"Musical"] = item_match['Musical'].values[0]
        df_data.at[i,"Mystery"] = item_match['Mystery'].values[0]
        df_data.at[i,"Romance"] = item_match['Romance'].values[0]
        df_data.at[i,"Sci-Fi"] = item_match['Sci-Fi'].values[0]
        df_data.at[i,"Thriller"] = item_match['Thriller'].values[0]
        df_data.at[i,"War"] = item_match['War'].values[0]
        df_data.at[i,"Western"] = item_match['Western'].values[0]
        
    print("find a match for user_id ", user_id)
    ##For this user_id get gender, occupation and zipcode
    match = df_user.loc[df_user['User'] == user_id]
    user_age = match['Age'].values[0]
    user_gender = match['Gender'].values[0]
    user_occupation = match['Occupation'].values[0]
    user_zipcode = match['Zipcode'].values[0]
    
    df_data.at[i,"Age"] = user_age
    df_data.at[i,"Gender"] = user_gender
    df_data.at[i,"Occupation"] = user_occupation    
    df_data.at[i,"Zip Code"] = user_zipcode    
    
print("After update")
print(df_data[:100])

df_data.to_csv("ml-100k/movielens_data_allfeatures.csv")


row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                    1
Rating                  5
TimeStamp       874965758
unknown                 0
Action                  0
Adventure               0
Children's              0
Comedy                  0
Crime                   0
Documentary             0
Drama                   0
Fantasy                 0
Film-Noir               0
Horror                  0
Musical                 0
Mystery                 0
Romance                 0
Sci-Fi                  0
Thriller                0
War                     0
Western                 0
Age                    24
Gender                  M
Occupation     technician
Zip Code            85711
Name: 0, dtype: object
find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                    2
Rating                  3
TimeStamp       876893171
unknown                 0
Action                  1
Adventure            

row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                   51
Rating                  4
TimeStamp       878543275
unknown                 0
Action                  0
Adventure               0
Children's              0
Comedy                  0
Crime                   0
Documentary             0
Drama                   1
Fantasy                 0
Film-Noir               0
Horror                  0
Musical                 0
Mystery                 0
Romance                 1
Sci-Fi                  0
Thriller                0
War                     1
Western                 1
Age                    24
Gender                  M
Occupation     technician
Zip Code            85711
Name: 48, dtype: object
find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                   52
Rating                  4
TimeStamp       875072205
unknown                 0
Action                  0
Adventure           

find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                   99
Rating                  3
TimeStamp       875072547
unknown                 0
Action                  0
Adventure               0
Children's              1
Comedy                  0
Crime                   0
Documentary             0
Drama                   0
Fantasy                 0
Film-Noir               0
Horror                  0
Musical                 1
Mystery                 0
Romance                 0
Sci-Fi                  0
Thriller                0
War                     0
Western                 0
Age                    24
Gender                  M
Occupation     technician
Zip Code            85711
Name: 95, dtype: object
find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  100
Rating                  5
TimeStamp       878543541
unknown                 0
Action            

find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  146
Rating                  4
TimeStamp       875071561
unknown                 0
Action                  0
Adventure               0
Children's              0
Comedy                  0
Crime                   0
Documentary             0
Drama                   1
Fantasy                 0
Film-Noir               0
Horror                  0
Musical                 0
Mystery                 0
Romance                 0
Sci-Fi                  0
Thriller                0
War                     0
Western                 0
Age                    24
Gender                  M
Occupation     technician
Zip Code            85711
Name: 141, dtype: object
find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  147
Rating                  3
TimeStamp       875240993
unknown                 0
Action           

row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  194
Rating                  4
TimeStamp       876892743
unknown                 0
Action                  0
Adventure               0
Children's              0
Comedy                  1
Crime                   1
Documentary             0
Drama                   0
Fantasy                 0
Film-Noir               0
Horror                  0
Musical                 0
Mystery                 0
Romance                 0
Sci-Fi                  0
Thriller                0
War                     0
Western                 0
Age                    24
Gender                  M
Occupation     technician
Zip Code            85711
Name: 185, dtype: object
find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  195
Rating                  5
TimeStamp       876892855
unknown                 0
Action                  1
Adventure          

row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  238
Rating                  4
TimeStamp       875072235
unknown                 0
Action                  0
Adventure               0
Children's              0
Comedy                  1
Crime                   0
Documentary             0
Drama                   0
Fantasy                 0
Film-Noir               0
Horror                  0
Musical                 0
Mystery                 0
Romance                 0
Sci-Fi                  0
Thriller                0
War                     0
Western                 0
Age                    24
Gender                  M
Occupation     technician
Zip Code            85711
Name: 228, dtype: object
find a match for user_id  1
row is  <class 'pandas.core.series.Series'>  :  User                    1
Item                  239
Rating                  4
TimeStamp       878542845
unknown                 0
Action                  0
Adventure          

row is  <class 'pandas.core.series.Series'>  :  User                   2
Item                 242
Rating                 5
TimeStamp      888552084
unknown                0
Action                 0
Adventure              0
Children's             0
Comedy                 1
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               0
War                    0
Western                0
Age                   53
Gender                 F
Occupation         other
Zip Code           94043
Name: 271, dtype: object
find a match for user_id  2
row is  <class 'pandas.core.series.Series'>  :  User                   2
Item                 255
Rating                 4
TimeStamp      888551341
unknown                0
Action                 0
Adventure              0
Children's             0
C

find a match for user_id  2
row is  <class 'pandas.core.series.Series'>  :  User                   3
Item                 181
Rating                 4
TimeStamp      889237482
unknown                0
Action                 1
Adventure              1
Children's             0
Comedy                 0
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                1
Sci-Fi                 1
Thriller               0
War                    1
Western                0
Age                   23
Gender                 M
Occupation        writer
Zip Code           32067
Name: 314, dtype: object
find a match for user_id  3
row is  <class 'pandas.core.series.Series'>  :  User                   3
Item                 258
Rating                 2
TimeStamp      889237026
unknown                0
Action                 0
Adventure              

row is  <class 'pandas.core.series.Series'>  :  User                   3
Item                 355
Rating                 3
TimeStamp      889237247
unknown                0
Action                 0
Adventure              1
Children's             0
Comedy                 0
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 1
Thriller               1
War                    0
Western                0
Age                   23
Gender                 M
Occupation        writer
Zip Code           32067
Name: 357, dtype: object
find a match for user_id  3
row is  <class 'pandas.core.series.Series'>  :  User                    4
Item                   11
Rating                  4
TimeStamp       892004520
unknown                 0
Action                  0
Adventure               0
Children's         

find a match for user_id  5
row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 144
Rating                 3
TimeStamp      875636141
unknown                0
Action                 1
Adventure              0
Children's             0
Comedy                 0
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               1
War                    0
Western                0
Age                   33
Gender                 F
Occupation         other
Zip Code           15213
Name: 400, dtype: object
find a match for user_id  5
row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 145
Rating                 1
TimeStamp      875720830
unknown                0
Action                 1
Adventure              

row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 243
Rating                 1
TimeStamp      878844164
unknown                0
Action                 0
Adventure              0
Children's             1
Comedy                 1
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               0
War                    0
Western                0
Age                   33
Gender                 F
Occupation         other
Zip Code           15213
Name: 441, dtype: object
find a match for user_id  5
row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 250
Rating                 3
TimeStamp      875635265
unknown                0
Action                 1
Adventure              0
Children's             0
C

row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 406
Rating                 1
TimeStamp      875635807
unknown                0
Action                 0
Adventure              0
Children's             0
Comedy                 0
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 1
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               1
War                    0
Western                0
Age                   33
Gender                 F
Occupation         other
Zip Code           15213
Name: 488, dtype: object
find a match for user_id  5
row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 407
Rating                 3
TimeStamp      875635431
unknown                0
Action                 0
Adventure              0
Children's             0
C

find a match for user_id  5
row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 456
Rating                 1
TimeStamp      875636375
unknown                0
Action                 1
Adventure              0
Children's             0
Comedy                 1
Crime                  0
Documentary            0
Drama                  0
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               0
War                    0
Western                0
Age                   33
Gender                 F
Occupation         other
Zip Code           15213
Name: 535, dtype: object
find a match for user_id  5
row is  <class 'pandas.core.series.Series'>  :  User                   5
Item                 457
Rating                 1
TimeStamp      879198898
unknown                0
Action                 0
Adventure              

row is  <class 'pandas.core.series.Series'>  :  User                   6
Item                 166
Rating                 4
TimeStamp      883601426
unknown                0
Action                 0
Adventure              0
Children's             0
Comedy                 0
Crime                  0
Documentary            0
Drama                  1
Fantasy                0
Film-Noir              0
Horror                 0
Musical                0
Mystery                0
Romance                0
Sci-Fi                 0
Thriller               0
War                    0
Western                0
Age                   42
Gender                 M
Occupation     executive
Zip Code           98101
Name: 580, dtype: object
find a match for user_id  6
row is  <class 'pandas.core.series.Series'>  :  User                   6
Item                 168
Rating                 4
TimeStamp      883602865
unknown                0
Action                 0
Adventure              0
Children's             0
C

KeyboardInterrupt: 

In [19]:
print(df_data[5000:5003])

      User  Item  Rating  TimeStamp  unknown  Action  Adventure  Children's  \
5000    56    22       5  892676376      0.0     1.0        0.0         0.0   
5001    56    25       4  892911166      0.0     0.0        0.0         0.0   
5002    56    28       5  892678669      0.0     1.0        0.0         0.0   

      Comedy  Crime  ...  Mystery  Romance  Sci-Fi  Thriller  War  Western  \
5000     0.0    0.0  ...      0.0      0.0     0.0       0.0  1.0      0.0   
5001     1.0    0.0  ...      0.0      0.0     0.0       0.0  0.0      0.0   
5002     0.0    0.0  ...      0.0      0.0     0.0       1.0  0.0      0.0   

       Age  Gender  Occupation  Zip Code  
5000  25.0       M   librarian     46260  
5001  25.0       M   librarian     46260  
5002  25.0       M   librarian     46260  

[3 rows x 26 columns]


## Preprocess feature of the movie lens data 

In [20]:
movie_lens_data_df = pd.read_csv('ml-100k/movielens_data_allfeatures.csv')
pd.set_option('display.max_columns', 500)
movie_lens_data_df

Unnamed: 0.1,Unnamed: 0,User,Item,Rating,TimeStamp,unknown,Action,Adventure,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,Age,Gender,Occupation,Zip Code
0,0,1,1,5,874965758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,M,technician,85711
1,1,1,2,3,876893171,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,24.0,M,technician,85711
2,2,1,3,4,878542960,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,24.0,M,technician,85711
3,3,1,4,3,876893119,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,M,technician,85711
4,4,1,5,3,889751712,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,24.0,M,technician,85711
5,5,1,6,5,887431973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,M,technician,85711
6,6,1,7,4,875071561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,24.0,M,technician,85711
7,7,1,8,1,875072484,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,M,technician,85711
8,8,1,9,5,878543541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,M,technician,85711
9,9,1,10,3,875693118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,24.0,M,technician,85711


In [21]:
##Remove the unnamed:0 column
movie_lens_data_df.drop(['Unnamed: 0'],axis=1, inplace=True)


## One hot encode categorial values

In [22]:
# One hot encode "Gender" 
movie_lens_data_df = pd.concat([movie_lens_data_df,pd.get_dummies(movie_lens_data_df['Gender'], prefix='Gender')],axis=1)
#movie_lens_data_df

In [23]:
##Drop the original feature, since it is not needed anymore.
movie_lens_data_df.drop(['Gender'],axis=1, inplace=True)

In [24]:
# One hot encode the 'Occupation' attribute 
movie_lens_data_df = pd.concat([movie_lens_data_df,pd.get_dummies(movie_lens_data_df['Occupation'], prefix='Occupation')],axis=1)
#movie_lens_data_df

In [25]:
##Drop the original feature, since it is not needed anymore.
movie_lens_data_df.drop(['Occupation'],axis=1, inplace=True)

In [26]:
##For SageMaker XGBoost, the predictor variable should be the first column and there should be no headers in the file.
##So move the 'Rating' colum to the begining of the dataframe.
rating = movie_lens_data_df['Rating']
movie_lens_data_df.drop(labels=['Rating'], axis=1,inplace = True)
movie_lens_data_df.insert(0, 'Rating', rating)
#movie_lens_data_df

In [27]:
##Check the columns after all the processing.
movie_lens_data_df.columns

Index(['Rating', 'User', 'Item', 'TimeStamp', 'unknown', 'Action', 'Adventure',
       'Children's', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
       'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Thriller', 'War', 'Western', 'Age', 'Zip Code', 'Gender_F', 'Gender_M',
       'Occupation_administrator', 'Occupation_artist', 'Occupation_doctor',
       'Occupation_educator', 'Occupation_engineer',
       'Occupation_entertainment', 'Occupation_executive',
       'Occupation_healthcare', 'Occupation_homemaker', 'Occupation_lawyer',
       'Occupation_librarian', 'Occupation_marketing', 'Occupation_none',
       'Occupation_other', 'Occupation_programmer', 'Occupation_retired',
       'Occupation_salesman', 'Occupation_scientist', 'Occupation_student',
       'Occupation_technician', 'Occupation_writer'],
      dtype='object')

## Explore the data : TODO

And now let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

In [28]:
##Pick up here again.

train_data, validation_data, test_data = np.split(movie_lens_data_df.sample(frac=1, random_state=1729), [int(0.7 * len(movie_lens_data_df)), int(0.9 * len(movie_lens_data_df))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)
test_data.to_csv('test.csv', header=False, index=False)


print("Number of training samples : " , len(train_data))
print("Number of validation samples : " , len(validation_data))
print("Number of test samples : " , len(test_data))




Number of training samples :  63398
Number of validation samples :  18115
Number of test samples :  9057


Now we'll upload these files to S3.

In [32]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')

print("Movie recommendation training data uploaded to ", bucket, os.path.join(prefix, 'train/train.csv'))
print("Movie recommendation validation data uploaded to ", bucket, os.path.join(prefix, 'validation/validation.csv'))
print("Movie recommendation test data uploaded to ", bucket, os.path.join(prefix, 'test/test.csv'))

Movie recommendation training data uploaded to  sagemaker-us-west-2-555360056434 sagemaker/recommendations-xgboost-movie/train/train.csv
Movie recommendation validation data uploaded to  sagemaker-us-west-2-555360056434 sagemaker/recommendations-xgboost-movie/validation/validation.csv
Movie recommendation test data uploaded to  sagemaker-us-west-2-555360056434 sagemaker/recommendations-xgboost-movie/test/test.csv


---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [33]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

	get_image_uri(region, 'xgboost', '0.90-1').


Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [34]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [35]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=6,
                        eta=0.2,
                        gamma=5,
                        min_child_weight=6,
                        subsample=0.9,
                        silent=0,
                        objective='reg:linear',
                        num_round=60)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

2019-11-26 19:26:59 Starting - Starting the training job...
2019-11-26 19:27:00 Starting - Launching requested ML instances......
2019-11-26 19:28:02 Starting - Preparing the instances for training......
2019-11-26 19:29:21 Downloading - Downloading input data
2019-11-26 19:29:21 Training - Downloading the training image..[31mArguments: train[0m
[31m[2019-11-26:19:29:41:INFO] Running standalone xgboost training.[0m
[31m[2019-11-26:19:29:41:INFO] File size need to be processed in the node: 11.57mb. Available memory size in the node: 8522.71mb[0m
[31m[2019-11-26:19:29:41:INFO] Determined delimiter of CSV input is ','[0m
[31m[19:29:41] S3DistributionType set as FullyReplicated[0m
[31m[19:29:41] 63398x46 matrix with 2916308 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-11-26:19:29:41:INFO] Determined delimiter of CSV input is ','[0m
[31m[19:29:41] S3DistributionType set as FullyReplicated[0m
[31m[19:29:41] 18115x46 matrix w


2019-11-26 19:29:53 Uploading - Uploading generated training model
2019-11-26 19:29:53 Completed - Training job completed
[31m[19:29:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 46 pruned nodes, max_depth=6[0m
[31m[57]#011train-rmse:0.945981#011validation-rmse:0.972677[0m
[31m[19:29:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 48 pruned nodes, max_depth=5[0m
[31m[58]#011train-rmse:0.945841#011validation-rmse:0.972631[0m
[31m[19:29:46] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 16 pruned nodes, max_depth=6[0m
[31m[59]#011train-rmse:0.945565#011validation-rmse:0.972541[0m
Training seconds: 50
Billable seconds: 50


---
## Host

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.

In [36]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')



--------------------------------------------------------------------------------------------------!

### Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [37]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [38]:
print("test_data type is ", type(test_data))

ratings = test_data['Rating']

print("ratings type ", type(ratings))

test_data.drop('Rating', axis=1, inplace=True)

test_data

test_data type is  <class 'pandas.core.frame.DataFrame'>
ratings type  <class 'pandas.core.series.Series'>


Unnamed: 0,User,Item,TimeStamp,unknown,Action,Adventure,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,Age,Zip Code,Gender_F,Gender_M,Occupation_administrator,Occupation_artist,Occupation_doctor,Occupation_educator,Occupation_engineer,Occupation_entertainment,Occupation_executive,Occupation_healthcare,Occupation_homemaker,Occupation_lawyer,Occupation_librarian,Occupation_marketing,Occupation_none,Occupation_other,Occupation_programmer,Occupation_retired,Occupation_salesman,Occupation_scientist,Occupation_student,Occupation_technician,Occupation_writer
11626,122,956,879270850,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,22206,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
6220,63,10,875748004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,31.0,75240,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
33916,342,192,875320082,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,98006,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
39502,393,1409,889729536,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,83686,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
76963,796,127,892660147,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,33755,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
36621,373,474,877098919,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,24.0,55116,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
65240,654,204,887864610,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,27.0,78739,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
77540,801,332,890332719,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,22.0,92154,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
27566,295,1,879517580,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,31.0,50325,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
53484,524,197,884637347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,56.0,02159,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [39]:
##Indice causing errors : 48,75,95.134
test_data_matrix = test_data.as_matrix()

#Try removing these indices : TODO

print("test_data_matrix type is ", type(test_data_matrix), " shape ", test_data_matrix.shape)

test_data_matrix_subset = test_data_matrix[:10]

predictions=[]

for i in range(0, 100):
    print(test_data_matrix[i])
    predicted_value = xgb_predictor.predict(test_data_matrix[i])
    predictions.append(predicted_value)
    print("predicted value ", predicted_value)
    
print("Number of predictions ", len(predictions))
print("Number of original ratings ", len(ratings))

  from ipykernel import kernelapp as app


test_data_matrix type is  <class 'numpy.ndarray'>  shape  (9057, 46)
[122 956 879270850 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 32.0 '22206' 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1]
predicted value  b'3.50076365471'
[63 10 875748004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 1.0 0.0 31.0 '75240' 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0]
predicted value  b'3.77948379517'
[342 192 875320082 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 25.0 '98006' 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0]
predicted value  b'3.99539732933'
[393 1409 889729536 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 0.0 19.0 '83686' 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0]
predicted value  b'2.6379570961'
[796 127 892660147 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 32.0 '33755' 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1]
predicted value  b'4.19961929321

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (415) from model with message "Loading csv data failed with Exception, please ensure data is in csv format:
 <type 'exceptions.ValueError'>
 could not convert string to float: Y1A6B". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/xgboost-2019-11-26-19-26-59-179 in account 555360056434 for more information.

In [None]:
##Compare with the original values 
for i in range(0, 20):
    #Prediction returned is a byte array.  Convert this to float to compare with the original
    prediction = float(predictions[i].decode())  
    print("predicted value ", prediction, " original value ", ratings.values[i])


TODO : Show some metrics

### (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)