# Capstone Project 2: Recommender System

## Implementing the Recommender System

There are multiple ways to implement a recommender system. One way that will be explored in this project is matrix factorization. This method factorizes a User x Item matrix of ratings into two smaller matrices: a Users matrix and an Items matrix. These smaller matrices will contain latent features.The values of these latent features can be estimated based on the known values in the original User x Item matrix. The dot product can be applied to the User and the Item matrices in order to calculate the estimated values for each user/item pair. By performing gradient descent, the values of the latent features that best estimate the actual matrix can be calculated. Therefore, the dot product can be used on the user/item pairs that did not originally have ratings. Of all the items that a particular user did not have ratings for, the top 5 highest estimated ratings can be recommended.  

The necessary imports will be made and the cleaned dataset will be read into a pandas DataFrame.

In [1]:
#importing relevant packages and modules
import pandas as pd
import numpy as np
import os

In [3]:
#accessing the local directory for the data
PATH = os.path.join(os.environ['HOMEPATH'], 'data', 'amazon_cleaned.csv')
#reading in the data and saving it into a DataFrame
df = pd.read_csv(PATH, index_col=0)

  mask |= (ar1 == a)


In [4]:
df.head()

Unnamed: 0,itemID,rating,reviewText,reviewTime,reviewerID,summary,foundHelpful,totalHelpful
0,528881469,5,We got this GPS for my husband who is an (OTR)...,2013-06-02,AO94DHGC771SJ,Gotta have GPS!,0,0
1,528881469,1,"I'm a professional OTR truck driver, and I bou...",2010-11-25,AMO214LNFCEI4,Very Disappointed,12,15
2,528881469,3,"Well, what can I say. I've had this unit in m...",2010-09-09,A3N7T0DY83Y4IG,1st impression,43,45
3,528881469,2,"Not going to write a long review, even thought...",2010-11-24,A1H8PY3QHMQQA0,"Great grafics, POOR GPS",9,10
4,528881469,1,I've had mine for a year and here's what we go...,2011-09-29,A24EV6RXELQZ63,"Major issues, only excuses for support",0,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689188 entries, 0 to 1689187
Data columns (total 8 columns):
itemID          1689188 non-null object
rating          1689188 non-null int64
reviewText      1688117 non-null object
reviewTime      1689188 non-null object
reviewerID      1689188 non-null object
summary         1689173 non-null object
foundHelpful    1689188 non-null int64
totalHelpful    1689188 non-null int64
dtypes: int64(3), object(5)
memory usage: 116.0+ MB


The library that will be used to create the recommender system is scikit Surprise. This library takes data in the form of 3 columns: users, items, and ratings. In order to efficiently work with the dataset, the reviewerID and itemID columns will need to be encoded. This can be done by creating an ordered categorical data type based on the unique values of the reviewerIDs and itemIDs. The index of the ordered categorical list will be used as the IDs for the reviewers and the items. The pandas api provides a method for creating a Categorical data type.

In [6]:
from pandas.api.types import CategoricalDtype

#creating category datatypes
reviewer = CategoricalDtype(sorted(df.reviewerID.unique()), ordered=True)
item = CategoricalDtype(sorted(df.itemID.unique()), ordered=True)

#setting the reviwerID and itemID to their respective datatype
df_cat = df.astype({'reviewerID':reviewer,'itemID':item})
#subsetting data by relevant columns, using the category indices
df_rec = pd.DataFrame(({'reviewerID':df_cat.reviewerID.cat.codes,
                          'itemID':df_cat.itemID.cat.codes,
                          'rating':df_cat.rating})).reset_index(drop=True)
df_rec.head()

Unnamed: 0,reviewerID,itemID,rating
0,176008,0,5
1,173739,0,1
2,134504,0,3
3,24476,0,2
4,57419,0,1


The surprise module will be imported as well as GridSearchCV. The surprise reader will be used to parse the data. The dataset will be loaded from a dataframe and parsed with the reader. GridSearchCV will be used to find the optimal hyperparameters.

In [7]:
import surprise
from surprise.model_selection import GridSearchCV, cross_validate

#read the custom data in, preprocessing
reader = surprise.Reader() #parser

#loading the data and creating train/test sets
data = surprise.Dataset.load_from_df(df_rec,reader)

The GridSearchCV will be used to tune the hyper parameters of the model. The model that will be used is SVD which was popularized by Simon Funk from the Netflix Prize recommender system challenge. SVD is very similar to matrix factorization. GridSearchCV will take a list of hyperparameters, param_grid. The model will be fit to the dataset and cross validated with every combination of hyperparameters provided. The combination that achieves the highest RMSE score will be returned. 

In [8]:
#hyperparameters to tune
param_grid = {'lr_all':[0.001, 0.01], 'reg_all':[0.1, 0.5]}
#running gridsearch on the SVD model
gs = GridSearchCV(surprise.SVD, param_grid, measures=['rmse'],cv=3)
gs.fit(data)
#storing the best parameters into variables
best_lr = gs.best_params['rmse']['lr_all']
best_reg = gs.best_params['rmse']['reg_all']

gs.best_params['rmse']

{'rmse': {'lr_all': 0.01, 'reg_all': 0.5},
 'mae': {'lr_all': 0.01, 'reg_all': 0.1}}

Using the best hyperparameters from GridSearchCV, the entire dataset will be fit to the surprise SVD model and 5-fold cross validated. The results of the cross validation will ensure that the model produces similar results from being fit to different subsets of the data.

In [11]:
#checking performance of model with cross validate
svd = surprise.SVD(lr_all=best_lr, reg_all=best_reg, random_state=29)
cv = cross_validate(svd, data, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0898  1.0883  1.0886  1.0917  1.0932  1.0903  0.0019  
MAE (testset)     0.8190  0.8183  0.8180  0.8204  0.8206  0.8193  0.0011  
Fit time          117.27  124.32  120.35  123.40  111.09  119.29  4.79    
Test time         5.05    4.44    3.82    4.70    4.10    4.42    0.43    


The results show that there is not much difference between the subsets of the data, as seen from the very low standard deviation. The root mean squared error measures the average magnitude of the errors. On average, the RMSE value was 1.0903.
Given that the rating values can be between 0-5, this shows that the model seems to have prediction errors of about 22% from the actual ratings. 

Now that the model has been fit to the data, the ratings can be predicted for each user/item pair. A list of all the items a given user has not rated yet can be created. From this list, the top 5 highest predicted ratings will be used to recommend items for the user. A function will be made in order to perform the recommendation. 

In [32]:
def recommendTopN(rid, model, n_rec=5):
    #creating list of unrated items for this reviewer
    iids = df_rec['itemID'].unique()
    iids_reviewer = df_rec[df_rec.reviewerID == rid].itemID
    #set difference, items reviewer hasn't rated yet
    iids_to_pred = np.setdiff1d(iids, iids_reviewer)
    #creating predictions for this reviewer, arbitrary rating of 0 for default
    reviewerset = [[rid, iid, 0] for iid in iids_to_pred]
    predictions = model.test(reviewerset)
    #creating array of prediction results
    pred_ratings = np.array([pred.est for pred in predictions])
    #sorting by values, -1 for descending, n_rec for top n recommendations, index from pred_ratings
    i_topN = pred_ratings.argsort()[::-1][:n_rec]
    #using index from pred_ratings to get itemID
    topN_iid = iids_to_pred[i_topN]
    #using index from pred_ratings to get the predicted ratings
    topN_rating = pred_ratings[i_topN]
    print("The top {} recommendations for reviewer {} are:".format(n_rec, reviewer.categories[rid]))
    for itemID, rating in zip(topN_iid, topN_rating):
        print("\titemID: {} \titem: {} \tpredicted rating: {:.3f}".format(itemID, item.categories[itemID], rating))
    return topN_iid, topN_rating

In [33]:
x = recommendTopN(50211, svd, n_rec=5)

The top 5 recommendations for user A1ZD690RCXOSB are:
	itemID: 50352 	item: B0087RF5RG 	rating: 4.943
	itemID: 32961 	item: B0041ORN9M 	rating: 4.904
	itemID: 9653 	item: B000HVHDJ8 	rating: 4.898
	itemID: 1376 	item: B000068IGO 	rating: 4.893
	itemID: 57954 	item: B00C10T2EC 	rating: 4.891


This function is able to provide the top 5 recommended items for the given user. The top 5 items are based on ratings that the model predicts the users would give each item. The function allows for usage of different recommender models as well as different amount of recommendations.

### Summary:

Matrix factorization is the used to break down a larger matrix of user/item pairs of ratings into separate users and items matrices so that the dot product can be performed to make predictions on unrated items for a user. Using scikit surprise, the SVD model was fit to the entire dataset with the hyperparameters that performed the best in the grid search. The model was cross validated to ensure that it would perform equally well on different subsets of the data. It was found that the average RMSE of the model was 1.0903. The model allows for predictions of user/item ratings. A function was then made so that the top N recommendations can be provided for any reviewer. This function returns a list of the top items that has not be rated by the reviewer yet.