# Loading the Dataset

In [52]:
import os
os.chdir("E:\Insofe\PH.D\PHD_ML")

In [53]:
# Import libraries
import numpy as np
import pandas as pd

# Reading ratings file
ratings = pd.read_csv("user_hotel_rating-1555730075105.csv")

In [54]:
ratings.head()

Unnamed: 0,userid,Hotelid,OverallRating
0,user_78131,hotel_558,3
1,user_78131,hotel_603,2
2,user_78131,hotel_610,3
3,user_78131,hotel_574,3
4,user_78131,hotel_570,3


In [55]:
# Removing user_ Sign from userid column
ratings['Hotelid'] = ratings['Hotelid'].str.replace('hotel_', '').astype('int64')

In [56]:
# Removing user_ Sign from userid column
ratings['userid'] = ratings['userid'].str.replace('user_', '').astype('int64')

Also let's count the number of unique users and hotels.

In [57]:
n_users = ratings.userid.unique().shape[0]
n_hotels = ratings.Hotelid.unique().shape[0]
print('Number of users = ' + str(n_users) + ' | Number of Hotels = ' + str(n_hotels))

Number of users = 5010 | Number of Hotels = 130


Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot ratings to get that and call the new variable Ratings (with a capital *R).

In [58]:
Ratings = ratings.pivot(index = 'userid', columns ='Hotelid', values = 'OverallRating').fillna(0)
Ratings.head()

Hotelid,501,502,503,504,506,507,508,509,510,511,...,630,631,632,633,634,635,636,637,639,640
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21002,0.0,3.0,0.0,3.0,3.0,0.0,4.0,0.0,3.0,0.0,...,4.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0
21025,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,3.0,2.0,...,0.0,2.0,0.0,0.0,4.0,2.0,0.0,0.0,3.0,3.0
21042,0.0,0.0,0.0,0.0,4.0,0.0,3.0,0.0,3.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0
21051,0.0,2.0,5.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,5.0,5.0,0.0,0.0,5.0,5.0,5.0,5.0,0.0,5.0
21054,0.0,1.0,0.0,1.0,0.0,3.0,4.0,2.0,0.0,4.0,...,4.0,0.0,0.0,0.0,3.0,2.0,1.0,2.0,4.0,4.0


In [59]:
Ratings.to_csv("Ratings.csv")

In [60]:
Ratings1 = pd.read_csv("Ratings1.csv")

In [61]:
Ratings1.columns

Index(['Row_no', 'userid', '501', '502', '503', '504', '506', '507', '508',
       '509',
       ...
       '630', '631', '632', '633', '634', '635', '636', '637', '639', '640'],
      dtype='object', length=132)

In [62]:
Ratings1.head(5)

Unnamed: 0,Row_no,userid,501,502,503,504,506,507,508,509,...,630,631,632,633,634,635,636,637,639,640
0,1,21002,0,3,0,3,3,0,4,0,...,4,0,0,3,0,0,0,3,0,0
1,2,21025,0,0,0,2,0,0,0,3,...,0,2,0,0,4,2,0,0,3,3
2,3,21042,0,0,0,0,4,0,3,0,...,0,0,2,0,0,0,2,0,2,0
3,4,21051,0,2,5,0,5,0,0,0,...,5,5,0,0,5,5,5,5,0,5
4,5,21054,0,1,0,1,0,3,4,2,...,4,0,0,0,3,2,1,2,4,4


In [63]:
Ratings1.loc[Ratings1['userid'] == 21002, 'Row_no'].iloc[0]

1

Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [69]:

R = Ratings.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)

  


With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math.

# Model-Based Collaborative Filtering
Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:

The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.
For example, let's check the sparsity of the ratings dataset:

In [70]:
sparsity = round(1.0 - len(ratings) / float(n_users * n_hotels), 3)
print('The sparsity level of Hotels dataset is ' +  str(sparsity * 100) + '%')

The sparsity level of Hotels dataset is 58.199999999999996%


# Support Vector Decomposition (SVD)
A well-known matrix factorization method is Singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:

where $A$ is the input data matrix (users's ratings), $U$ is the left singular vectors (user "features" matrix), $\Sigma$ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and  $V^{T}$ is the right singluar vectors (movie "features" matrix). $U$ and $V^{T}$ are column orthonomal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, I take these matrices and keep only the top $k$ features, which can be thought of as the underlying tastes and preferences vectors.

# Setting Up SVD
Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).



In [71]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R, k = 50)

As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\Sigma$ (now are values) to the diagonal matrix form.

In [72]:
sigma = np.diag(sigma)

# Making Predictions from the Decomposed Matrices
I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$.

But first, I need to add the user means back to get the actual star ratings prediction.

In [73]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)


With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison.

In [74]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()

Hotelid,501,502,503,504,506,507,508,509,510,511,...,630,631,632,633,634,635,636,637,639,640
0,1.421711,2.578467,2.198804,4.102824,2.69817,1.113877,4.546845,1.544424,3.435844,1.42777,...,3.727534,0.900897,2.149955,3.217377,2.507979,1.499241,1.922172,4.190599,2.200845,2.419727
1,1.267634,2.387099,1.526894,3.050668,2.436692,1.901982,2.736718,3.879747,1.95326,1.83703,...,2.137318,3.051434,3.365642,2.361558,5.261645,3.773936,2.710147,0.878873,3.990151,2.996152
2,1.130127,1.193146,1.842137,1.353468,5.132136,0.496458,3.681109,0.673854,4.191412,1.514681,...,1.680593,0.374712,3.226474,1.485946,1.916444,1.123832,3.037241,1.142092,2.7506,1.152776
3,2.084745,4.637345,7.132861,2.051363,7.004469,2.182085,1.124913,3.090875,1.827924,1.629233,...,7.878621,7.284585,1.86831,2.560925,7.346225,7.00427,7.071337,7.192435,1.861107,7.703493
4,1.110677,1.263767,2.357511,1.289578,1.817639,5.811811,4.77204,2.964819,2.199234,3.218805,...,3.399571,2.841067,1.757202,1.431574,4.458158,2.848928,1.311629,4.577747,4.121669,4.407313


In [75]:
def recommend_hotels(predictions, userid, Ratings1, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = Ratings1.loc[Ratings1['userid'] == userid, 'Row_no'].iloc[0] - 1 # Userid starts at 1, not 0
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    # Get the user's data and merge in the hotel information.
    user_data = original_ratings[original_ratings.userid == (userid)]
    user_full = (user_data.sort_values(['OverallRating'], ascending=False)
                 )
    print(user_full.head(2))
    print('User {0} has already rated {1} hotels.'.format(userid, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings hotels not already rated.'.format(num_recommendations))
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (ratings[~ratings['Hotelid'].isin(user_full['Hotelid'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'Hotelid',
               right_on = 'Hotelid').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).drop_duplicates('Hotelid').drop(['userid'],axis=1).iloc[:num_recommendations]
                      )
    return user_full, recommendations
    
    

Let's try to recommend 20 Hotels for user with ID 21051.

In [76]:
already_rated, predictions = recommend_hotels(preds, 21051, Ratings1, ratings, 5)

       userid  Hotelid  OverallRating
46028   21051      630              5
46055   21051      586              5
User 21051 has already rated 53 hotels.
Recommending highest 5 predicted ratings hotels not already rated.


In [77]:
# Top 20 movies that User 21051 has rated 
already_rated.head(20)

Unnamed: 0,userid,Hotelid,OverallRating
46028,21051,630,5
46055,21051,586,5
46057,21051,573,5
46058,21051,577,5
46059,21051,623,5
46060,21051,537,5
46061,21051,628,5
46062,21051,503,5
46063,21051,539,5
46064,21051,605,5


In [79]:
# Top 5 hotels that User 21051 hopefully will enjoy
predictions

Unnamed: 0,Hotelid,OverallRating,Predictions
81425,546,3,4.941662
17543,619,5,4.751268
45992,563,1,3.844053
106551,529,3,3.424692
19377,553,3,3.11965


# Model Evaluation
Can't forget to evaluate our model, can we?

Instead of doing manually like the last time, I will use the Surprise library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems.

In [80]:
# Import libraries from Surprise package
from surprise import Reader, Dataset, SVD, evaluate

# Load Reader library
reader = Reader()

# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings[['userid', 'Hotelid', 'OverallRating']], reader)

# Split the dataset for 5-fold evaluation
data.split(n_folds=5)



In [81]:
# Use the SVD algorithm.
svd = SVD()

# Compute the RMSE of the SVD algorithm.
evaluate(svd, data, measures=['RMSE'])



Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8303
------------
Fold 2
RMSE: 0.8328
------------
Fold 3
RMSE: 0.8300
------------
Fold 4
RMSE: 0.8361
------------
Fold 5
RMSE: 0.8281
------------
------------
Mean RMSE: 0.8314
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.8302830253279723,
                             0.8327503521045512,
                             0.8299678654029405,
                             0.8360533438100443,
                             0.82809154325299]})