# Matrix Factorization with SVD
Xiaolan Li

This project mainly creates a collaborative recommendation system based on SVD matrix decomposition.

The mean center and baseline estimation methods were used to deal with the sparsity. Then, different K factors with different 'energy' in the SVD model were used to get the predicted score.

Finally, these models are compared with RMSE scores to select the best and most reasonable model to recommend the items to the user.

## Loading the Dataset

The data I choose is a subset reviews data about Yelp restaurant from [here](https://www.kaggle.com/omkarsabnis/yelp-reviews-dataset). It includes 10000 reviews with 10 columns.

In [1]:
import pandas as pd
df_reviews = pd.read_csv('https://raw.githubusercontent.com/xiaolancara/Recommender-System/main/data/yelp.csv')
df_reviews.tail()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
9995,VY_tvNUCCXGXQeSvJl757Q,2012-07-28,Ubyfp2RSDYW0g7Mbr8N3iA,3,First visit...Had lunch here today - used my G...,review,_eqQoPtQ3e3UxLE4faT6ow,1,2,0
9996,EKzMHI1tip8rC1-ZAy64yg,2012-01-18,2XyIOQKbVFb6uXQdJ0RzlQ,4,Should be called house of deliciousness!\n\nI ...,review,ROru4uk5SaYc3rg8IU7SQw,0,0,0
9997,53YGfwmbW73JhFiemNeyzQ,2010-11-16,jyznYkIbpqVmlsZxSDSypA,4,I recently visited Olive and Ivy for business ...,review,gGbN1aKQHMgfQZkqlsuwzg,0,0,0
9998,9SKdOoDHcFoxK5ZtsgHJoA,2012-12-02,5UKq9WQE1qQbJ0DJbc-B6Q,2,My nephew just moved to Scottsdale recently so...,review,0lyVoNazXa20WzUyZPLaQQ,0,0,0
9999,pF7uRzygyZsltbmVpjIyvw,2010-10-16,vWSmOhg2ID1MNZHaWapGbA,5,4-5 locations.. all 4.5 star average.. I think...,review,KSBFytcdjPKZgXKQnYQdkA,0,0,0


Since the subset data has a huge rating dimension, so I'll select the top 100 popular business to reduce the dimension for saving more computation time.

In [2]:
df_new = pd.DataFrame(df_reviews.groupby('business_id')['stars'].count().sort_values(ascending=False)).reset_index().iloc[:100]
df_reviews = df_reviews.loc[df_reviews['business_id'].isin(df_new['business_id'])].reset_index(drop = True)
df_reviews.tail(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
1634,9Y3aQAVITkEJYe5vLZr13w,2010-04-01,ZoTUU6EJ1OBNr7mhqxHBLw,5,This is the place for a fabulos breakfast!! I ...,review,vasHsAZEgLZGJDTlIweUYQ,0,1,0
1635,r-a-Cn9hxdEnYTtVTB5bMQ,2012-04-07,j9HwZZoBBmJgOlqDSuJcxg,1,The food is delicious. The service: discrimi...,review,toPtsUtYoRB-5-ThrOy2Fg,0,0,0
1636,xY1sPHTA2RGVFlh5tZhs9g,2012-06-02,TM8hdYqs5Zi1jO5Yrq6E0g,4,For our first time we had a great time! Our se...,review,GvaNZY4poCcd3H4WxHjrLQ,0,2,0
1637,R8VwdLyvsp9iybNqRvm94g,2011-10-03,pcEeHdAJPoFNF23es0kKWg,5,Yes I do rock the hipster joints. I dig this ...,review,b92Y3tyWTQQZ5FLifex62Q,1,1,1
1638,53YGfwmbW73JhFiemNeyzQ,2010-11-16,jyznYkIbpqVmlsZxSDSypA,4,I recently visited Olive and Ivy for business ...,review,gGbN1aKQHMgfQZkqlsuwzg,0,0,0


## Missing Value processing

In [3]:
n_user = df_reviews.user_id.unique().shape[0]
n_business = df_reviews.business_id.unique().shape[0]
print('Number of users = ' + str(n_user) + ' | Number of business = ' + str(n_business))

Number of users = 1423 | Number of business = 100


In [4]:
#Calculate sparse rate
matrixSparsity = 1 - len(df_reviews) / (n_user * n_business)

print('matrixSparsity:', matrixSparsity)

matrixSparsity: 0.9884820801124385


The new subset has 1423 users and 100 business, which matrix has 98.8% sparse data as shown below.

In [5]:
RatingsNan = df_reviews.pivot(index = 'user_id', columns ='business_id', values = 'stars')
RatingsNan

business_id,-4A5xmN21zi_TXnUESauUQ,-AAig9FG0s8gYE4f8GfowQ,-sC66z4SO3tR7nFCjfQwuQ,1NZLxU5WvB5roPFzneAlLw,2bdKR3l4o-S1CscLqqnvVw,2ceeU8e3nZjaPfGmLwh4kg,3l72FflaaeI0tWEAWN3-gQ,3n9mSKySEv3G03YjcU-YOQ,3oZcTGb_oDHGwZFiP-7kxQ,53YGfwmbW73JhFiemNeyzQ,...,rZbHg4ACfN3iShdsT47WKQ,sbsFamEj5wDxNAjUKrMcSw,tZXPhvufHhfejGrRp554Lg,uEJQSIjWui-TDWXaGlcqyQ,uKSX1n1RoAzGq4bV8GPHVg,uR2aNW75R4oYs9w7aw-_kQ,wH9WtaTlrRawH_IpK90RPg,xY1sPHTA2RGVFlh5tZhs9g,yVQiGdxmnrkJDyQXv2maNA,z3yFuLVrmH-3RJruPEMYKw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--65q1FpAL_UQtVZ2PTGew,,,,,,,,,,,...,,,,,,,,,,
--rlgfAvvi0BtfRDA1p-VQ,,,,,,,,,,,...,,,,,,,,,,
-2jevGd5B6dqAT7AwBW6lA,,,,,,,,,,,...,,,,,,,,,,
-7LfdqX286W8zJ01ljY_SQ,,,,,,,,,,,...,,,,,,,,,,
-F32Vl8Rk4dwsmk0f2wRIw,,,,4.0,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zqwR5M2gZGKUeRwIx_axpA,,,,,,,,,,,...,,,,,,,,,,
zt7Yfeld6yR_bD4EKaOMjQ,,,,,,,,,,,...,,,,,,,,,,
zvgQkY3MLsF6R1-PuktgaA,,,,,,,,,,,...,,,,,,2.0,,,,
zw-bIcZP4_VEi3UetomDeg,,,,,,,,,,,...,,,,,,,,,,


Thus, I'm going to use two methods to fill all missing data in the Ratings.

### MeanCenter

In the MeanCenter method, I'll replace all nan value to 0 and de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [6]:
Ratings = RatingsNan.fillna(0)
Ratings.head()

business_id,-4A5xmN21zi_TXnUESauUQ,-AAig9FG0s8gYE4f8GfowQ,-sC66z4SO3tR7nFCjfQwuQ,1NZLxU5WvB5roPFzneAlLw,2bdKR3l4o-S1CscLqqnvVw,2ceeU8e3nZjaPfGmLwh4kg,3l72FflaaeI0tWEAWN3-gQ,3n9mSKySEv3G03YjcU-YOQ,3oZcTGb_oDHGwZFiP-7kxQ,53YGfwmbW73JhFiemNeyzQ,...,rZbHg4ACfN3iShdsT47WKQ,sbsFamEj5wDxNAjUKrMcSw,tZXPhvufHhfejGrRp554Lg,uEJQSIjWui-TDWXaGlcqyQ,uKSX1n1RoAzGq4bV8GPHVg,uR2aNW75R4oYs9w7aw-_kQ,wH9WtaTlrRawH_IpK90RPg,xY1sPHTA2RGVFlh5tZhs9g,yVQiGdxmnrkJDyQXv2maNA,z3yFuLVrmH-3RJruPEMYKw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--65q1FpAL_UQtVZ2PTGew,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--rlgfAvvi0BtfRDA1p-VQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-2jevGd5B6dqAT7AwBW6lA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-7LfdqX286W8zJ01ljY_SQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-F32Vl8Rk4dwsmk0f2wRIw,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
import numpy as np
R = np.array(Ratings)
user_ratings_mean = np.mean(R, axis = 1)
Ratings_MeanCenter = R - user_ratings_mean.reshape(-1, 1)

In [8]:
Ratings_MeanCenter.shape

(1423, 100)

In [9]:
Ratings_MeanCenter

array([[-0.12, -0.12, -0.12, ..., -0.12, -0.12, -0.12],
       [-0.05, -0.05, -0.05, ..., -0.05, -0.05, -0.05],
       [-0.02, -0.02, -0.02, ..., -0.02, -0.02, -0.02],
       ...,
       [-0.02, -0.02, -0.02, ..., -0.02, -0.02, -0.02],
       [-0.05, -0.05, -0.05, ..., -0.05, -0.05, -0.05],
       [-0.03, -0.03, -0.03, ..., -0.03, -0.03, -0.03]])

### Baseline Estimation

In the Baseline Estimation method, I'll use the relationship between matrix average, bias user and bias items algorithm to estimate the ratings for each nan value. 

The algorithm reference is from https://www.youtube.com/watch?v=4RSigTais8o

In [10]:
import math
def Basedline_predictor():
    Baseline_avg = np.nanmean(RatingsNan)
    BiasUser = []
    BiasItem = []
    Baseline_Ratings = np.zeros(RatingsNan.shape)
    for userid in RatingsNan.index:
        user_rating = RatingsNan.loc[userid]
        BiasUser.append(np.nanmean(user_rating))
    for itemid in RatingsNan.columns:
        item_rating = RatingsNan.loc[:,itemid]
        BiasItem.append(np.nanmean(item_rating))
    for i in range(len(RatingsNan.index)):
        for j in range(len(RatingsNan.columns)):
            if math.isnan(RatingsNan.iloc[i,j]):
                # based line predictor algorithm
                Baseline_Ratings[i][j] = (BiasUser[i] - Baseline_avg) + (BiasItem[j] - Baseline_avg) + Baseline_avg
            else:
                Baseline_Ratings[i][j]= RatingsNan.iloc[i,j]
    return Baseline_Ratings
Ratings_Baseline = Basedline_predictor()

In [11]:
Ratings_Baseline.shape

(1423, 100)

In [12]:
Ratings_Baseline

array([[4.42859154, 4.27474539, 4.42538641, ..., 3.70618352, 4.05038641,
        4.73628385],
       [5.42859154, 5.27474539, 5.42538641, ..., 4.70618352, 5.05038641,
        5.73628385],
       [2.42859154, 2.27474539, 2.42538641, ..., 1.70618352, 2.05038641,
        2.73628385],
       ...,
       [2.42859154, 2.27474539, 2.42538641, ..., 1.70618352, 2.05038641,
        2.73628385],
       [5.42859154, 5.27474539, 5.42538641, ..., 4.70618352, 5.05038641,
        5.73628385],
       [3.42859154, 3.27474539, 3.42538641, ..., 2.70618352, 3.05038641,
        3.73628385]])

## Support Vector Decomposition (SVD)

A well-known matrix factorization method is Singular value decomposition (SVD). 

At a high level, SVD is an algorithm that decomposes a matrix A into the best lower rank (i.e. smaller/simpler) approximation of the original matrix A. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix.

![SVD](https://miro.medium.com/max/625/1*W4MnB2hyvgqedLmwJLrpqw.png)

For decompose sigma k factor, there's an 'energy' calculation that when the 'energy' is between 80%-90%, it has a good result for prediction. 

The algorithm reference is from https://www.youtube.com/watch?v=iG517ZbIzMw

In [13]:
from scipy.sparse.linalg import svds

# define a function to decompose the matrix using SVD and calculate the 'energy'  
def SVD_Energy(Matrix,k ):
    r = min(Matrix.shape)-1
    U_k, sigma_k, Vt_k = svds(Matrix, k)
    U_r, sigma_r, Vt_r = svds(Matrix, r)
    # energy claculate algorithm
    sigmaEnergy = (np.sum(sigma_k**2)/np.sum(sigma_r**2))*100
    return sigmaEnergy,U_k,sigma_k,Vt_k

In [14]:
# define a function to obtain the predict rating matrix after decomposing the original matrix
def predsMatrix(orig_Matrix,n_compose,method = 'MeanCenter'):
    sigmaEnergy, U,sigma,Vt = SVD_Energy(Matrix = orig_Matrix,k = n_compose)
    sigma = np.diag(sigma)
    
    # obtain the predict rating matrix
    if method == 'MeanCenter':
        all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
    elif method == 'Baseline':
        all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt)
        
    preds = pd.DataFrame(all_user_predicted_ratings, columns = RatingsNan.columns,index = RatingsNan.index )
    return sigmaEnergy, preds

## Model Evaluation

In [15]:
import math

# define a function to calculate the RMSE score for the model we used
# It only caluculate the error between actual rated stars and predict stars
def evaluateModel(preds):
    preds_np = preds.to_numpy()
    ratings_np_nan = RatingsNan.to_numpy()
    diff_act_pred = np.subtract(ratings_np_nan, preds_np)
    sq_diff_act_pred = np.square(diff_act_pred)
    
    mse = sq_diff_act_pred[~np.isnan(sq_diff_act_pred)].mean()
    rmse = math.sqrt(mse)
    return rmse

In [16]:
# create lists to store evaluation result for each method
Energy = []
RMSE = []
method = []

After preparing all functions, then I can start to implement it.

__Method 1: Ratings_MeanCenter,n_compose = 70__

In [17]:
sigmaEnergy, preds_MC70 = predsMatrix(orig_Matrix = Ratings_MeanCenter,n_compose = 70)
rmse = evaluateModel(preds_MC70)

Energy.append(sigmaEnergy)
RMSE.append(rmse)
method.append('MeanCenter_70')

print('sigmaEnergy',sigmaEnergy,'%')
preds_MC70.head()

sigmaEnergy 82.2612166967768 %


business_id,-4A5xmN21zi_TXnUESauUQ,-AAig9FG0s8gYE4f8GfowQ,-sC66z4SO3tR7nFCjfQwuQ,1NZLxU5WvB5roPFzneAlLw,2bdKR3l4o-S1CscLqqnvVw,2ceeU8e3nZjaPfGmLwh4kg,3l72FflaaeI0tWEAWN3-gQ,3n9mSKySEv3G03YjcU-YOQ,3oZcTGb_oDHGwZFiP-7kxQ,53YGfwmbW73JhFiemNeyzQ,...,rZbHg4ACfN3iShdsT47WKQ,sbsFamEj5wDxNAjUKrMcSw,tZXPhvufHhfejGrRp554Lg,uEJQSIjWui-TDWXaGlcqyQ,uKSX1n1RoAzGq4bV8GPHVg,uR2aNW75R4oYs9w7aw-_kQ,wH9WtaTlrRawH_IpK90RPg,xY1sPHTA2RGVFlh5tZhs9g,yVQiGdxmnrkJDyQXv2maNA,z3yFuLVrmH-3RJruPEMYKw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--65q1FpAL_UQtVZ2PTGew,-0.089195,-0.011911,0.014441,-0.035998,0.103113,0.048367,-0.051231,0.061499,0.07248,0.018404,...,0.083916,-0.046706,0.027772,-0.007016,0.030856,-0.089733,0.179162,-0.022116,-0.370984,0.193972
--rlgfAvvi0BtfRDA1p-VQ,-0.030004,-0.0415,0.022647,0.001373,0.042239,-0.054011,-0.079966,0.0056,0.023528,-0.014744,...,-0.028565,-0.068782,-0.006435,-0.0072,0.0236,-0.07422,0.026117,-0.020413,-0.043101,-0.10307
-2jevGd5B6dqAT7AwBW6lA,-0.012001,-0.0166,0.009059,0.000549,0.016895,-0.021605,-0.031986,0.00224,0.009411,-0.005898,...,-0.011426,-0.027513,-0.002574,-0.00288,0.00944,-0.029688,0.010447,-0.008165,-0.01724,-0.041228
-7LfdqX286W8zJ01ljY_SQ,-0.01095,-0.005446,-0.005126,-0.000464,0.042818,-0.000508,0.011737,0.008098,0.025671,-0.000648,...,-0.001648,0.008613,0.030195,0.013052,0.001511,-0.005281,0.002125,-0.001559,0.022837,-0.002992
-F32Vl8Rk4dwsmk0f2wRIw,0.002601,-0.003623,0.001567,3.949642,0.025105,-0.01056,-0.015378,-7.3e-05,0.01004,-0.00779,...,-0.015519,-0.013135,0.015865,-0.002584,0.000963,-0.012159,-0.009589,-0.001677,-0.025071,-0.042022


__Method 2: Ratings_MeanCenter,n_compose = 80__

In [18]:
sigmaEnergy, preds_MC80 = predsMatrix(orig_Matrix = Ratings_MeanCenter,n_compose = 80)
rmse = evaluateModel(preds_MC80)

Energy.append(sigmaEnergy)
RMSE.append(rmse)
method.append('MeanCenter_80')

print('sigmaEnergy',sigmaEnergy,'%')
preds_MC80.head()

sigmaEnergy 89.34920346626221 %


business_id,-4A5xmN21zi_TXnUESauUQ,-AAig9FG0s8gYE4f8GfowQ,-sC66z4SO3tR7nFCjfQwuQ,1NZLxU5WvB5roPFzneAlLw,2bdKR3l4o-S1CscLqqnvVw,2ceeU8e3nZjaPfGmLwh4kg,3l72FflaaeI0tWEAWN3-gQ,3n9mSKySEv3G03YjcU-YOQ,3oZcTGb_oDHGwZFiP-7kxQ,53YGfwmbW73JhFiemNeyzQ,...,rZbHg4ACfN3iShdsT47WKQ,sbsFamEj5wDxNAjUKrMcSw,tZXPhvufHhfejGrRp554Lg,uEJQSIjWui-TDWXaGlcqyQ,uKSX1n1RoAzGq4bV8GPHVg,uR2aNW75R4oYs9w7aw-_kQ,wH9WtaTlrRawH_IpK90RPg,xY1sPHTA2RGVFlh5tZhs9g,yVQiGdxmnrkJDyQXv2maNA,z3yFuLVrmH-3RJruPEMYKw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--65q1FpAL_UQtVZ2PTGew,-0.00937,-0.011021,0.002508,-0.008572,0.12682,0.004585,0.056629,0.004457,0.046647,0.021492,...,0.094063,-0.011936,-0.233605,-0.024312,0.019303,-0.01289,0.005929,-0.009123,0.053764,0.14315
--rlgfAvvi0BtfRDA1p-VQ,0.023436,-0.053841,-0.003752,0.014362,0.04195,-0.038037,0.023935,0.025671,0.002477,0.010704,...,-0.022496,-0.010127,0.084244,-0.02994,-0.001198,-0.041164,-0.044933,-0.012503,-0.079331,-0.043919
-2jevGd5B6dqAT7AwBW6lA,0.009374,-0.021537,-0.001501,0.005745,0.01678,-0.015215,0.009574,0.010268,0.000991,0.004282,...,-0.008998,-0.004051,0.033698,-0.011976,-0.000479,-0.016465,-0.017973,-0.005001,-0.031733,-0.017568
-7LfdqX286W8zJ01ljY_SQ,-0.013493,-0.007135,-0.002503,-0.00325,0.042647,-0.004324,0.00416,0.002447,0.022479,-0.00124,...,-0.000364,-0.005272,0.045508,0.014024,0.003425,-0.003762,0.00199,-0.001875,0.002663,-0.003477
-F32Vl8Rk4dwsmk0f2wRIw,0.005441,-0.003772,0.000513,3.952821,0.025527,-0.008247,-0.006894,-0.000315,0.004349,-0.002413,...,-0.014269,-0.001635,0.012571,-0.003252,0.001307,0.003164,-0.011314,0.00011,0.018042,-0.03865


__Method 3: Ratings_Baseline,n_compose = 70__

In [19]:
sigmaEnergy, preds_BL70 = predsMatrix(orig_Matrix = Ratings_Baseline,n_compose = 70)
rmse = evaluateModel(preds_BL70)

Energy.append(sigmaEnergy)
RMSE.append(rmse)
method.append('Baseline_70')

print('sigmaEnergy',sigmaEnergy,'%')
preds_BL70.head()

sigmaEnergy 99.99918562163356 %


business_id,-4A5xmN21zi_TXnUESauUQ,-AAig9FG0s8gYE4f8GfowQ,-sC66z4SO3tR7nFCjfQwuQ,1NZLxU5WvB5roPFzneAlLw,2bdKR3l4o-S1CscLqqnvVw,2ceeU8e3nZjaPfGmLwh4kg,3l72FflaaeI0tWEAWN3-gQ,3n9mSKySEv3G03YjcU-YOQ,3oZcTGb_oDHGwZFiP-7kxQ,53YGfwmbW73JhFiemNeyzQ,...,rZbHg4ACfN3iShdsT47WKQ,sbsFamEj5wDxNAjUKrMcSw,tZXPhvufHhfejGrRp554Lg,uEJQSIjWui-TDWXaGlcqyQ,uKSX1n1RoAzGq4bV8GPHVg,uR2aNW75R4oYs9w7aw-_kQ,wH9WtaTlrRawH_IpK90RPg,xY1sPHTA2RGVFlh5tZhs9g,yVQiGdxmnrkJDyQXv2maNA,z3yFuLVrmH-3RJruPEMYKw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--65q1FpAL_UQtVZ2PTGew,4.548858,4.388642,4.545997,3.872218,3.541569,3.754328,3.865684,4.420009,3.921809,3.768923,...,4.613806,3.325693,3.921038,4.13628,4.004964,3.839765,3.948072,3.825676,4.172695,4.856587
--rlgfAvvi0BtfRDA1p-VQ,5.477223,5.325391,5.475177,4.807713,4.472101,4.685432,4.796852,5.350144,4.852018,4.700236,...,5.543395,4.255601,4.851168,5.066118,4.940443,4.769825,4.875992,4.757526,5.104349,5.786101
-2jevGd5B6dqAT7AwBW6lA,2.447278,2.296321,2.445477,1.778335,1.442225,1.65644,1.766353,2.320902,1.821733,1.670576,...,2.51389,1.225858,1.822757,2.037419,1.911564,1.740183,1.845564,1.728168,2.071979,2.756248
-7LfdqX286W8zJ01ljY_SQ,4.457099,4.307983,4.469538,3.793494,3.461811,3.684638,3.768062,4.343048,3.835678,3.687606,...,4.537008,3.246188,3.861724,4.071381,3.939802,3.762079,3.864697,3.747598,4.083166,4.777302
-F32Vl8Rk4dwsmk0f2wRIw,4.461691,4.318262,4.465738,3.909,3.461924,3.675941,3.787334,4.345636,3.843376,3.687227,...,4.534285,3.241677,3.838972,4.059532,3.926913,3.757031,3.869819,3.743535,4.095664,4.776817


__Method 3: Ratings_Baseline,n_compose = 80__

In [20]:
sigmaEnergy, preds_BL80 = predsMatrix(orig_Matrix = Ratings_Baseline,n_compose = 80)
rmse = evaluateModel(preds_BL80)

Energy.append(sigmaEnergy)
RMSE.append(rmse)
method.append('Baseline_80')

print('sigmaEnergy',sigmaEnergy,'%')
preds_BL80.head()

sigmaEnergy 99.99964327823774 %


business_id,-4A5xmN21zi_TXnUESauUQ,-AAig9FG0s8gYE4f8GfowQ,-sC66z4SO3tR7nFCjfQwuQ,1NZLxU5WvB5roPFzneAlLw,2bdKR3l4o-S1CscLqqnvVw,2ceeU8e3nZjaPfGmLwh4kg,3l72FflaaeI0tWEAWN3-gQ,3n9mSKySEv3G03YjcU-YOQ,3oZcTGb_oDHGwZFiP-7kxQ,53YGfwmbW73JhFiemNeyzQ,...,rZbHg4ACfN3iShdsT47WKQ,sbsFamEj5wDxNAjUKrMcSw,tZXPhvufHhfejGrRp554Lg,uEJQSIjWui-TDWXaGlcqyQ,uKSX1n1RoAzGq4bV8GPHVg,uR2aNW75R4oYs9w7aw-_kQ,wH9WtaTlrRawH_IpK90RPg,xY1sPHTA2RGVFlh5tZhs9g,yVQiGdxmnrkJDyQXv2maNA,z3yFuLVrmH-3RJruPEMYKw
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--65q1FpAL_UQtVZ2PTGew,4.547844,4.397212,4.545337,3.88,3.541927,3.755065,3.868602,4.420783,3.920028,3.769697,...,4.613657,3.325693,3.920432,4.13612,4.011161,3.839639,3.947339,3.827158,4.172587,4.856318
--rlgfAvvi0BtfRDA1p-VQ,5.47685,5.32494,5.475186,4.809152,4.472055,4.685137,4.796806,5.350221,4.852202,4.699974,...,5.543377,4.255587,4.850772,5.065821,4.942179,4.770234,4.876204,4.757249,5.103794,5.786119
-2jevGd5B6dqAT7AwBW6lA,2.446933,2.296253,2.445302,1.780167,1.442105,1.65549,1.768363,2.32099,1.822806,1.670089,...,2.513654,1.225768,1.820538,2.036328,1.911822,1.740559,1.846197,1.727517,2.071989,2.756225
-7LfdqX286W8zJ01ljY_SQ,4.469942,4.315991,4.465637,3.795778,3.461432,3.673263,3.783307,4.342061,3.846275,3.68867,...,4.533603,3.245069,3.84106,4.054612,3.925778,3.75771,3.867843,3.745713,4.088476,4.776474
-F32Vl8Rk4dwsmk0f2wRIw,4.468766,4.310255,4.465239,4.035368,3.461407,3.672291,3.776593,4.339957,3.848399,3.688607,...,4.532631,3.244437,3.842036,4.052897,3.926357,3.757439,3.869773,3.745035,4.094251,4.776033


## Model Selection

Lastly, let's see the result of all methods

In [21]:
col_dict = {'Energy':Energy,'RMSE':RMSE}
df_result = pd.DataFrame(col_dict,index = method)
df_result

Unnamed: 0,Energy,RMSE
MeanCenter_70,82.261217,1.531726
MeanCenter_80,89.349203,1.162162
Baseline_70,99.999186,0.113841
Baseline_80,99.999643,0.09266


From above table, we can see the Based line methods has an amazing low error between predicting rating and actual rating. But we can see whatever I changed the k factor for baseline method, its energy is still close to 100%, which is not I want. Besides, the MeanCenter method with high energy decomposing has lower RMSE in this dataset.

Thus, I'll select the MeanCenter_80, which has 1.162 RMSE score to do the further recommendation for users.

## Recommend Items for user

In [22]:
# define a recommend_items function for user 
def recommend_items(predictions, userID, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions
    sorted_user_predictions = predictions.loc[userID].sort_values(ascending=False).reset_index()
    
    # Get the user's data information.
    user_data = original_ratings[original_ratings.user_id == (userID)]
    user_full = (user_data.sort_values(['stars'], ascending=False))
    user_full_num = user_full.shape[0]
    print('User {0} has already rated {1} business.'.format(userID, user_full_num))
    
    print('Recommending highest {0} predicted ratings business not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating business that the user hasn't rated yet.
    recommendations = pd.DataFrame(sorted_user_predictions[~sorted_user_predictions['business_id'].isin(user_full['business_id'])].iloc[:num_recommendations])
    recommendations.columns=['business_id','similar_score']
    return user_full, recommendations

In [23]:
# find the count of reviews group by user 
df_reviews.groupby('user_id')['stars'].count().sort_values(ascending=False)

user_id
wHg1YkCzdZq9WBJOTRgxHQ    7
0bNXP9quoJEgyVZu9ipGgQ    5
LqgGgWi3FLHBViX9tmZ9sw    4
-txH2zJSBZQHO6RWvoWXuQ    4
HZeFzs42f0iGaA-sP_hUnA    4
                         ..
dvu9jhoAg88OIKRxuaTKpA    1
e7_mPkNLzbyWMXOBpT0E5Q    1
e8ZXaLh79xm9h5OKLavILQ    1
eA7hkwrrknhxhqs-dOhDYg    1
WQX1Hio90vjGkASKM0v5kA    1
Name: stars, Length: 1423, dtype: int64

In [24]:
# using Baseline and 80 factor to recommend items for top reviews user
already_rated, predictions = recommend_items(preds_MC80, 'wHg1YkCzdZq9WBJOTRgxHQ', df_reviews, 5)

User wHg1YkCzdZq9WBJOTRgxHQ has already rated 7 business.
Recommending highest 5 predicted ratings business not already rated.


In [25]:
# Top 5 business that User has rated 
already_rated.head(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
572,Bc4DoKgrKCtCuN-0O5He3A,2009-12-19,-qqrl4101KbQKIdar1lMRw,5,"Cashew brittle, almond brittle, bacon brittle!...",review,wHg1YkCzdZq9WBJOTRgxHQ,9,8,6
632,z3yFuLVrmH-3RJruPEMYKw,2010-04-24,7iUIThqzcZwOi7Xtu0C1jg,5,"When I moved to the Valley, our first place wa...",review,wHg1YkCzdZq9WBJOTRgxHQ,6,7,5
331,rZbHg4ACfN3iShdsT47WKQ,2010-03-17,DFjpEBuLU5Wu4cJ8JAeX8w,4,"""You smell like smoke."" That's the greeting I...",review,wHg1YkCzdZq9WBJOTRgxHQ,7,7,7
900,FV0BkoGOd3Yu_eJnXY15ZA,2009-02-17,gSc3pwGVSiCtGKDTuvNQCg,4,Decided to go for Round 2 for a light dinner t...,review,wHg1YkCzdZq9WBJOTRgxHQ,4,3,3
148,AqbgC7Gul5Es1rRzGNLDFA,2010-08-23,J1q-zeAespG5YRRSxalSfQ,3,Somehow while attempting to write an update to...,review,wHg1YkCzdZq9WBJOTRgxHQ,7,8,9


In [26]:
# Top 5 business that User hopefully will enjoy
predictions

Unnamed: 0,business_id,similar_score
7,k8JnZBspVOI8kLcQek-Chw,0.528854
8,AaKlegu7gmOCD4rEESF76Q,0.474532
9,Xq9tkiHhyN_aBFswFeGLvA,0.342034
10,9ziO3NpoNTKHvIKCBFB_fQ,0.331167
11,DcrM4hwDcU2G6vuh2cnaYQ,0.273784


These look like pretty good recommendations.

## Conclusion

In SVD, the Mean Center method has quiet stable matrix factoriztion energy than Basedline method. 

The SVD based collaborative user_item matrix has a simple and good evaluation in recommender system.

## Reference 

https://towardsdatascience.com/the-4-recommendation-engines-that-can-predict-your-movie-tastes-109dc4e10c52

https://towardsdatascience.com/yelp-restaurant-recommendation-system-capstone-project-264fe7a7dea1

https://stackoverflow.com/questions/29338183/recommendation-system-and-baseline-predictors