# Model Based Collaborative Filtering

Here we will be using dimensionality reduction techniques to improve robustness and accuracy of Memory-Based CF. Basically, we compress user-item matrix which has high number of dimensions into a low dimensional matrix using techniques like SVD, KNN and 
NMF.


In [1]:
#reading dataset
import pandas as pd
import numpy as np
from sklearn import preprocessing

In [2]:
df = pd.read_csv('ratings_Beauty.csv')


In [4]:
df

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,0205616461,5.0,1369699200
1,A3JM6GV9MNOF9X,0558925278,3.0,1355443200
2,A1Z513UWSAAO0F,0558925278,5.0,1404691200
3,A1WMRR494NWEWV,0733001998,4.0,1382572800
4,A3IAAVS479H7M7,0737104473,1.0,1274227200
...,...,...,...,...
2023065,A3DEHKPFANB8VA,B00LORWRJA,5.0,1405296000
2023066,A3DEHKPFANB8VA,B00LOS7MEE,5.0,1405296000
2023067,AG9TJLJUN5OM3,B00LP2YB8E,5.0,1405382400
2023068,AYBIB14QOI9PC,B00LPVG6V0,5.0,1405555200


In [7]:
#getting the number of ratings for each product and dropping products with less than 1000 ratings. If the higher limit is less, the dimensionality increases which takes more compute power which is not available in local machine.
rating_of_product = df.groupby('ProductId')['Rating'].count() 
ratings_of_products_df = pd.DataFrame(rating_of_product)
filtered_ratings_per_product = ratings_of_products_df[ratings_of_products_df.Rating >= 1000]
filtered_ratings_per_product

Unnamed: 0_level_0,Rating
ProductId,Unnamed: 1_level_1
B0000YUXI0,2143
B000142FVW,1558
B0009V1YR8,2869
B000FS05VG,1589
B000TKH6G2,1379
B000ZMBSPE,2041
B00121UVU0,1838
B00150LT40,1349
B00188IFHS,1341
B001JKTTVQ,1468


In [8]:
# build a list of products to keep
popular_products = filtered_ratings_per_product.index.tolist()
print("Popular product count which have ratings over average rating count: ",len(popular_products))
filtered_ratings_data = df[df["ProductId"].isin(popular_products)]
print("The size of dataset has changed from ", len(df), " to ", len(filtered_ratings_data))

Popular product count which have ratings over average rating count:  22
The size of dataset has changed from  2023070  to  43063


In [9]:
from surprise import Reader, Dataset, KNNBasic, SVD, NMF
from surprise.model_selection import GridSearchCV, cross_validate


I am using a "Surprise" library to implement SVD, KNN and NMF.
You can find its documentation here > https://surprise.readthedocs.io/en/stable/
Surprise Library has almost all the algorithms implemented that are required for model-based Recommendation systems.

To load a dataset from a pandas dataframe, you will need the load_from_df() method. You will also need a Reader object, but only the rating_scale parameter must be specified.

The Reader class is used to parse a file containing ratings.

In [11]:
reader = Reader(rating_scale=(0.5, 5.0))

data = Dataset.load_from_df( filtered_ratings_data[['UserId', 'ProductId', 'Rating']], reader = reader )

#  K-Nearest Neighbours (KNN)

In [13]:
# Compute Mean Squared Distance Similarity
sim_options = {'name' : 'msd'}

algo = KNNBasic(k=15, sim_options=sim_options )
cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2679  1.2820  1.2434  1.2703  1.2662  1.2660  0.0126  
Fit time          64.49   78.95   70.44   73.26   70.32   71.49   4.69    
Test time         0.65    0.56    0.60    0.54    0.55    0.58    0.04    


{'test_rmse': array([1.26788585, 1.28204568, 1.2433788 , 1.27033014, 1.26621639]),
 'fit_time': (64.48896431922913,
  78.94706225395203,
  70.44118452072144,
  73.2566487789154,
  70.32270073890686),
 'test_time': (0.6484858989715576,
  0.5581698417663574,
  0.6041581630706787,
  0.5350089073181152,
  0.55301833152771)}

# Tuning KNN using GridSearchCV

In [14]:
n_neighbours = [5,10,15]
param_grid = {'n_neighbours' : n_neighbours}

gs = GridSearchCV(KNNBasic, measures=['RMSE'], param_grid=param_grid)
gs.fit(data)

print('\n\n###############')
# Best RMSE score
print('Best Score :', gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print('Best Parameters :', gs.best_params['rmse'])
print('###############')

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

# Singular Value Decomposition (SVD)

In [15]:
algo = SVD()
cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2104  1.2275  1.2271  1.2227  1.2310  1.2237  0.0072  
Fit time          2.67    2.42    2.29    1.97    1.77    2.22    0.32    
Test time         0.06    0.04    0.04    0.03    0.03    0.04    0.01    


{'test_rmse': array([1.21040106, 1.22749314, 1.22707846, 1.22267768, 1.23103277]),
 'fit_time': (2.6673812866210938,
  2.42238450050354,
  2.285001516342163,
  1.9690957069396973,
  1.767554521560669),
 'test_time': (0.06033730506896973,
  0.04401206970214844,
  0.040024518966674805,
  0.03201889991760254,
  0.032019853591918945)}

#  Non-Negative Matrix Factorization (NMF)

In [17]:
algo = NMF()
cross_validate(data=data, algo=algo, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2846  1.2816  1.2682  1.2786  1.2780  1.2782  0.0055  
Fit time          3.42    3.38    3.41    3.38    3.40    3.40    0.01    
Test time         0.06    0.03    0.04    0.04    0.04    0.04    0.01    


{'test_rmse': array([1.28457273, 1.28162271, 1.26824242, 1.27864237, 1.2780382 ]),
 'fit_time': (3.4189417362213135,
  3.38118052482605,
  3.408555746078491,
  3.38480281829834,
  3.4014081954956055),
 'test_time': (0.056090593338012695,
  0.0319979190826416,
  0.03600955009460449,
  0.040010929107666016,
  0.03600931167602539)}