# Collaborative Filtering Modeling

In this notebook, I use the powerful Surprise library's SVD algorithm to create a collaborative filtering recommendation system.

In [None]:
# Import necessary libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

from surprise import Dataset, Reader
from surprise import SVD 
from surprise import accuracy 
from surprise.model_selection import train_test_split, GridSearchCV 

In [None]:
# Load the data
df = pd.read_csv('../data/processed/merged_for_svd.csv')

## SVD

#### Create reader object

In [None]:
# Initialize a reader object with our 0.5-5.0 rating scale
reader = Reader(rating_scale=(df['rating'].min(), df['rating'].max()))

#### Dataset generator

In [10]:
# Create dataset generator
data_gen = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
data_gen

<surprise.dataset.DatasetAutoFolds at 0x16c300a10>

#### Train/test split the data

In [11]:
trainset, testset = train_test_split(data_gen, test_size=0.2, random_state = 42)

### Modeling

The parameters for the first "default" model here are just chosen arbitrarily.

In [None]:
# Fit the model
svd = SVD(reg_all=0.05, lr_all=0.0025, n_factors=100, n_epochs=30)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x16891e180>

In [None]:
# Make predictions
predictions = svd.test(testset)

### Model Evaluation

#### Root Mean Squared Error (RMSE)

In [14]:
print(accuracy.rmse(predictions))

RMSE: 0.8789
0.8789324838424579


#### Mean Absolute Error (MAE)

In [15]:
print(accuracy.mae(predictions))

MAE:  0.6772
0.6772170515543106


#### Fraction of Concordant Pairs (FCP)

In [16]:
accuracy.fcp(predictions)

FCP:  0.6636


0.6635937182215395

Not the best. The model has room for improvement. But before I move on to model optimization, let's run a quick model evaluation with cross_validate from the surprise library.

In [17]:
from surprise.model_selection import cross_validate

In [18]:
algo = SVD()
cross_validate(algo, data_gen, measures=['RMSE', 'MAE', 'FCP'], cv=5, verbose=True)

Evaluating RMSE, MAE, FCP of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8785  0.8742  0.8865  0.8618  0.8682  0.8738  0.0084  
MAE (testset)     0.6720  0.6702  0.6796  0.6650  0.6692  0.6712  0.0048  
FCP (testset)     0.6512  0.6537  0.6697  0.6726  0.6581  0.6611  0.0086  
Fit time          0.37    0.36    0.36    0.36    0.36    0.36    0.00    
Test time         0.04    0.04    0.04    0.04    0.04    0.04    0.00    


{'test_rmse': array([0.87849451, 0.87417833, 0.8864518 , 0.86184196, 0.86817434]),
 'test_mae': array([0.6719964 , 0.67019209, 0.67961491, 0.66499076, 0.66918246]),
 'test_fcp': array([0.65120584, 0.65372255, 0.66968265, 0.67263084, 0.65808339]),
 'fit_time': (0.36647987365722656,
  0.36095499992370605,
  0.36016201972961426,
  0.3616800308227539,
  0.3599710464477539),
 'test_time': (0.04133296012878418,
  0.0393679141998291,
  0.03963303565979004,
  0.03938007354736328,
  0.0395660400390625)}

The evaluation metric scores are fairly in line with what I got above. Now, onto optimization.

## Model Optimization

Cross-validation and hyperparameter tuning to obtain the best hyperparameters for the model.

In [None]:
# Creating a parameter grid for GridSearchCV
param_grid = {
    'n_factors': [50, 100, 150],           # Number of latent factors
    'lr_all': [0.002, 0.005, 0.01],        # Learning rate for all parameters
    'reg_all': [0.02, 0.05, 0.1],          # Regularization term for all parameters
    'n_epochs': [20, 50, 100]              # Number of iterations
}

In [None]:
# Initialize a GridSearchCV object
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae", "fcp"], cv=5)

In [None]:
# Running the grid search over my specified parameters
gs.fit(data_gen)

In [22]:
pd.DataFrame(gs.cv_results)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,split3_test_rmse,split4_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,...,rank_test_fcp,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_n_factors,param_lr_all,param_reg_all,param_n_epochs
0,0.887010,0.887983,0.878330,0.884048,0.876095,0.882693,0.004712,68,0.681126,0.686924,...,73,0.242185,0.011855,0.077888,0.075899,"{'n_factors': 50, 'lr_all': 0.002, 'reg_all': ...",50,0.002,0.02,20
1,0.877858,0.876475,0.863703,0.873373,0.866027,0.871487,0.005647,50,0.671393,0.675169,...,57,0.567590,0.026436,0.059502,0.041092,"{'n_factors': 50, 'lr_all': 0.002, 'reg_all': ...",50,0.002,0.02,50
2,0.888985,0.882042,0.875635,0.880659,0.872050,0.879874,0.005788,64,0.678516,0.677899,...,65,1.072411,0.005289,0.067272,0.057242,"{'n_factors': 50, 'lr_all': 0.002, 'reg_all': ...",50,0.002,0.02,100
3,0.886180,0.887737,0.878285,0.883696,0.875397,0.882259,0.004697,66,0.681122,0.686957,...,56,0.223188,0.003652,0.060444,0.042679,"{'n_factors': 50, 'lr_all': 0.002, 'reg_all': ...",50,0.002,0.05,20
4,0.873889,0.874613,0.866010,0.871060,0.863193,0.869753,0.004460,41,0.669322,0.673933,...,44,0.550258,0.020065,0.061180,0.043959,"{'n_factors': 50, 'lr_all': 0.002, 'reg_all': ...",50,0.002,0.05,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,0.859018,0.859863,0.848357,0.858690,0.846913,0.854568,0.005692,10,0.656225,0.660548,...,12,1.185448,0.001426,0.061758,0.041405,"{'n_factors': 150, 'lr_all': 0.01, 'reg_all': ...",150,0.010,0.05,50
77,0.861548,0.859363,0.850039,0.855868,0.846156,0.854595,0.005739,11,0.657466,0.659657,...,10,2.415816,0.052048,0.065488,0.047542,"{'n_factors': 150, 'lr_all': 0.01, 'reg_all': ...",150,0.010,0.05,100
78,0.868343,0.866953,0.857336,0.862949,0.854077,0.861932,0.005481,30,0.665118,0.667900,...,30,0.489466,0.002551,0.075714,0.069706,"{'n_factors': 150, 'lr_all': 0.01, 'reg_all': ...",150,0.010,0.10,20
79,0.856519,0.855567,0.847104,0.852155,0.842634,0.850796,0.005246,5,0.654988,0.656599,...,4,1.187856,0.002947,0.040947,0.000257,"{'n_factors': 150, 'lr_all': 0.01, 'reg_all': ...",150,0.010,0.10,50


### Best model according to the different evaluation metrics

In [None]:
# Best FCP parameters
best_fcp_est = gs.best_estimator['fcp']
print(gs.best_params['fcp'])
gs.best_score['fcp']

{'n_factors': 150, 'lr_all': 0.01, 'reg_all': 0.1, 'n_epochs': 100}


0.6846921730434321

In [None]:
# Best MAE parameters
best_mae_est = gs.best_estimator['mae']
print(gs.best_params['mae'])
gs.best_score['mae']

{'n_factors': 150, 'lr_all': 0.01, 'reg_all': 0.1, 'n_epochs': 100}


0.650350258594008

In [None]:
# Best RMSE parameters
best_rmse_est = gs.best_estimator['rmse']
print(gs.best_params['rmse'])
gs.best_score['rmse']

{'n_factors': 150, 'lr_all': 0.005, 'reg_all': 0.1, 'n_epochs': 100}


0.8497407009989321

The only different parameter in the three above is that in the best RMSE model learning rate is 0.005 instead of 0.01. 