# Chapter 6 - Collaborative Filtering

### The Framework

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Load the u.user file into a dataframe
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('data/u.user', sep='|', names=u_cols, encoding='latin-1')

users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [3]:
#Load the u.item file into a dataframe
i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('data/u.item', sep='|', names=i_cols, encoding='latin-1')

movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   title               1682 non-null   object 
 2   release date        1681 non-null   object 
 3   video release date  0 non-null      float64
 4   IMDb URL            1679 non-null   object 
 5   unknown             1682 non-null   int64  
 6   Action              1682 non-null   int64  
 7   Adventure           1682 non-null   int64  
 8   Animation           1682 non-null   int64  
 9   Children's          1682 non-null   int64  
 10  Comedy              1682 non-null   int64  
 11  Crime               1682 non-null   int64  
 12  Documentary         1682 non-null   int64  
 13  Drama               1682 non-null   int64  
 14  Fantasy             1682 non-null   int64  
 15  Film-Noir           1682 non-null   int64  
 16  Horror

In [5]:
#Remove all information except Movie ID and title
movies = movies[['movie_id', 'title']]

In [6]:
# Load the u.data file into a dataframe
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_csv('data/u.data', sep='\t', names=r_cols,
 encoding='latin-1')

# Drop the timestamp column
ratings = ratings.drop('timestamp', axis=1)

ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [7]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   100000 non-null  int64
 1   movie_id  100000 non-null  int64
 2   rating    100000 non-null  int64
dtypes: int64(3)
memory usage: 2.3 MB


### Baseline

Let's now split our ratings dataset in such a way that 75% of a user's ratings is in the training dataset and 25% is in the testing dataset. We will do this using a slightly hacky way: we will assume that the user_id field is the target variable (or y) and that our ratings DataFrame consists of the predictor variables (or X). We will then pass these two variables into scikitlearn's train_test_split function and stratify it along y. This ensures that the proportion of each class is the same in both the training and testing datasets:

In [8]:
#Import the train_test_split function
from sklearn.model_selection import train_test_split

#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['user_id']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state=42)

We will be using the RMSE to assess our modeling performance.

In [9]:
#Import the mean_squared_error function
from sklearn.metrics import mean_squared_error

#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

Next, let's define our baseline collaborative filter model. All our collaborative filter (or CF) models will take in a `user_id` and `movie_id` as
input and output a floating point number between 1 and 5. We define our baseline model in such a way that it returns 3 regardless of `user_i` or movie_id:

In [10]:
#Define the baseline model to always return 3.
def baseline(user_id, movie_id):
    return 3.0

To test the potency of our model, we compute the RMSE obtained by that particular model for all user-movie pairs in the test dataset:

In [11]:
#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model):
    
    #Construct a list of user-movie tuples from the testing dataset
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    
    #Predict the rating for every user-movie tuple
    y_pred = np.array([cf_model(user, movie) for (user, movie) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])
    
    #Return the final RMSE score
    return rmse(y_true, y_pred)

We're all set. Let's now compute the RMSE obtained by our baseline model:

In [12]:
score(baseline)

1.2488234462885457

We obtain a score of 1.2488. For the models that we build in the subsequent sections, we will try to obtain an RMSE that is less than that obtained for the baseline.

# User Based Collaborative Filtering

User-based collaborative filters find users similar to a particular user and then recommend products that those users have liked to the first user.

### Ratings Matrix

In [13]:
# Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie_id')

# Movie x
r_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### Mean

It is possible that some movies are available only in the test set and not the training set (and consequentially, not in our ratings matrix). In such cases, we will just default to a rating of 3.0, like the baseline model:

In [14]:
#User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, movie_id):
    
    #Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        #Compute the mean of all the ratings given to the movie
        mean_rating = r_matrix[movie_id].mean()
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        mean_rating = 3.0
    
    return mean_rating

In [15]:
#Compute RMSE for the Mean model
score(cf_user_mean)

1.0300824802393536

We see that the score obtained for this model is lower and therefore better than the baseline.

### Weighted Mean

<div style="text-align:center;">
    <img src='images/wm.jpg' width='500'>
</div>

For the sake of this exercise, we will use the cosine score as our similarity function (or sim). Recall how we constructed a movie cosine similarity
matrix while building our content-based engine. We will be building a very similar cosine similarity matrix for our users in this section.

However, scikit-learn's cosine_similarity function does not work with NaN values. Therefore, we will convert all missing values to zero in order to
compute our cosine similarity matrix:

In [16]:
#Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

In [17]:
# Import cosine_score 
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

In [18]:
#Convert into pandas dataframe 
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)

cosine_sim.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.108361,0.046638,0.029577,0.245753,0.335853,0.344724,0.191582,0.057149,0.251979,...,0.257073,0.069412,0.231643,0.108093,0.176842,0.104799,0.232472,0.051528,0.129555,0.256333
2,0.108361,1.0,0.057613,0.130237,0.054918,0.190552,0.079399,0.076146,0.167992,0.147376,...,0.136993,0.252887,0.255454,0.285193,0.232751,0.149088,0.102807,0.062386,0.109143,0.107686
3,0.046638,0.057613,1.0,0.139805,0.0,0.032485,0.043869,0.080968,0.022263,0.059925,...,0.027402,0.0,0.17506,0.010343,0.105635,0.019052,0.127099,0.023917,0.060392,0.0
4,0.029577,0.130237,0.139805,1.0,0.0,0.04519,0.088586,0.199526,0.135013,0.026919,...,0.055392,0.049773,0.076549,0.139382,0.113886,0.0,0.130343,0.077357,0.15789,0.063911
5,0.245753,0.054918,0.0,0.0,1.0,0.176443,0.28186,0.132205,0.03879,0.1342,...,0.183969,0.019305,0.073714,0.041807,0.081088,0.029743,0.188392,0.068342,0.055557,0.207259
6,0.335853,0.190552,0.032485,0.04519,0.176443,1.0,0.394725,0.143385,0.125126,0.372679,...,0.328643,0.070809,0.135806,0.17167,0.125446,0.086464,0.230566,0.095478,0.197307,0.185268
7,0.344724,0.079399,0.043869,0.088586,0.28186,0.394725,1.0,0.215861,0.121224,0.378723,...,0.339853,0.110866,0.096055,0.10469,0.126108,0.075012,0.270071,0.020036,0.236086,0.266571
8,0.191582,0.076146,0.080968,0.199526,0.132205,0.143385,0.215861,1.0,0.116173,0.169088,...,0.150048,0.064242,0.118297,0.053969,0.168057,0.095736,0.164157,0.076269,0.089871,0.210995
9,0.057149,0.167992,0.022263,0.135013,0.03879,0.125126,0.121224,0.116173,1.0,0.152694,...,0.082819,0.0644,0.127051,0.069251,0.095673,0.0,0.131458,0.106763,0.089297,0.089583
10,0.251979,0.147376,0.059925,0.026919,0.1342,0.372679,0.378723,0.169088,0.152694,1.0,...,0.279849,0.087828,0.131888,0.111841,0.094423,0.080883,0.255758,0.063461,0.169309,0.181031


With the user cosine similarity matrix in hand, we are now in a position to efficiently calculate the weighted mean scores for this model. However,
implementing this model in code is a little more nuanced than its simpler mean counterpart. This is because we need to only consider those cosine
similarity scores that have a corresponding, non-null rating. In other words, we need to avoid all users that have not rated movie m:

In [19]:
def cf_user_wmean(user_id, movie_id):
    # Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        # Get the similarity scores for the user in question with every other user
        sim_scores = cosine_sim[user_id]

        # Get the user ratings for the movie in question
        m_ratings = r_matrix[movie_id]

        # Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index

        # Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()

        # Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)

        # Compute the final weighted mean
        if sim_scores.sum() != 0:
            wmean_rating = np.dot(sim_scores, m_ratings) / sim_scores.sum()
        else:
            # Default to a rating of 3.0 if sim_scores sum to zero
            wmean_rating = 3.0
    else:
        # Default to a rating of 3.0 in the absence of any information
        wmean_rating = 3.0

    return wmean_rating

In [20]:
score(cf_user_wmean)

1.0237210431087944

Since we are dealing with positive ratings, the cosine similarity score will always be positive. Therefore, we do not need to explicitly add in a modulus function while computing the normalizing factor (the denominator of the equation that ensures the final rating is scaled back to between 1 and 5).

However, if you're working with a similarity metric that can be negative in this scenario (for instance, the Pearson correlation score), it is important
that we factor in the modulus. Running this code takes significantly more time than the previous model. However, we achieve a (very small) improvement in our RMSE score.

### Demographics

The basic intuition behind these filter is that users of the same demographic tend to have similar tastes. Therefore, their effectiveness
depends on the assumption that women, or teenagers, or people from the same area will share the same taste in movies.

In [21]:
#Merge the original users dataframe with the training set 
merged_df = pd.merge(X_train, users)

merged_df.head()

Unnamed: 0,user_id,movie_id,rating,age,sex,occupation,zip_code
0,862,177,4,25,M,executive,13820
1,862,416,3,25,M,executive,13820
2,862,1093,5,25,M,executive,13820
3,862,168,4,25,M,executive,13820
4,862,568,3,25,M,executive,13820


Next, we need to compute the mean rating of each movie by gender.

In [22]:
#Compute the mean rating of every movie by gender
gender_mean = merged_df[['movie_id', 'sex', 'rating']].groupby(['movie_id', 'sex'])['rating'].mean()

gender_mean

movie_id  sex
1         F      3.797872
          M      3.888446
2         F      3.285714
          M      3.202703
3         F      2.916667
                   ...   
1677      F      3.000000
1679      M      3.000000
1680      M      2.000000
1681      M      3.000000
1682      M      3.000000
Name: rating, Length: 3047, dtype: float64

In [23]:
gender_mean.head(10)

movie_id  sex
1         F      3.797872
          M      3.888446
2         F      3.285714
          M      3.202703
3         F      2.916667
          M      3.245614
4         F      3.545455
          M      3.563025
5         F      3.714286
          M      3.155556
Name: rating, dtype: float64

We are now in a position to define a function that identifies the gender of the user, extracts the average rating given to the movie in question by that
particular gender, and return that value as output:

In [24]:
#Set the index of the users dataframe to the user_id
users = users.set_index('user_id')

In [25]:
#Gender Based Collaborative Filter using Mean Ratings
def cf_gender(user_id, movie_id):
    
    #Check if movie_id exists in r_matrix (or training set)
    if movie_id in r_matrix:
        #Identify the gender of the user
        gender = users.loc[user_id]['sex']
        
        #Check if the gender has rated the movie
        if gender in gender_mean[movie_id]:
            
            #Compute the mean rating given by that gender to the movie
            gender_rating = gender_mean[movie_id][gender]
        
        else:
            gender_rating = 3.0
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        gender_rating = 3.0
    
    return gender_rating

In [26]:
score(cf_gender)

1.0392906999935203

We see that this model actually performs worse than the standard mean ratings collaborative filter. This indicates that a user's gender isn't the
strongest indicator of their taste in movies.

Let's try building one more demographic filter, but this time using both **gender** and **occupation**:

In [27]:
#Compute the mean rating by gender and occupation
gen_occ_mean = merged_df[['sex', 'rating', 'movie_id', 'occupation']].pivot_table(
    values='rating', index='movie_id', columns=['occupation', 'sex'], aggfunc='mean')

gen_occ_mean.head()

occupation,administrator,administrator,artist,artist,doctor,educator,educator,engineer,engineer,entertainment,...,salesman,salesman,scientist,scientist,student,student,technician,technician,writer,writer
sex,F,M,F,M,M,F,M,F,M,F,...,F,M,F,M,F,M,F,M,F,M
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,3.9375,3.75,5.0,3.4,3.666667,3.25,3.884615,4.0,4.083333,4.0,...,,4.0,3.5,4.0,4.043478,3.796296,4.0,3.75,4.0,3.0
2,3.0,3.666667,,,,4.0,3.5,,3.066667,,...,,,,3.0,2.666667,3.277778,,2.714286,,2.333333
3,3.5,4.0,,,,,2.0,,3.777778,,...,,,,,3.0,3.391304,,4.25,,1.0
4,3.666667,3.6,,4.666667,3.0,2.5,3.8,4.0,3.65,,...,4.0,4.0,,3.4,3.25,3.777778,,3.333333,4.25,3.25
5,4.0,2.333333,,,,4.0,2.333333,,3.5,,...,,,,4.0,4.333333,3.111111,,3.333333,4.0,2.0


In [28]:
#Gender and Occupation Based Collaborative Filter using Mean Ratings
def cf_gen_occ(user_id, movie_id):
    
    #Check if movie_id exists in gen_occ_mean
    if movie_id in gen_occ_mean.index:
        
        #Identify the user
        user = users.loc[user_id]
        
        #Identify the gender and occupation
        gender = user['sex']
        occ = user['occupation']
        
        #Check if the occupation has rated the movie
        if occ in gen_occ_mean.loc[movie_id]:
            
            #Check if the gender has rated the movie
            if gender in gen_occ_mean.loc[movie_id][occ]:
                
                #Extract the required rating
                rating = gen_occ_mean.loc[movie_id][occ][gender]
                
                #Default to 3.0 if the rating is null
                if np.isnan(rating):
                    rating = 3.0
                
                return rating
            
    #Return the default rating    
    return 3.0

In [29]:
score(cf_gen_occ)

1.1419651376788005

We see that this model performs the worst out of all the filters we've built so far, beating only the baseline. This strongly suggests that tinkering with user demographic data may not be the best way to go forward with the data that we are currently using. 

### Model Based Approaches

The collaborative filters we have built thus far are known as memorybased filters. This is because they only make use of similarity metrics to come up with their results. They learn any parameters from the data or assign classes/clusters to the data. In other words, they do not make use of machine learning algorithms.

Surprise is a scikit (or scientific kit) for building recommender systems in Python. You can think of it as scikit-learn's recommender systemscounterpart. It is extremely robust and easy to use. It gives us ready-to-use implementations of most of the popular collaborative filtering algorithms and also allows us to integrate an algorithm of our own into the framework.

In [30]:
# Import the required classes and methods from the surprise library
from surprise import Reader, Dataset, KNNBasic
from surprise.model_selection import cross_validate

# Assuming you have a pandas DataFrame `ratings` with columns 'userId', 'movieId', and 'rating'
# Define a Reader object with the rating scale
reader = Reader(rating_scale=(1, 5))

# Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

# Define the algorithm object; in this case kNN
knn = KNNBasic()

# Evaluate the performance in terms of RMSE using cross-validation
results = cross_validate(knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# print(results)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9760  0.9782  0.9808  0.9796  0.9782  0.9786  0.0016  
MAE (testset)     0.7690  0.7742  0.7772  0.7726  0.7719  0.7730  0.0027  
Fit time          0.17    0.17    0.17    0.17    0.17    0.17    0.00    
Test time         1.24    1.23    1.25    1.22    1.20    1.23    0.02    


The output indicates that the filter is making use of a technique known as fivefold cross-validation. In a nutshell, this means that surprise divides the data into five equal parts. It then uses four parts as the training data and tests it on the fifth part. This is done five times, in such a way that every part plays the role of the test data once.

We see that the RMSE obtained by this model is 0.9786. This is, by far, the best result we have achieved.

### Supervised Learning and Dimensionality Reduction

Let's now take a tour of some other model-based approaches to collaborative filtering and implement a few of them using the surprise library.

Consider our ratings matrix once again. It is of the m × n shape, where every row represents one of the m users and every column represents one of
the n items.

Let's now remove one of the n columns (say n ). We now have an m × (n-1) matrix. If we treat the m × (n-1) matrix as the predictor variables and n as
the target variable, we can use supervised learning algorithms to train on the values available in n to predict values that are not. This can be repeated n times for every column to eventually complete our matrix.

One big problem is that most supervised learning algorithms do not work with missing data. In standard problems, it is common practice to impute
the missing values with the mean or median of the column it belongs to.

However, our matrix suffers from heavy data sparsity. More than 99% of the data in the matrix is unavailable. Therefore, it is simply not possible to
impute values (such as mean or median) without introducing a large bias.

One solution that may come to mind is to compress the predictor matrix in such a way that all the values are available. Unfortunately, dimensionality
reduction techniques, such as SVD and PCA, also do not work in an environment with missing values.

While working toward a solution for the Netflix Problem, Simon Funk came up with a solution that could be used to reduce the m × (n-1) matrix into a
lower-dimensional m × d matrix where d << n. He used standard dimensionality-reduction techniques (in his case, the SVD) but with slight tweaks. Explaining the technique is outside the scope of this book, but is presented in the Appendix for advanced readers. For the sake of this chapter,
we will treat this technique as a black box that converts an m × n sparse matrix into an m × d dense matrix where d << n, and call it SVD-like.

In [31]:
# Import the required classes and methods from the surprise library
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Define the SVD algorithm object
svd = SVD()

# Evaluate the performance in terms of RMSE using cross-validation
results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9323  0.9375  0.9372  0.9345  0.9398  0.9362  0.0026  
MAE (testset)     0.7364  0.7396  0.7387  0.7377  0.7410  0.7387  0.0016  
Fit time          0.38    0.37    0.39    0.37    0.37    0.38    0.01    
Test time         0.10    0.05    0.05    0.10    0.05    0.07    0.02    


The SVD filter outperforms all other filters, with an RMSE score of 0.9362.