# <u> Movie Recommender : Explicit Recommender Systems </u>

## Introduction

*The aim of this project is to implement collaborative filtering and matrix factorization from scratch and to observe how the results are varying from a naive popularity based recommender system.*

### *Explicit Ratings*
Explicit ratings are clear and direct scores or grades given by individuals to express their opinion on a particular item or experience. Since the users have directly 'scored' the items, in many cases it wll help us to make a better recommender system.

### *Dataset*
The MovieLens dataset is a popular benchmark dataset in the field of recommender systems. It contains a large number of movie ratings provided by users, along with movie metadata such as title, genre, and release year. Majorly, the dataset contains three files </br>
* Movie data - which contains details of the movie like Genre, title, release year etc.
* Rating data - which contains the ratings provided by users to each movie
* User data - which contains the demographics of users
</br> </br>
For the purpose of this project, we will focus on the rating file and aim to predict the ratings assigned by each user to individual movies.

### *Algorithms*

Here we will try out the below 3 algorithms and compare their performance through selected evaluation metrics: </br>
* Global average - Most naive form of predicting ratings without taking users into consideration. Here, each user will have the same ratings for a movie which is the average rating of the movie across all users. </br></br>
* User-user collaborative filtering - Here we will find users who have similar tastes and preferences to the user for whom recommendations are being made. Once similar users are identified, a weighted average of their ratings and preferences for movies are used to make recommendations for the target user.</br></br>
* Matrix Factorization -  This algorithm involves breaking down the user-item rating matrix into two lower-dimensional matrices: one representing the users and the other representing the items. These matrices are designed to capture the underlying characteristics of users and items that contribute to their preferences and ratings. These matrices can be used to predict the ratings for new items and make recommendations for users by computing the dot product of the user and item vectors.</br>

### *Evaluation Metrics*

We will consider two metrics for evaluation:
* RMSE (Root Mean Square Error)
* MAPE (Mean Absolute Percentage Error)

## Code

In [188]:
"""
Import the required libraries
"""

import pandas as pd
import numpy as np
from numpy.linalg import norm
from scipy.sparse import coo_matrix

from tqdm import tqdm
from datetime import datetime as dt

# Setting a fixed seed for comparing results
np.random.seed(0)

### *Reading data*

The file 'ratings.dat' (downloadable from https://grouplens.org/datasets/movielens/1m/) contains data of the format </br>
$<UserID> :: <MovieID> :: <Rating> :: <Timestamp>$
</br>
Let's read and store the data into a Pandas DataFrame

In [198]:
"""
Read data
"""

data_df = pd.read_csv('ratings.dat', sep='::',
                      names=["UserID", "MovieID", "Rating", "Timestamp"],
                      engine='python')
print("Shape of Data : {} x {}\n".format(data_df.shape[0], data_df.shape[1]))
print("SAMPLE FROM ACTUAL DATASET ")
print("--------------------------")
data_df.head()

Shape of Data : 1000209 x 4

SAMPLE FROM ACTUAL DATASET 
--------------------------


Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Using NumPy, we will convert the dataset into a sparse matrix to facilitate faster computations. The resulting matrix will have the number of users as the number of rows and the number of movies as the number of columns.
</br></br>
We will also divide the data into training and testing sets to evaluate the accuracy of our predictions.

In [199]:
TRAIN_SIZE = 0.7


# First, generate dictionaries for mapping old id to new id for users
# and movies
# unique_MovieID = data_df['MovieID'].unique()
unique_UserID = data_df['UserID'].unique()
j = 0
user_old2new_id_dict = dict()
for u in unique_UserID:
    user_old2new_id_dict[u] = j
    j += 1
j = 0
movie_old2new_id_dict = dict()
for i in unique_MovieID:
    movie_old2new_id_dict[i] = j
    j += 1

# Then, use the generated dictionaries to reindex UserID and MovieID
# in the data_df
user_list = data_df['UserID'].values
movie_list = data_df['MovieID'].values
for j in range(len(data_df)):
    user_list[j] = user_old2new_id_dict[user_list[j]]
    movie_list[j] = movie_old2new_id_dict[movie_list[j]]
data_df['UserID'] = user_list
data_df['movieID'] = movie_list

# Generate train_df with 70% samples and test_df with 30% samples,
# and there should have no overlap between them.
train_index = np.random.random(len(data_df)) <= TRAIN_SIZE
train_df = data_df[train_index]
test_df = data_df[~train_index]

# Number of Unique Users
num_user = len(data_df['UserID'].unique())   
# Number of Unique Movies
num_movie = len(data_df['MovieID'].unique())

# Generate train_mat and test_mat
train_mat = coo_matrix(
    (train_df['Rating'].values,
     (train_df['UserID'].values, train_df['MovieID'].values)),
    shape=(num_user, num_movie)
).astype(float).toarray()
test_mat = coo_matrix(
    (test_df['Rating'].values,
     (test_df['UserID'].values, test_df['MovieID'].values)),
    shape=(num_user, num_movie)
).astype(float).toarray()

# Visualize the new Matrix : rows -> users, columns -> movies
pd.DataFrame(train_mat, dtype = int)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3696,3697,3698,3699,3700,3701,3702,3703,3704,3705
0,5,3,0,0,5,3,5,0,4,4,...,0,0,0,0,0,0,0,0,0,0
1,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,3,5,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6035,5,0,3,4,4,5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6036,4,0,0,0,0,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0
6037,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6038,0,3,4,0,0,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0


### *Evaluation metrics*

As mentioned earlier, we will implement RMSE and MAPE metrics to evaluate our recommender models. Both of the metrics take one array of 'predicted ratings' and 'true ratings' and calculate the error.

</br>
* $ RMSE = \sqrt{\frac{(\Sigma (P_i - R_i)^2}{N}}$
<br> where $P_i$ is the predicted rating and $R_i$ is the true rating and $N$ is the number of predictions</br></br>
* $ MAPE = \frac{1}{N}\Sigma \frac{P_i - R_i}{R_i}$

In [238]:
"""
Evaluation Metrics
"""


# Function to find the Root Mean Square Error
def rmse(rating_predictions, true_ratings):

    # Consider only those movies which the user has actually rated
    rated_movies = np.nonzero(true_ratings)[0]
    rmserror = 0
    for movie in rated_movies:
            rmserror += np.power((rating_predictions[movie] -
                                  true_ratings[movie]), 2)
    rmserror = np.sqrt(rmserror / len(rated_movies))

    return rmserror


# Function to find the Mean Absolute Percentage Error
def mape(rating_predictions, true_ratings):

    # Consider only those movies which the user has actually rated
    rated_movies = np.nonzero(true_ratings)[0]
    mape = 0
    for movie in rated_movies:
        mape += np.abs(rating_predictions[movie] -
                       true_ratings[movie]) / true_ratings[movie]
    mape = mape/len(rated_movies)
    return mape

For ease of evaluation, we will create a function to which when the recommender is passed, it returns the performance using both metrics

In [200]:
# Function to evaluate any recommender based on the RMSE and MAP
def evaluate_recommender(model):

    rmses, mapes = [], []
    for user_ind in range(n_users):
        predictions = model.predict_ratings_for_user(user_ind)
        true_ratings = test_mat[user_ind]
        rmses.append(rmse(predictions, true_ratings))
        mapes.append(mape(predictions, true_ratings))

    print("RMSE : {:.3f}\t MAPE : {:.3f}".
          format(np.mean(rmses), np.mean(mapes)))

### 1. Global Average Rating a.k.a. Popular Movie recommender ('Non-personalized')

Here we will suggest movies based on their overall popularity, meaning those that have received high ratings from a larger number of users will be recommended to all. Therefore, it is a very naive approach and personalization is not a factor in this approach.
</br></br>
To determine a movie's popularity, we will calculate its mean rating.

In [209]:
n_users = train_mat.shape[0]
n_movies = train_mat.shape[1]


class Global_Average_Recommender():

    def __init__(self, train_mat):

        self.train_mat = train_mat
        self.n_users = train_mat.shape[0]
        self.n_movies = train_mat.shape[1]
        self.calculate_predictions()

    # Function to calculate predictions for all users and movies

    def calculate_predictions(self):

        # Generate a ranked list of movies by the popularity based
        # recommendation algorithm.
        # No of ratings received for each movie (n_movies,1)
        self.nonzero_ratings = (self.train_mat != 0).sum(axis=0)
        # Sum of the ratings received for each movie
        self.sum_ratings = self.train_mat.sum(axis=0)
        self.mean_ratings = self.sum_ratings / self.nonzero_ratings
        self.mean_ratings[np.isnan(self.mean_ratings)] = 0

    # Function to retrieve prediction of all movies for a user
    def predict_ratings_for_user(self, user_ind):

        # Remove the movies already watched by the user
        watched_movies = np.nonzero(train_mat[user_ind])[0]
        predicted_ratings = self.mean_ratings.copy()
        predicted_ratings[watched_movies] = 0
        return predicted_ratings

    # Function to retrieve prediction of any user-movie pair
    def predict_rating(self, user_indices, movie_indices):

        return self.mean_ratings[movie_indices]


In [212]:
global_average_recommender = Global_Average_Recommender(train_mat)

print("Evaluating Global average Recommender")
evaluate_recommender(global_average_recommender)

"""
Get an idea of the prediction
"""

sample_data = test_df.sample(10)
sample_data.drop(['Timestamp','movieID'], axis = 1, inplace=True)
user_indices = sample_data['UserID']
movie_indices = sample_data['MovieID']

sample_data['Global_Average_Recommender'] = [np.round(i,2) for i in global_average_recommender.predict_rating(user_indices, movie_indices)]
sample_data

  self.mean_ratings = self.sum_ratings / self.nonzero_ratings


Evaluating Global average Recommender
RMSE : 0.960	 MAPE : 0.303


Unnamed: 0,UserID,MovieID,Rating,Global_Average_Recommender
27417,191,1996,2,3.01
128483,831,633,3,4.15
459617,2831,379,5,3.75
167903,1067,1363,5,1.94
244769,1472,68,4,4.12
86030,562,1004,3,3.36
429277,2610,1883,2,2.95
336828,1982,2332,3,3.29
756303,4505,1070,4,3.96
583324,3561,2884,2,2.26


-------

### 2. User-User Collaborative Filtering

User-User Collaborative Filtering is a type of recommendation system that is based on the idea that people who have similar preferences in the past are likely to have similar preferences in the future. This technique uses the behavior and choices of other users to make recommendations to a target user.
</br></br>
To apply user-user collaborative filtering, the system identifies a target user and then looks for other users with similar tastes and preferences. It then analyzes the ratings and preferences of those similar users to recommend items that the target user has not yet seen or rated. The system can make use of various similarity metrics, such as Pearson correlation or cosine similarity, to identify the most similar users.
</br></br>
This technique has several advantages over other recommendation approaches, such as content-based filtering. User-user collaborative filtering can identify and recommend items that may be outside a user's typical preferences, but are enjoyed by similar users. It also doesn't rely on metadata or content information about items, which may be incomplete or inaccurate. However, this approach may face scalability issues with larger user bases and may require frequent updates to keep up with changing user preferences.
</br></br>

In [159]:
class User_User_CF():

    def __init__(self, train_mat, n_similar):

        self.train_mat = train_mat
        self.n_users = train_mat.shape[0]
        self.n_movies = train_mat.shape[1]
        self.n_similar = n_similar    # no: of similar users to be considered
        self.calculate_predictions()

    def calculate_predictions(self):

        start_time = dt.now()
        self.user_norm = norm(train_mat, axis=1)
        self.user_user_similarity = np.matmul(train_mat, train_mat.T)
        for row in range(self.user_user_similarity.shape[0]):
            self.user_user_similarity[row,
                                      :] = self.user_user_similarity[row,
                                                                     :] / self.user_norm[row]
        for col in range(self.user_user_similarity.shape[1]):
            self.user_user_similarity[:,
                                      col] = self.user_user_similarity[:,
                                                                       col] / self.user_norm[col]
        self.user_user_similarity = self.user_user_similarity - \
            np.eye(self.user_user_similarity.shape[0])
        end_time = dt.now()
        print("Time taken : {}".format((end_time - start_time).__str__()))

        # Identify top N similar users for each user
        self.top_N_similar_users = np.zeros(
            (self.n_users, self.n_similar), dtype=int)
        for user_ind, user in enumerate(self.user_user_similarity):
            self.top_N_similar_users[user_ind] = (
                -user).argsort()[:self.n_similar]

        # Using similarity scores from each user, predict movie ratings for
        # each user
        self.cf_recommendations = np.zeros((self.n_users, self.n_movies))

        for user_ind, user in enumerate(self.user_user_similarity):
            movie_ratings = np.zeros(self.n_movies)
            norm_ratings = np.zeros(self.n_movies)
            for similar_user in self.top_N_similar_users[user_ind]:
                movie_ratings += self.user_user_similarity[user_ind, int(
                    similar_user)] * train_mat[similar_user, :]
                rated_vector = train_mat[similar_user].copy()
                rated_vector[rated_vector > 0] = 1
                norm_ratings += self.user_user_similarity[user_ind, int(
                    similar_user)] * rated_vector
            movie_ratings = movie_ratings / norm_ratings
            movie_ratings[np.isnan(movie_ratings)] = 0
            self.cf_recommendations[user_ind] = movie_ratings

    def predict_ratings_for_user(self, user_ind):

        return self.cf_recommendations[user_ind]

    def predict_rating(self, user_indices, movie_indices):

        predictions = []
        for user_ind, movie_ind in zip(user_indices, movie_indices):
            predictions.append(self.cf_recommendations[user_ind][movie_ind])
        return predictions


In [215]:
collborative_recommender = User_User_CF(train_mat, n_similar = 1000)
evaluate_recommender(collborative_recommender)

"""
Get an idea of the prediction
"""

sample_data = test_df.sample(10)
sample_data.drop(['Timestamp','movieID'], axis = 1, inplace=True)
user_indices = sample_data['UserID']
movie_indices = sample_data['MovieID']

sample_data['CF_rating'] = [np.round(i,2) for i in 
                            collborative_recommender.predict_rating(
                                user_indices, movie_indices)]
sample_data

Time taken : 0:00:02.256699


  movie_ratings = movie_ratings / norm_ratings


RMSE : 0.950	 MAPE : 0.299


Unnamed: 0,UserID,MovieID,Rating,CF_rating
683015,4085,556,5,3.26
655462,3946,664,4,3.83
310424,1849,4,5,4.04
142829,921,184,5,4.14
579482,3538,400,3,3.36
389745,2286,171,5,4.01
737564,4407,372,4,4.09
488198,3001,1641,4,3.87
772406,4604,169,3,3.69
155159,999,148,3,3.64


------

In [177]:
"""
Get an idea of the prediction
"""

sample_data = test_df.sample(10)
sample_data.drop(['Timestamp','movieID'], axis = 1, inplace=True)
user_indices = sample_data['UserID']
movie_indices = sample_data['MovieID']

sample_data['CF_rating'] = [np.round(i,2) for i in collborative_recommender.predict_rating(user_indices, movie_indices)]
sample_data

Unnamed: 0,UserID,MovieID,Rating,CF_rating
551940,3399,2011,4,4.18
712831,4276,2629,4,4.41
222218,1342,766,5,4.27
122088,786,68,5,4.11
915513,5534,51,5,4.41
53646,351,118,4,2.85
523201,3225,151,5,3.89
820317,4931,42,5,3.91
592851,3613,424,3,3.25
691535,4139,794,3,3.65


### 3. Matrix Factorization

Matrix Factorization is a technique used in recommendation systems to model the preferences of users and items in a more efficient way. The idea behind this technique is to factorize a large user-item interaction matrix into two smaller matrices: a user matrix and an item matrix. The user matrix represents the preferences of each user, while the item matrix represents the attributes of each item. </br></br>

The factorization process can be achieved using various algorithms, such as Singular Value Decomposition (SVD), Non-Negative Matrix Factorization (NMF), and Alternating Least Squares (ALS). The effectiveness of matrix factorization depends on the quality of the data, the selection of appropriate algorithm and hyperparameters, and the evaluation metrics used to measure the performance of the system.

The MF model can be mathematically represented as: 

<center>$\underset{\mathbf{P},\mathbf{Q}}{\text{min}}\,\,L=\sum_{(u,i)\in\mathcal{O}}(\mathbf{P}_u\cdot\mathbf{Q}^\top_i-r_{u,i})^2+\lambda(\lVert\mathbf{P}\rVert^2_{\text{F}}+\lVert\mathbf{Q}\rVert^2_{\text{F}})$,</center>
    
where $\mathbf{P}$ is the user latent factor matrix of size (#user, #latent); $\mathbf{Q}$ is the movie latent factor matrix of size (#movie, #latent); $\mathcal{O}$ is a user-movie pair set containing all user-movie pairs having ratings in train_mat; $r_{u,i}$ represents the rating for user u and movie i; $\lambda(\lVert\mathbf{P}\rVert^2_{\text{F}}+\lVert\mathbf{Q}\rVert^2_{\text{F}})$ is the regularization term to overcome overfitting problem, $\lambda$ is the regularization weight (a hyper-parameter manually set by developer, i.e., you), and $\lVert\mathbf{P}\rVert^2_{\text{F}}=\sum_{x}\sum_{y}(\mathbf{P}_{x,y})^2$, $\lVert\mathbf{Q}\rVert^2_{\text{F}}=\sum_{x}\sum_{y}(\mathbf{Q}_{x,y})^2$. Such an L function is called the **loss function** for the matrix factorization model. The goal of training an MF model is to find appropriate $\mathbf{P}$ and $\mathbf{Q}$ to minimize the loss L.

In [234]:
class MF_explicit:
    def __init__(self, train_mat, test_mat, latent=5, lr=0.01, reg=0.01):
        # the training rating matrix of size (#user, #movie)
        self.train_mat = train_mat
        # the training rating matrix of size (#user, #movie)
        self.test_mat = test_mat

        self.latent = latent  # the latent dimension
        self.lr = lr  # learning rate
        self.reg = reg  # regularization weight, i.e., the lambda in the objective function

        self.num_user, self.num_movie = train_mat.shape

        # get the user-movie paris having ratings in train_mat
        self.sample_user, self.sample_movie = self.train_mat.nonzero()
        # the number of user-movie pairs having ratings in train_mat
        self.num_sample = len(self.sample_user)

        self.user_test_like = []
        for u in range(self.num_user):
            self.user_test_like.append(np.where(self.test_mat[u, :] > 0)[0])

        # latent factors for users, size (#user, self.latent), randomly
        # initialized
        self.P = np.random.random((self.num_user, self.latent))
        # latent factors for users, size (#movie, self.latent), randomly
        # initialized
        self.Q = np.random.random((self.num_movie, self.latent))

    def train(self, epoch=20):

        for ep in tqdm(range(epoch)):

            users, movies = self.sample_user.copy(), self.sample_movie.copy()

            # Shuffle the data
            p = np.random.permutation(len(users))
            users, movies = users[p], movies[p]

            # Iterate over each user-movie pair and train
            for user, movie in zip(users, movies):
                r_ui = self.train_mat[user, movie]
                self.P[user,:] -= self.lr * \
                    (2 * (np.dot(self.P[user,:], self.Q[movie,:]) - r_ui) * \
                     self.Q[movie,:] + 2 * self.reg * self.P[user,:])
                self.Q[movie,:] -= self.lr *  \
                    (2 * (np.dot(self.P[user,:],self.Q[movie,:]) - r_ui) * \
                     self.P[user,:] + 2 * self.reg * self.Q[movie,:])

        self.predictions = np.matmul(self.P, self.Q.T)
        
        # Test the model
        self.test()

    def predict_ratings_for_user(self, user_ind):
        
        return self.predictions[user_ind]

    def test(self):

        predictions = np.matmul(self.P, self.Q.T)
        recalls = np.zeros(3)
        precisions = np.zeros(3)
        user_count = 0.

        rmses = []
        for u in range(self.num_user):
            test_like = self.user_test_like[u]
            test_like_num = len(test_like)
            if test_like_num == 0:
                continue
            pred = predictions[u, :]

            rmses.append(np.sqrt(np.array([np.power(
                i - j, 2) for i, j in zip(test_mat[u][test_like], pred[test_like])]).mean()))

        print("Mean RMSE : ", np.array(rmses).mean())

    def predict_rating(self, user_indices, movie_indices):

        complete_prediction = np.matmul(self.P, self.Q.T)
        predictions = []
        for user_ind, movie_ind in zip(user_indices, movie_indices):
            predictions.append(complete_prediction[user_ind][movie_ind])
        return predictions

In [239]:
mf_explicit= MF_explicit(train_mat, test_mat, latent=5, lr=0.001, reg=0.000001)
mf_explicit.train(epoch=30)
evaluate_recommender(mf_explicit)

"""
Get an idea of the prediction
"""

sample_data = test_df.sample(20)
sample_data.drop(['Timestamp','movieID'], axis = 1, inplace=True)
user_indices = sample_data['UserID']
movie_indices = sample_data['MovieID']

sample_data['Popular_rating'] = [np.round(i,2) for i in global_average_recommender.predict_rating(user_indices, movie_indices)]
sample_data['CF_rating'] = [np.round(i,2) for i in collborative_recommender.predict_rating(user_indices, movie_indices)]
sample_data['MF'] = [np.round(i,2) for i in mf_explicit.predict_rating(user_indices, movie_indices)]
sample_data

100%|█████████████████████████████████████████████| 30/30 [04:51<00:00,  9.70s/it]


Mean RMSE :  0.9092376771298393
RMSE : 0.909	 MAPE : 0.282


Unnamed: 0,UserID,MovieID,Rating,Popular_rating,CF_rating,MF
777044,4641,283,1,3.59,3.54,3.41
541827,3333,243,4,3.79,3.72,3.69
187626,1163,369,4,4.0,3.96,3.69
784123,4681,2560,4,3.58,3.38,3.34
316662,1883,337,4,3.71,3.66,3.75
468420,2886,2132,3,2.65,2.66,2.6
622128,3767,1,3,3.48,3.66,3.51
457485,2817,855,5,3.54,3.79,3.43
909032,5500,2980,2,2.99,2.82,2.96
176901,1116,1000,4,3.59,3.44,3.84


<!-- latent_factor - 6 -> 0.9082
latent_factor - 8 -> 0.9144
latent_factor - 10 -> 0.9153
latent_factor - 12 -> 0.9170
latent_factor - 14 -> 0.9220
latent_factor - 16 -> 0.9238
latent_factor - 18 -> 0.9245
latent_factor - 20 -> 0.93170
latent_factor - 22 -> 0.9350 -->


From the above results we can see that although naive recommender provided a decent rating prediction, collaborative filtering could slightly improve on the performance. Most importantly, Matrix Factorization was able to enhance the performance which we can observe from the RMSE value and samples.

-----
-----

## The End