# Recommendation system with matrix factorization

In this notebook a movie recommendation system is presented which uses matrix factorization. This method works by creating an M matrix of size U*I, where U is the number of users, and I is the number of items, in our case movies, and in the matrix, the M[i,j] element is the rating given to the j-th movie by the i-th user. 

The matrix factorization method creates two matrixes, one is named as User Feature Vector (M1) with size of U*K, and an Item Feature Vector (M2) with the size of I*K, in a way, that values in Mnew = M1*M2 should differ as less as possible from the original M. 

![](https://miro.medium.com/max/1400/1*Zhm1NMlmVywn0G18w3exog.png)

The recommendation for a user is done returning movies with the highest value from the Mnew, which were empty in the original matrix.

The matrix factorization will be presented with 2 different methods:
* Singular value decomposition
* Kernelized matrix factorization

The main difference between the two methods, is that the kernelized matrix factorization can do online learning, so if new ratings or new users come, the whole matrix factorization process doesn't have to be 
recalculated.

In both of the case the validation metric used is the root mean squared error, which can be calculated in the following way: we take the squared error of each of the predictions, we sum these, and divide it by the number of predictions, and finally the square root of this is calculated. Formula: 

![](https://miro.medium.com/max/966/1*lqDsPkfXPGen32Uem1PTNg.png)



## Preparations

We are using the movielens-25m-dataset which contains data about more than 62k movies and 25M rating. In the 'movies.csv' the movie names and it's genres are present, and the ratings.csv presents the ratings given by the users.

### Imports
*  Pandas - for reading in the csv-s and manipulating it's data 
*  Numpy - linear algebra calculations
*  svds - creates the new matrixes with matrix factorization
*  mean_squared_error : for calulcating the goodness of the result, it calulcates the RMSE metric

In [None]:
!pip install matrix_factorization

In [None]:
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error
from matrix_factorization import BaselineModel, KernelMF, train_update_test_split

In [None]:
base_path = "/kaggle/input/movielens-25m-dataset/ml-25m/"

### Reading in the necessary csv-s

The movies_df contains the movies and ratings_df the ratings.

In [None]:
movies_df = pd.read_csv(base_path + "movies.csv")
movies_df.head()

In [None]:
ratings_df = pd.read_csv(base_path + "ratings.csv")
ratings_df.drop(columns = ["timestamp"], inplace=True)
ratings_df

In [None]:
users = ratings_df["userId"].unique()
users_df = pd.DataFrame(users, columns=["userId"])
users_df

As there are more than 162k users and more than 62k movies, we will keep only those users who have given at least 200 and the movies which received at least 200 ratings,as with few ratings given by a user, or with few ratings received by a movie, the algorithm cannot return with good results, and the calculation would take too much time.

In [None]:
# Filter to only to the most popular movies, and the most active users

ratings_df["user_freq"] = ratings_df.groupby("userId")["userId"].transform('count')
ratings_df["movie_freq"] = ratings_df.groupby("movieId")["movieId"].transform('count')

In [None]:
USER_FREQ_LIMIT = 200
MOVIE_FREQ_LIMIT = 200

ratings_df = ratings_df.loc[(ratings_df["user_freq"] > USER_FREQ_LIMIT) & (ratings_df["movie_freq"] > MOVIE_FREQ_LIMIT)]
ratings_df

In [None]:
print(ratings_df["movieId"].nunique())
print(ratings_df["userId"].nunique())

In [None]:
movies_for_mf_df = movies_df.loc[movies_df["movieId"].isin(ratings_df["movieId"])].reset_index()
movies_for_mf_df

In [None]:
users_df = users_df.loc[users_df["userId"].isin(ratings_df["userId"])].reset_index()
users_df

In the below dataframe (ratings_df) only the ratings of the most active users, and the most rated films are present. This will be the input for both of our matrix factorization methods.

In [None]:
movies_for_mf_df["movieIndex"] = movies_for_mf_df.index
ratings_df = ratings_df.merge(movies_for_mf_df[["movieId", "movieIndex"]], on="movieId")

users_df["userIndex"] = users_df.index
ratings_df = ratings_df.merge(users_df[["userId", "userIndex"]], on = "userId")

ratings_df

In [None]:
ratings_df.rename(columns={"userIndex": "user_id", "movieIndex": "item_id"}, inplace=True)

In [None]:
X = ratings_df[["user_id", "item_id"]]
y = ratings_df["rating"]

## Kernelized matrix factoization

### Training

The data is split into training, update, and testing datasets.

In [None]:

# Prepare data for online learning
(
    X_train_initial,
    y_train_initial,
    X_train_update,
    y_train_update,
    X_test_update,
    y_test_update,
) = train_update_test_split(ratings_df, frac_new_users=0.2)

We will start the training with the initial traing dataset, which consists 60% of our users.
The RMSE for training is: 0.5935

In [None]:

matrix_fact = KernelMF(n_epochs=20, n_factors=100, verbose=1, lr=0.005, reg=0.005)
matrix_fact.fit(X_train_initial, y_train_initial)


UPDATE RMSE : 0.5883  
TEST RMSE : 0.7544

In [None]:

# Update model with new users
matrix_fact.update_users(
    X_train_update, y_train_update, lr=0.005, n_epochs=20, verbose=1
)
pred = matrix_fact.predict(X_test_update)
rmse = mean_squared_error(y_test_update, pred, squared=False)
print(f"\nTest RMSE: {rmse:.4f}")


### Predicting

On the below dataframe the most rated movies of user 480 can be seen.

In [None]:

# Get and sort the user's predictions
user_row_number =480

# Get the user's data and merge in the movie information.
user_data = ratings_df[ratings_df.user_id == (user_row_number)]
user_full = (user_data.merge(movies_for_mf_df, how = 'left', left_on = 'item_id', right_on = 'movieIndex').
                 sort_values(['rating'], ascending=False)
             )

print ("User %i has already rated %i  movies" % (user_row_number, user_full.shape[0]))
user_full.head(30)


In [None]:

items_known = X_train_initial.loc[X_train_initial["user_id"] == user_row_number]["item_id"]
recom = matrix_fact.recommend(user=user_row_number, items_known=items_known, amount=30)
items_known


On the below dataframe the recommendations for user 480 can be seen.  From the above dataframe it's clear that the favourite genres of user 480 are drama, action and thriller. Most of the movies from the prediction fall into this category.

In [None]:

recom = recom.merge(movies_for_mf_df, left_on="item_id", right_on="movieIndex")
recom


In [None]:
# Delete unused elements from the memory, to save some space

del recom
del pred 
del X_train_initial
del y_train_initial
del X_train_update
del y_train_update
del X_test_update
del y_test_update

## Singluar value decomposition

### Training

First prepare the data, to be in the correct numpy matrix form. The following transformations are done:
* The missing values are replaced by the average rating of the user
* For each of the users the new average is subtitued from the rating value, so it won't matter if a user is more critical than others.

In [None]:
ratings_matrix_df = ratings_df.pivot(index='user_id', columns='item_id', values='rating')

In [None]:
orig_ratings_matrix_df = ratings_matrix_df

In [None]:
ratings_matrix_df = ratings_matrix_df.apply(lambda row: row.fillna(row.mean()), axis=1)

In [None]:
ratings_matrix = ratings_matrix_df.to_numpy()

In [None]:
ratings_matrix

The mean is substracted in order to remove the bias of the users

In [None]:
user_ratings_mean = np.mean(ratings_matrix, axis = 1)
R_demeaned = ratings_matrix - user_ratings_mean.reshape(-1, 1)
R_demeaned

The decomposition will be done with the use of svds library. The R_demeaned matrix, which is the training matrix is decomposed in the following way:  
R_predcition = U * sigma * Vt

In [None]:
U, sigma, Vt = svds(R_demeaned, k = 100)
sigma = np.diag(sigma)

In [None]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = ratings_matrix_df.columns)
preds_df

The prediction is in the same format as in the previous case: first the dataframe of the already rated movies can be seen, then the dataframe of the newly rated movies.  

It can be seen, that the favourite genre of the user is drama, and the movies recommended are also dramas.

In [None]:
def recommend_movies(predictions_df, userID, num_recommendations=5):
    
    # Get and sort the user's predictions
    print(ratings_df)
    
    user_row_number = userID
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)

    # Get the user's data and merge in the movie information.
    user_data = ratings_df[ratings_df.user_id == (user_row_number)]
    user_full = (user_data.merge(movies_for_mf_df, how = 'left', left_on = 'item_id', right_on = 'movieIndex').
                     sort_values(['rating'], ascending=False)
                 )

    print ("User %i has already rated %i  movies" % (user_row_number, user_full.shape[0]))
    print ('Recommending the highest %i predicted ratings movies not already rated.' % (num_recommendations))
    
    
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_for_mf_df[~movies_for_mf_df['movieIndex'].isin(user_full['movieIndex'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieIndex',
               right_on = 'item_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )
    

    return user_full, recommendations

In [None]:
already_rated, recommendations = recommend_movies(preds_df, 200, 10)

In [None]:
already_rated.head(30)

In [None]:
recommendations

In [None]:


# Calculate RMSE

y_actual_mtx = orig_ratings_matrix_df.fillna(0).to_numpy()
y_predicted_mtx = preds_df.to_numpy()

y_actual = []
y_predicted = []

moviesNum = ratings_df["movieId"].nunique()
usersNum = ratings_df["userId"].nunique()
for i in range(usersNum):
    for j in range(moviesNum):
        if y_actual_mtx[i, j] != 0:
            y_actual.append(y_actual_mtx[i, j])
            y_predicted.append(y_predicted_mtx[i,j])

rmse = mean_squared_error(y_actual, y_predicted, squared=False)
rmse

RMSE is 0.7669, so the results are very similar in both of the cases

## Conclusion

It can be seen that both of the algorithms give similar results, however in a real production it would be more benefitial the algorithm based on varying kernel, as with new users it's easier to do retraining.