# Recommender System with Matrix Factorization Method
October 29, 2022

We have learned how to use FunkSVD technique (Funk Singular Value Decomposition) from Surprise library to provide recommendation (or more accurately to provide rating prediction). 

For today's kickoff, we will be practicing with a dataset of **anime ratings** from [this](https://www.kaggle.com/CooperUnion/anime-recommendations-database) Kaggle dataset and do hyperparameter optimization with GridSearchCV method from the same library.

Note: Since the original rating dataset has more than 7 million rows, I filtered the data to include only users and animes with minimum 1000 ratings.

In [2]:
# Import the basic packages
import numpy as np 
import pandas as pd 

# Import the surprise packages
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

## Data loading

In [3]:
rating_df = pd.read_csv("anime_rating_merged.csv")

In [4]:
# Check the anime dataset
display(rating_df.head())
print('Shape: ', rating_df.shape)
print('Number of Unique Users:',len(rating_df['user_id'].unique()))
print('Number of Unique Animes:',len(rating_df['anime_id'].unique()))

Unnamed: 0,user_id,anime_id,rating,name,genre
0,1497,18,9,Initial D Fourth Stage,"Action, Cars, Drama, Seinen, Sports"
1,1497,22,10,Prince of Tennis,"Action, Comedy, School, Shounen, Sports"
2,1497,51,3,Tenshi Kinryouku,"Action, Drama, Romance, Shoujo, Supernatural"
3,1497,52,8,Kidou Tenshi Angelic Layer,"Comedy, Drama, Sci-Fi, Shounen, Sports"
4,1497,59,7,Chobits,"Comedy, Drama, Ecchi, Romance, Sci-Fi, Seinen"


Shape:  (114207, 5)
Number of Unique Users: 165
Number of Unique Animes: 1462


In [5]:
# Check the min and max value of rating columns
print(f'Min:{rating_df["rating"].min()}')
print(f'Max:{rating_df["rating"].max()}')

Min:1
Max:10


In [6]:
# check if there is null values in the rating
rating_df[rating_df['rating'].isna()]

Unnamed: 0,user_id,anime_id,rating,name,genre


**Goal**: From the rating dataset, we understand that every row is a different user and anime with the rating between 1 (negative) and 10 (positive). We will focus on user `1497` and see what animes that this user watched and predict what animes this user might be interested in.

In [7]:
# Check number of animes rated by user 1497
user_1497 = rating_df[rating_df['user_id']==1497]
user_1497.shape

(628, 5)

In [8]:
# find the number of anime user 2497 rated as 10/10
sum(user_1497['rating'] == 10)

45

So out of the 1,462 animes in the data set we can see that user `1497` has reviewed 628 of them and gave ratings 10 for 45 animes.

In [9]:
# sort the anime of user 1497 by the rating in descending order (10->1)
user_1497.sort_values('rating', ascending = False).head(10)

Unnamed: 0,user_id,anime_id,rating,name,genre
38,1497,427,10,Kaleido Star,"Comedy, Drama, Fantasy, Shoujo"
102,1497,3470,10,Special A,"Comedy, Romance, School, Shoujo"
34,1497,317,10,Final Fantasy VII: Advent Children,"Action, Fantasy, Super Power"
268,1497,10460,10,Kimi to Boku.,"Comedy, Drama, Romance, School, Shounen, Slice..."
417,1497,18115,10,Magi: The Kingdom of Magic,"Action, Adventure, Fantasy, Magic, Shounen"
341,1497,14345,10,Btooom!,"Action, Psychological, Sci-Fi, Seinen"
170,1497,6637,10,Higashi no Eden Movie II: Paradise Lost,"Action, Comedy, Drama, Mystery, Romance, Thriller"
41,1497,476,10,Ginban Kaleidoscope,"Drama, Romance, Sports"
423,1497,18245,10,White Album 2,"Drama, Music, Romance, Slice of Life"
277,1497,10793,10,Guilty Crown,"Action, Drama, Sci-Fi, Super Power"


In [10]:
# animes with lowest rating by user
user_1497.sort_values('rating', ascending = False).tail(10)

Unnamed: 0,user_id,anime_id,rating,name,genre
505,1497,23079,2,Glasslip,"Romance, Slice of Life, Supernatural"
60,1497,935,2,Witchblade,"Action, Sci-Fi, Super Power"
539,1497,25283,2,Kuusen Madoushi Kouhosei no Kyoukan,"Action, Drama, Fantasy, Magic, School"
555,1497,27891,1,Sword Art Online II: Debriefing,"Action, Adventure, Fantasy, Game"
212,1497,8861,1,"Yosuga no Sora: In Solitude, Where We Are Leas...","Drama, Ecchi, Harem, Romance"
361,1497,15085,1,Amnesia,"Fantasy, Josei, Mystery, Romance"
408,1497,17513,1,Diabolik Lovers,"Harem, School, Shoujo, Vampire"
621,1497,32379,1,Berserk (2016),"Action, Adventure, Demons, Drama, Fantasy, Hor..."
256,1497,10213,1,Maji de Watashi ni Koi Shinasai!,"Comedy, Ecchi, Harem, Martial Arts, Romance, S..."
17,1497,147,1,Kimi ga Nozomu Eien,"Drama, Romance, Slice of Life"


## Matrix Factorization: FunkSVD

With the traditional matrix factorization techniques, we are able to find the **latent features**, which are essentially the 'hidden factor ' in our data related to the items and users. 

Recall the movie review example from the Recommender Systems lecture.<br>
Rating matrix $R$:

|User  | Finding Nemo |  Thor  | Thor: Ragnarok |  Finding Dory  |
|------|--------------|--------|----------------|----------------|
|Bob   |   4          | 1      |   1            | 5              |
|Sally |   5          | 5      |   2            | 1              |
|Shila |   3          | 5      |   4            | 3              |

To find the latent features, we can decompose this matrix into two other matrices, $U$ (for users) and $M$ (for movies):

$$U \cdot M = R$$

User Matrix $U$:

|      | Latent Variable1 |  Latent Variable2  | Latent Variable3|
|------|------------------|--------------------|-----------------|
|Bob   |   0.29           |         2.89       |   0.46          |
|Sally |   2.33           | 0.61               |   0.0           |
|Shila |   0.56           | 0.0                |   1.87          |


Movie Matrix $M$:

|                 | Finding Nemo |  Thor  | Thor: Ragnarok |  Finding Dory  |
|-----------------|--------------|--------|----------------|----------------|
|Latent Variable1 |   1.78       | 2.13   |   0.75         | 0.0            |
|Latent Variable2 |   1.36       | 0.0    |   0.37         | 1.63           |
|Latent Variable3 |   0.0        | 2.03   |   1.91         | 0.53           |


From the movie matrix $M$, latent variable 1 seems to be concerned with original movies vs sequels (Finding Nemo vs Finding Dory, and Thor vs Thor: Ragnarok). Looking at the user matrix $U$, Sally seems to be most concern about latent variable 1, which we can interpret as her being mostly concerned with originals vs sequels.

The traditional matrix factorization techniques seem to work fine here, but in real life, our rating matrix will look more like this:

|User  | Stein;Gate |  Pokemon the Series: Sun & Moon  | Haikyuu |
|------|--------------|--------|----------------|
|1872   |   9          | 1      |   ?            |
|227 |   ?          | 2      |   ?            |
|1982 |   10          | ?      |   7            |

**So how do we deal with these missing values?**

We can use the **Funk Singular Value Decomposition**, which will ignore these missing values and find a way to compute latent factors only using the values we know. You can read up on how this algorithm works in detail [here](https://medium.com/datadriveninvestor/how-funk-singular-value-decomposition-algorithm-work-in-recommendation-engines-36f2fbf62cac).

FunkSVD can allow us to predict a rating for every user-item pair, if we can predict ratings with low error, we can use this predicted rating in order to find the item related to the highest rating predicted.

Now let's see how we can use the FunkSVD technique, to find which unwatched animes user `1497` would likely give a high rating on.

In surprise library, we need a `reader` object to load the data in the special object.

In [11]:
# Set the reader with accurate rating scale
my_reader = Reader(rating_scale=(1,10))

# Set the dataset
# Remember that the df parameter has to have 3 columns:
# User ids, Item ids (anime), Ratings
my_dataset = Dataset.load_from_df(rating_df[['user_id','anime_id','rating']], my_reader)
my_dataset

<surprise.dataset.DatasetAutoFolds at 0x2971a505908>

Next, surprise library has GridSearchCV function for hyperparameter optimization activity. We have to set up the basic parameters for SVD and also defines what evaluation metric we are going to use.

**Parameters**

- Number of latent factors: more factors could give better results, but can also lead overfitting.
- Number of epochs: number of iterations the algorithm will run.
- Learning rate: the speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.


**Evaluation Metric**

For evaluation, we will use the **Fraction of Concordant Pairs (FCP)**, which is the fraction of pairs whose relative ranking order is correct

A concordant pair of ratings is composed of two pairs of ratings, a true rating and a predicted rating, $(r_{1}, \hat{r_{1}})$ and $(r_{2}, \hat{r_{2}})$. If these pairs are concordant, and $r_1 > r_2$, then $\hat{r_1} > \hat{r_2}$. This means that regardless of the predicted values, the order of ratings is correct (i.e. movie 1 is predicted to be preferred to movie 2)

In the example below, although the true and predicted values differ for both animes, but since $r_1(6) > r_2(3)$  and $\hat{r_1}(6) > \hat{r_2}(5) $, the order of rating is still correct, showing that anime 1 is predicted to be preferred to anime 2.

| | True ($r_{i}$) | Predicted ($\hat{r_{i}}$)|
|--|--|--|
|Anime 1|6|6|
|Anime 2|3|5|

In [12]:
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV

# Set the parameter grid
param_grid = {
    'n_factors': [100], 
    'n_epochs': [10, 20],
    'lr_all': [0.005],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=2)

# Fit the model
GS.fit(my_dataset)

In [13]:
# Check the FCP accuracy score (1.0 is ideal and 0 is worst)
GS.best_score['fcp']

0.7065980145337932

In [14]:
# Check the best parameters
GS.best_params['fcp']

{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'biased': False}

Based on above information:
{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'biased': False}

We can build the algorithm on train and test set to check the accuracy based on the prediction values.

In [17]:
# Import train_test_split
from surprise.model_selection import train_test_split

# Split train test set
trainset, testset = train_test_split(my_dataset, test_size=0.25)

# Set the algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=20, 
                 lr_all=0.005, 
                 biased=False,
                 verbose=0)
# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [19]:
# Put my_pred result in a dataframe
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                                     'anime_id',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

# Calculate the difference of actual and prediction into diff column
df_prediction['diff'] = abs(df_prediction['prediction'] - 
                            df_prediction['actual'])

In [20]:
# Check the df_prediction
df_prediction.head()

Unnamed: 0,user_id,anime_id,actual,prediction,details,diff
0,22394,5671,8.0,8.34528,{'was_impossible': False},0.34528
1,12431,9047,6.0,4.927884,{'was_impossible': False},1.072116
2,51270,1690,10.0,8.578622,{'was_impossible': False},1.421378
3,61110,21881,10.0,10.0,{'was_impossible': False},0.0
4,23247,20939,7.0,7.078663,{'was_impossible': False},0.078663


In [21]:
(df_prediction['diff'] == 0).mean()

0.00724992995236761

We can see that only 0.8% data have same prediction rating with the actual rating. It is understandable since the prediction scores are in float data type. In this case, we will put threshold +- 1 for the difference.

In [22]:
(df_prediction["diff"] <= 1).mean()

0.6700056038105912

67% of the predictions are almost accurate. Now we can build the algorithm for the full train set.

In [23]:
# Build full trainset
full_trainset = my_dataset.build_full_trainset()

# Build the SVD algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=20, 
                 lr_all=0.005,    
                 biased=False, 
                 verbose=0)

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x29720113208>

Now we have to build a test set for all users and animes without any ratings. We can utilize `.build_anti_testset` method for this purpose.

This method allows us to create an “anti testset”, which is essentially the complement of the original dataset. In this case, since the user `1497` has rated 628 out of 1,462 animes, our test set will contain the 834 animes the user did not rate.

The, we can run the predictions on the anti_testset with the test method (which results in a similar structure to the predict method). With this step, we have an estimated rating for all the user-item rating pairs that was missing from our data.

In [25]:
# Define the full test set
full_testset = full_trainset.build_anti_testset(fill=-1)

In [26]:
# Set the prediction
my_prediction = my_svd.test(full_testset)

In [27]:
# Put into a dataframe
df_prediction = pd.DataFrame(my_prediction, columns=['user_id',
                                                     'anime_id',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [31]:
# Check user id `1497` predictions
df_1497 = df_prediction[df_prediction['user_id']==1497]
df_1497 = df_1497.sort_values('prediction', ascending=False)

In [32]:
# Merge with the anime data
merge_df = df_1497.merge(rating_df[["anime_id", "name", "genre"]].drop_duplicates(), how='left', 
                    left_on=['anime_id'], right_on=['anime_id'])

# Check anime of user 1497
merge_df

Unnamed: 0,user_id,anime_id,actual,prediction,details,name,genre
0,1497,2034,-1.0,9.474843,{'was_impossible': False},Lovely★Complex,"Comedy, Romance, Shoujo"
1,1497,4181,-1.0,9.438204,{'was_impossible': False},Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern..."
2,1497,2167,-1.0,9.306982,{'was_impossible': False},Clannad,"Comedy, Drama, Romance, School, Slice of Life,..."
3,1497,5114,-1.0,9.306035,{'was_impossible': False},Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
4,1497,11061,-1.0,9.239254,{'was_impossible': False},Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power"
...,...,...,...,...,...,...,...
829,1497,7017,-1.0,4.804937,{'was_impossible': False},K-On!: Ura-On!,Comedy
830,1497,9587,-1.0,4.796448,{'was_impossible': False},Oniichan no Koto nanka Zenzen Suki ja Nai n da...,"Comedy, Ecchi, Harem, Romance"
831,1497,19315,-1.0,4.205938,{'was_impossible': False},Pupa,"Fantasy, Horror, Psychological"
832,1497,15565,-1.0,3.744902,{'was_impossible': False},Maken-Ki! Two,"Action, Ecchi, Harem, Martial Arts, School, Su..."


Now we can recommend some animes to user `1497` based on top ratings prediction. The genre of those animes quite similar with their favourite animes.