# Matrix Factorization

## Preparation


### Python library
In this practice,  we use the **[scikit-surprise](https://surprise.readthedocs.io/en/stable/index.html)** library.
The scikit-surprise library is a python library to use famous recommendation algorithms efficiently.
Let's run the following code and install the scikit-surprise library into your environment.

In [1]:
!pip install scikit-surprise



After downloading the scikit-surprise,  let's load some libraries to use in this practice.


In [2]:
import numpy as np
from surprise import Dataset
from surprise import SVD as SimonMF
from surprise import KNNWithMeans as KNN
from surprise.model_selection import cross_validate

### Dataset
Fortunately, the scikit-surprise library has a function to load several sample dataset including the MovieLens dataset which we used for the last practices.

Run the following code on Google Colaboratory to load the MovieLens dataset (**[MovieLens 100K Dataset](https://grouplens.org/datasets/movielens/100k/)**).

In [3]:
dataset_100k = Dataset.load_builtin('ml-100k')

# If you want to try a larger dataset, run the following code:
# dataset_1m = Dataset.load_builtin('ml-1m')

## Prediction by using sckit-surprise

Let's use the **scikit-surprise** library to predict rating scores.
Here, we try to predict rating scores by using **user-based collaborative filtering** algorithm.

At first, we create an instance to use user-based CF algorithm as below.

In [4]:
ubcf = KNN(k=10, sim_options={'user_based': True, 'name': 'pearson'})

In the above code, we set a threshold for k-nearest neighbors to 10 and select pearson correlation coefficient as a similarity measure.

Let's use this instance to predict the rating score of the 1st user for the 1st item in the dataset.
For that, run the following code.

In [5]:
# All data in the dataset is used
trainset = dataset_100k.build_full_trainset()
ubcf.fit(trainset)


# Predict the 1st user's rating score for the 1st item
predicted_score = ubcf.predict(uid='1', iid='1', verbose=False)
print("predicted score = ", predicted_score.est)

Computing the pearson similarity matrix...
Done computing similarity matrix.
predicted score =  4.392863701697205


## Comparison between several recommendation algorithms

The objective of this practice is to compare the algorithms which we learned in the lecture. 
The target algorithms are:
* User-based collaborative filtering
* Item-based collaborative filtering
* Simon Funk's matrix factorization

How do we evaluate and compare the performance of these algorithms?
Don't worry about that. I have prepared a function to evaluation the performance of algorithms below (the function name is **evaluate_mean_absolute_error**).
The evaluation metric is **Mean Absolute Error (MAE)**, which capture the difference between real values and predicted values. The definition of MAE is below:

$MAE = \frac{1}{R} \sum_{r_{ui} \in R}|r_{ui} - \hat{r}_{ui}|$

Here, $R$ is a set of rating scores. $r_{ui}$ is the actual rating score of user $u$ for item $i$. $\hat{r}_{ui}$ is the predicted rating score of user $u$ for item $i$.
Intuitively, *MAE* value means how large the gap between actual values and predicted values on average.

We use a cross-validation techqunique in the function **evaluate_mean_absolute_error**.
Cross-validation is a common evaluation procedure in fhe field of machine learning.
The next subsection is about cross-validation. If you are familiar with cross-validation or aren't interested in it, please skip it.


In [6]:
def evaluate_mean_absolute_error(algorithm, dataset):
    result = cross_validate(algorithm, dataset,
                            measures=['MAE', 'RMSE'], cv=5,
                            n_jobs=1, verbose=False)
    mean_average_error = np.mean(result['test_mae'])
    return mean_average_error

### Cross Validation
The following describes cross-validation (from [A Gentle Introduction to k-fold Cross-Validation](https://machinelearningmastery.com/k-fold-cross-validation/)):

> Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. 

> Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

> The general procedure is as follows: 
1. Shuffle the dataset randomly. 
2. Split the dataset into k groups
3. For each unique group:
   1. Take the group as a hold out or test data set
   2. Take the remaining groups as a training data set
   3. Fit a model on the training set and evaluate it on the test set
   4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

### User-based Collaborative Filtering (CF)

Let's evaluate the user-based CF for the MovieLens dataset.
The following code provides us with the MAE value of the user-based CF.

In [7]:
# Create an instance for the user-based CF
## A threshold for K-nearest neighbors = 10
ubcf = KNN(k=10, sim_options={'user_based': True, 'name': 'pearson'},
           verbose=False)

# Evaluation
mean_absolute_error = evaluate_mean_absolute_error(ubcf, dataset_100k)
print("Mean absolute error = ", mean_absolute_error)

Mean absolute error =  0.7675940474526761


### Item-based Collaborative Filtering (CF)

Let's evaluate the item-based CF for the MovieLens dataset.
The following code provides us with the MAE value of the user-based CF.

In [8]:
# Create an instance for the item-based CF
## A threshold for K-nearest neighbors = 10
## The cosine similarity is used as a similarity metric
ibcf = KNN(k=10, sim_options={'user_based': False, 'name': 'cosine'},
           verbose=False)

# Evaluation
mean_absolute_error = evaluate_mean_absolute_error(ibcf, dataset_100k)
print("Mean absolute error = ", mean_absolute_error)

Mean absolute error =  0.7708191754780656


### Simon Funk's Matrix Factorization (MF)

Finally, let's evaluate Simon Funk's matrix factorization for the MovieLens dataset.
The following code provides us with the MAE value of Funk's MF.

In [9]:
# Create an instance for the Simon Funk's MF
## The number of latent factors is set to 100
simon_mf = SimonMF(n_factors=100)

# Evaluation
mean_absolute_error = evaluate_mean_absolute_error(simon_mf, dataset_100k)
print("Mean absolute error = ", mean_absolute_error)

Mean absolute error =  0.7377233036075014
