[View in Colaboratory](https://colab.research.google.com/github/ruxandraburtica/recommender-systems/blob/master/2_model_based_collaborative_filtering.ipynb)

# 2. Model-based collaborative filtering


>[Model-based collaborative filtering](#scrollTo=avRbNXr3L3k4)

>[Model-based collaborative filtering](#scrollTo=4amzxC539k88)

>>>[Types of models](#scrollTo=4amzxC539k88)

>[Using SVD:](#scrollTo=LmyTNkPNT4Uw)

>>>[The mathematics:](#scrollTo=LmyTNkPNT4Uw)

>>>[What are latent factors?](#scrollTo=LmyTNkPNT4Uw)

>>[Hands-on](#scrollTo=9Zpoz98F81UV)

>>>[Get additional packages](#scrollTo=YUMd8nTW-sjD)

>>>[Import packages needed throughout the notebook](#scrollTo=Y3-_67QY9OoU)

>>>[Get the data](#scrollTo=XHKSiVyO-wvq)

>>>[Plug-in SVD](#scrollTo=d3ozFWucHhXc)

>>[Pick 3](#scrollTo=bfhR7pdTWFmy)

>>[Extra step:](#scrollTo=Kx7Xav0PWejb)

>>>[Try out multiple other algorithms](#scrollTo=e8FGobf1HvVr)

>>>[Note for the trainers](#scrollTo=e8FGobf1HvVr)




# Model-based collaborative filtering

Model-based collaborative filtering methods first create a model of the user, and then build the predictions.

### Types of models
- Probabilistic
- Classification
- Regression
- Clustering
- Rule-based


The netflix prize was a model-baed collaborative filtering.

Top 2 algorithms in the Netflix prize:
- SVD (matrix factorization) -- RMSE 0.8914
- RBM (Restricted Boltzman Machines - neural network) -- RMSE 0.8990
- Ensemble of the two -- RMSE 0.88

# Using SVD:


### The mathematics:

The basic idea is that the matrix I start off with, the $X$ matrix, which is very sparse and had the users and the items, I want to colapse it into something that has less dimensions and is much less sparse.

We're going to do that by decomposing my original matrix X into 3:
* $U$ == left singular matrix, representing the relationship between users and latent factors
* $S$ == diagonal matrix describing the strength of each latent factor
* $V$ == right singular matrix, indicating the similarity between items and latent factors. 

![alt text](https://cdn-images-1.medium.com/max/2000/1*haUDjEiQmG0RapR0SHos6Q.png =500x120)




r is the number of factors that are in my decompositions.



### What are latent factors?
Latent factors describe a property or concept that a user or an item have. 
For instance, for music, latent factor can refer to the genre that the music belongs to. SVD decreases the dimension of the utility matrix by extracting its latent factors. Essentially, we map each user and each item into a latent space with dimension r. Therefore, it helps us better understand the relationship between users and items as they become directly comparable. The below figure illustrates this idea.




## Hands-on

### 1. Get additional packages
 In Jupyter notebooks and colab, we can install additional packages

In [0]:
!pip uninstall -y scipy
!pip install scipy==1.0.0
!pip install surprise

**Note:** After installing, please restart the runtime (Runtime --> Restart runtime)


### 2. Import packages needed throughout the notebook

In [0]:
from surprise import Dataset, SVD
from surprise.model_selection import cross_validate, split

### 3. Get the data


The `surprise` package has support for downloading movielens dataset, and we're going to use it.

In [0]:
# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')

### 4. Plug-in SVD

In [0]:
# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

## Benchmarks

Surprise already has implemented multiple models, they can be found here: http://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

Also a benchmark of them can be found: http://surpriselib.com

Try out a couple of them, trying to identify which performs the best on our dataset.

## Extra step:

1. Variate the cross-validation class used, algorithms or their parameters in order to obtain a smaller error.
2. Test these with a different dataset (e.g. jester) and compare results of the output of models.

