# Recommendation System

In this lab, we will use a python package named [Surprise](http://surpriselib.com/), which is an easy-to-use Python scikit for recommendation systems. It includes several commonly used algorithms, including [collaborative filtering](https://surprise.readthedocs.io/en/stable/knn_inspired.html) and [Matrix Factorization-based algorithms](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

In [1]:
# # install packages
# import sys

# !pip3 install scikit-surprise

In [3]:
from surprise.prediction_algorithms.matrix_factorization import SVD
from surprise.prediction_algorithms.knns import KNNBasic
from surprise.prediction_algorithms.knns import KNNWithMeans
from surprise.prediction_algorithms.knns import KNNBaseline
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

ModuleNotFoundError: No module named 'surprise'

-----

## Load data from package surprise 

First, we can download the ml-100k dataset included in package surprise. The data will be saved in the .surprise_data folder in your home directory. Use the API in the package to sample random trainset and testset where test set is made of 20% of the ratings.

In [None]:
# Load the movielens-100k dataset (download it if needed) and split the data into 
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset where test set is made of 20% of the ratings.
trainset, testset = train_test_split(data, test_size=0.20)

In [None]:
print("Number of users: {}".format(trainset.n_users))
print("Number of items: {}".format(trainset.n_items))
print("Number of ratings: {}".format(trainset.n_ratings))

-----

## Collaborative Filtering

First, we will apply three different flavors of collaborative filtering to this data and evaluate their performances using RMSE and MAE. For each of these algorithms, the actual number of neighbors that are aggregated to compute an estimation is necessarily less than or equal to `𝑘`.

### The basic collaborative filtering algorithm

**TODO**: You will study the [KNNBasic](https://surprise.readthedocs.io/en/stable/knn_inspired.html) API, choose the number of neighbors and the similarity measure, train the model based on training dataset and make predictions on the test dataset. Finally, you will evaluate the model performance based on RMSE and MAE. 

Try to play around with the different number of neighbors in the algorithm as well as the different similarity measure and see how it impacts the model performance.

In [None]:
# Use the basic collaborative filtering algorithm.
# See https://surprise.readthedocs.io/en/stable/knn_inspired.html for more details.
# Sim options
sim_1 = {'name': 'cosine', 'user_based': False} # 81.6% MAE, 1.033 RMSE
sim_2 = {'name': 'pearson_baseline', 'shrinkage': 0} # 79.8% MAE, 1.009 RMSE
model = KNNBasic(k=50, min_k=1, sim_options=sim_1, verbose=True)
preds = model.fit(trainset).test(testset)
print("MAE: ", accuracy.mae(preds))
print("RMSE: ", accuracy.rmse(preds))


### The basic collaborative filtering algorithm with user mean ratings

**TODO**: A variation of the basic CF model is to take into account the mean ratings of each user. You will study the [KNNWithMeans](https://surprise.readthedocs.io/en/stable/knn_inspired.html) API, choose the number of neighbors and the similarity measure, train the model based on training dataset and make predictions on the test dataset. Finally, you will evaluate the model performance based on RMSE and MAE. 

Try to play around with the different number of neighbors in the algorithm as well as the different similarity measure and see how it impacts the model performance.

In [None]:
# Use the basic collaborative filtering algorithm, taking into account the mean ratings of each user.
# See https://surprise.readthedocs.io/en/stable/knn_inspired.html for more details.

sim_1 = {'name': 'cosine', 'user_based': False} # 0.7415 MAE; 0.9464 RMSE
sim_2 = {'name': 'pearson_baseline', 'shrinkage': 0} # 0.7482 MAE; 0.9560 RMSE
model = KNNWithMeans(k=50, min_k=1, sim_options=sim_1, verbose=True)
preds = model.fit(trainset).test(testset)
print("MAE: ", accuracy.mae(preds))
print("RMSE: ", accuracy.rmse(preds))

-----

## Matrix Factorization

Then, we will explore the matrix factorization techniques for recommendation. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The famous SVD algorithm for matrix factorization is popularized by Simon Funk during the Netflix Prize. 

**TODO**: in this task, you will use the famous SVD algorithm for the implementation of the matrix factorization modeo. You will study the [SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html) API, choose the number of neighbors and the similarity measure, train the model based on training dataset and make predictions on the test dataset. Finally, you will evaluate the model performance based on RMSE and MAE. 

Try to play around with different number of factors and also try the [SVD++ algorithm](https://surprise.readthedocs.io/en/stable/matrix_factorization.html) and [Non-negative Matrix Factorization](https://surprise.readthedocs.io/en/stable/matrix_factorization.html) to see if you can imporve the model preformance.

In [None]:
# We'll use the famous SVD algorithm.
from surprise.prediction_algorithms.matrix_factorization import SVDpp
from surprise.prediction_algorithms.matrix_factorization import NMF

model = SVD(n_factors=50) # Default factors (20) was 0.9423 RMSE
cross_validate(model, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

ppmodel = SVDpp(n_factors=50) # Default factors (20) was 0.92 RMSE
cross_validate(ppmodel, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

nmfmodel = NMF(n_factors=50) # Default factors (15) was 0.9788 RMSE
cross_validate(nmfmodel, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

## [BONUS] 
Implement your own version of User-User or Item-Item Collaborative Filtering and compare its performance against the surprise package's implementation.

In [None]:
# TODO

# End of Lab: Recommendation System

# Conceptual Overview
KNN uses a design of nearest-neighbor design which says points on a graph are likely to be related/similar to points they are close to.
Determining the relative closeness of the neighbors can be derived different ways (pearson, cosine, etc). The KNNMeans smooths out the predictions by leveraging the average of the user ratings/reviews/etc as part of the similarity equation.
SVD uses gradient descent to reduce the error, and fills in missing values in the matrix.