[View in Colaboratory](https://colab.research.google.com/github/ruxandraburtica/recommender-systems/blob/master/1_Collaborative_filtering.ipynb)

## Install additional packages

In [0]:
!pip uninstall -y scipy
!pip install scipy==1.0.0
!pip install surprise

## Imports

In [0]:
import requests
import zipfile
from io import BytesIO

from surprise import SVD, Dataset, Reader, evaluate
from surprise.model_selection import cross_validate

## Get the data

The `surprise` package has support for downloading movielens dataset, but we're going to download data ourselves, given that with your projects, probably data will not be available within `surprise`.

We are going to use the data in the `u.data` file that contains all the user-item ratings. 

In the u.data file each line represents a rating from a user to an item and the time when the rating happened. 

The format of each line is:
`userID itemID rating timestamp`, separated by tabs

In order to read the data, we're creating a Reader and define its format. In this case each line is divided as user item rating timestamp and is seperated by a tab \t. After we define the format we load our data in a Dataset object:

In [0]:
# Download archive and extract its contents.
ml_100k_url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
r = requests.get(ml_100k_url, stream=True)
z = zipfile.ZipFile(BytesIO(r.content))
z.extractall()

# Define the format
reader = Reader(line_format='user item rating timestamp', sep='\t')

# Load the data from the file using the reader format
data = Dataset.load_from_file('./ml-100k/u.data', reader=reader)

# # Load the movielens-100k dataset (download it if needed).
# data = Dataset.load_builtin('ml-100k')

#### Split data

We're going to use cross-validation, splitting the data in 5 folds right from the beginning.

We will then train our model 5 times, each on 4 different folds, testing results on the 5th. Data folding is done using the split function as below:

In [0]:
# Split data into 5 folds
data.split(n_folds=5)

### Training

Our model will try to optimize predictions, in order to match as closely as possible the actual results.

As we're trying to predict the rating of a certain user-movie combination, we will compare that prediction to the actual prediction. The difference between the actual and the predicted rating is measured using classical error measurements such as Root mean squared error (RMSE) and Mean absolute error (MAE):


$$RMSE = \sqrt{\sum_{t=1}^{T}  {(\hat{y_t}-y_t)}^2 \over n}$$

$$MAE = {\sum_{t=1}^{T}  (\hat{y_t}-y_t) \over n}$$

We're going to use SVD first:
https://medium.com/data-science-group-iitr/singular-value-decomposition-elucidated-e97005fb82fa

In [0]:
algo = SVD()
evaluate(algo, data, measures=['RMSE', 'MAE'])

In [0]:
# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.train(trainset)

In [0]:
# Predict a certain item
userid = str(196)
itemid = str(302)
actual_rating = 4
print(algo.predict(userid, itemid, actual_rating))