**INSTRUCTIONS**

For the assignment, you need to do the following steps :

1. Read the MovieLens dataset from a file (ratings.csv & movies.csv) instead of loading it directly with using the load_builtin method. For more informations, check the Surprise Dataset module documentation.

2. Create 2 model pipelines :

1st pipeline : Load data, Train test split, model training, prediction, evaluation.

2nd pipeline : Load data, cross validation.

3. Benchmark the User based and item based collaborative filtering models using the cosine and pearson correlation similarity metrics. In this step you need to use the data loaded in the 1st step.

**Notebook :**

Your notebook should be leasable, well organized and commented. It should contain 3 seperate parts :

- Data loading
- Model pipelines
- Model benchmarking


**Submission :**

The submission deadline is the 20 / 01 @ 17:42.

You need to push your code in a github repository and to send the link in the assignment tab.

Your repository hierarchy should be the same as the hierarchy used during the practical work (for more information check the shared github repository https://github.com/bachtn/recommender_system_practical_work_students)

**NB :** during the next session, I will verify that you are using a separate environment for the practical work. If not you will get a penalty on the practical work grade.

If you have any questions, you need to post them in the Discussions channel.

# Data loading

In [1]:
import pandas as pd
from pathlib import Path
from surprise import Dataset
from surprise import Reader
DATA_DIR = Path('../data/movielens/ml-latest')

reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)

ratings = Dataset.load_from_file(DATA_DIR / 'ratings.csv', reader=reader)


movies = pd.read_csv(DATA_DIR / 'movies.csv')

# Model Pipeline

## First Pipeline

**Load the data : For this we will use ml-latest-small**

In [2]:
DATA_DIR = Path('../data/movielens/ml-latest-small')

reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)

ratings_small = Dataset.load_from_file(DATA_DIR / 'ratings.csv', reader=reader)

data = ratings_small

movies_small = pd.read_csv(DATA_DIR / 'movies.csv')

**Train test split**

In [3]:
from surprise.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=42)
train.n_users, train.n_items

(610, 8928)

In [4]:
from surprise import KNNBasic

**Model Training user based**

In [5]:

# users based and cosine similarity
sim_options = {'name': 'cosine',
               'user_based': True  # compute  similarities between users
               }
algo_users_cos = KNNBasic(sim_options=sim_options)
model_users_cos = algo_users_cos.fit(train)
# users based and pearson correlation
sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }
algo_users_pears = KNNBasic(sim_options=sim_options)
model_users_pears = algo_users_pears.fit(train)


Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


**Model Training item based**

In [6]:
# item based and cosine similarity
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo_items_cos = KNNBasic(sim_options=sim_options)
model_items_cos = algo_items_cos.fit(train)

# user based and pearson correlation
sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }
algo_items_pears = KNNBasic(sim_options=sim_options)
model_items_pears = algo_items_pears.fit(train)


Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


**Predictions**

In [7]:
predictions_users_cos = model_users_cos.test(test)
predictions_users_pears = model_users_pears.test(test)
predictions_items_cos = model_items_cos.test(test)
predictions_items_pears = model_items_pears.test(test)

**Evaluation**

In [8]:
from surprise import accuracy

accuracy.rmse(predictions=predictions_users_cos)
accuracy.rmse(predictions=predictions_users_pears)
accuracy.rmse(predictions=predictions_items_cos)
accuracy.rmse(predictions=predictions_items_pears)

RMSE: 0.9806
RMSE: 0.9900
RMSE: 0.9800
RMSE: 0.9900


0.9900277794148814

## Second Pipeline

In [9]:
from surprise.model_selection import cross_validate
# Load Data
data = data
# Cross Validation
cv_users_cos = cross_validate(model_users_cos, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
cv_users_pears = cross_validate(model_users_pears, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
cv_items_cos = cross_validate(model_items_cos, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
cv_items_pears = cross_validate(model_items_pears, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity mat

# Model benchmarking

**Load the data:  for this we will use ml-latest loaded in question 1**

In [10]:
data = ratings # This was loaded in the first part ( ml-latest)

**Let us use GridSearchCV to implement the different model, the cross validation and finally to perfom the benchmarking**

In [None]:
from surprise.model_selection import GridSearchCV

param_grid = {'sim_options': {'name': ['pearson_baseline', 'cosine'],
                               'user_based': [False, True]}
              }
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=5)

gs.fit(data)

Still processing................... !! the data is really huge

In [None]:
pd.DataFrame.from_dict(gs.cv_results)

In [None]:
# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])