For the assignment, you need to do the following steps :

Read the MovieLens dataset from a file (ratings.csv & movies.csv) instead of loading it directly with using the load_builtin method. For more informations, check the Surprise Dataset module documentation.

Create 2 model pipelines :

1st pipeline : Load data, Train test split, model training, prediction, evaluation.

2nd pipeline : Load data, cross validation.

Benchmark the User based and item based collaborative filtering models using the cosine and pearson correlation similarity metrics. In this step you need to use the data loaded in the 1st step.

Notebook :

Your notebook should be leasable, well organized and commented. It should contain 3 seperate parts :

Data loading
Model pipelines
Model benchmarking


Submission :

The submission deadline is the 20 / 01 @ 17:42.

You need to push your code in a github repository and to send the link in the assignment tab.

Your repository hierarchy should be the same as the hierarchy used during the practical work (for more information check the shared github repository https://github.com/bachtn/recommender_system_practical_work_students)

NB : during the next session, I will verify that you are using a separate environment for the practical work. If not you will get a penalty on the practical work grade.

#### So basically, I am going to split the dataset into train and test, and predict the movie ratings of test based on the item-item relationship

for a new user and a movie, i will calculate his/her most similar users, and calcualte the rating he/her would give to this movie

#### Load the dataset

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/movielens/ml-latest/rating.csv')

In [3]:
print(df.shape)
df.head()

(20000263, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


#### transform the dataframe struction: each row represents a unique movieId and each column is a userId, the values are ratings
so it's going to be 26744 x 138493 table, which has too many dimension, so I'm going to remove some columns and save the active userIds

In [5]:
print(len(df['movieId'].unique()))
print(len(df['userId'].unique()))

26744
138493


filter the userId by the number of ratings, save the most active 1000 userIds

In [32]:
userId_list = list(df['userId'].value_counts()[:1000].index)
df = df[df['userId'].isin(userId_list)]
df.shape

(1819672, 4)

dropdup by userId and movieId, users can't comment on the same movie twice

In [33]:
df = df.drop_duplicates(['movieId', 'userId'])
df.shape

(1819672, 4)

In [35]:
df_new = pd.pivot_table(df,index=['movieId'],columns=['userId'],values=['rating'])
df_new

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
userId,156,208,359,572,586,741,768,775,903,982,...,136875,136989,137037,137202,137277,137343,137686,137885,138208,138325
movieId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,5.0,4.0,5.0,5.0,2.5,5.0,,4.5,4.0,3.0,...,4.0,2.0,4.0,4.5,4.0,4.0,5.0,5.0,3.0,5.0
2,5.0,,,3.5,3.0,3.0,3.0,2.0,4.0,2.0,...,3.0,2.0,3.0,2.5,2.5,2.5,3.0,3.0,2.0,3.0
3,2.0,,,3.5,2.0,3.0,,3.5,2.0,2.0,...,,,2.0,3.5,3.0,3.5,3.0,4.0,2.0,
4,3.0,,,,,,,,2.0,2.0,...,,,,,,,,2.0,2.0,
5,3.0,,,3.5,3.0,4.0,2.0,1.0,3.0,2.5,...,,4.5,3.0,,3.0,3.5,,3.0,2.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131172,,,,,,,,,,,...,,,,,,,,,,
131174,,,,,,,,,,,...,,,,,,,,,,
131176,,,,,,,,,,,...,,,,,,,,,,
131180,,,,,,,,,,,...,,,,,,,,,,


Drop empty rows, now I shrink the dataset to a 24188 x 1000 matrix

In [36]:
df_new = df_new.dropna(subset=list(df_new),how='all',axis=0)
df_new.shape

(24188, 1000)

#### Split the dataset into train and test