# Collaborative Filtering
This Python Notebook will show a number of collaborative filtering techniques, applied to the Movielens 100K dataset. The components required are distributed across different files (e.g., users, items, ratings, etc.). 

- The most prominent approach to generate recommendations
- People who agreed in their subjective evaluations in the past are likely to agree again in the future.
- The filtering decision in CF based on human and not machine analysis of the content.
    - Memory based CF: operates over the entire user database to make predictions, by obtaining similar relationship between user or items according to user-item rating matrix and then recommeds the items that are highly rated by similar users for the active user. (user based, items based)
    - Model base CF:  requires a learning phase in advance for finding out the optimal
model parameters before making a recommendation, after the learning phase is finished, the model
based RS, easily predict the ratings of the active user.
    
- we will focus mainly on the User based approach: used the users database to estimate or learn a model, which is then used for prediction. 
     
     **Procedure**: Given an "active user" (Alice) and an item *i* not yet seen by Alice:
        
        1.  Find a set of users (peers/ nearest neighbors) who like the same items as Alice in the pase **and**  who have rated item *i*.
        2.  User there ratings (e.g, their average) to predict whether Alice will like item *i*
        3.  Do this for all items Alice has not seen and recommend the best-rated

    - The same procedure can be applied on item-based approach

## The Framework - Standard approach
#### Pre-processing the dataa

In [1]:
#Again we import the relevant packages
import pandas as pd
import numpy as np

In [None]:
#Load the u.user file into a dataframe
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code'] # define names for the columns

users = pd.read_csv('./u.user', sep='|', names=u_cols,      # read the data
 encoding='latin-1')

users.head()

In [None]:
#Load the u.item file into a dataframe
i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('u.item', sep='|', names=i_cols, encoding='latin-1')

movies.head()

In [4]:
#Remove all information except Movie ID and title :
movies = movies[['movie_id', 'title']] # specific features

In [None]:
#Load the u.data file into a dataframe
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_csv('./u.data', sep='\t', names=r_cols,
 encoding='latin-1')

ratings.head()

In [6]:
#Drop the timestamp column
ratings = ratings.drop('timestamp', axis=1)

#### Separating into training and test data
Eventually, we would like to generate a prediction model based on our rating data, which runs from 1 to 5. While a dichotomous classification model (true/false) would not care about the magnitude of error in a prediction (e.g., if the true value is 5, 1 and 4 are bad predicitions), we would like the model to be 'punished' in line with a regular regression model (e.g., a prediction of 4 for the true of 5 is better than a prediction of 2). 

To evaluate the data like that, we will first need to separate our training and test data. In this example, we split 75% of the data in a training set, and 25% of the data in a validation. You are of course free to change these parameters below.

### Method below
The split is done in a slightly 'hacky' way: we assume that the user_id is the target variable (or Y) and that our ratings dataframe comprises the predictor variables (or x). We will then pass these two varaibles into scikit-learn's train_test_split function and stratify it along y. This ensures that the proportion of each class is the same in both the training and testing datasets:

In [None]:
#Import the train_test_split function
from sklearn.model_selection import ## your code 

# Assign X as the original ratings dataframe and y as the user_id column of ratings.
X =  # your code 
y = ratings['user_id']  

# Split into training and test datasets, stratified along user_id with 25% of data as a Training data
''' train_test_split(
        Predictor variables (X),
        Outcome variable (label: y),
        test_size,
        stratify:  split data in a stratified way,
        random_state:  reproducibility variable (to find the same results)
) '''

X_train, X_test, y_train, y_test = # your code

#### Evaluation

Root Mean Squared Error is the most common metric (minimizing error) between the predicted values and actual values. 
$$ RMSE = \sqrt{\frac{\sum_{(u,i,r)\in R}{(\hat{r_{u,i}} - r_{u,i})^2}}{|R|}}$$

We will first use this one.

In [8]:
#Import the mean_squared_error function
from sklearn.metrics import mean_squared_error

#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return ## your code (mean_squared_error(y_true, y_pred))

In [9]:
# Define the baseline model to always return 3.
def baseline(user_id, movie_id):
    return 3.0

In [2]:
#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model):
    
    #Construct a list of user-movie tuples from the testing dataset
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    
    #Predict the rating for every user-movie tuple
    y_pred = np.array([cf_model(user, movie) for (user, movie) in id_pairs])
    print(y_pred)
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])
    
    #Return the final RMSE score
    return rmse(y_true, y_pred)

In [None]:
# The RMSE scoce between of our  baseline model that predict 3 for all ratings 

score(## your code)

## User Based Collaborative Filtering

### Ratings Matrix
The columns represent the movies, the rows represent the users. Each cell is a rating given a user i to a movie j.

In [None]:
#Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie_id')

r_matrix.head()

### Mean
First, we will build the simples collaborative filter possible. We will compute the means of each movie in our training dataset. In doing so, we assume equal weight of each user in determining the rating (which is of course not very accurate, RIGHT?)

If there are no ratings available in either the training or test dataset, we will assume the absolute mean value for that movie: 3.0.

In [5]:
#User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, movie_id):
    
    #Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        #Compute the mean of all the ratings given to the movie
        mean_rating = r_matrix[movie_id].mean()
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        mean_rating = 3.0
    
    return mean_rating

In [None]:
#Compute RMSE for the Mean model
score(cf_user_mean)

### Weighted Mean
The weighted mean is computed by multiplying ratings with some kind of weight. As we've learned, this is a similarity score between two users when we are performing user-user CF.

The rating can be predicted by 

**r(u,m) = (similarity between_two_users * rating_of_user_for_item) / (Euclidean  length  of similarity between two users)**.

For the sake of the exercise, we will focus on cosine similarity. Since Scikit-learns' cosine similarity can't handle missing values (i.e., NaN), we need to convert them to 0. 

In [15]:
#Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

In [16]:
# Import cosine_score 
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

In [17]:
#Convert into pandas dataframe 
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)

cosine_sim.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.118076,0.029097,0.011628,0.264677,0.312419,0.308729,0.224269,0.026017,0.286411,...,0.308475,0.055872,0.197862,0.131367,0.152449,0.084456,0.293293,0.056765,0.103536,0.326491
2,0.118076,1.0,0.099097,0.10768,0.034279,0.152789,0.086705,0.078864,0.06894,0.092399,...,0.086927,0.259636,0.289092,0.318824,0.149105,0.186347,0.168034,0.106748,0.136796,0.080358
3,0.029097,0.099097,1.0,0.252131,0.026893,0.062539,0.039767,0.089474,0.078162,0.03767,...,0.040918,0.019031,0.065417,0.055373,0.086503,0.018418,0.096993,0.109631,0.092574,0.018987
4,0.011628,0.10768,0.252131,1.0,0.0,0.045543,0.078812,0.095354,0.059498,0.053879,...,0.024226,0.050703,0.056561,0.107294,0.098892,0.0,0.1329,0.142798,0.097066,0.015176
5,0.264677,0.034279,0.026893,0.0,1.0,0.202843,0.299619,0.163724,0.038474,0.153021,...,0.262547,0.048524,0.048312,0.022202,0.09191,0.066,0.156172,0.115842,0.124297,0.267574
6,0.312419,0.152789,0.062539,0.045543,0.202843,1.0,0.375963,0.131795,0.110944,0.400758,...,0.287549,0.080312,0.162988,0.182856,0.114262,0.09209,0.261859,0.097606,0.206104,0.187637
7,0.308729,0.086705,0.039767,0.078812,0.299619,0.375963,1.0,0.211282,0.107795,0.328923,...,0.290002,0.07417,0.094619,0.084235,0.11562,0.100625,0.233843,0.039199,0.224227,0.296332
8,0.224269,0.078864,0.089474,0.095354,0.163724,0.131795,0.211282,1.0,0.03704,0.183375,...,0.165008,0.066843,0.058766,0.068759,0.087159,0.129381,0.188662,0.121223,0.08391,0.273238
9,0.026017,0.06894,0.078162,0.059498,0.038474,0.110944,0.107795,0.03704,1.0,0.155435,...,0.011708,0.0,0.10171,0.034568,0.045002,0.052699,0.107486,0.055766,0.070065,0.088281
10,0.286411,0.092399,0.03767,0.053879,0.153021,0.400758,0.328923,0.183375,0.155435,1.0,...,0.278558,0.04931,0.153506,0.065471,0.060088,0.033686,0.197107,0.085402,0.118945,0.162538


Using the cosine similarity matrix above, we are now in the position to efficiently calculate the weighted mean scores for this model. However, implementing this model in code is slightly more complex than the regular mean above, for that we only need to consider cosine similarity score that have a non-null rating. Hence, we need to avoid all users that have not rated a certain movie m. To do this, we need to double check the similarity score that we have with the rating matrix of earlier. 

In [6]:
#User Based Collaborative Filter using Weighted Mean Ratings
def cf_user_wmean(user_id, movie_id):
    
    #Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        
        #Get the similarity scores for the user in question with every other user
        sim_scores = cosine_sim[user_id]
        
        #Get the user ratings for the movie in question
        m_ratings = r_matrix[movie_id]
        
        #Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index
        
        #Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()
        
        #Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)
        
        #Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings)/ sim_scores.sum()
        if SimScore == 0:
                SimScore = 1
        wmean_rating = np.dot(sim_scores, m_ratings) / SimScore
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        wmean_rating = 3.0
    
    return wmean_rating

Since we are only dealing with positive ratings, we do not need to build in a modulus/mode function. As you will see, the improvement in RMSE is very small given the longer runtime of the model.

In [None]:
#compute the RMSE score for this model
score(cf_user_wmean)

### Demographics
Demographic collaborative filters rely on the intuition that users with similar backgrounds (ages, sex, etc.) are more likely to have similar tastes. This means that we do not need to take all ratings of all users into account, but only the ratings of those that are relevant to another user. 

The first demographic filter we will build simply takes the gender of the user, compute the (weighted) mean rating of a movie by that particular gender, and return that as the predicted value. To obtain this information, we need to merge our predictor set with the demographic dataframe.

In [20]:
#Merge the original users dataframe with the training set 
merged_df = pd.merge(X_train, users)

merged_df.head()

Unnamed: 0,user_id,movie_id,rating,age,sex,occupation,zip_code
0,889,684,2,24,M,technician,78704
1,889,279,2,24,M,technician,78704
2,889,29,3,24,M,technician,78704
3,889,190,3,24,M,technician,78704
4,889,232,3,24,M,technician,78704


Compute the mean rating given by each gender.

In [21]:
#Compute the mean rating of every movie by gender
gender_mean = merged_df[['movie_id', 'sex', 'rating']].groupby(['movie_id', 'sex'])['rating'].mean()

## gender_mean

In [22]:
#Set the index of the users dataframe to the user_id
users = users.set_index('user_id')

In [23]:
#Gender Based Collaborative Filter using Mean Ratings
def cf_gender(user_id, movie_id):
    
    #Check if movie_id exists in r_matrix (or training set)
    if movie_id in r_matrix:
        #Identify the gender of the user
        gender = users.loc[user_id]['sex']
        
        #Check if the gender has rated the movie
        if gender in gender_mean[movie_id]:
            
            #Compute the mean rating given by that gender to the movie
            gender_rating = gender_mean[movie_id][gender]
        
        else:
            gender_rating = 3.0
    
    else:
        #Default to a rating of 3.0 in the absence of any information
        gender_rating = 3.0
    
    return gender_rating

In [None]:
#Compute the RMSE Score
score(cf_gender)

Since the RMSE is slightly worse than for the other approaches, we can assume that gender is probably not a good predictor of movie taste. Let's to expand, by using gender and occupation simultaneously. Because... doctors must doctor movies, etc.???

In [None]:
#Compute the mean rating by gender and occupation
gen_occ_mean = merged_df[['sex', 'rating', 'movie_id', 'occupation']].pivot_table(
    values='rating', index='movie_id', columns=['occupation', 'sex'], aggfunc='mean')

gen_occ_mean.head()

Pivottable is another way of using the groupby command, but is slightly more compact.

In [30]:
#Gender and Occupation Based Collaborative Filter using Mean Ratings
def cf_gen_occ(user_id, movie_id):
    
    #Check if movie_id exists in gen_occ_mean
    if movie_id in gen_occ_mean.index:
        
        #Identify the user
        user = users.loc[user_id]
        
        #Identify the gender and occupation
        gender = user['sex']
        occ = user['occupation']
        
        #Check if the occupation has rated the movie
        if occ in gen_occ_mean.loc[movie_id]:
            
            #Check if the gender has rated the movie
            if gender in gen_occ_mean.loc[movie_id][occ]:
                
                #Extract the required rating
                rating = gen_occ_mean.loc[movie_id][occ][gender]
                
                #Default to 3.0 if the rating is null
                if np.isnan(rating):
                    rating = 3.0
                
                return rating
            
    #Return the default rating    
    return 3.0

In [None]:
#RMSE Score
score(cf_gen_occ)

Ok, this has been the worst improvement of the baseline so far. Apparently, this is not the way forward to improve the model accuracy, but you are free to experiment with different demographic characteristics! 

## Item based collaborative filtering
You could also focus on item-item CF and compute the pairwise similarity of every item in the inventory. We will again apply a weighted mean function to come up with our model, as we expect users to give similar ratings to movies for which we have computed that they are similar.

## Model Based Approaches
The previous examples have been memory-based (think about why :-). The upcoming methods will make use of model-based approaches, in the sense that we are actually going to apply machine learning!

The previous example with demographics was a bit too simplistic. Now, we are going to move beyond the metadata that we have by using cluster algorithms such as k-means to group users into a cluster and then to take only the users from the same cluster into consideration when predicting ratings.
Now, we are using kNN. The steps are as follows:
1. Find the k-nearest neightbors of u who have rated movie m
2. Output the average rating of the k users for the movie m.

This simple approach happens to be among the most popular algorithms in use. We will implement this using an extremely popular and robust library called 'Surprise', which is also a scikit. Surprise is an acronym of 'Simple Python Recommendation System Engine'. You might need to download it first to it to work! Use the following command in your command prompt environment:

sudo pip3 install scikit-surprise


In [None]:
#Import the required classes and methods from the surprise library
from surprise import Reader, Dataset, KNNBasic
from surprise.model_selection import cross_validate

#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader()

#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

#Define the algorithm object; in this case kNN
knn = KNNBasic()

#Evaluate the performance in terms of RMSE using  cross validation

cross_validate(knn, data, measures=['RMSE'])

From the output above, you can compute the mean RMSE by averaging the test_rmse values. The result is much better than our previous approaches!

#### Singular Value Decomposition
Remember from the previous slides what this is? It aims to reduce the number of dimensions in your data, as not every rating should be considered as a unique dimension. It is an advanced version of Principal Component AnalysisL You separate a single user-item rating matrix into three parts: a user rating part (user preferences for each relevant dimension), an item rating part (how well an item scores on each dimension), and a weights part (how relevant each dimension is). 

In [None]:
#Import SVD
from surprise import SVD

#Define the SVD algorithm object
svd = SVD()

#Evaluate the performance in terms of RMSE
cross_validate(svd, data, measures=['RMSE'])

Again, we find a better RMSE score!

Surprise Library: https://surpriselib.com/