## Introduction

This tutorial will introduce you to some basic knowledge about implementing a collaborative filtering recommender system on a Netflix dataset.

Recommender systems has been playing a key part in many data science application scenarios. In a nutshell, the goal of recommender systems is to maximize the profit (or user happiness) by recommending items that users want, and the biggest challenge in achieving this goal is to estimate ratings for the items that have not been seen by a user. 

An example of recommender systems - Amazon:
[<img src="https://cdn-images-1.medium.com/max/2000/1*UEIb9b7VT0u5NMBeZajxjg.png">](https://cdn-images-1.medium.com/max/2000/1*UEIb9b7VT0u5NMBeZajxjg.png)

It adds to the complexity of the problem that the items can vary from movies to songs, posts, ads, products... While the companies or researchers collect millions of ratings from users of their chosen items, we need to know how to get insights out of the data and form computational models in order to achieve the goal. 

There are generally two high-level approaches in modeling the recommender system problem - content filtering and collaborative filtering:

<img src="https://www.bizofit.com/blog/wp-content/uploads/2017/03/Picture1.png">

While content filtering requires side information (or attributes) of the items, the item attributes are usually not available or insufficient in most of the real-world problems. Therefore, a big perk of the collaborative filtering algorithm is that it does not need side information (or attributes) of the items other than the user preferences over them. The essence of collaborative filtering algorithm is the assumption that personal tastes are correlated, i.e. we would be able to predict the target user's preferences by looking at preferences of a group of user who are "similar" to this user. 

So here comes the motivation of this tutorial - to demonstrate the effectiveness of collaborative filtering algorithms in tackling the challenges in building recommender systems. 

In this tutorial, we compare the performance of several rating algorithms and understand their differences. In particular, we will focus on the comparison between the two nearest neighbor approaches: the memory-based approach (or the nearest neighbor approach based on user-user similarity) and the model-based approach (or the nearest neighbor approach based on item-item similarity). 

Here's an illustration of the two approaches:
[<img src="http://www.salemmarafi.com/wp-content/uploads/2014/04/collaborativeFiltering-960x540.jpg">](http://www.salemmarafi.com/wp-content/uploads/2014/04/collaborativeFiltering-960x540.jpg)

This tutorial will also introduce a few parameter tuning tricks in the implementation of the algorithms, including tuning the number of neighbors and the similarity metric.

### Tutorial content

The Netflix dataset used in this tutorial can be downloaded from [here](https://drive.google.com/file/d/1GYL2Uy7XEX6-ziGNRxTwdMyyAcHXVLdG/view?usp=sharing).  

We will cover the following topics in this tutorial:
- Dataset Description

- Corpus Exploration

- User-User Similarity: The memory-based Collaborative Filtering approach

- Item-Item Similarity: The model-based Collaborative Filtering approach

- Pearson's Correlation Coefficient (PCC): The bias-reduced Collaborative Filtering approach

##  Dataset Description

The dataset has 3 components: train/development/test.  The development and testing sets were created by sampling points from the training set.

- Training set:

The training set is the input to the Collaborative Filtering system. The format of the data is as follows:

"MovieID","UserID","Rating","RatingDate"
MovieID1,UserID11,rating_score_for_UserID11_to_MovieID1,the_date_of_rating
MovieID1,UserID12,rating_score_for_UserID12_to_MovieID1,the_date_of_rating

where rating_score_for_UserID*_to_MovieID* are decimal values between 1.0 and 5.0 and dates have the format YYYY-MM-DD.

In [1]:
import pandas as pd
train_data = pd.read_csv('train.csv',header=None)
train_data.columns = ['movieid', 'userid', 'rating', 'date']
train_data.head()

Unnamed: 0,movieid,userid,rating,date
0,0,0,3,2001-03-05
1,0,1,3,2001-02-15
2,1,2,3,2000-01-22
3,1,3,4,2001-02-15
4,1,5,3,2000-01-21


- Development set:

The development set consists of movie-user pairs without a rating.  The format of the data is as follows:

"MovieID","UserID"
MovieID1,UserID11
MovieID1,UserID12

Our task is to predict the ratings of these pairs given the training data.

In [2]:
dev_data = pd.read_csv('dev.csv',header=None)
dev_data.columns = ['movieid', 'userid']
dev_data.head()

Unnamed: 0,movieid,userid
0,2,23
1,3,108
2,3,29
3,3,80
4,4,137


 - Development set queries:
 
The user vectors from the training data corresponding to the users in the development set are in the format:

"UserID" "MovieID:Rating" "MovieID:Rating" ... "MovieID:Rating"

Note: when finding the k-nearest neighbors for a query of User_i, User_i does not count as one of the k-nearest neighbors.

In [3]:
dev_queries = pd.read_csv('dev.queries',header=None)
dev_queries.head()

Unnamed: 0,0
0,0 0:3 134:5 153:3 159:5 175:3 178:3 191:4 239:...
1,1 0:3 955:4 1092:4 1532:5 2168:4 2396:4 2987:4...
2,2 1:3 88:2 304:3 338:5 347:4 404:5 451:3 466:3...
3,4 18:4 19:4 32:2 62:5 66:3 70:5 80:5 85:2 89:3...
4,5 1:3 4:2 19:4 32:4 48:4 52:2 54:1 74:1 116:2 ...


- Test set format:

The test set and test queries are given in the same format as the development set.

## Corpus Exploration

We begin the corpus exploration with a few basic statistics: 
- the total number of users
- the total number of movies
- the number of times any movie was rated '1'
- the number of times any movie was rated '3'
- the number of times any movie was rated '5'
- the average movie rating across all users and movies

In [4]:
import sys, math
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix
from scipy.sparse.linalg import norm

In [5]:
def readTrain(file):
    user = []
    movie = []
    rating = []
    with open(file, 'r') as tr:
        for line in tr:
            t = line.split(",")
            user.append(int(t[1]))
            movie.append(int(t[0]))
            rating.append(int(t[2])) # pre-process here, option 2

    m = coo_matrix((rating, (user, movie)), dtype = float)
    m = m.tocsr()
    return m, rating

In [6]:
m, rating = readTrain('train.csv')
print("The total number of nonzero entries:", m.nnz)
print("The total number of users:", m.shape[0]) 
print("The total number of movies:",  m.shape[1])

The total number of nonzero entries: 820367
The total number of users: 10916
The total number of movies: 5392


Since 820367 << 10916 * 5392, we know that the matrix is very sparse.

In [7]:
r = np.array(rating)
print("the number of times any movie was rated '1':", np.where(r == 1)[0].shape[0])
print("the number of times any movie was rated '3':", np.where(r == 3)[0].shape[0])
print("the number of times any movie was rated '5':", np.where(r == 5)[0].shape[0])
print("the average movie rating across all users and movies:", np.average(r))

the number of times any movie was rated '1': 53852
the number of times any movie was rated '3': 260055
the number of times any movie was rated '5': 139429
the average movie rating across all users and movies: 3.3805674777264323


## User-User Similarity

We then proceed to the memory-based Collaborative Filtering approach. We start with reading in the queries from the dev.csv file. Note that for memory-based and model-based CF, there are subtle differences in the calculation of similarity as well as reading the queries, as shown in the similarity/similarityPair functions and the readQuery function below.

In the memory-based Collaborative Filtering approach, for each query, the similarity betwen the user in this query and all the users is computed. 

In [8]:
def similarity(m, q, func):
    # m is the CSR matrix and q is the query row index
    if func == 'cosine':
        m = normalize(m)
    res = np.dot(m, m[q].transpose()).toarray().flatten()
    res[q] = 0 # exclude query itself
    return res

However, the queries in nature come in a random manner, which means a same user could appear in different queries several times with different movies. Therefore, in order to avoid repetitive similarity computation for the same user, a dictionary is implemented to group the queries by user.

In addition, a list of tuple is implemented to keep track of the original order of the queries, which is important in the evaluation step.

In [9]:
from collections import defaultdict

# read query file into a user-movies dictionary
def readQuery(file, model):
    output = None
    tuples = [] # input order maintenance
    with open(file, 'r') as f:
        if model == "memory":
            query = defaultdict(list) # user as key, list of movies as value
            for line in f:
                t = line.split(",")
                tuples.append((int(t[1]), int(t[0])))
                query[int(t[1])].append(int(t[0]))
            output = (query, tuples)
        if model == "model":
            for line in f:
                t = line.split(",")
                tuples.append((int(t[0]),int(t[1])))
            output = tuples
    return output

## Item-Item Similarity


On the other hand, the model-based Collaborative Filtering approach derives a model for pairwise similarity for further computation, as shown in the similarityPair function below. 

Therefore, we maintain only a tuple list to keep track of the queries, as shown in the readQuery function above.

In [10]:
def similarityPair(m, func):
    # m is the matrix
    if func == 'cosine':
        m = normalize(m)
    res = m.transpose().dot(m)
     # exclude query itself
    np.fill_diagonal(res,0)
    return res

Note that the model argument in the readQuery function has two different values, indicating the two different approaches (memory-based, model-based). 

Below is a list of all arguments we will be experimenting in this tutorial:

- model (memory-based or model collaborative filtering): 
    "memory", "model"


- k (number of k nearest neighbor): 
    10,100,500
    
    
- similarity (similarity metric used for knn): 
    "dotproduct", "cosine"
    
    
- weighing (approach for combining prediction given knn):
    "mean","weight"
    
    
- pcc (if standardization used):
    "True", "False"
    

In [11]:
import numpy as np
from sklearn.preprocessing import normalize
from numpy.linalg import norm
from scipy.sparse import csr_matrix

The knn and knnWeight functions can be shared in the two approaches:

In [12]:
def knn(similarity, k):
    return np.argpartition(-similarity, k)[:k]

def knnWeight(similarity, knn):
    return (similarity[[knn]]+1.)/2

The predict function for the two approaches are essentially different in the row or column slicing.

In [13]:
def predict(m, knn, similarity, func, model):
    if model == "memory":
        # row slicing and average
        if (func == 'weight') and (np.sum(knnWeight(similarity,knn)) != 0):
            res = np.average(m[knn].toarray(), axis = 0, weights = knnWeight(similarity, knn))
        else:
            res = np.mean(m[knn].toarray(), axis = 0)
    if model == "model":
        # column slicing and average
        if (func == 'weight') and (np.sum(knnWeight(similarity,knn)) != 0):
            res = np.average(m[:,knn], axis = 1, weights = knnWeight(similarity, knn))
        else:
            res = np.mean(m[:,knn], axis = 1)
    return res

With the helper functions ready, we are now ready to implement the two collaborative filtering algorithms!

In [14]:
def memoryCF(m, query, k, func, func_w):
    res = {} # return the dictionary {u,m} = prediction for all the queries
    for user in query.keys():
        sim = similarity(m, user, func) #(1, #user)
        temp = knn(sim, k)
        prediction = predict(m, temp, sim, func_w, "memory") #(1, #movies)
        for movie in query[user]:
            pred = prediction[movie] + 3 # plus 3
            if pred > 5:
                pred = 5
            elif pred < 1:
                pred = 1
            res[user, movie] = pred
    return res

def modelCF(m, tuples, k, func, func_w):
    res = [] # return the list of predictions given the query tuples
    m = np.asarray(m, order = 'F') # column-major order
    simPair = similarityPair(m, func)
    for pair in tuples:
        #print pair[0]
        sim = simPair[pair[0]]
        temp = knn(sim, k)
        prediction = predict(m, temp, sim, func_w, "model")
        pred = prediction[pair[1]] + 3
        if pred > 5:
            pred = 5
        elif pred < 1:
            pred = 1
        res.append(pred)# plus 3
    return res

In [15]:
def getPrediction(model, trainFile, devFile, k, similarity, weighing):
    if model == 'memory':
        M, _ = readTrain(trainFile)
        query, tuples = readQuery(devFile, model)
        pred = memoryCF(M, query, k, similarity, weighing)
    elif model == 'model':
        M, _ = readTrain(trainFile)
        tuples = readQuery(devFile, model)
        pred = modelCF(M, tuples, k, similarity, weighing)
    return pred

Note: We left out the weighing="weight" option due to the limitation of space in this tutorial.

In [None]:
getPrediction('memory', 'train.csv', 'dev.csv', 10, "dotproduct", "mean")
getPrediction('memory', 'train.csv', 'dev.csv', 100, "dotproduct", "mean")
getPrediction('memory', 'train.csv', 'dev.csv', 500, "dotproduct", "mean")

getPrediction('memory', 'train.csv', 'dev.csv', 10, "cosine", "mean")
getPrediction('memory', 'train.csv', 'dev.csv', 100, "cosine", "mean")
getPrediction('memory', 'train.csv', 'dev.csv', 500, "cosine", "mean")

getPrediction('model', 'train.csv', 'dev.csv', 10, "dotproduct", "mean")
getPrediction('model', 'train.csv', 'dev.csv', 100, "dotproduct", "mean")
getPrediction('model', 'train.csv', 'dev.csv', 500, "dotproduct", "mean")

getPrediction('model', 'train.csv', 'dev.csv', 10, "cosine", "mean")
getPrediction('model', 'train.csv', 'dev.csv', 100, "cosine", "mean")
getPrediction('model', 'train.csv', 'dev.csv', 500, "cosine", "mean")

In the evaluation step, we will use one single evaluation metric typically used in collaborative filtering, Root Mean Squared Error (RMSE). RMSE is calculated as follows:

    E = 0

    For each rating

        E = E + (true_rating - rating)^2
    
    RMSE = sqrt(E / num_ratings)

Due to the limitation of space in the tutorial, we leave out the output of the getPredictions function, which we use along with the gold standards behind the scenes to calculate the RMSE. 

### Conclusions

A few conclusions about the choices of k, similarity metric, and the two approaches themselves:

Conclusion 1: Value of k

In our experiment, for all the similarity metrics and all the weighting schemes, k = 10 always give the best RMSE. When k grows, the RMSE also grows, indicating a decrease in the performance. It is not dificult to understand, for example, given a query user, it's easier to find 10 similar users than to find 500 similar users. Getting a large set of similar users is not very reasonable, as a large subset of it might not be really similar to the query user. They are selected because of an overly large value of k. This is also true when it comes to movies. Therefore, a reasonable k for k nearest neighborhood on user-user similarity and item-item similarity is important. With some experiments, we find that when k falls into [25,30], the algorithm could achieve a relatively low RMSE.

Conclusion 2: Similaity metric: dot product vs. cosine similarity

Given the same value of k, the same similarity metric, for user-user similarity and item-item similarity, either choosing the dot product or choosing cosine similarity does not differ a lot. Cosine similarity only differs from dot product that it goes through a normalization process. For the dataset given, it's so sparse that such normalization does little work to help us revise the prediction. Thereforem, these two metrics are both good indicators for measuring similarity.

Conclusion 3: Feature: user similarity vs. movie similarity

User simialrity is a little better than the movie similarity. This is not difficult to find given the same similairty metric and the value of k. As we understand, there are not a large group of users being so generous(always give high ratings) while others being so strict(always give low ratings). On the other hand, movies differ a lot more than users. There are movies that are highly rated and there are movies that are not liked by users. As a result, for the dataset, user similarity is a better indicator than movie similarity.

## References and Further Resources

1. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems[J]. Computer, 2009, 42(8). 

2. Ricci F, Rokach L, Shapira B. Introduction to recommender systems handbook[M]//Recommender systems handbook. springer US, 2011: 1-35.

3. Resnick P, Varian H R. Recommender systems[J]. Communications of the ACM, 1997, 40(3): 56-58.

4. 