* Authors: Andrea Jiménez Zuñiga, Valentina Díaz Torres e Isabel Afán de Ribera Olaso
* Date: 15/01/2020
* Institution: CUNEF

# 05. Collaborative Filtering


The collaborative filtering methodology is applied to the recommendation systems to optimize their operation and mitigate the problems of information that can be generated in a digital environment. Internet users can obtain millions of data, but only a few are of interest to them, and this can generate negative experiences and significant loss of time. Thanks to collaborative filtering, valuable information is selected, processed and built from this information, a set of suggestions and recommendations that are in accordance with user expectations.


Here we are going to use the Memory-Based Collaborative filtering to make recommendations to movie users. The idea is based on the fact that users that are similar to me can be used to predict how much I would like a movie that they have liked and I have not watched. 

There are two types of Collaborative Filtering: 

1. __User-User Collaborative Filtering:__ Its aim is to find user's look-alike.

2. __Item-Item Collaborative Filtering:__ Its aim is to find movie's look-alike.



![Captura%20de%20pantalla%202021-01-07%20a%20las%202.15.15%20p.%C2%A0m..png](attachment:Captura%20de%20pantalla%202021-01-07%20a%20las%202.15.15%20p.%C2%A0m..png)

If we either proceed to do a user-user collaborative filtering or an item-item one, we need to build a similarity matrix. For the first one, this matrix will consist of distance metrics that measure the similarity between any two pair of users. On the other hand, for the item-similarity matrix, this will measure the similarity between any two pair of items.

This distance similarity metrics are 3: 

1. ___Jaccard Similarity___
2. ___Cosine Similarity___
3. ___Pearson Similarity___

In this case we are going to use the **Pearson Similarity**, which its similarity refers to the Pearson coefficient between the two vectors.

## Import Libraries


In [1]:
import pandas as pd 
import numpy as np

In this case we only need two datasets: _movies.csv_ and _ratings.csv_.

In [2]:
ratings = pd.read_csv('../../data/ratings.csv')
movies = pd.read_csv('../../data/movies.csv')

In [3]:
# We are just interested in the ratings and users so we drop genres and timestamp

ratings = pd.merge(movies,ratings).drop(['genres','timestamp'], axis = 1)

In [5]:
ratings.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),2,3.5
1,1,Toy Story (1995),3,4.0
2,1,Toy Story (1995),4,3.0
3,1,Toy Story (1995),5,4.0
4,1,Toy Story (1995),8,4.0


In [4]:
ratings = ratings.head(10000000)

In [5]:
# Now we use pivot method in pandas. In the values for each column we want the ratings that each user gives 
# to a particular movie.

user_ratings = ratings.pivot_table(index = ['userId'], columns = ['title'],
                                  values = 'rating')
user_ratings.head()

title,'Til There Was You (1997),"'burbs, The (1989)",1-900 (06) (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (One Eight Seven) (1997),2 Days in the Valley (1996),2 ou 3 choses que je sais d'elle (2 or 3 Things I Know About Her) (1967),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),...,Year of the Horse (1997),"Yes, Madam (a.k.a. Police Assassins) (a.k.a. In the Line of Duty 2) (Huang gu shi jie) (1985)",You Can't Take It with You (1938),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,5.0,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,4.0,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


There are a lot of NaN values so we need to make a decision. We should drop a few movies from our dataframe which dont have a lot of users, as it might create noise in our system. As a result, we are going to drop those movies that have less than 10 users.


In [6]:
user_ratings = user_ratings.dropna(thresh = 10, axis = 1).fillna(0)
user_ratings.head() # See with how many movies we are left with (# of columns)


title,'Til There Was You (1997),"'burbs, The (1989)",1-900 (06) (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (One Eight Seven) (1997),2 Days in the Valley (1996),2 ou 3 choses que je sais d'elle (2 or 3 Things I Know About Her) (1967),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),...,Year of the Horse (1997),"Yes, Madam (a.k.a. Police Assassins) (a.k.a. In the Line of Duty 2) (Huang gu shi jie) (1985)",You Can't Take It with You (1938),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we are going to build our similarity matrix. There are 3 methods: Jaccard Similarity, Cosine Similarity and Pearson Similarity. In this case we are going to use Pearson Similarity, being such similarity the coefficient between the 2 vectors.


In [7]:
item_similarity_df = user_ratings.corr(method = 'pearson')
item_similarity_df.head(50)

title,'Til There Was You (1997),"'burbs, The (1989)",1-900 (06) (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (One Eight Seven) (1997),2 Days in the Valley (1996),2 ou 3 choses que je sais d'elle (2 or 3 Things I Know About Her) (1967),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),...,Year of the Horse (1997),"Yes, Madam (a.k.a. Police Assassins) (a.k.a. In the Line of Duty 2) (Huang gu shi jie) (1985)",You Can't Take It with You (1938),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,0.047906,0.027405,0.051586,0.021262,0.026896,0.045499,0.01401,0.029385,0.01647,...,0.006427,0.0045,0.02631,0.025617,0.050339,0.043479,0.025779,0.056017,0.011935,0.070167
"'burbs, The (1989)",0.047906,1.0,0.006663,0.103757,0.089285,0.086664,0.101953,0.020389,0.106135,0.112069,...,0.012814,0.014366,0.059477,0.138555,0.198989,0.164613,0.042797,0.099648,0.011493,0.03283
1-900 (06) (1994),0.027405,0.006663,1.0,0.031192,0.002457,0.013331,0.034175,0.053222,0.004793,0.004051,...,0.013348,0.074564,0.007579,0.001755,0.002457,0.005737,0.041087,0.006002,0.157366,0.052635
101 Dalmatians (1996),0.051586,0.103757,0.031192,1.0,0.060308,0.044739,0.068065,0.013449,0.110464,0.067418,...,0.009901,0.017034,0.047614,0.092231,0.090296,0.070803,0.031117,0.034353,0.038465,0.097793
12 Angry Men (1957),0.021262,0.089285,0.002457,0.060308,1.0,0.042971,0.064653,0.031654,0.109552,0.264863,...,0.015258,0.008776,0.110153,0.199204,0.076112,0.041252,0.034163,0.07873,0.01734,0.007513
187 (One Eight Seven) (1997),0.026896,0.086664,0.013331,0.044739,0.042971,1.0,0.106135,0.024304,0.039061,0.042752,...,0.047434,0.014633,0.026211,0.036519,0.083552,0.092339,0.031907,0.076315,0.013063,0.019695
2 Days in the Valley (1996),0.045499,0.101953,0.034175,0.068065,0.064653,0.106135,1.0,0.020402,0.09436,0.094391,...,0.01675,0.023206,0.044188,0.1102,0.144933,0.137041,0.133489,0.170327,0.02118,0.036541
2 ou 3 choses que je sais d'elle (2 or 3 Things I Know About Her) (1967),0.01401,0.020389,0.053222,0.013449,0.031654,0.024304,0.020402,1.0,0.023513,0.051345,...,0.059446,0.021384,0.05805,0.039045,0.011481,0.007881,0.041261,0.027364,0.094822,0.020206
"20,000 Leagues Under the Sea (1954)",0.029385,0.106135,0.004793,0.110464,0.109552,0.039061,0.09436,0.023513,1.0,0.229542,...,0.016139,0.026561,0.081015,0.205126,0.16859,0.121673,0.0418,0.074257,0.015687,0.027238
2001: A Space Odyssey (1968),0.01647,0.112069,0.004051,0.067418,0.264863,0.042752,0.094391,0.051345,0.229542,1.0,...,0.02214,0.010303,0.084428,0.300235,0.127583,0.082724,0.067628,0.122272,0.018205,0.011464


* __Top Action Movie Recommendations:__

Now we make recommendations based on the model that we have created: 

In [8]:
# To make recommendations based on the model that we have created.
# This method will take the movie name and the rating 
# Will return a similarity score for all the movies that are similar to this particular movie. 

def get_similar_movies(movie_name, user_rating): 
    similar_score = item_similarity_df[movie_name]*(user_rating - 2.5) # We use the df we just created. We 
    # first get the particular movie that this user has already seen and we scale it by the rating that the 
    # user has given to that particular movie. (if it gives a 5 for that movie all will be multiplied by 5)
    # If a user rates bad a movie, we want all the similar movies to go down in the list. So we substract by the 
    # mean (2.5), only if the ratings are above 3 will appear at the top of the list. 
    similar_score = similar_score.sort_values(ascending = False) # I want it in descending order 
    
    return similar_score 

We proceed to test how well our recommendation system is working.

In [9]:
# The rating I give depends on the action that the movie has, that is, if it's fast and furious it would be 5 
# and if I put it's a romantic movie it would be a lower rating, a rating of 2 or 1. 
# For example an action user has rated these movies bellow:

action_lover = [('Broken Arrow (1996)', 5),
                ('Eye for an Eye (1996)', 5), 
                ('Dead Presidents (1995)', 4),
                ('Father of the Bride Part II (1995)', 2)]


similar_movies = pd.DataFrame()

# I want to get similar movies 
for movie, rating in action_lover: 
    similar_movies = similar_movies.append(get_similar_movies(movie,rating), ignore_index = True)
    # ignore index, so the indexes are not automatically created. 
    

similar_movies.head()





Unnamed: 0,'Til There Was You (1997),"'burbs, The (1989)",1-900 (06) (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (One Eight Seven) (1997),2 Days in the Valley (1996),2 ou 3 choses que je sais d'elle (2 or 3 Things I Know About Her) (1967),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),...,Year of the Horse (1997),"Yes, Madam (a.k.a. Police Assassins) (a.k.a. In the Line of Duty 2) (Huang gu shi jie) (1985)",You Can't Take It with You (1938),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997)
0,0.101835,0.162148,0.068519,0.316694,0.035923,0.156593,0.363449,0.0231,0.214379,0.157662,...,0.017942,0.076368,0.028143,0.16178,0.404899,0.383543,0.155587,0.183288,0.046505,0.07504
1,0.152925,0.099461,0.086618,0.317165,4.1e-05,0.147764,0.254816,0.032661,0.061349,-0.015668,...,0.021065,0.075277,0.032731,0.025945,0.148834,0.174413,0.114472,0.064033,0.047556,0.111132
2,0.049168,0.131817,0.034792,0.055482,0.059718,0.162252,0.255523,0.037831,0.116367,0.109186,...,0.036147,0.067065,0.049317,0.119588,0.238449,0.253389,0.116169,0.160763,0.022265,0.036343
3,-0.027532,-0.039562,-0.013337,-0.095484,-0.001368,-0.014796,-0.038116,-0.001062,-0.025393,0.000651,...,-0.002084,-0.008953,-0.013957,-0.026009,-0.039496,-0.043448,-0.013948,-0.014877,-0.00654,-0.027698


The recommended movies selected are:

In [10]:
similar_movies.sum().sort_values(ascending = False).head(20)


Broken Arrow (1996)                     3.058887
Eye for an Eye (1996)                   3.025922
Dead Presidents (1995)                  2.073822
Executive Decision (1996)               1.827838
Eraser (1996)                           1.763359
Twister (1996)                          1.633610
Juror, The (1996)                       1.552337
Rock, The (1996)                        1.492793
Sudden Death (1995)                     1.423407
Mission: Impossible (1996)              1.413466
Ransom (1996)                           1.381950
Heat (1995)                             1.364817
River Wild, The (1994)                  1.355941
Substitute, The (1996)                  1.352341
Independence Day (a.k.a. ID4) (1996)    1.328684
Time to Kill, A (1996)                  1.317868
Phenomenon (1996)                       1.294335
Fear (1996)                             1.285459
Primal Fear (1996)                      1.271886
Grumpier Old Men (1995)                 1.271673
dtype: float64

* __Top Comedy Movie Recommendations:__

For a **comedy lover** for example: 

In [13]:
comedy_lover = [("Clueless (1995)",5),("Father of the Bride Part II (1995)",4),
                ("Dangerous Minds (1995)",1),
                ("Flirting With Disaster (1996)",5)]

similar_movies = pd.DataFrame()

for movie, rating in comedy_lover:
    
    similar_movies = similar_movies.append(get_similar_movies(movie,rating), ignore_index = True)


similar_movies.head(10)

Unnamed: 0,'Til There Was You (1997),"'burbs, The (1989)",1-900 (06) (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (One Eight Seven) (1997),2 Days in the Valley (1996),2 ou 3 choses que je sais d'elle (2 or 3 Things I Know About Her) (1967),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),...,Year of the Horse (1997),"Yes, Madam (a.k.a. Police Assassins) (a.k.a. In the Line of Duty 2) (Huang gu shi jie) (1985)",You Can't Take It with You (1938),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997)
0,0.146193,0.258208,0.017503,0.237657,0.14845,0.121268,0.27678,0.061603,0.177629,0.228963,...,0.043511,0.035482,0.176477,0.375255,0.306245,0.270187,0.159189,0.324292,0.033654,0.074912
1,0.082597,0.118685,0.04001,0.286451,0.004105,0.044388,0.114348,0.003186,0.076178,-0.001953,...,0.006251,0.02686,0.041872,0.078026,0.118489,0.130344,0.041845,0.04463,0.019621,0.083093
2,-0.065343,-0.097143,-0.017464,-0.108401,-0.046056,-0.097947,-0.094394,-0.021036,-0.061747,-0.024401,...,-0.006979,-0.022657,-0.03283,-0.038839,-0.131502,-0.127501,-0.032688,-0.041506,-0.016594,-0.034852
3,0.121849,0.176719,0.028501,0.072127,0.17971,0.124,0.421503,0.106843,0.142367,0.258401,...,0.083833,0.031364,0.222518,0.375206,0.207059,0.17125,0.282504,0.597426,0.052738,0.036411


The recommended movies selected are: