# Collaborative Filtering

In [637]:
import numpy as np 
import pandas as pd 

# Introduction

# Data

In [638]:
tags = pd.read_csv('datasets/ml-latest-small/tags.csv')
ratings = pd.read_csv('datasets/ml-latest-small/ratings.csv')
movies = pd.read_csv('datasets/ml-latest-small/movies.csv')
links = pd.read_csv('datasets/ml-latest-small/links.csv')

In [639]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [640]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# Data Preparation

In [641]:
df = movies.merge(ratings, on = 'movieId')
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [642]:
# Aggregate by movie
agg_ratings = df.groupby('title').agg(mean_rating = ('rating', 'mean'),
                                                number_of_ratings = ('rating', 'count')).reset_index()

# Keep the movies with over 100 ratings
agg_ratings_GT100 = agg_ratings[agg_ratings['number_of_ratings']>100]

df = pd.merge(df, agg_ratings_GT100[['title']], on = 'title', how = 'inner')

# User-User Collaborative Filtering

The following is the User-Item Matrix we will be working with. Each user is represented row-wise and each movie is represented by a column. The values inside the matrix represent the rating a user gave for a particular movie. As you can see we are dealing with a very sparse matrix as there are a lot of NaN values which is to be expected as there are only so many movies one viewer can watch and rate.

In [643]:
# Convert are DataFrame into a User X Movie Matrix
userRatings = df.pivot_table(index = ['userId'], columns = ['title'],
                            values = 'rating')
userRatings.head()

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,True Lies (1994),"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,4.0,,,5.0,5.0,,4.0,...,,,,3.0,,5.0,,,5.0,5.0
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,4.0,,,,5.0,,,,...,,,2.0,,,,,,4.0,
5,,3.0,4.0,,,,,,,,...,2.0,,,,,4.0,,,,


## Similarity Metric

### Pearson Similarity

The Pearson Similarity metric is used to measure the rating vectors of two users (computing row-wise on the user-item matrix). We will denote the users as user *x* and user *y*. Additionally, $I_x$ and $I_y$ will denote the set of items user *x* and user *y* has rated respectfully.

The first step to computing the Pearson Similarity is computing the mean rating for each user. The mean rating of user *x* is computed with the following equation:

$$\mu_u = \frac{\sum_{k \in I_u} r_{ul}}{|I_u|}$$ 

$$\forall u \in \{1...m\}$$


where *i* is the index of the item therefore $x_{i}$ is the rating user *x* gave on item *i*

The pearson similarity can then be computed between the two users:

$$Sim(u,v) = Pearson(u,v) = \frac{\sum_{k \in I_u \bigcap I_v} ((r_{uk}-\mu_u)(r_{vk} - \mu_v))}{\sqrt{\sum_{k \in I_u \bigcap I_v} ((r_{uk}-\mu_u)^2}\sqrt{\sum_{k \in I_u \bigcap I_v} ((r_{vk}-\mu_v)^2}}$$

The Pearson Similarity is computed between a target user and all the other users. We can then find the *k* number of users with the highest Pearson Similarity with the target user.

In [644]:
# Perform Pearson Similarity on the users.
user_similarity_matrix = userRatings.T.corr(method = 'pearson')
user_similarity_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,,,0.391797,0.180151,-0.439941,-0.029894,0.464277,1.0,-0.037987,...,0.091574,0.254514,0.101482,-0.5,0.78002,0.303854,-0.012077,0.242309,-0.175412,0.071553
2,,1.0,,,,,,,,1.0,...,-0.583333,,-1.0,,,0.583333,,-0.229416,,0.765641
3,,,,,,,,,,,...,,,,,,,,,,
4,0.391797,,,1.0,-0.394823,0.421927,0.704669,0.055442,,0.360399,...,-0.239325,0.5625,0.162301,-0.158114,0.905134,0.021898,-0.020659,-0.286872,,-0.050868
5,0.180151,,,-0.394823,1.0,-0.006888,0.328889,0.030168,,-0.777714,...,0.0,0.231642,0.131108,0.068621,-0.245026,0.377341,0.228218,0.263139,0.384111,0.040582


We need to define the top *k* users similar to the target user. Additionally we need to set some sort of similarity threshold because the top *k* results could yield users that are drastically different.

In [645]:
k = 10 # Number of similar users we want to retrieve
similarity_threshold = 0.3 # Threshold that needs to be met to be considered similar
user = 1 # The target user for which we want to generate recommendations for

user_similarity_matrix.drop(index = user) # remove target user so that they are not amongst one of the similar users.

# Return the top k (10) similar users
k_Neighbours = user_similarity_matrix[user_similarity_matrix[user] > similarity_threshold][user].sort_values(ascending = False)[:k]
k_Neighbours

userId
550    1.000000
502    1.000000
1      1.000000
598    1.000000
108    1.000000
9      1.000000
401    0.942809
511    0.925820
366    0.872872
595    0.866025
Name: 1, dtype: float64

Remove movies that our target user has already seen and keep movies that similar users have watched.

In [646]:
# get movies that target user has watched and rated
target_watched = userRatings[userRatings.index == user].dropna(axis = 1, how = 'all')
target_watched

title,Alien (1979),American Beauty (1999),American History X (1998),Apocalypse Now (1979),Back to the Future (1985),Batman (1989),"Big Lebowski, The (1998)",Braveheart (1995),Clear and Present Danger (1994),Clerks (1994),...,Star Wars: Episode IV - A New Hope (1977),Star Wars: Episode V - The Empire Strikes Back (1980),Star Wars: Episode VI - Return of the Jedi (1983),Stargate (1994),"Terminator, The (1984)",Toy Story (1995),Twister (1996),"Usual Suspects, The (1995)",Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,5.0,5.0,4.0,5.0,4.0,5.0,4.0,4.0,3.0,...,5.0,5.0,5.0,3.0,5.0,4.0,3.0,5.0,5.0,5.0


We need to remove the movies that the target user has watched from the movies that similar users have watched.

In [647]:
# drop movies that none of the similar users have watched
target_not_watched = userRatings[userRatings.index.isin(k_Neighbours.index)].dropna(axis = 1, how = 'all')
# remove movies that the target user has watched.
target_not_watched.drop(target_watched.columns, axis = 1, inplace = True, errors = 'ignore')
target_not_watched.head()

title,Aladdin (1992),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",Batman Begins (2005),"Beautiful Mind, A (2001)",Beauty and the Beast (1991),Blade Runner (1982),"Bourne Identity, The (2002)","Breakfast Club, The (1985)",Catch Me If You Can (2002),"Dark Knight, The (2008)",...,"Monsters, Inc. (2001)",Ocean's Eleven (2001),Pirates of the Caribbean: The Curse of the Black Pearl (2003),"Shawshank Redemption, The (1994)",Shrek (2001),Spider-Man (2002),Terminator 2: Judgment Day (1991),Titanic (1997),Up (2009),WALL·E (2008)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
108,,5.0,,5.0,,5.0,,4.0,5.0,,...,,,,,,5.0,,4.0,,
366,,,4.0,,,,,,,4.0,...,,,4.0,,,,4.0,,,
401,3.0,,,,3.0,,,,,,...,3.5,,3.5,,3.5,,,,4.0,4.0


#### Pearson Weighted Average

Once we have the *k* users most similar to the target user, we can use the Pearson Similarity scores and the user-item ratings of the similar users to calculate the Weighted Average for an item. 

The weighted average is calculated with the following formula:

$$\frac{\sum_{v \in P_{u(j)}} Sim(u,v)*r_{vj}}{\sum_{v \in P_{u(j)}} |Sim(u,v)|}$$

The above equation takes the sum of the product of the similarity scores of the target and similar users and the rating the similar users gave on item *i*, all divided by the sum of the similarity scores of the target and similar users. 

The items with the highest weighted average are the items that should be recommended to the target user.

In [648]:
u = 1

# Get list of movies similar users have watched but target has not.
movies = target_not_watched.columns
recommended_movie_list = []
predicted_rating_list = []

for j in movies:
    movie_ratings = target_not_watched
    rating_sum = 0
    similarity_sum = 0
    for v in movie_ratings.index :
        rating = movie_ratings.loc[v][j]
        similarity = user_similarity_matrix[u][v]
        if pd.isna(rating) == False:
            rating_sum = rating_sum + similarity*rating
            similarity_sum = similarity_sum + similarity
    weighted_average = rating_sum/similarity_sum
    recommended_movie_list.append(j)
    predicted_rating_list.append(weighted_average)

results = pd.DataFrame(list(zip(recommended_movie_list, predicted_rating_list)), 
                      columns = ['Movie', 'Weighted_Average']).sort_values('Weighted_Average', ascending = False)
results.head(10)

Unnamed: 0,Movie,Weighted_Average
12,Donnie Darko (2001),5.0
3,"Beautiful Mind, A (2001)",5.0
15,Harry Potter and the Chamber of Secrets (2002),5.0
5,Blade Runner (1982),5.0
13,Eternal Sunshine of the Spotless Mind (2004),5.0
29,"Shawshank Redemption, The (1994)",4.829108
16,Inception (2010),4.770218
1,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",4.519259
25,Minority Report (2002),4.5
8,Catch Me If You Can (2002),4.5


#### Mean-Centered Ratings

To avoid dealing with bias, we need use the mean-centered ratings.. This is because different users rate things on different scale. For example, user *x* might be very lenient with their ratings and rate things highly whereas user *y* is a tough critique and rarely gives out high reviews. The ratings need to be mean-centered before predicting ratings. To compute the mean-centered rating for user *x* on item *i* you would simply substract the rating given to item *i* by user *x* with the average rating of user *x*.

$$s_{uj} = r_{uj} - \mu_u$$

$$\forall u \in \{1...m\}$$

#### Predicting Ratings

To get the predicted ratings, we simply add the mean rating given by the target user and add it to each weighted average for each item.

$$\hat{r}_{uj} = \mu_u + \frac{\sum_{v \in P_{u(j)}} Sim(u,v)*s_{vj}}{\sum_{v \in P_{u(j)}} |Sim(u,v)|}$$

In [649]:
u = 1

movies = target_not_watched.columns
recommended_movie_list = []
predicted_rating_list = []

mu_u = userRatings[userRatings.index == u].T.mean()[u]

for j in movies:
    movie_ratings = target_not_watched
    rating_sum = 0
    similarity_sum = 0
    for v in movie_ratings.index :
        rating = movie_ratings.loc[v][j]
        similarity = user_similarity_matrix[u][v]
        if pd.isna(rating) == False:
            mu_v = userRatings[userRatings.index == v].T.mean()[v]
            mean_centered_rating = rating - mu_v
            rating_sum = rating_sum + similarity*mean_centered_rating
            similarity_sum = similarity_sum + similarity
    prediction_rating = mu_u + rating_sum/similarity_sum
    recommended_movie_list.append(j)
    predicted_rating_list.append(prediction_rating)

results = pd.DataFrame(list(zip(recommended_movie_list, predicted_rating_list)), 
                      columns = ['Movie', 'Weighted_Average']).sort_values('Weighted_Average', ascending = False)
results.head(10)

Unnamed: 0,Movie,Weighted_Average
15,Harry Potter and the Chamber of Secrets (2002),6.281746
13,Eternal Sunshine of the Spotless Mind (2004),6.281746
27,Ocean's Eleven (2001),5.281746
6,"Bourne Identity, The (2002)",5.281746
16,Inception (2010),5.117285
12,Donnie Darko (2001),4.859524
3,"Beautiful Mind, A (2001)",4.859524
5,Blade Runner (1982),4.859524
10,"Departed, The (2006)",4.686975
34,Up (2009),4.623668


## Putting it Together

In [650]:
def user_recommend_movie(u, k, threshold, num_recommendations):

    user_similarity_matrix.drop(index = u) # remove target user so that they are not amongst one of the similar users.

    # Return the top k (10) similar users
    k_Neighbours = user_similarity_matrix[user_similarity_matrix[u] > threshold][u].sort_values(ascending = False)[:k]
    target_not_watched = userRatings[userRatings.index == u].dropna(axis = 1, how = 'all')
    target_not_watched = userRatings[userRatings.index.isin(k_Neighbours.index)].dropna(axis = 1, how = 'all')
    # remove movies that the target user has watched.
    target_not_watched.drop(target_watched.columns, axis = 1, inplace = True, errors = 'ignore')
    
    movies = target_not_watched.columns
    recommended_movie_list = []
    predicted_rating_list = []

    mu_u = userRatings[userRatings.index == u].T.mean()[u]

    for j in movies:
        movie_ratings = target_not_watched
        rating_sum = 0
        similarity_sum = 0
        for v in movie_ratings.index :
            rating = movie_ratings.loc[v][j]
            similarity = user_similarity_matrix[u][v]
            if pd.isna(rating) == False:
                mu_v = userRatings[userRatings.index == v].T.mean()[v]
                mean_centered_rating = rating - mu_v
                rating_sum = rating_sum + similarity*mean_centered_rating
                similarity_sum = similarity_sum + similarity
        prediction_rating = mu_u + rating_sum/similarity_sum
        recommended_movie_list.append(j)
        predicted_rating_list.append(prediction_rating)

    results = pd.DataFrame(list(zip(recommended_movie_list, predicted_rating_list)), 
                          columns = ['Movie', 'Predicted_Rating']).sort_values('Predicted_Rating', ascending = False).head(num_recommendations)
    return results


In [651]:
user_recommend_movie(1, 10, 0.3, 10)

Unnamed: 0,Movie,Predicted_Rating
15,Harry Potter and the Chamber of Secrets (2002),6.281746
13,Eternal Sunshine of the Spotless Mind (2004),6.281746
27,Ocean's Eleven (2001),5.281746
6,"Bourne Identity, The (2002)",5.281746
16,Inception (2010),5.117285
12,Donnie Darko (2001),4.859524
3,"Beautiful Mind, A (2001)",4.859524
5,Blade Runner (1982),4.859524
10,"Departed, The (2006)",4.686975
34,Up (2009),4.623668


# Item-Based Collaborative Filtering

The following is the Movie-User Matrix we will be working with. Each Movie is represented row-wise and each User is represented by a column. The values inside the matrix represent the rating a Movie is given by a particular User. As you can see we are dealing with a very sparse matrix as there are a lot of NaN values which is to be expected as there are only so many movies one viewer can watch and rate.

In [186]:
# Convert are DataFrame into a Movie X User Matrix
itemRatings = df.pivot_table(index = ['title'], columns = ['userId'],
                            values = 'rating')
itemRatings.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001: A Space Odyssey (1968),,,,,,,4.0,,,,...,,,5.0,,,5.0,,3.0,,4.5
Ace Ventura: Pet Detective (1994),,,,,3.0,3.0,,,,,...,,2.0,,2.0,,,,3.5,,3.0
Aladdin (1992),,,,4.0,4.0,5.0,3.0,,,4.0,...,,,,3.0,3.5,,,3.0,,
Alien (1979),4.0,,,,,,,,,,...,,,5.0,,,4.0,3.0,4.0,,4.5
Aliens (1986),,,,,,,,,,,...,,,4.0,,,3.5,,4.5,,5.0


## Normalization

#### Mean-Centered Ratings

To avoid dealing with bias, we need to perform some sort of normalization on our dataset. This is because different users rate things on different scale. For example, user *x* might be very lenient with their ratings and rate things highly whereas user *y* is a tough critique and rarely gives out high reviews. The ratings need to be mean-centered before predicting ratings. To compute the mean-centered rating for user *x* on item *i* you would simply substract the rating given to item *i* by user *x* with the average rating of user *x*.

$$s_{uj} = r_{uj} - \mu_u$$

$$\forall u \in \{1...m\}$$

In [187]:
# Compute the mean-centered ratings for each item
itemRatings_centered = itemRatings.subtract(itemRatings.mean(axis = 1), axis = 'rows')
itemRatings_centered.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001: A Space Odyssey (1968),,,,,,,0.105505,,,,...,,,1.105505,,,1.105505,,-0.894495,,0.605505
Ace Ventura: Pet Detective (1994),,,,,-0.040373,-0.040373,,,,,...,,-1.040373,,-1.040373,,,,0.459627,,-0.040373
Aladdin (1992),,,,0.20765,0.20765,1.20765,-0.79235,,,0.20765,...,,,,-0.79235,-0.29235,,,-0.79235,,
Alien (1979),0.030822,,,,,,,,,,...,,,1.030822,,,0.030822,-0.969178,0.030822,,0.530822
Aliens (1986),,,,,,,,,,,...,,,0.035714,,,-0.464286,,0.535714,,1.035714


## Similarity Metric

### Adjusted Cosine

The Adjusted Cosine metric is used to measure the rating vectors of two items (computing column-wise on the user-item matrix). We will denote the items as item *1* and item *2*. Additionally, $U_1$ and $U_2$ will denote the set of users who have rated items *1* and *2*.

For Adjusted Cosine, the similarity between items are calculated using the mean-centered ratings which we discussed previously. 

$$AdjustedCosine(i,j) = \frac{\sum_{u \in U_i \bigcap U_j} s_{ui}*s_{uj}}{\sqrt{\sum_{u \in U_i \bigcap U_j} s^2_{ui}}\sqrt{\sum_{u \in U_i \bigcap U_j} s^2_{uj}}}$$

In [188]:
# Compute the Adjusted Cosine Similarity between Item Pairs
items_similarity_matrix = itemRatings_centered.T.corr()
items_similarity_matrix.head()

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,True Lies (1994),"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001: A Space Odyssey (1968),1.0,-0.036319,0.017446,0.318523,0.317386,0.32415,0.193592,0.152405,0.01149,0.478877,...,-0.108291,-0.012451,-0.041791,-0.458642,0.152271,0.245279,0.100172,-0.447306,0.087803,-0.123862
Ace Ventura: Pet Detective (1994),-0.036319,1.0,0.302193,-0.208017,-0.107524,-0.030425,0.040435,0.065549,0.173855,0.245829,...,0.139896,0.188089,0.054408,0.17693,-0.007853,-0.06152,0.170717,0.176155,0.051239,0.045676
Aladdin (1992),0.017446,0.302193,1.0,0.026514,0.151152,0.445204,0.127764,0.262014,0.367076,0.015038,...,0.333687,0.562311,-0.069176,0.137215,0.17133,0.153934,0.272375,0.065342,0.164459,0.28548
Alien (1979),0.318523,-0.208017,0.026514,1.0,0.705925,0.387215,0.215751,0.035373,-0.006804,0.378709,...,0.199538,0.17862,0.108327,0.022007,-0.098813,0.350428,0.270697,0.119849,0.117749,0.030257
Aliens (1986),0.317386,-0.107524,0.151152,0.705925,1.0,0.540458,0.111452,0.139326,0.076674,0.22192,...,0.369971,0.287243,0.084792,0.092412,0.195581,0.296933,0.294852,-0.014274,0.111864,0.225923


#### Weighted Average (Predictions)

The weighed Average values of these raw ratings is reported as the predicted value.

$$\hat{r}_{ut} = \frac{\sum_{j \in Q_{t(u)}} AdjustedCosine(j,t)*r_{uj}}{\sum_{j \in Q_{t(u)}} |AdjustedCosine(j,t)|}$$

We want to make movie recommendations for a particular user. We will take user 1. We need to extract the movies he has watched and rated. Once we have those movies and find the movies that are similar to the ones they have rated and those movies will be candidates for recommendations for that person.

#### Predicting Ratings

In [466]:
# Pick a user ID
user = 1

k = 5 # get k most similar unwatched movies for each movie that target user has watched
n = 3 # number of recommendations to output

# Get all the movies that target user has watched
target_watched = itemRatings_centered[user].dropna()

# Get all the movies that target user has not watched
target_not_watched = itemRatings_centered[itemRatings_centered[user].isna()][user]

results = pd.DataFrame(columns = ['Movie', 'Prediction'])

for t in target_not_watched.index:

    rating_sum = 0
    similarity_sum = 0
    recommended_movie_list = []
    predicted_rating_list = []
    
    for j in target_watched.index:
        rating = itemRatings[user][j]
        similarity = items_similarity_matrix[j][t]
        if pd.isna(rating) == False:
            rating_sum = rating_sum + rating*similarity
            similarity_sum = similarity_sum + abs(similarity)
            predicted_rating = rating_sum/similarity_sum
            recommended_movie_list.append(t)
            predicted_rating_list.append(predicted_rating)
        predicted_movies_df = pd.DataFrame(list(zip(recommended_movie_list,
                                                 predicted_rating_list)),
                                        columns = ['Movie', 'Prediction']).sort_values('Prediction', ascending = False).head(k)
    results = results.append(predicted_movies_df)
results.sort_values('Prediction', ascending = False).head()

Unnamed: 0,Movie,Prediction
1,Cliffhanger (1993),4.984929
4,Aladdin (1992),4.956856
2,Aladdin (1992),4.93631
2,Ocean's Eleven (2001),4.925916
3,Aladdin (1992),4.903666


Some movies appear more than once because they were similar to movies that the user has already rated. We can groupby *Target_Movie* and get the average of the Predicted Ratings for each movie

In [467]:
results = results.groupby('Movie').mean().sort_values('Prediction', ascending = False).reset_index()
results.head()

Unnamed: 0,Movie,Prediction
0,Aladdin (1992),4.879762
1,Ocean's Eleven (2001),4.782858
2,"Shawshank Redemption, The (1994)",4.708112
3,American Pie (1999),4.700215
4,Catch Me If You Can (2002),4.694853


## Putting it Together

In [468]:
def item_recommend_movies(user, num_similar_items, num_recommendations):
    # Get all the movies that target user has watched
    target_watched = itemRatings_centered[user].dropna()

    # Get all the movies that target user has not watched
    target_not_watched = itemRatings_centered[itemRatings_centered[user].isna()][user]

    results = pd.DataFrame(columns = ['Movie', 'Prediction'])

    for t in target_not_watched.index:

        rating_sum = 0
        similarity_sum = 0
        recommended_movie_list = []
        predicted_rating_list = []

        for j in target_watched.index:
            rating = itemRatings[user][j]
            similarity = items_similarity_matrix[j][t]
            if pd.isna(rating) == False:
                rating_sum = rating_sum + rating*similarity
                similarity_sum = similarity_sum + abs(similarity)
                predicted_rating = rating_sum/similarity_sum
                recommended_movie_list.append(t)
                predicted_rating_list.append(predicted_rating)
            predicted_movies_df = pd.DataFrame(list(zip(recommended_movie_list,
                                                     predicted_rating_list)),
                                            columns = ['Movie', 'Prediction']).sort_values('Prediction', ascending = False).head(num_similar_items)
        results = results.append(predicted_movies_df)
    results = results.groupby('Movie').mean().sort_values('Prediction', ascending = False).reset_index().head(num_recommendations)
    return results


In [469]:
item_recommend_movies(1, 10, 5)

Unnamed: 0,Movie,Prediction
0,Aladdin (1992),4.715059
1,"Shawshank Redemption, The (1994)",4.642777
2,Batman Begins (2005),4.613685
3,"Monsters, Inc. (2001)",4.603398
4,Trainspotting (1996),4.575341


# Evaluation Metric

### Root Mean Square Error (RMSE)

# Baseline Model

# Matrix Factorization

Matrix Factorization methods are used to reduce the dimensionality of the matrix. This is especially helpful if we are dealing with a very sparse matrix such as our User-Item Rating Matrix. Using Matrix Factorization we are able to represent our User-Item Rating Matrix as a low-rank matrix. 

Once this reduced dimensional user-rating matrix is computed, we can use this reduced representation to calculate similarities. The calculations are more robust because all the values are filled in this matrix. The calculations are also more efficient because of the lower dimensionality

## Principal Component Analysis

The idea is to use PCA to transform a user-ratings matrix *R* that is of size *m x n* into a lower-dimension of *m x d* where *d << n*. With this transformation we will have a user-rating matrix where all the values inside the matrix are filled as oppose to the sparse matrix we had before.


## Singular Value Decomposition

The first step to SVD is to fill in the incomplete User-Rating Matrix R. We will denote this resulting matrix as $R_f$. To avoid introducing bias, we will performing mean-centering on the User-Rating Matrix and fill in the missing values with 0. A value of 0 would be the average rating after mean-centering.

$R_f$ can be broken down into the following matrices.
 
$$R_f = Q\Sigma P^T$$

where:

* Q: m x m matrix where the columns are the m orthonormal eigenvectors of $R_fR_{f}^T$
* P: n x n matrix where the columns are the n orthonormal eigenvectors of 
$R_{f}^TR_f$
* \Sigma: m x n diagonal matrix where the diagonal entries are non-zero and they contain the square-root of the nonzero eigenvalues of $R_fR_{f}^T$

In [None]:
def adjusted_cosine():
    

## Advantages and Disadvantages

The advantage of Item-Based Collaborative Filtering is that it often provides more relevant recommendations because it using your OWN ratings to make recommendations. For example a recommonder system might look at amovie you've enjoyed and rated highly and recommend similar movies.

Item-Based ratings are also more stable to changes in ratings. This is because for User-Based ratings, there are a lot more users than items. This means that there will be cases where two users have a small number of the same items, but two items are much more likely to have a larger number of users who have rated both of them. This means that for User-Based ratings, just adding a few ratings can change the similarity score a lot, whereas for Item-Based it is much more stable to additions of new ratings.

The disadvantage of Item-Based Collaborative Filtering is that they may not provide more diverse recommendations as oppose to User-Based Collaborative Filtering. Recommending more diverse items may lead to pleasant surprises or new found interests. Without enough diversity, it is possible that a user can get border with similar recommendations to items they've been recommended.

An additional disadvantage is the problem of sparsity. For example if none of the nearest neighbors for one user has rated a particular item, it is not possible to predict a rating for that item. Though another to consider is that if none of the similar users have rated that item, it is possible that the target user won't like it.