# Collaborative Filtering

In [124]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import sklearn

from surprise import KNNBasic, Reader, Dataset, SVD
from numpy.linalg import svd
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

# Introduction

# Data

In [159]:
tags = pd.read_csv('datasets/ml-latest-small/tags.csv')
ratings = pd.read_csv('datasets/ml-latest-small/ratings.csv')
movies = pd.read_csv('datasets/ml-latest-small/movies.csv')
links = pd.read_csv('datasets/ml-latest-small/links.csv')

In [160]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [161]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [162]:
df = movies.merge(ratings, on = 'movieId')
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,3.0,851866703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9,4.0,938629179
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13,5.0,1331380058
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.0,997938310
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19,3.0,855190091


In [163]:
# Convert are DataFrame into a User X Movie Matrix
userRatings = df.pivot_table(index = ['userId'], columns = ['title'],
                            values = 'rating')
userRatings.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


The following is the User-Item Matrix we will be working with. Each user is represented row-wise and each movie is represented by a column. The values inside the matrix represent the rating a user gave for a particular movie. As you can see we are dealing with a very sparse matrix as there are a lot of NaN values which is to be expected as there are only so many movies one viewer can watch and rate.

# Matrix Factorization

Matrix Factorization methods are used to reduce the dimensionality of the matrix. This is especially helpful if we are dealing with a very sparse matrix such as our User-Item Rating Matrix. Using Matrix Factorization we are able to represent our User-Item Rating Matrix as a low-rank matrix. 

Once this reduced dimensional user-rating matrix is computed, we can use this reduced representation to calculate similarities. The calculations are more robust because all the values are filled in this matrix. The calculations are also more efficient because of the lower dimensionality

## Principal Component Analysis

The idea is to use PCA to transform a user-ratings matrix *R* that is of size *m x n* into a lower-dimension of *m x d* where *d << n*. With this transformation we will have a user-rating matrix where all the values inside the matrix are filled as oppose to the sparse matrix we had before.


## Singular Value Decomposition

The first step to SVD is to fill in the incomplete User-Rating Matrix R. We will denote this resulting matrix as $R_f$. To avoid introducing bias, we will performing mean-centering on the User-Rating Matrix and fill in the missing values with 0. A value of 0 would be the average rating after mean-centering.

$R_f$ can be broken down into the following matrices.
 
$$R_f = Q\Sigma P^T$$

where:

* Q: m x m matrix where the columns are the m orthonormal eigenvectors of $R_fR_{f}^T$
* P: n x n matrix where the columns are the n orthonormal eigenvectors of 
$R_{f}^TR_f$
* \Sigma: m x n diagonal matrix where the diagonal entries are non-zero and they contain the square-root of the nonzero eigenvalues of $R_fR_{f}^T$

# User-User Collaborative Filtering

## Normalization

#### Mean-Centered Ratings

To avoid dealing with bias, we need to perform some sort of normalization on our dataset. This is because different users rate things on different scale. For example, user *x* might be very lenient with their ratings and rate things highly whereas user *y* is a tough critique and rarely gives out high reviews. The ratings need to be mean-centered before predicting ratings. To compute the mean-centered rating for user *x* on item *i* you would simply substract the rating given to item *i* by user *x* with the average rating of user *x*.

$$s_{x,i} = r_{x,i} - \mu_x$$

In [164]:
# Calculate mean-centered User-Movie Matrix
userRatings_centered = userRatings.subtract(userRatings.mean(axis = 1), axis = 'rows')
userRatings_centered.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


## Similarity Metric

### Pearson Similarity

The Pearson Similarity metric is used to measure the rating vectors of two users (computing row-wise on the user-item matrix). We will denote the users as user *x* and user *y*. Additionally, $I_x$ and $I_y$ will denote the set of items user *x* and user *y* has rated respectfully.

The first step to computing the Pearson Similarity is computing the mean rating for each user. The mean rating of user *x* is computed with the following equation:

$$\mu_x = \frac{\sum_i x_{i}}{|I_x|}$$

where *i* is the index of the item therefore $x_{i}$ is the rating user *x* gave on item *i*

The pearson similarity can then be computed between the two users:

$$Pearson(x,y) = Sim(x,y) = \frac{\sum_i ((x_{i}-\mu_x)(y_{i} - \mu_y))}{\sqrt{\sum_i ((x_{i}-\mu_x)^2}\sqrt{\sum_i ((y_{i}-\mu_y)^2}}$$

The Pearson Similarity is computed between a target user and all the other users. We can then find the *k* number of users with the highest Pearson Similarity with the target user.

In [165]:
# Perform Pearson Similarity on the users.
similarity_matrix = userRatings.T.corr()
similarity_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,,,0.068752,,,-0.912871,,,,...,,,1.0,0.132453,,,,,,
2,,1.0,-1.683415e-16,-0.070244,0.283473,,0.302032,-0.044901,4.6638700000000004e-17,-1.0,...,-0.042903,-1.0,0.296117,-0.368922,0.052083,-0.224189,-0.5,-1.0,-0.648204,0.460239
3,,-1.683415e-16,1.0,,-0.170881,-1.0,0.188056,0.061023,-0.5809475,0.559017,...,0.14342,,0.506024,0.114808,0.655596,0.69926,-0.693375,0.970725,-0.218218,0.354044
4,0.068752,-0.07024394,,1.0,0.084827,0.434057,0.270274,0.471954,,,...,0.179664,,0.181856,0.356573,0.254491,0.600736,0.57735,0.270765,0.027639,0.131904
5,,0.2834734,-0.1708812,0.084827,1.0,0.333333,-0.559017,-0.181601,0.2611165,1.0,...,0.389156,,0.002655,0.180346,-0.369175,-0.408248,,,-0.632456,-0.186872


We need to define the top *k* users similar to the target user. Additionally we need to set some sort of similarity threshold because the top *k* results could yield users that are drastically different.

In [166]:
k = 10 # Number of similar users we want to retrieve
similarity_threshold = 0.3 # Threshold that needs to be met to be considered similar
user_id = 1 # The target user for which we want to generate recommendations for

similarity_matrix.drop(index = user_id, inplace = True) # remove target user so that they are not amongst one of the similar users.

# Return the top k (10) similar users
top_k_users = similarity_matrix[similarity_matrix[user_id] > similarity_threshold][user_id].sort_values(ascending = False)[1: k + 1]
top_k_users

userId
243    1.0
458    1.0
539    1.0
428    1.0
420    1.0
403    1.0
574    1.0
582    1.0
594    1.0
268    1.0
Name: 1, dtype: float64

Remove movies that our target user has already seen and keep movies that similar users have watched.

In [167]:
# retrieve the movies watched by the target user. Do not want to recommend these movies
target_watched = userRatings_centered[userRatings_centered.index == user_id].dropna(axis = 1, how = 'all')
target_watched

title,Antz (1998),Beavis and Butt-Head Do America (1996),Ben-Hur (1959),Blazing Saddles (1974),Cape Fear (1991),Cinema Paradiso (Nuovo cinema Paradiso) (1989),Dangerous Minds (1995),"Deer Hunter, The (1978)",Dracula (Bram Stoker's Dracula) (1992),Dumbo (1941),Escape from New York (1981),"Fly, The (1986)","French Connection, The (1971)",Gandhi (1982),"Gods Must Be Crazy, The (1980)",Sleepers (1996),Star Trek: The Motion Picture (1979),Time Bandits (1981),Tron (1982),Willow (1988)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,-0.55,-1.55,-0.55,0.45,-0.55,1.45,-0.05,-0.55,0.95,0.45,-0.55,-0.05,1.45,-0.55,0.45,0.45,-0.05,-1.55,1.45,-0.55


In [168]:
# drop movies that none of the similar users have watched
similar_watched = userRatings_centered[userRatings_centered.index.isin(top_k_users.index)].dropna(axis = 1, how = 'all')
similar_watched.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),...And Justice for All (1979),10 Things I Hate About You (1999),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Angry Men (1997),127 Hours (2010),...,X2: X-Men United (2003),Yellow Submarine (1968),You Can't Take It with You (1938),Young Frankenstein (1974),Your Highness (2011),Zack and Miri Make a Porno (2008),Zodiac (2007),Zoolander (2001),loudQUIETloud: A Film About the Pixies (2006),xXx (2002)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
243,,,,,,,,,,,...,0.605863,,,,,,,,,
268,-2.125,0.875,-1.625,-0.125,-1.125,0.375,-2.625,1.375,-0.125,1.375,...,1.375,,-0.125,1.375,-3.125,,,,,
403,,,,,,,,,,,...,,,,,,,,,,
420,,,,,,,,,,,...,,,,0.193878,,,,,,
428,,0.262048,,,,,,,,,...,-0.737952,,,,,,,,0.762048,


We need to remove the movies that the target user has watched from the movies that similar users have watched.

In [169]:
# remove movies that the target user has watched.
similar_watched.drop(target_watched.columns, axis = 1, inplace = True, errors = 'ignore')
similar_watched.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),...And Justice for All (1979),10 Things I Hate About You (1999),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Angry Men (1997),127 Hours (2010),...,X2: X-Men United (2003),Yellow Submarine (1968),You Can't Take It with You (1938),Young Frankenstein (1974),Your Highness (2011),Zack and Miri Make a Porno (2008),Zodiac (2007),Zoolander (2001),loudQUIETloud: A Film About the Pixies (2006),xXx (2002)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
243,,,,,,,,,,,...,0.605863,,,,,,,,,
268,-2.125,0.875,-1.625,-0.125,-1.125,0.375,-2.625,1.375,-0.125,1.375,...,1.375,,-0.125,1.375,-3.125,,,,,
403,,,,,,,,,,,...,,,,,,,,,,
420,,,,,,,,,,,...,,,,0.193878,,,,,,
428,,0.262048,,,,,,,,,...,-0.737952,,,,,,,,0.762048,


#### Weighted Average

Once we have the *k* users most similar to the target user, we can use the Pearson Similarity scores and the user-item ratings of the similar users to calculate the Weighted Average for an item. 

The weighted average is calculated with the following formula:

$$\frac{\sum_i Sim*s_{i}}{\sum Sim}$$

The above equation takes the sum of the product of the similarity scores of the target and similar users and the rating the similar users gave on item *i*, all divided by the sum of the similarity scores of the target and similar users. 

The items with the highest weighted average are the items that should be recommended to the target user.

In [170]:
# Compute the weighted Average

# Get list of movies that target has not watched
movies = similar_watched.columns.values.tolist()
score = {}
for i in movies:
    movie_ratings = similar_watched[i]
    rating_sum = 0
    similarity_sum = 0
    for user in movie_ratings.index:
        rating = float(movie_ratings.loc[user])
        similarity = float(top_k_users.loc[user])
        if pd.isna(rating) == False:
            rating_sum = rating_sum + rating*similarity
            similarity_sum = similarity_sum + similarity
    weighted_average = rating_sum/similarity_sum
    score.update({i:weighted_average})
    
recommendations = pd.DataFrame({'Movie': score.keys(), 'Weighted_Average': score.values()})
recommendations.sort_values('Weighted_Average', ascending = False)[:10]

Unnamed: 0,Movie,Weighted_Average
370,"Great Dictator, The (1940)",1.605863
579,Moulin Rouge (2001),1.496772
262,Dirty Dancing: Havana Nights (2004),1.387681
722,Save the Last Dance (2001),1.387681
706,"Rocky Horror Picture Show, The (1975)",1.387681
542,Mask (1985),1.387681
205,Chocolat (2000),1.387681
387,Hans Christian Andersen (1952),1.387681
831,The Count of Monte Cristo (2002),1.387681
403,"Hello, Dolly! (1969)",1.387681


#### Predicting Ratings

To get the predicted ratings, we simply add the mean rating given by the target user and add it to each weighted average for each item.

$$\hat{r}_{x,i} = \frac{\sum_i Sim*s_{i}}{\sum Sim} + \mu_x$$

In [171]:
# Compute the predicted rating

avg_rating = userRatings[userRatings.index == user_id].T.mean()[user_id]
recommendations['Predicted_Rating'] = recommendations['Weighted_Average'] + avg_rating
recommendations.sort_values('Weighted_Average', ascending = False)[:10]

Unnamed: 0,Movie,Weighted_Average,Predicted_Rating
370,"Great Dictator, The (1940)",1.605863,4.155863
579,Moulin Rouge (2001),1.496772,4.046772
262,Dirty Dancing: Havana Nights (2004),1.387681,3.937681
722,Save the Last Dance (2001),1.387681,3.937681
706,"Rocky Horror Picture Show, The (1975)",1.387681,3.937681
542,Mask (1985),1.387681,3.937681
205,Chocolat (2000),1.387681,3.937681
387,Hans Christian Andersen (1952),1.387681,3.937681
831,The Count of Monte Cristo (2002),1.387681,3.937681
403,"Hello, Dolly! (1969)",1.387681,3.937681


In [172]:
k = 10 # Number of similar users we want to retrieve
similarity_threshold = 0.3 # Threshold that needs to be met to be considered similar


def recommend_movies(user):
    similarity_matrix.drop(index = user, inplace = True)
    top_k_users = similarity_matrix[similarity_matrix[user] > similarity_threshold][user].sort_values(ascending = False)[1: k + 1]
    target_watched = userRatings_centered[userRatings_centered.index == user].dropna(axis = 1, how = 'all')
    similar_watched = userRatings_centered[userRatings_centered.index.isin(top_k_users.index)].dropna(axis = 1, how = 'all')
    similar_watched.drop(target_watched.columns, axis = 1, inplace = True, errors = 'ignore')

    movies = similar_watched.columns.values.tolist()
    score = {}
    
    for i in movies:
        movie_ratings = similar_watched[i]
        rating_sum = 0
        similarity_sum = 0
        for user in movie_ratings.index:
            rating = float(movie_ratings.loc[user])
            similarity = float(top_k_users.loc[user])
            if pd.isna(rating) == False:
                rating_sum = rating_sum + rating*similarity
                similarity_sum = similarity_sum + similarity
        weighted_average = rating_sum/similarity_sum
        score.update({i:weighted_average})
    
    recommendations = pd.DataFrame({'Movie': score.keys(), 'Weighted_Average': score.values()})
    avg_rating = userRatings[userRatings.index == user_id].T.mean()[user_id]
    recommendations['Predicted_Rating'] = recommendations['Weighted_Average'] + avg_rating
    recommendations = recommendations.sort_values('Weighted_Average', ascending = False)[:10]
    return recommendations

In [173]:
recommend_movies(3)

Unnamed: 0,Movie,Weighted_Average,Predicted_Rating
38,Arachnophobia (1990),2.023256,4.573256
36,Apollo 13 (1995),2.023256,4.573256
80,"Civil War, The (1990)",1.916667,4.466667
149,Inside Out (2015),1.916667,4.466667
311,You've Got Mail (1998),1.916667,4.466667
50,Bad News Bears (2005),1.916667,4.466667
138,"Grand Budapest Hotel, The (2014)",1.916667,4.466667
156,"Journey, The (El viaje) (1992)",1.916667,4.466667
287,Thursday (1998),1.875,4.425
55,Battle Royale (Batoru rowaiaru) (2000),1.875,4.425


# Evaluation Metric

### Root Mean Square Error (RMSE)

# Baseline Model

# Item-Based Collaborative Filtering

## Similarity Metric

### Adjusted Cosine

The Adjusted Cosine metric is used to measure the rating vectors of two items (computing column-wise on the user-item matrix). We will denote the items as item *1* and item *2*. Additionally, $U_1$ and $U_2$ will denote the set of users who have rated items *1* and *2*.

For Adjusted Cosine, the similarity between items are calculated using the mean-centered ratings which we discussed previously. 

$$AdjustedCosine(1,2) = \frac{\sum_{U_1 \bigcap U_2} s_{u,1}s_{u,2}}{\sqrt{\sum_{U_i \bigcap U_j} (s_{u,1})^2}\sqrt{\sum_{U_1 \bigcap U_2} (s_{u,2})^2}}$$

#### Predicting Ratings

Once we have the *k* items most similar to the target item for one user, we can use the Adjusted Cosine scores and the user-item ratings of the similar items to predict the score that target item would receive from the user. Afterwards you would just recommend the top items based off the predicted rated scores. 

For example, lets say we are trying to make recommendations for user *3* and we see that user *3* is missing ratings for items *1* and *6*. We want to make predictions on those items to see if they would be good recommendations for user *3*. Therefore, items 1 and 6 will be our target items. We will compute the Adjusted Cosine between the target item and every item to see which items are most similar to the target. For item *1* the most similar items are item *2* and *3*. For item *6* the most similar items are item *4* and *5*. We will use the Adjusted Cosine score between the target and the similar item aswell as the raw rating user *3* gave to the similar item. We can use the raw rating because mean-centered rating was accounted for during the calculating of Adjusted Cosine:

$$r_{3,1} = \frac{r_{3,2}*AdjustedCosine(1,2) + r_{3,3}*AdjustedCosine(1,3)}{AdjustedCosine(1,2) + AdjustedCosine(1,3)}$$

$$r_{3,6} = \frac{r_{3,4}*AdjustedCosine(6,4) + r_{3,5}*AdjustedCosine(6,5)}{AdjustedCosine(6,4) + AdjustedCosine(6,5)}$$

If $r_{3,1} > r_{3,6}$, then item *1* would get recommended to user *3*.


In [None]:
def adjusted_cosine():
    

## Advantages and Disadvantages

The advantage of Item-Based Collaborative Filtering is that it often provides more relevant recommendations because it using your OWN ratings to make recommendations. For example a recommonder system might look at amovie you've enjoyed and rated highly and recommend similar movies.

Item-Based ratings are also more stable to changes in ratings. This is because for User-Based ratings, there are a lot more users than items. This means that there will be cases where two users have a small number of the same items, but two items are much more likely to have a larger number of users who have rated both of them. This means that for User-Based ratings, just adding a few ratings can change the similarity score a lot, whereas for Item-Based it is much more stable to additions of new ratings.

The disadvantage of Item-Based Collaborative Filtering is that they may not provide more diverse recommendations as oppose to User-Based Collaborative Filtering. Recommending more diverse items may lead to pleasant surprises or new found interests. Without enough diversity, it is possible that a user can get border with similar recommendations to items they've been recommended.

An additional disadvantage is the problem of sparsity. For example if none of the nearest neighbors for one user has rated a particular item, it is not possible to predict a rating for that item. Though another to consider is that if none of the similar users have rated that item, it is possible that the target user won't like it.