# Memory-Based / Neighborhood-Based Collaborative Filtering

key idea behind collaborative filtering is that similar users share similar interests and that users tend to like items that are similar to one another. With neighborhood-based collaborative filtering methods, you're attempting to quantify just how similar users and items are to one another and getting the top N recommendations based on that similarity metric.

In [187]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise import accuracy

# importing relevant libraries
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

from myFunctions import movie_rocommendation_system

import warnings
warnings.filterwarnings('ignore')

### Import Movie and Rating Dataframes

In [188]:
movies_df = pd.read_csv('../ml-latest-small/movies.csv').drop(['genres'],axis=1)
print('Dataset - Movies')
print('-------------------------')
print('Number of Rows: ' + str(movies_df.shape[0]))
print('Number of Columns: ' + str(movies_df.shape[1]))
movies_df.head()

Dataset - Movies
-------------------------
Number of Rows: 9742
Number of Columns: 2


Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [189]:
ratings_df = pd.read_csv('../ml-latest-small/ratings.csv',index_col=0).reset_index().drop(['timestamp'],axis=1)
print('Dataset - Ratings')
print('-------------------------')
print('Number of Rows: ' + str(ratings_df.shape[0]))
print('Number of Columns: ' + str(ratings_df.shape[1]))
ratings_df.head()

Dataset - Ratings
-------------------------
Number of Rows: 100836
Number of Columns: 3


Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [190]:
print('Number of Unique Movies: ', len(movies_df))
print('Number of Unique Users: ', ratings_df['userId'].nunique())

Number of Unique Movies:  9742
Number of Unique Users:  610


In [191]:
ratings_df

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0


### Create new input User and Ratings 

We're going to create a new user and give ratings to movies. There can be an endless number of movie ratings.

In [192]:
# created a random list of movies for a user
user_movie_ratings = [
    {'title':'Breakfast Club, The (1985)', 'rating':5},
    {'title':'Toy Story (1995)', 'rating':3.5},
    {'title':'Jumanji (1995)', 'rating':2},
    {'title':'Akira (1988)', 'rating':4.5}
    ]
user_movie_ratings = pd.DataFrame(user_movie_ratings)
user_movie_ratings

Unnamed: 0,title,rating
0,"Breakfast Club, The (1985)",5.0
1,Toy Story (1995),3.5
2,Jumanji (1995),2.0
3,Akira (1988),4.5


In [193]:
# Merge personal user rating dataframe to get the movieId for each movie
user_movie_ratings = user_movie_ratings.merge(movies_df, left_on='title', right_on='title')
user_movie_ratings.head()

Unnamed: 0,title,rating,movieId
0,"Breakfast Club, The (1985)",5.0,1968
1,Toy Story (1995),3.5,1
2,Jumanji (1995),2.0,2
3,Akira (1988),4.5,1274


Next we want to filter the users who have watched the same movies as the input user.

In [194]:
# Filtering out users that have watched the same movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(user_movie_ratings['movieId'].tolist())]
userSubset

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
422,4,1968,4.0
516,5,1,4.0
560,6,2,4.0
874,7,1,4.5
...,...,...,...
98980,608,1968,4.0
99497,609,1,3.0
99534,610,1,5.0
99636,610,1274,5.0


In [195]:
# group users by their userId
userSubsetGroup = userSubset.groupby(['userId'])

In [196]:
# Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

In [206]:
userSubsetGroup[0:10]

[(91,        userId  movieId  rating
  14121      91        1     4.0
  14122      91        2     3.0
  14316      91     1274     5.0
  14383      91     1968     3.0), (177,        userId  movieId  rating
  24900     177        1     5.0
  24901     177        2     3.5
  25069     177     1274     2.0
  25129     177     1968     3.5), (219,        userId  movieId  rating
  31524     219        1     3.5
  31525     219        2     2.5
  31628     219     1274     2.5
  31680     219     1968     3.0), (274,        userId  movieId  rating
  39229     274        1     4.0
  39230     274        2     3.5
  39448     274     1274     4.0
  39549     274     1968     4.0), (298,        userId  movieId  rating
  44535     298        1     2.0
  44536     298        2     0.5
  44620     298     1274     4.0
  44655     298     1968     3.5), (414,        userId  movieId  rating
  62294     414        1     4.0
  62295     414        2     3.0
  62769     414     1274     4.0
  62957  

### Creating Similarity Matrix - Similarities between Users to input User

We want to compare the users who have watched similar movies to the input user by using the Pearson Correlation Coefficient. It is used to measure the strength of the two variables. The values that result in the formula is between -1 and 1.

In [198]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}
#For every user group in our subset
for name, group in userSubsetGroup:
    #Let’s start by sorting the input and current user group so the values aren’t mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = user_movie_ratings.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let’s also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let’s calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/math.sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

In [199]:
# pearsonCorrelationDict.items()

In [204]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))

# Sort top users by correlation similarity index
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)
topUsers

Unnamed: 0,similarityIndex,userId
116,1.0,590
118,1.0,597
29,1.0,434
103,1.0,489
121,1.0,605
...,...,...
91,-1.0,387
76,-1.0,282
77,-1.0,292
54,-1.0,140


In [152]:
# Merge with users' ratings
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,590,1,4.0
1,1.0,590,2,2.5
2,1.0,590,3,3.0
3,1.0,590,5,2.0
4,1.0,590,6,3.5


### Creating Weighted Ratings Matrix

Represents the user's neighbor's opinion about what movies to recommend, giving more weight to those users who are more similar to the input user.

In [154]:
# Multiply the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,590,1,4.0,4.0
1,1.0,590,2,2.5,2.5
2,1.0,590,3,3.0,3.0
3,1.0,590,5,2.0,2.0
4,1.0,590,6,3.5,3.5


### Aggregation of Weighted Ratings

Once we aggregate the weighted ratings, we can find what movies to recommend based on how similar other users are to the input user.

In [155]:
# Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,41.762726,157.437936
2,36.116974,109.807481
3,11.960221,36.829883
4,1.0,2.0
5,7.791846,22.583691


In [156]:
# Creates an empty dataframe
recommendation_df = pd.DataFrame()
# Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.769819,1
2,3.040329,2
3,3.079365,3
4,2.0,4
5,2.898375,5


In [157]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
633,5.0,633
6945,5.0,6945
83827,5.0,83827
67618,5.0,67618
67788,5.0,67788
5537,5.0,5537
1096,5.0,1096
3404,5.0,3404
3813,5.0,3813
2290,5.0,2290


### Recommendation Results

In [158]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title
536,633,Denise Calls Up (1995)
835,1096,Sophie's Choice (1982)
1703,2290,Stardust Memories (1980)
2543,3404,Titanic (1953)
2851,3813,Interiors (1978)
3936,5537,Satin Rouge (2002)
4646,6945,My Architect: A Son's Journey (2003)
6999,67618,Strictly Sexual (2008)
7003,67788,Confessions of a Shopaholic (2009)
7511,83827,Marwencol (2010)
