# First, we'll use the content-based model:

In [145]:
import pandas as pd
import numpy as np


In [146]:
dataset = pd.read_csv(r"C:\Users\h.rahnavard\Downloads\IMDB\movies.csv")
dataset.head(15)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


# We should remove the year numbers from Title column and make them into an entire new column, plus split the movie genres and use one hot encoding method to turn movie genre into a binary value:
First we use regex for removing and replacing the year from Title, but as we know dataset['title'] is a pandas series and regex operations must be performed on strings. Hence, we need to add ".str" to be able to perform str functions on this series. Expand argument decides if the extract method return a series (when expand = False) or a data frame (when expand=True). Instead of the first two lines of the corresponding block in the sample, I used  "'\((\d{4})\)'" as my first argument so that regex only capture 4 digits, excluding the parentheses automatically and I also applied str.strip on the same line as replacing the year number from movie titles:

In [147]:
dataset['year'] = dataset['title'].str.extract(r'\((\d{4})\)', expand=False)
dataset['title'] = dataset['title'].str.replace(r'\(\d{4}\)', '', regex=True).str.strip()
dataset.head(15)


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995
5,6,Heat,Action|Crime|Thriller,1995
6,7,Sabrina,Comedy|Romance,1995
7,8,Tom and Huck,Adventure|Children,1995
8,9,Sudden Death,Action,1995
9,10,GoldenEye,Action|Adventure|Thriller,1995


Now we should seperate the genres:

In [148]:
dataset['genres'] = dataset['genres'].str.split('|')
dataset.head(15)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995
5,6,Heat,"[Action, Crime, Thriller]",1995
6,7,Sabrina,"[Comedy, Romance]",1995
7,8,Tom and Huck,"[Adventure, Children]",1995
8,9,Sudden Death,[Action],1995
9,10,GoldenEye,"[Action, Adventure, Thriller]",1995


In order to implement one hot encoder we can use the method in the sample of course and we can also use built-in Sklearn methods, which in this case must be MultiLabelBinarizer since each row in genres as multiple values:

In [149]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genre_binary = mlb.fit_transform(dataset['genres'])
genre_binary

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(9742, 20))

So we see that the outcome is an array. We need to convert it to a data frame and concat or join it with the original data frame. We also need to make sure that the new columnn names correspond to actual movie genres(hence: columns=mlb.classes_)and the indexs line up with our "dataset"(hence: index=dataset.index). Here I made a copy of the original dataset since I will need the version before encoding later on:

In [150]:
genre_pd = pd.DataFrame(genre_binary,columns=mlb.classes_, index=dataset.index)
dataset2 = dataset.join(genre_pd)
dataset2.head(15)

Unnamed: 0,movieId,title,genres,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,[Comedy],1995,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,6,Heat,"[Action, Crime, Thriller]",1995,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
6,7,Sabrina,"[Comedy, Romance]",1995,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
7,8,Tom and Huck,"[Adventure, Children]",1995,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
8,9,Sudden Death,[Action],1995,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,GoldenEye,"[Action, Adventure, Thriller]",1995,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


We start working with ratings data:

In [151]:
ratings = pd.read_csv(r"C:\Users\h.rahnavard\Downloads\IMDB\ratings.csv")
ratings.head(15)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


We need to drop timestamp column so we use drop method and pass axis=1 as the 2nd argument to ask it to look for that key in colmuns and drop it, not rows(when axis=0):

In [152]:
ratings = ratings.drop('timestamp',axis=1)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Since there was no user input attached on the website other than the one in the sample, I made up an arbitrary one. As you see by the formatting here, writing user input as a list of multiple dictionaries makes turning the entire data into a data frame easier and adding the key 'title' helps us with having a column with the same name as the one in our original dataset:

In [153]:
user_input = [
    {'title': 'Shawshank Redemption, The', 'rating': 5.0},
    {'title': 'Inception', 'rating': 4.5},
    {'title': 'Forrest Gump', 'rating': 4.0},
    {'title': 'City Hall', 'rating': 3.0},
    {'title': 'Pulp Fiction', 'rating': 4.5},
    {'title': 'East of Eden', 'rating': 4.0},
    {'title': 'Fight Club', 'rating': 4.0},
    {'title': 'Godfather, The', 'rating': 5.0},
    {'title': 'Interstellar', 'rating': 4.0},
    {'title': 'Lust for Life', 'rating': 3.5}
]
user_likes = pd.DataFrame(user_input)
user_likes

Unnamed: 0,title,rating
0,"Shawshank Redemption, The",5.0
1,Inception,4.5
2,Forrest Gump,4.0
3,City Hall,3.0
4,Pulp Fiction,4.5
5,East of Eden,4.0
6,Fight Club,4.0
7,"Godfather, The",5.0
8,Interstellar,4.0
9,Lust for Life,3.5


Now that we have user_likes data frame and as we know the ratings are categorized by movie id, we need to find the movie id for the movies whose title matched the title in our original data set. So we need to iterate throuh the movie titles in user_likes (and to iterate we need to make it into a list, hence .tolist() method) by writing dataset[dataset['title].isin(user_likes['title'].tolist()) we genereate a series of boolan values, either True or False based on whether or not in each row the movie title in dataset matches the movie title in user_likes and then only get the rows for which the boolean value is True:

In [154]:
movieids = dataset[dataset['title'].isin(user_likes['title'].tolist())]
movieids

Unnamed: 0,movieId,title,genres,year
88,100,City Hall,"[Drama, Thriller]",1996
257,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
277,318,"Shawshank Redemption, The","[Crime, Drama]",1994
314,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994
659,858,"Godfather, The","[Crime, Drama]",1972
729,949,East of Eden,[Drama],1955
2226,2959,Fight Club,"[Action, Crime, Drama, Thriller]",1999
5117,8153,Lust for Life,[Drama],1956
7372,79132,Inception,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010
8376,109487,Interstellar,"[Sci-Fi, IMAX]",2014


So now we need to merge the above data frame with user_likes, so we can have access to their ratings submitted by user as well and we drop the columns we do not need now:

In [155]:
user_likes2 = pd.merge(movieids,user_likes)
user_likes2 = user_likes2.drop('genres', axis=1).drop('year',axis=1)
user_likes2

Unnamed: 0,movieId,title,rating
0,100,City Hall,3.0
1,296,Pulp Fiction,4.5
2,318,"Shawshank Redemption, The",5.0
3,356,Forrest Gump,4.0
4,858,"Godfather, The",5.0
5,949,East of Eden,4.0
6,2959,Fight Club,4.0
7,8153,Lust for Life,3.5
8,79132,Inception,4.5
9,109487,Interstellar,4.0


Now we need to see which genre categories the movie ids user has rated fall into. Again since we need to iterate through the content of this column, we make it into a list and again user_movies = dataset2[dataset2['movieId'].isin(user_likes2['movieId'].tolist()], generates a series of boolan values, either True or False based on whether or not in each row the movie id in dataset2 matches the movie id in user_likes2 and then only returns the rows for which the boolean value is True.

In [156]:
user_movies = dataset2[dataset2['movieId'].isin(user_likes2['movieId'].tolist())]
user_movies

Unnamed: 0,movieId,title,genres,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
88,100,City Hall,"[Drama, Thriller]",1996,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
257,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
277,318,"Shawshank Redemption, The","[Crime, Drama]",1994,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
314,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,1,0
659,858,"Godfather, The","[Crime, Drama]",1972,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
729,949,East of Eden,[Drama],1955,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2226,2959,Fight Club,"[Action, Crime, Drama, Thriller]",1999,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5117,8153,Lust for Life,[Drama],1956,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7372,79132,Inception,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010,0,1,0,0,0,0,...,0,0,1,0,1,0,1,1,0,0
8376,109487,Interstellar,"[Sci-Fi, IMAX]",2014,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0


As you see above, the row numbers are not in order since they've been picked up from original dataset in no particular order, so we reset_index to reset the row numbers in numerical order. The argument drop=True helps us get rid of previous index numbers:

In [157]:
user_movies = user_movies.reset_index(drop=True)
user_movies

Unnamed: 0,movieId,title,genres,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,100,City Hall,"[Drama, Thriller]",1996,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,318,"Shawshank Redemption, The","[Crime, Drama]",1994,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,1,0
4,858,"Godfather, The","[Crime, Drama]",1972,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,949,East of Eden,[Drama],1955,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2959,Fight Club,"[Action, Crime, Drama, Thriller]",1999,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,8153,Lust for Life,[Drama],1956,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,79132,Inception,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010,0,1,0,0,0,0,...,0,0,1,0,1,0,1,1,0,0
9,109487,Interstellar,"[Sci-Fi, IMAX]",2014,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0


Out of the data frame above, we no longer need genres, title, movie id, or year, so we drop them all:

In [158]:
user_genres = user_movies.drop('genres', axis=1).drop('title',axis=1).drop('year', axis=1).drop('movieId', axis=1)
user_genres

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0
4,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
6,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,1,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0


We need to calculate user weighted profile, so first we multiply user_genres by user_likes2['rating'] and calculate the sum of each column, through T.dot() :

In [159]:
userprofile = user_genres.T.dot(user_likes2['rating'])
userprofile

(no genres listed)     0.0
Action                 8.5
Adventure              0.0
Animation              0.0
Children               0.0
Comedy                 8.5
Crime                 23.0
Documentary            0.0
Drama                 37.5
Fantasy                0.0
Film-Noir              0.0
Horror                 0.0
IMAX                   8.5
Musical                0.0
Mystery                4.5
Romance                4.0
Sci-Fi                 8.5
Thriller              16.0
War                    4.0
Western                0.0
dtype: float64

Since we're going to see which movies to recommend to our user from the original data set, first we need to access all genre types. Since we need this data frame to only contain numbers, we need to change the indexing of the rows to movie id, so we can later on access the names of movies through movie id. We also drop year and genres column. 

In [160]:
data_genre = dataset2.set_index(dataset2['movieId'])
data_genre = data_genre.drop('movieId', axis=1).drop('title', axis=1).drop('genres', axis=1).drop('year', axis=1)
data_genre

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193583,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193585,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
193587,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [161]:
data_genre.shape

(9742, 20)

We need to multiply data_genre by userprofile, and since each movie may be catgorized in more than 1 genre, we need to summarize the result of each multiplication.By writing sum(axis=1), the sum function goes through each row and adds up the values. Then we need to normalize these numbers by dividing them by the sum of userprofile values:

In [162]:
recom_table = ((data_genre*userprofile).sum(axis=1))/(userprofile.sum())
recom_table                                           

movieId
1         0.069106
2         0.000000
3         0.101626
4         0.406504
5         0.069106
            ...   
193581    0.138211
193583    0.069106
193585    0.304878
193587    0.069106
193609    0.069106
Length: 9742, dtype: float64

Now we sort the recom table in descending order, so we can find and recommend the ones that have the highest compatibility with our user's preferences: 

In [163]:
recom_table = recom_table.sort_values(ascending = False)
recom_table

movieId
79132     0.865854
198       0.796748
26701     0.796748
81132     0.796748
4719      0.792683
            ...   
7096      0.000000
6945      0.000000
190219    0.000000
190221    0.000000
193573    0.000000
Length: 9742, dtype: float64

Earlier we changes the indexing of the data frame that eventually ended up being recom_table to movieId, so the indexing system of recom_table now corresponds to movie id in our original data. We're going to choose the first 20 movies of recom_table and by using .keys(), we access the indexes, which are movie ids. ".loc()" accesses particular rows and columns in a data frame and again, because of [movies_df['movieId'].isin(recommendationTable_df.head(20).keys())], the following code, generates a series of boolan values, either True or False based on whether or not in row and column, the movie id in dataset matches the movie id in recom_table and then only returns the rows and columns for which the boolean value is True:

In [164]:
dataset.loc[dataset['movieId'].isin(recom_table.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
19,20,Money Train,"[Action, Comedy, Crime, Drama, Thriller]",1995
118,145,Bad Boys,"[Action, Comedy, Crime, Drama, Thriller]",1995
167,198,Strange Days,"[Action, Crime, Drama, Mystery, Sci-Fi, Thriller]",1995
454,519,RoboCop 3,"[Action, Crime, Drama, Sci-Fi, Thriller]",1993
1103,1432,Metro,"[Action, Comedy, Crime, Drama, Thriller]",1997
2248,2985,RoboCop,"[Action, Crime, Drama, Sci-Fi, Thriller]",1987
3460,4719,Osmosis Jones,"[Action, Animation, Comedy, Crime, Drama, Roma...",2001
3657,5027,Another 48 Hrs.,"[Action, Comedy, Crime, Drama, Thriller]",1990
3989,5628,Wasabi,"[Action, Comedy, Crime, Drama, Thriller]",2001
4693,7007,"Last Boy Scout, The","[Action, Comedy, Crime, Drama, Thriller]",1991


# Now let's try collaborative filtering. We're going to use the same dataset and the same user input. First we need to drop genres column from our original dataset:

In [165]:
dataset3 = dataset.drop('genres', axis=1)
dataset3.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


We already have the dataframe containing the movie ids of the movies our user has watched along with the ratings they have submitted:

In [166]:
movieids

Unnamed: 0,movieId,title,genres,year
88,100,City Hall,"[Drama, Thriller]",1996
257,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
277,318,"Shawshank Redemption, The","[Crime, Drama]",1994
314,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994
659,858,"Godfather, The","[Crime, Drama]",1972
729,949,East of Eden,[Drama],1955
2226,2959,Fight Club,"[Action, Crime, Drama, Thriller]",1999
5117,8153,Lust for Life,[Drama],1956
7372,79132,Inception,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010
8376,109487,Interstellar,"[Sci-Fi, IMAX]",2014


Out of the dataframe above, we do not need year or genre column, so we drop them:

In [167]:
movieids2 = movieids.drop('year',axis=1).drop('genres',axis=1)
movieids2

Unnamed: 0,movieId,title
88,100,City Hall
257,296,Pulp Fiction
277,318,"Shawshank Redemption, The"
314,356,Forrest Gump
659,858,"Godfather, The"
729,949,East of Eden
2226,2959,Fight Club
5117,8153,Lust for Life
7372,79132,Inception
8376,109487,Interstellar


So now we need to find users who have seen the same movies out of our ratings data frame. Again in order to iterate, we make movieids2['movieId'] into a list and this ratings[ratings['movieId'].isin[movieids2['movieId'].tolist()], generates a series of boolan values, either True or False based on whether or not, the movie id in ratings matches the movie id in movieids2 and then only returns the rows for which the boolean value is True 

In [168]:
usersimilar = ratings[ratings['movieId'].isin(movieids2['movieId'].tolist())]
usersimilar

Unnamed: 0,userId,movieId,rating
16,1,296,3.0
20,1,356,4.0
192,1,2959,5.0
232,2,318,3.0
246,2,79132,4.0
...,...,...,...
99559,610,356,3.0
99586,610,858,5.0
99699,610,2959,5.0
100452,610,79132,4.0


Since we need to find the users who have the most movies in common with our user, we first need to group usersimilar dataframe by userid, which creates an iterable object in which we have small data frames containing all the rows submitted by the same userid. To further work with this object, we sort it, which makes this object into a list of tuples of (userid, data frame for that user's ratings), and then we set the key by the len of x[1] which would be the number of rows in that data frame and the number of movies that user has watched and rated and we write reverse=True to make the order descending, making the users with the most ratings appear at the top:

In [169]:
usersimilar_grouped = usersimilar.groupby(['userId'])
usersimilar_sorted = sorted(usersimilar_grouped, key=lambda x: len(x[1]), reverse=True)
usersimilar_sorted[:5]

[((474,),
         userId  movieId  rating
  73129     474      100     2.0
  73172     474      296     4.0
  73179     474      318     5.0
  73190     474      356     3.0
  73279     474      858     5.0
  73326     474      949     3.5
  73936     474     2959     4.0
  74872     474     8153     2.5),
 ((599,),
         userId  movieId  rating
  92671     599      100     2.0
  92742     599      296     5.0
  92749     599      318     4.0
  92769     599      356     3.5
  92939     599      858     4.0
  93545     599     2959     5.0
  94766     599    79132     3.0
  94951     599   109487     3.5),
 ((15,),
        userId  movieId  rating
  1442      15      296     4.0
  1443      15      318     5.0
  1445      15      356     5.0
  1453      15      858     4.0
  1479      15     2959     2.5
  1531      15    79132     3.5
  1547      15   109487     4.0),
 ((18,),
        userId  movieId  rating
  1796      18      296     4.0
  1797      18      318     5.0
  1801    

To calculate the similarity between user's opinions, we implement Pearson method which give us the linear relationship between users' ratings, meaning that we see if their rating for the same movies move in the same directin (similar tastes) or the opposite direction (different tastes). The result of Pearson formula is between 1 (perfect positie correlation and perfect similarity), and -1 (perfect negative correlation and perfect dissimilarity). The point of computing the sum of each rating by that user minus the mean of all ratings by them is to account for the fact that for one user, 4 would equal to good and for another 3 would mean the same thing (due to different rating styles), then we should mcalculate the sum of the multiplication of each rating by the two users minus the mean of all ratings by the two users and divide by the multiplication of the standard deviation in their ratings (the square root of variance). We will use the ratings of the first 100 users in usersimilar_sorted and we need to compute the Pearson correlation for each of the with our user:

In [170]:
from math import sqrt
usersimilarset = usersimilar_sorted[:100]
pearsonCorrelationDict = {}
for user, df in usersimilarset:
#first we need to make sure usersimilarset and movieids2 data frame line up, so we sort both of them by movie ids:
    df = df.sort_values(by='movieId')
    user_likes3=user_likes2.sort_values(by='movieId')
    #Looking for overlaps and only keeping those:
    temp_userlist = user_likes2[user_likes2['movieId'].isin(df['movieId'].tolist())]
    temp_df= df[df['movieId'].isin(user_likes2['movieId'].tolist())]
    #calculating the total number of ratings for our Pearson Correlation
    nRatings= len(temp_df)
    #making a list of our input user's rating and the comparison user's ratings:
    tempuserrating = temp_userlist['rating'].tolist()
    tempcomparison = temp_df['rating'].tolist()
    #instead of calculating the variance for each user's ratings, which would be the sum of each rating's difference with the mean of ratings squared, we can calculate the sum of the square of each rating minus the sqaure of the sum of all the ratings divided by the number of ratings(unnormalized sum of squared deviations):
    uservar = sum([i**2 for i in tempuserrating]) - pow(sum(tempuserrating),2)/float(nRatings) 
    comaprisonvar = sum([i**2 for i in tempcomparison]) - pow(sum(tempcomparison),2)/float(nRatings)
    #instead of computing the covariance of the two users which would be the sum of the difference between each rating by one user with the mean of the ratings multiplied by the corresponding rating's difference with the mean of the other user's ratings , we can calculate the sum of the multiplication of the two users' rating for each movie minus the sum of all the ratings for user1 multiplied by the sum of all the ratings by user2 divided by the number of ratings:
    n = sum( i*j for i, j in zip(tempuserrating, tempcomparison)) - sum(tempuserrating)*sum(tempcomparison)/float(nRatings)
    #accounting for possible ZeroDivisionError and calculating Pearson Correlations:
    if uservar !=0 and comaprisonvar !=0:
        pearsonCorrelationDict[user] = n/sqrt(uservar*comaprisonvar)
    else: 
    #in case of no overlap:
        pearsonCorrelationDict[user] = 0

In [171]:
pearsonCorrelationDict.items()

dict_items([((474,), 0.9639723416734176), ((599,), 0.545544725589981), ((15,), 0.32084447395987287), ((18,), -0.038348249442359456), ((50,), 0.6447150063759698), ((62,), -0.5717718748968712), ((103,), 0.3150630189063124), ((105,), 0.6333004963811282), ((122,), -0.07001400420138301), ((125,), -0.21404317236952947), ((227,), -0.20797258270192795), ((233,), 0.27116307227332265), ((247,), -0.4451319072597332), ((249,), -0.42008402520838867), ((279,), 0.28301965507089494), ((298,), 0.29854071701326274), ((305,), 0.7134772412317459), ((317,), 0.10090091909945025), ((339,), -0.07763376175556151), ((414,), 0.4200840252084095), ((417,), 0.6316299207100884), ((483,), -0.3363363969981611), ((573,), -0.1084652289093456), ((610,), 0.11354659116073401), ((16,), -0.21320071635561041), ((17,), 0.4330127018922132), ((21,), -0.18650096164806057), ((63,), 0.14852213144650128), ((68,), 0.20701966780270634), ((76,), 0.37977726265637507), ((132,), -0.18257418583505536), ((141,), -0.7575724468926281), ((177,

In [172]:
print(len(pearsonCorrelationDict))

100


We need to convert pearsonCorrelationDict into a data frame, orient='index' makes the keys in pearsonCorrelationDict, meaning user ids into the row numbers or indexes. Then we name the pearsonDF's cloumn as correlation and another column named user ids and fill it with the original pearsonDF's indexing which already was user id and then change the indexing of the data frame to regular indexing in the range of the data frame's length:

In [173]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['Correlation']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,Correlation,userId
0,0.963972,"(474,)"
1,0.545545,"(599,)"
2,0.320844,"(15,)"
3,-0.038348,"(18,)"
4,0.644715,"(50,)"


In [174]:
print(len(pearsonDF))

100


Now we sort pearsonDF to only find the top 50 users who are most similar to our input user:

In [175]:
top_users = pearsonDF.sort_values(by='Correlation', ascending = False)
top_50_users = top_users[:50]
top_50_users

Unnamed: 0,Correlation,userId
0,0.963972,"(474,)"
81,0.912871,"(307,)"
60,0.896421,"(64,)"
53,0.821584,"(448,)"
65,0.801784,"(131,)"
70,0.766965,"(193,)"
85,0.763763,"(331,)"
16,0.713477,"(305,)"
99,0.707107,"(462,)"
56,0.707107,"(522,)"


Our goal is to recommend new movies from our movie data set to out inuput user based on the movies the most similar users to them rated and their Pearson Correlation. So first we need to see what movies have been rated by these 50 most similar users, whose user ids we have access to in both top_50_users and the original ratings data frame, so we merge the two data frames. left_on='userId', right_on='userId' means that we match rows based on userId in both the left data frame (top_50_users) and the right one(ratings) and how='inner' means that we will keep the rows in where userid exists in both data frames. But before that I needed to convert user ids in top_50_users (now tuples) to integers.

In [176]:
top_50_users.loc[:, 'userId'] = top_50_users['userId'].apply(lambda x: int(x[0]) if isinstance(x, tuple) else int(x))
SimilarRatings=top_50_users.merge(ratings, left_on='userId', right_on='userId', how='inner')
SimilarRatings.head()

Unnamed: 0,Correlation,userId,movieId,rating
0,0.963972,474,1,4.0
1,0.963972,474,2,3.0
2,0.963972,474,5,1.5
3,0.963972,474,6,3.0
4,0.963972,474,7,3.0


Now we need to multiply each movie rating by correlation column to get the weighted ratings:

In [177]:
SimilarRatings['WeightedRatings'] = SimilarRatings['Correlation']*SimilarRatings['rating']
SimilarRatings.head()

Unnamed: 0,Correlation,userId,movieId,rating,WeightedRatings
0,0.963972,474,1,4.0,3.855889
1,0.963972,474,2,3.0,2.891917
2,0.963972,474,5,1.5,1.445959
3,0.963972,474,6,3.0,2.891917
4,0.963972,474,7,3.0,2.891917


Now we need to sum up the weighted ratings and divide them by the sum of the correlation indexes, so e start by grouping our data frame by movieId and summing up Correlation and WeightedRatings columns:

In [178]:
SimilarRatings = SimilarRatings.groupby('movieId').sum()[['Correlation','WeightedRatings']]
SimilarRatings.head()

Unnamed: 0_level_0,Correlation,WeightedRatings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16.651852,61.622422
2,7.735387,23.171577
3,4.083033,12.828001
4,0.607143,1.821429
5,4.027686,10.325059


As we can see, now we havee 2 columns. We can rename them in a way that would be easier to understand for our purposes:

In [179]:
SimilarRatings.columns = ['Sum of Correlation indexes','Sum of weighted ratings']
SimilarRatings.head()

Unnamed: 0_level_0,Sum of Correlation indexes,Sum of weighted ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16.651852,61.622422
2,7.735387,23.171577
3,4.083033,12.828001
4,0.607143,1.821429
5,4.027686,10.325059


Now we need to divide the sum of weighted ratings by the sum of correlation indexes to creat weighted averge recommendation score/how our input user will likely rate each movie. We can creat a new data frame named Recommendation df:

In [180]:
recommendation_df = pd.DataFrame()
recommendation_df['weighted average recommendation score'] = SimilarRatings['Sum of weighted ratings']/SimilarRatings['Sum of Correlation indexes']
recommendation_df['movieId'] = SimilarRatings.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.700635,1
2,2.995529,2
3,3.141782,3
4,3.0,4
5,2.563521,5


It's time to sort the data frame by the weighted average recommendation score:

In [181]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score')
recommendation_df.head(20)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
122898,0.5,122898
6557,0.5,6557
122627,0.5,122627
3973,0.5,3973
575,0.5,575
3774,0.5,3774
137517,0.5,137517
138798,0.5,138798
157172,0.5,157172
5700,0.5,5700


Now we need to find the name of the first 20 movies with highest weighted average recommendation score. Again to iterate through the movie ids of the first 20 rows of recommendation_df we turn them into a list and dataset.loc[dataset['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())] generates a series of boolan values, either True or False based on whether or not, the movie id in recommendation_df matches the movie id in dataset and then only returns the rows for which the boolean value is True:

In [182]:
print(dataset.loc[dataset['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())])

      movieId                                     title  \
497       575                       Little Rascals, The   
2825     3774                             House Party 2   
2964     3973            Book of Shadows: Blair Witch 2   
4030     5700                               The Pumaman   
4053     5771                       My Bloody Valentine   
4439     6557                           Born to Be Wild   
7678    89118  Skin I Live In, The (La piel que habito)   
7943    95796                   Anaconda: The Offspring   
8091   100163            Hansel & Gretel: Witch Hunters   
8143   102025            Yongary: Monster from the Deep   
8168   102735                           Captain America   
8171   102749        Captain America II: Death Too Soon   
8680   122627                      Oblivion 2: Backlash   
8688   122898                            Justice League   
8760   128512                               Paper Towns   
8782   129250                                Superfast! 

And we're done :)