<b>USER BASED RECOMMENDER SYSTEM

Steps in a user-based recommendation system:

1. Select a user with the movies the user has watched
2. Based on his rating to movies, find the top x neighbours
3. Get the watched movie record of the user for each neighbour.
4. Calculate a similarity score using some formula
5. Recommend the items with the highest score

In [1]:
import pandas as pd
from math import sqrt
import numpy as np


In [2]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
print(movies_df.info())
print(movies_df.head())
print(ratings_df.head())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1 

In [3]:
userInput = [{'title':'Toy Story (1995)', 'rating':4},
             {'title':'Jumanji (1995)', 'rating':4},
             {'title':'Grumpier Old Men (1995)', 'rating':4},
             {'title':'Waiting to Exhale (1995)', 'rating':5},
             {'title':'Father of the Bride Part II (1995)', 'rating':5}]
inputMovies = pd.DataFrame(userInput)
print(inputMovies)

                                title  rating
0                    Toy Story (1995)       4
1                      Jumanji (1995)       4
2             Grumpier Old Men (1995)       4
3            Waiting to Exhale (1995)       5
4  Father of the Bride Part II (1995)       5


In [4]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)
#inputMovies = inputMovies.drop('year', 1) #we don't really need this at the moment
inputMovies = inputMovies[['movieId','title','rating']]
print(inputMovies)

   movieId                               title  rating
0        1                    Toy Story (1995)       4
1        2                      Jumanji (1995)       4
2        3             Grumpier Old Men (1995)       4
3        4            Waiting to Exhale (1995)       5
4        5  Father of the Bride Part II (1995)       5


#### Now with the `movieId` in our input, we can now get the subset of users that have watched and reviewed the movies in our input. Find the similar user taste.

In [5]:
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
print(userSubset.groupby('movieId').count())

         userId  rating  timestamp
movieId                           
1           215     215        215
2           110     110        110
3            52      52         52
4             7       7          7
5            49      49         49


In [6]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])

def take_5_elem(x):
    # print (len(x[1]))
    return len(x[1])
    

#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)

userSubsetGroup = userSubsetGroup[0:100]
print(userSubsetGroup[0:5])


[((6,),      userId  movieId  rating  timestamp
560       6        2     4.0  845553522
561       6        3     5.0  845554296
562       6        4     3.0  845554349
563       6        5     5.0  845553938), ((68,),        userId  movieId  rating   timestamp
10360      68        1     2.5  1158531426
10361      68        2     2.5  1158532776
10362      68        3     2.0  1158533415
10363      68        5     2.0  1158533624), ((169,),        userId  movieId  rating   timestamp
24321     169        1     4.5  1059427918
24322     169        2     4.0  1078284713
24323     169        3     5.0  1078284750
24324     169        5     5.0  1078284788), ((288,),        userId  movieId  rating   timestamp
42114     288        1     4.5  1054568869
42115     288        2     2.0   978467973
42116     288        3     4.0   975691635
42117     288        5     2.0   978622571), ((414,),        userId  movieId  rating  timestamp
62294     414        1     4.0  961438127
62295     414       

In [7]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:

    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')

    #Get the N for the formula
    nRatings = len(group)

    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]

    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
   
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
   
    
    #Now let's calculate the pearson correlation between two users, so called, x and y manually (check the formula from week 7 slide)
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
    


In [8]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
print(pearsonDF.head())


   similarityIndex  userId
0        -0.301511    (6,)
1        -0.577350   (68,)
2         0.522233  (169,)
3        -0.570352  (288,)
4        -0.870388  (414,)


In [9]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers['userId'] = topUsers['userId'].apply(lambda x: x[0])
print(topUsers.head())

    similarityIndex  userId
47         1.000000      58
65         1.000000     200
2          0.522233     169
59         0.000000     151
69         0.000000     232


In [10]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
print(topUsersRating.head(100))

    similarityIndex  userId  movieId  rating  timestamp
0               1.0      58        3     3.0  847719397
1               1.0      58        5     4.0  847719151
2               1.0      58        7     5.0  847719397
3               1.0      58       19     1.0  847718718
4               1.0      58       21     4.0  847718718
..              ...     ...      ...     ...        ...
95              1.0      58      552     3.0  847719108
96              1.0      58      555     3.0  847719035
97              1.0      58      587     4.0  847718674
98              1.0      58      588     5.0  847718433
99              1.0      58      589     4.0  847718617

[100 rows x 5 columns]


In [11]:
#Multiplies the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
print(topUsersRating.head())

   similarityIndex  userId  movieId  rating  timestamp  weightedRating
0              1.0      58        3     3.0  847719397             3.0
1              1.0      58        5     4.0  847719151             4.0
2              1.0      58        7     5.0  847719397             5.0
3              1.0      58       19     1.0  847718718             1.0
4              1.0      58       21     4.0  847718718             4.0


In [12]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
print(tempTopUsersRating.head())

         sum_similarityIndex  sum_weightedRating
movieId                                         
1                   1.522233            5.850048
2                   0.522233            2.088932
3                   1.522233            5.611165
4                   0.000000            0.000000
5                   2.522233           10.611165


In [13]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
print(recommendation_df.head(10))

         weighted average recommendation score  movieId
movieId                                                
1                                     3.843070        1
2                                     4.000000        2
3                                     3.686141        3
4                                          NaN        4
5                                     4.207052        5
6                                          NaN        6
7                                     4.828465        7
8                                          NaN        8
9                                          NaN        9
10                                    4.500000       10


In [14]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
print(recommendation_df)


         weighted average recommendation score  movieId
movieId                                                
1136                                       5.0     1136
342                                        5.0      342
337                                        5.0      337
7318                                       5.0     7318
31878                                      5.0    31878
...                                        ...      ...
185029                                     NaN   185029
185435                                     NaN   185435
187593                                     NaN   187593
187595                                     NaN   187595
188301                                     NaN   188301

[5155 rows x 2 columns]


In [15]:
recommended_movie=movies_df.loc[movies_df['movieId'].isin(recommendation_df['movieId'])]

#we don't want to recommend the same movie
recommended_movie=recommended_movie.loc[~recommended_movie.movieId.isin(userSubset['movieId'])]

print(recommended_movie)

      movieId                           title  \
5           6                     Heat (1995)   
6           7                  Sabrina (1995)   
7           8             Tom and Huck (1995)   
8           9             Sudden Death (1995)   
9          10                GoldenEye (1995)   
...       ...                             ...   
9699   185029            A Quiet Place (2018)   
9703   185435          Game Over, Man! (2018)   
9709   187593               Deadpool 2 (2018)   
9710   187595  Solo: A Star Wars Story (2018)   
9713   188301     Ant-Man and the Wasp (2018)   

                                      genres  
5                      Action|Crime|Thriller  
6                             Comedy|Romance  
7                         Adventure|Children  
8                                     Action  
9                  Action|Adventure|Thriller  
...                                      ...  
9699                   Drama|Horror|Thriller  
9703                           Acti