<b>USER BASED RECOMMENDER SYSTEM

Steps in a user-based recommendation system:

1. Select a user with the movies the user has watched
2. Based on his rating to movies, find the top x neighbours
3. Get the watched movie record of the user for each neighbour.
4. Calculate a similarity score using some formula
5. Recommend the items with the highest score

In [3]:
import pandas as pd
from math import sqrt
import numpy as np

In [4]:
ratings_df = pd.read_csv('joinedUserReview.csv')

In [5]:
userInput = [{'title':'1984', 'rating':4},
             {'title':'A Child Called "It"', 'rating':3},
             {'title':'A Conjuring of Light', 'rating':2},
             {'title':'Mistborn', 'rating':5},
             {'title':'The Ocean at the End of the Lane', 'rating':1}]
inputBooks = pd.DataFrame(userInput)
print(inputBooks)

                              title  rating
0                              1984       4
1               A Child Called "It"       3
2              A Conjuring of Light       2
3                          Mistborn       5
4  The Ocean at the End of the Lane       1


In [6]:
inputId = ratings_df[ratings_df['Title'].isin(inputBooks['title'].tolist())]
print(inputId)
inputBooks = inputBooks[['title','rating']]
print(inputBooks)

      UserID                             Title          ISBN  Rating
0          1                          Mistborn           NaN       5
68         1              A Conjuring of Light           NaN       2
516        1                              1984           NaN       4
554        2                          Mistborn  9.780765e+12       5
633        2                              1984           NaN       4
1124       2               A Child Called "It"           NaN       5
1695       3              A Conjuring of Light  9.780765e+12       3
1835       3  The Ocean at the End of the Lane  9.781472e+12       4
1951       3                              1984  9.780141e+12       5
3218       3               A Child Called "It"           NaN       4
7177       5              A Conjuring of Light  9.780765e+12       5
7670       5  The Ocean at the End of the Lane           NaN       5
                              title  rating
0                              1984       4
1              

#### Now with the `movieId` in our input, we can now get the subset of users that have watched and reviewed the movies in our input. Find the similar user taste.

In [7]:
userSubset = ratings_df[ratings_df['Title'].isin(inputBooks['title'].tolist())]
print(userSubset.groupby('Title').count())

                                  UserID  ISBN  Rating
Title                                                 
1984                                   3     1       3
A Child Called "It"                    2     0       2
A Conjuring of Light                   3     2       3
Mistborn                               2     1       2
The Ocean at the End of the Lane       2     1       2


In [8]:
userSubsetGroup = userSubset.groupby(['UserID'])

def take_5_elem(x):
    return len(x[1])
    
userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)

userSubsetGroup = userSubsetGroup[0:100]
print(userSubsetGroup[0:5])


[(3,       UserID                             Title          ISBN  Rating
1695       3              A Conjuring of Light  9.780765e+12       3
1835       3  The Ocean at the End of the Lane  9.781472e+12       4
1951       3                              1984  9.780141e+12       5
3218       3               A Child Called "It"           NaN       4), (1,      UserID                 Title  ISBN  Rating
0         1              Mistborn   NaN       5
68        1  A Conjuring of Light   NaN       2
516       1                  1984   NaN       4), (2,       UserID                Title          ISBN  Rating
554        2             Mistborn  9.780765e+12       5
633        2                 1984           NaN       4
1124       2  A Child Called "It"           NaN       5), (5,       UserID                             Title          ISBN  Rating
7177       5              A Conjuring of Light  9.780765e+12       5
7670       5  The Ocean at the End of the Lane           NaN       5)]


In [9]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:

    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='Title')
    inputBooks = inputBooks.sort_values(by='title')

    #Get the N for the formula
    nRatings = len(group)

    #Get the review scores for the movies that they both have in common
    temp_df = inputBooks[inputBooks['title'].isin(group['Title'].tolist())]

    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
   
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['Rating'].tolist()
    for i in range(len(tempGroupList)):
        tempGroupList[i] = float(tempGroupList[i])
    
    #Now let's calculate the pearson correlation between two users, so called, x and y manually (check the formula from week 7 slide)
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
    


In [10]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['UserID'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
print(pearsonDF.head())


   similarityIndex  UserID
0         0.632456       3
1         1.000000       1
2         0.000000       2
3         0.000000       5


In [11]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
print(topUsers.head())

   similarityIndex  UserID
1         1.000000       1
0         0.632456       3
2         0.000000       2
3         0.000000       5


In [12]:
topUsersRating=topUsers.merge(ratings_df, left_on='UserID', right_on='UserID', how='inner')
print(topUsersRating.head(100))
topUsersRating.to_csv('test.csv')

    similarityIndex  UserID                                      Title  \
0               1.0       1                                   Mistborn   
1               1.0       1  Assassin's Quest: The Illustrated Edition   
2               1.0       1                           The Blade Itself   
3               1.0       1                     Last Argument of Kings   
4               1.0       1                                 Warbreaker   
..              ...     ...                                        ...   
95              1.0       1                             The Great Hunt   
96              1.0       1                                 The Heroes   
97              1.0       1                            Fool's Assassin   
98              1.0       1                            The Poison Song   
99              1.0       1                       The Eye of the World   

            ISBN  Rating  
0            NaN       5  
1   9.780593e+12       2  
2   9.780575e+12       5  
3  

In [13]:
#Multiplies the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['Rating']
print(topUsersRating.head())

   similarityIndex  UserID                                      Title  \
0              1.0       1                                   Mistborn   
1              1.0       1  Assassin's Quest: The Illustrated Edition   
2              1.0       1                           The Blade Itself   
3              1.0       1                     Last Argument of Kings   
4              1.0       1                                 Warbreaker   

           ISBN  Rating  weightedRating  
0           NaN       5             5.0  
1  9.780593e+12       2             2.0  
2  9.780575e+12       5             5.0  
3  9.780575e+12       5             5.0  
4  9.781939e+12       5             5.0  


In [14]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('Title').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
print(tempTopUsersRating.head())

                         sum_similarityIndex  sum_weightedRating
Title                                                           
0 の奏香師 [0 no Soukoushi]             0.632456            2.529822
01 Monkey                           0.632456            1.264911
0x0 Memories                        0.632456            1.264911
1/2 Ceremony                        0.632456            2.529822
1/2 Engaged                         0.632456            2.529822


In [15]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['Title'] = tempTopUsersRating.index
print(recommendation_df.head(10))

                         weighted average recommendation score  \
Title                                                            
0 の奏香師 [0 no Soukoushi]                                    4.0   
01 Monkey                                                  2.0   
0x0 Memories                                               2.0   
1/2 Ceremony                                               4.0   
1/2 Engaged                                                4.0   
1/2 Fairy                                                  2.0   
1/2 Honeymoon                                              4.0   
1/2 Wedding                                                4.0   
1/3 Romantica                                              3.0   
1/6000 HONESTY                                             4.0   

                                           Title  
Title                                             
0 の奏香師 [0 no Soukoushi]  0 の奏香師 [0 no Soukoushi]  
01 Monkey                              01 Monkey  
0x0

In [16]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
print(recommendation_df)


                                            weighted average recommendation score  \
Title                                                                               
Natsir: Politik Santun di Antara Dua Rezim                                    5.0   
Rose of Versailles Vol. 4                                                     5.0   
Kings of Heaven                                                               5.0   
Kings of the Wyld                                                             5.0   
Lady Mitsuko                                                                  5.0   
...                                                                           ...   
Zoe's Tale                                                                    NaN   
ハピネス 1                                                                        NaN   
그와 그와 그 [Geuwa Geuwa Geu]                                                     NaN   
술탄의 꽃 [Sultan'eui Ggoch]                                         

In [17]:
recommended_books=ratings_df.loc[ratings_df['Title'].isin(recommendation_df['Title'])]
recommended_books=recommended_books.loc[~recommended_books.Title.isin(userSubset['Title'])]
print(recommended_books)

      UserID                                      Title          ISBN  Rating
1          1  Assassin's Quest: The Illustrated Edition  9.780593e+12       2
2          1                           The Blade Itself  9.780575e+12       5
3          1                     Last Argument of Kings  9.780575e+12       5
4          1                                 Warbreaker  9.781939e+12       5
5          1                          A Game of Thrones  9.780008e+12       5
...      ...                                        ...           ...     ...
7955       5  At The End of the Day I Burst Into Flames  9.781948e+12       5
7956       5                              Christmas Eve           NaN       4
7957       5                    A Cosmology of Monsters  9.781525e+12       5
7958       5                            Halloween Fiend           NaN       4
7959       5                     Walk the Darkness Down  9.781951e+12       5

[7726 rows x 4 columns]


In [18]:
recommended_books.to_csv('recommended.csv')