# Collaborative Filtering - Movies dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt

In [2]:
# Read the data

movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

In [3]:
## Preprocessing (I have a way better solution for that but too lazy to reimplement it, look at RecSys_1.ipynb for that)

#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [5]:
#Dropping the genres column
movies_df = movies_df.drop('genres', 1)

In [6]:
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [7]:
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


### User-Based Filtering - Pearson Correlation Function

The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score

In [8]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,rating,title
0,5.0,"Breakfast Club, The"
1,3.5,Toy Story
2,2.0,Jumanji
3,5.0,Pulp Fiction
4,4.5,Akira


In [9]:
## Adding movieId to inputMovies

inputId = movies_df[movies_df["title"].isin(inputMovies["title"].tolist())]
inputMovies = pd.merge(inputId, inputMovies)

# We won't use year
inputMovies = inputMovies.drop("year", 1)
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


In [10]:
## Getting subset of users who has watched the same movies ##

userSubset = ratings_df[ratings_df["movieId"].isin(inputMovies["movieId"].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0


In [11]:
# Group the rows by user ID

userSubsetGroup = userSubset.groupby(["userId"])

In [12]:
# Sort groups such that users share the most movies in common has the highest priority
# This provides a richer recommendation since we can't go thru every user

userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

In [13]:
# Looking at the first user 
userSubsetGroup[0]

(75,       userId  movieId  rating
 7507      75        1     5.0
 7508      75        2     3.5
 7540      75      296     5.0
 7633      75     1274     4.5
 7673      75     1968     5.0)

### Similarity of users to input user

Next, we are going to compare all users (not really all !!!) to our specified user and find the one that is most similar.  
we're going to find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. 

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 "Pearson Correlation")

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. 

**In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.**

In [14]:
# We're going to iterate over the first 100 users
userSubsetGroup = userSubsetGroup[0:100]

In [15]:
## Pearson Correlation ##
from scipy.stats import pearsonr

# Store the correlation in a dict
pearsonDict = {}

for name, group in userSubsetGroup:
    # Sort the input and the current user group to avoid calculation problems
    group = group.sort_values(by="movieId")
    inputMovies = inputMovies.sort_values(by="movieId")
    
    # Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies["movieId"].isin(group["movieId"].tolist())]
    
    # Store them in a temp variable in a list format
    tempRatings = temp_df["rating"].tolist()
    tempGroupList = group["rating"].tolist()
    
    pearsonDict[name] = pearsonr(tempRatings, tempGroupList)

In [16]:
# Correlations dataframe

pearson_df = pd.DataFrame.from_dict(pearsonDict, orient='index')
pearson_df = pearson_df.iloc[:, :1]
pearson_df['userId'] = pearson_df.index
pearson_df.index = range(len(pearson_df))
pearson_df.columns = ["similarityIndex", "userId"]
pearson_df.head()

Unnamed: 0,similarityIndex,userId
0,0.827278,75
1,0.586009,106
2,0.83205,686
3,0.576557,815
4,0.943456,1040


In [17]:
## Let's get the top 50 users that are most similar to the input user. ##

top_users = pearson_df.sort_values(by='similarityIndex', ascending=False)
top_users[0:50]
top_users.head()

Unnamed: 0,similarityIndex,userId
64,0.961678,12325
34,0.961538,6207
55,0.961538,10707
67,0.960769,13053
4,0.943456,1040


In [18]:
top_users_rating = top_users.merge(ratings_df, left_on="userId", right_on="userId", how="inner")
top_users_rating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.961678,12325,1,3.5
1,0.961678,12325,2,1.5
2,0.961678,12325,3,3.0
3,0.961678,12325,5,0.5
4,0.961678,12325,6,2.5


In [19]:
## Take weighted average ##

#Multiplies the similarity by the user's ratings
top_users_rating['weightedRating'] = top_users_rating['similarityIndex'] * top_users_rating['rating']
top_users_rating.head()

# Apply sum to top users, group by movieId before

tempTopUsersRating = top_users_rating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex', 'sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,47.110148,172.862258
2,47.110148,122.131292
3,14.570089,39.216419
4,0.454111,1.399461
5,13.899388,32.820691


In [20]:
# Creates an empty dataframe
recommendation_df = pd.DataFrame()

# Now we take the weighted average
recommendation_df['recommendationScore'] = tempTopUsersRating['sum_weightedRating'] / tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,recommendationScore,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.669321,1
2,2.592463,2
3,2.69157,3
4,3.081759,4
5,2.361305,5


In [21]:
# Getting top 10 movies

recommendation_df = recommendation_df.sort_values(by="recommendationScore", ascending=False)

In [22]:
recommendation_df = recommendation_df.where(recommendation_df["recommendationScore"] < 5, 0)
recommendation_df.head(10)

Unnamed: 0_level_0,recommendationScore,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
4111,0.0,0
103655,0.0,0
3048,0.0,0
75803,0.0,0
57464,0.0,0
72919,0.0,0
97860,0.0,0
60046,0.0,0
1797,0.0,0
3516,0.0,0


In [23]:
recommendation_df = recommendation_df.sort_values(by="recommendationScore", ascending=False)
show = recommendation_df.head(10)
show

Unnamed: 0_level_0,recommendationScore,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
136485,5.0,136485
7043,5.0,7043
73587,5.0,73587
139098,5.0,139098
38499,4.999525,38499
8014,4.989287,8014
4227,4.985986,4227
82765,4.957488,82765
3090,4.956485,3090
4383,4.949001,4383


In [24]:
show_movies = movies_df.loc[movies_df['movieId'].isin(show['movieId'].tolist())]

In [25]:
show = show.reset_index(drop=True)
show = show.sort_values(by="movieId")
show

Unnamed: 0,recommendationScore,movieId
8,4.956485,3090
6,4.985986,4227
9,4.949001,4383
1,5.0,7043
5,4.989287,8014
4,4.999525,38499
2,5.0,73587
7,4.957488,82765
0,5.0,136485
3,5.0,139098


In [26]:
show = show.reset_index(drop=True)
scores = show["recommendationScore"]
show_movies["recommendationScore"] = np.asanyarray(scores)
show_movies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,movieId,title,year,recommendationScore
3004,3090,Matewan,1987,4.956485
4134,4227,"Brothers, The",2001,4.985986
4289,4383,"Crimson Rivers, The (Rivières pourpres, Les)",2000,4.949001
6932,7043,Vivre sa vie: Film en douze tableaux (My Life ...,1962,5.0
7620,8014,"Spring, Summer, Fall, Winter... and Spring (Bo...",2003,4.989287
10483,38499,Angels in America,2003,4.999525
14750,73587,Soul Kitchen,2009,5.0
16431,82765,Little Big Soldier (Da bing xiao jiang),2010,4.957488
30005,136485,Robot Chicken: Star Wars,2007,5.0
30704,139098,Four Days in October,2010,5.0


In [27]:
## FINAL RESULT ##

show_movies = show_movies.sort_values(by="recommendationScore", ascending=False)
show_movies = show_movies.reset_index(drop=True)
show_movies

Unnamed: 0,movieId,title,year,recommendationScore
0,7043,Vivre sa vie: Film en douze tableaux (My Life ...,1962,5.0
1,73587,Soul Kitchen,2009,5.0
2,136485,Robot Chicken: Star Wars,2007,5.0
3,139098,Four Days in October,2010,5.0
4,38499,Angels in America,2003,4.999525
5,8014,"Spring, Summer, Fall, Winter... and Spring (Bo...",2003,4.989287
6,4227,"Brothers, The",2001,4.985986
7,82765,Little Big Soldier (Da bing xiao jiang),2010,4.957488
8,3090,Matewan,1987,4.956485
9,4383,"Crimson Rivers, The (Rivières pourpres, Les)",2000,4.949001
