# Collaborative Filtering

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [2]:
tags = pd.read_csv('datasets/ml-latest-small/tags.csv')
ratings = pd.read_csv('datasets/ml-latest-small/ratings.csv')
movies = pd.read_csv('datasets/ml-latest-small/movies.csv')
links = pd.read_csv('datasets/ml-latest-small/links.csv')

In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
df = pd.merge(movies , ratings)

In [8]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,3.0,851866703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9,4.0,938629179
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13,5.0,1331380058
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.0,997938310
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19,3.0,855190091


Now we will pivot this table to construct a nice matrix of users and the movies they rated. Nan indicates missing data, or movies that a given user did not watch.

In [11]:
userRatings = df.pivot_table(index = ['userId'], columns=['title'],
                                 values = 'rating')
userRatings.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


Pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix. This gives us a correlation score between every pair of movies (where atleast one user rated both movies - otherwise NaN's will show up)

If we were doing user-user we could compute correlation between users to find similar users.

Since we are doing item based, we are more interested in the relationships between columns. Doing a correlation score between any two columns. That will give us a correlation score for a given movie pair.

In [12]:
corrMatrix = userRatings.corr()
corrMatrix.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",1.0,,,,,,,,,,...,,,,,,,,,,
$9.99 (2008),,1.0,,,,,,,,1.0,...,,,,,,,,,,
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Neath the Arizona Skies (1934),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,1.0,,,,,1.0,...,,,,,,,,,,


We have a new dataframe where every movie is on the row and on the columns. We can look at the intersection of any two given movies and find the correlation score to each other based on the user rating data.

Movies are perfectly correlated with themselves

*his Notes*

However, we want to avoid spurious results that happened from just a handful of users that happened to rate the same pair of movies. In order to restrict our results to movies that lots of people rated together - and also gives us more popular results that are more easily recongizable - we'll use the min_periods argument to throw out results where fewer than 100 users rated a given movie pair:

In [13]:
corrMatrix = userRatings.corr(method = 'pearson', min_periods = 100)
corrMatrix.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",,,,,,,,,,,...,,,,,,,,,,
$9.99 (2008),,,,,,,,,,,...,,,,,,,,,,
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Neath the Arizona Skies (1934),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,


Instead of throwing out items that were reviewed by less than 100 people, what we are doing is throwing out movies similarities where less than 100 people rated both of those movies.

*his notes*

Now let's produce some movie recommendations for user ID - who i manually added to the data set as a test case. He really likes SW and the Empire Strikes back, but hated Gone with the Wind. I'll extract his ratings from userRatings df and use dropna() to get rid of missing data (leaving me only with a Series of the movies I actually rated).

In [46]:
movies = {'Star Wars: Episode V - The Empire Strikes Back (1980)': 5,
          'Gone with the Wind (1939)': 1,
          'Star Wars: Episode IV - A New Hope (1977)': 5}

index = ['Star Wars: Episode V - The Empire Strikes Back (1980)',
        'Gone with the Wind (1939)',
        'Star Wars: Episode IV - A New Hope (1977)']
myRatings = pd.Series(data = movies, index = index)

myRatings.index.name = 'title'


In [47]:
# myRatings = userRatings.loc[8].dropna()
# myRatings

myRatings

title
Star Wars: Episode V - The Empire Strikes Back (1980)    5
Gone with the Wind (1939)                                1
Star Wars: Episode IV - A New Hope (1977)                5
dtype: int64

What we want to do with this data is recommend movies for people. How we do that is look at all the ratings for a given person, find movies similar to the stuff they rated, and those are candidates for recommendations for that person.

*his notes*

Now, let's go through each movie i rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones i rated.

So for each movie I rate, I will retrieve the list of similar movies from our correlation matrix. Ill then scale those correlation scores by how well i rated the movie they are similar to, so movies similar to ones i liked count more than movies similar to ones i hated.

In [48]:
simCandidates = pd.Series()

for i in range(0, len(myRatings.index)):
    print("adding sims for " + myRatings.index[i] + "...")
    # retrieve similar movies to this one that i rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    # now scale its similarity by how well i rated this movie
    sims = sims.map(lambda x: x*myRatings[i])
    # add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)

# glance at our results so far
print("sorting.....")
simCandidates.sort_values(inplace = True, ascending = False)
print(simCandidates.head(10))

  """Entry point for launching an IPython kernel.


adding sims for Star Wars: Episode V - The Empire Strikes Back (1980)...
adding sims for Gone with the Wind (1939)...
adding sims for Star Wars: Episode IV - A New Hope (1977)...
sorting.....
Star Wars: Episode V - The Empire Strikes Back (1980)                             5.000000
Star Wars: Episode IV - A New Hope (1977)                                         5.000000
Star Wars: Episode VI - Return of the Jedi (1983)                                 3.817034
Star Wars: Episode VI - Return of the Jedi (1983)                                 3.738871
Star Wars: Episode IV - A New Hope (1977)                                         3.503949
Star Wars: Episode V - The Empire Strikes Back (1980)                             3.503949
Lord of the Rings: The Fellowship of the Ring, The (2001)                         2.387912
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    2.382210
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)  

*his note*

This is starting to look like something useful. Note that some of the movies came up more than once, because they were similar to one movie I rated. We will use Groupby() to add together the scores from movies that show up more than once, so they'll count more

In [49]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()

In [50]:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

Star Wars: Episode IV - A New Hope (1977)                                         8.503949
Star Wars: Episode V - The Empire Strikes Back (1980)                             8.503949
Star Wars: Episode VI - Return of the Jedi (1983)                                 7.555905
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    4.762813
Lord of the Rings: The Fellowship of the Ring, The (2001)                         4.166083
Lord of the Rings: The Two Towers, The (2002)                                     4.027989
Lord of the Rings: The Return of the King, The (2003)                             3.628191
E.T. the Extra-Terrestrial (1982)                                                 3.354069
Men in Black (a.k.a. MIB) (1997)                                                  3.243457
Jurassic Park (1993)                                                              2.986512
dtype: float64

The last thing we need to do is filter out movies I've already rated.

In [54]:
filterSims = simCandidates.drop("Star Wars: Episode V - The Empire Strikes Back (1980)")
filterSims.head(10)

Star Wars: Episode IV - A New Hope (1977)                                         8.503949
Star Wars: Episode VI - Return of the Jedi (1983)                                 7.555905
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    4.762813
Lord of the Rings: The Fellowship of the Ring, The (2001)                         4.166083
Lord of the Rings: The Two Towers, The (2002)                                     4.027989
Lord of the Rings: The Return of the King, The (2003)                             3.628191
E.T. the Extra-Terrestrial (1982)                                                 3.354069
Men in Black (a.k.a. MIB) (1997)                                                  3.243457
Jurassic Park (1993)                                                              2.986512
Pulp Fiction (1994)                                                               2.945942
dtype: float64