# Item-Based Collaborative Filtering for Movies

Start by importing the MovieLens 20M data set into a pandas DataFrame:

In [4]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ratings.csv', 
                      names=r_cols, 
                      usecols=range(3), 
                      header=None, 
                      low_memory=False, 
                      dtype={'user_id':'int', 
                             'movie_id':'int',
                             'rating':'float'})
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [5]:
m_cols = ['movie_id', 'title']
movies = pd.read_csv('movies.csv', 
                     names=m_cols, 
                     usecols=range(2), 
                     header=None, 
                     low_memory=False, 
                     dtype={'movie_id':'int',
                            'user_id':'int'})

movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [6]:
ratings = pd.merge(movies, ratings)

ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),3,4.0
1,1,Toy Story (1995),6,5.0
2,1,Toy Story (1995),8,4.0
3,1,Toy Story (1995),10,4.0
4,1,Toy Story (1995),11,4.5


In [7]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 4 columns):
movie_id    500000 non-null int64
title       500000 non-null object
user_id     500000 non-null int64
rating      500000 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 19.1+ MB


In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 2 columns):
movie_id    27278 non-null int64
title       27278 non-null object
dtypes: int64(1), object(1)
memory usage: 426.3+ KB


Create a pivot table of users & the movies they rated. 

NaN indicates missing data or movies that a given user hasn't watched

In [9]:
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head()

title,"""Great Performances"" Cats (1998)",$5 a Day (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu Dawn (1979),Zus & Zo (2001),[REC] (2007),[REC]² (2009),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


**corr()** method computes a correlation score for every column pair in the matrix  

This results in a correlation score between every pair of movies (at least one user rated both movies - otherwise NaN's)

In [None]:
corrMatrix = userRatings.corr()
corrMatrix.head()

Use **min_periods** instead to restrict our results to movies that lots of people have rated. As a result, only popular movies (more ratings by users) will be more recognizable. Otherwise, you'll see a whole bunch of movies which only a few people have rated.

Throw out movies if less than 100 people have rated it.

In [10]:
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

title,"""Great Performances"" Cats (1998)",$5 a Day (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu Dawn (1979),Zus & Zo (2001),[REC] (2007),[REC]² (2009),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",,,,,,,,,,,...,,,,,,,,,,
$5 a Day (2008),,,,,,,,,,,...,,,,,,,,,,
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Neath the Arizona Skies (1934),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,


Let's show some movie recommendations for user ID 1, who I've added for validation. User ID 1 likes Star Wars and The Empire Strikes Back, but hated Gone with the Wind.

**use dropna()** to keep movies that have been rated!

In [13]:
myRatings = userRatings.loc[1].dropna()
myRatings

title
2001: A Space Odyssey (1968)                                       3.5
28 Days Later (2002)                                               3.5
7th Voyage of Sinbad, The (1958)                                   4.0
Adventures of Baron Munchausen, The (1988)                         4.0
Alien (1979)                                                       4.0
Aliens (1986)                                                      4.0
American Werewolf in London, An (1981)                             4.0
Apocalypse Now (1979)                                              3.5
Army of Darkness (1993)                                            4.0
Austin Powers: The Spy Who Shagged Me (1999)                       3.5
Beastmaster, The (1982)                                            3.0
Beetlejuice (1988)                                                 4.0
Bill & Ted's Bogus Journey (1991)                                  3.5
Bill & Ted's Excellent Adventure (1989)                            4.0


Next, let's go through each movie I rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones I rated.

For each movie I rated, I'll retrieve the list of similar movies from our correlation matrix. I'll then scale those correlation scores by how well I rated the movie they are similar to, so movies similar to ones I liked count more than movies similar to ones I hated:

In [16]:
similar_candidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print("Adding similar movies for " + myRatings.index[i] + "...")
    # retrieve similar movies to this one that I rated
    similar_movies = corrMatrix[myRatings.index[i]].dropna()
    # scale its similarity by how well I rated this movie
    similar_movies = similar_movies.map(lambda x: x * myRatings[i])
    # add the score to the list of similar candidates
    similar_candidates = similar_candidates.append(similar_movies)
    
print("Sort recommendations...")
similar_candidates.sort_values(inplace = True, ascending = False)
similar_candidates.head(10)

Adding similar movies for 2001: A Space Odyssey (1968)...
Adding similar movies for 28 Days Later (2002)...
Adding similar movies for 7th Voyage of Sinbad, The (1958)...
Adding similar movies for Adventures of Baron Munchausen, The (1988)...
Adding similar movies for Alien (1979)...
Adding similar movies for Aliens (1986)...
Adding similar movies for American Werewolf in London, An (1981)...
Adding similar movies for Apocalypse Now (1979)...
Adding similar movies for Army of Darkness (1993)...
Adding similar movies for Austin Powers: The Spy Who Shagged Me (1999)...
Adding similar movies for Beastmaster, The (1982)...
Adding similar movies for Beetlejuice (1988)...
Adding similar movies for Bill & Ted's Bogus Journey (1991)...
Adding similar movies for Bill & Ted's Excellent Adventure (1989)...
Adding similar movies for Birds, The (1963)...
Adding similar movies for Blade Runner (1982)...
Adding similar movies for Borrowers, The (1997)...
Adding similar movies for Brotherhood of the Wo

Lord of the Rings: The Two Towers, The (2002)                                     5.000000
Lord of the Rings: The Fellowship of the Ring, The (2001)                         5.000000
Lord of the Rings: The Return of the King, The (2003)                             5.000000
Star Wars: Episode V - The Empire Strikes Back (1980)                             4.500000
Spider-Man 2 (2004)                                                               4.500000
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    4.500000
Lord of the Rings: The Two Towers, The (2002)                                     4.400666
Lord of the Rings: The Return of the King, The (2003)                             4.400666
Lord of the Rings: The Return of the King, The (2003)                             4.353071
Lord of the Rings: The Fellowship of the Ring, The (2001)                         4.353071
dtype: float64

Notice some of the same movies came up more than once, because they were similar to more than one movie I rated. Let's use **groupby()** to add together the scores from movies that show up more than once, so they'll count more:

In [18]:
similar_candidates = similar_candidates.groupby(similar_candidates.index).sum()

In [19]:
similar_candidates.sort_values(inplace = True, ascending = False)
similar_candidates.head(10)

Aliens (1986)                                                                     104.549547
Terminator, The (1984)                                                            102.166397
Toy Story 2 (1999)                                                                100.207342
Alien (1979)                                                                       98.187615
Ghostbusters (a.k.a. Ghost Busters) (1984)                                         98.031349
Men in Black (a.k.a. MIB) (1997)                                                   96.197739
Star Wars: Episode IV - A New Hope (1977)                                          94.046253
Toy Story (1995)                                                                   93.177287
Back to the Future (1985)                                                          93.003814
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)     92.876196
dtype: float64

Filter out movies I've already rated, as recommending a movie I've already watched isn't helpful:

## Further Enhancements

+ Experiment with different methods rather than Pearson correlation
+ Use a different value for min_periods for the correlation computation
+ How do we filter out movies similar to the ones I've hated? Perhaps penalize movies similar to the ones I've hated (rather than scale it down)!
+ Some users have rated lots of movies which may have a disporportionate effect on the results. Consider removing outliers.