# Item-Based Collaborative Filtering

As before, we'll start by importing the MovieLens 100K data set into a pandas DataFrame:

In [2]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('/Users/Zia/Google Drive/Bootcamp/Bootcamp Notes/\
Day 6 Recommendation Systems Notes/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


In [3]:
m_cols = ['movie_id', 'title']
movies = pd.read_csv('/Users/Zia/Google Drive/Bootcamp/Bootcamp Notes/\
Day 6 Recommendation Systems Notes/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2))
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [4]:
ratings = pd.merge(movies, ratings)

ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Now we'll pivot this table to construct a nice matrix of users and the movies they rated. NaN indicates missing data, or movies that a given user did not watch:

In [5]:
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head(20)

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,� k�ldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,
6,,,,4.0,,,,5.0,,,...,,,,4.0,,,,,,
7,,,,4.0,,,5.0,5.0,,4.0,...,,,,5.0,3.0,,3.0,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,4.0,...,,,,,,,,,,


Now the magic happens - pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix! This gives us a correlation score between every pair of movies (where at least one user rated both movies - otherwise NaN's will show up.) That's amazing!

In [6]:
corrMatrix = userRatings.corr()
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,� k�ldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
101 Dalmatians (1996),-1.0,,1.0,-0.04989,0.269191,0.048973,0.266928,-0.043407,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,0.0,0.707107,,
12 Angry Men (1957),-0.5,,-0.04989,1.0,0.666667,0.256625,0.274772,0.178848,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.144338,1.0,1.0,
187 (1997),-0.5,,0.269191,0.666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.475327,,,


In [7]:
userRatings[["12 Angry Men (1957)","'Til There Was You (1997)"]].corr()

title,12 Angry Men (1957),'Til There Was You (1997)
title,Unnamed: 1_level_1,Unnamed: 2_level_1
12 Angry Men (1957),1.0,-0.5
'Til There Was You (1997),-0.5,1.0


In [8]:
userRatings[["Star Wars (1977)","Empire Strikes Back, The (1980)"]].corr()

title,Star Wars (1977),"Empire Strikes Back, The (1980)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),1.0,0.748353
"Empire Strikes Back, The (1980)",0.748353,1.0


However, we want to avoid spurious results that happened from just a handful of users that happened to rate the same pair of movies. In order to restrict our results to movies that lots of people rated together - and also give us more popular results that are more easily recongnizable - we'll use the min_periods argument to throw out results where fewer than 100 users rated a given movie pair:

In [9]:
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head(10)

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,� k�ldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,
2 Days in the Valley (1996),,,,,,,,,,,...,,,,,,,,,,
"20,000 Leagues Under the Sea (1954)",,,,,,,,,,,...,,,,,,,,,,
2001: A Space Odyssey (1968),,,,,,,,1.0,,,...,,,,-0.001307,,,,,,
3 Ninjas: High Noon At Mega Mountain (1998),,,,,,,,,,,...,,,,,,,,,,
"39 Steps, The (1935)",,,,,,,,,,,...,,,,,,,,,,


Now let's produce some movie recommendations for user ID 0, who I manually added to the data set as a test case. This guy really likes Star Wars and The Empire Strikes Back, but hated Gone with the Wind. I'll extract his ratings from the userRatings DataFrame, and use dropna() to get rid of missing data (leaving me only with a Series of the movies I actually rated:)

In [15]:
userRatings.loc[0]

title
'Til There Was You (1997)                                  NaN
1-900 (1994)                                               NaN
101 Dalmatians (1996)                                      NaN
12 Angry Men (1957)                                        NaN
187 (1997)                                                 NaN
2 Days in the Valley (1996)                                NaN
20,000 Leagues Under the Sea (1954)                        NaN
2001: A Space Odyssey (1968)                               NaN
3 Ninjas: High Noon At Mega Mountain (1998)                NaN
39 Steps, The (1935)                                       NaN
8 1/2 (1963)                                               NaN
8 Heads in a Duffel Bag (1997)                             NaN
8 Seconds (1994)                                           NaN
A Chef in Love (1996)                                      NaN
Above the Rim (1994)                                       NaN
Absolute Power (1997)                            

In [10]:
myRatings = userRatings.loc[0].dropna()
myRatings

title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64

In [11]:
myRatings.index

Index([u'Empire Strikes Back, The (1980)', u'Gone with the Wind (1939)',
       u'Star Wars (1977)'],
      dtype='object', name=u'title')

In [12]:
myRatings.values

array([ 5.,  1.,  5.])

In [13]:
myRatings[0], myRatings[1], myRatings[2]

(5.0, 1.0, 5.0)

In [14]:
corrMatrix['Empire Strikes Back, The (1980)']

title
'Til There Was You (1997)                                        NaN
1-900 (1994)                                                     NaN
101 Dalmatians (1996)                                            NaN
12 Angry Men (1957)                                              NaN
187 (1997)                                                       NaN
2 Days in the Valley (1996)                                      NaN
20,000 Leagues Under the Sea (1954)                              NaN
2001: A Space Odyssey (1968)                                0.141598
3 Ninjas: High Noon At Mega Mountain (1998)                      NaN
39 Steps, The (1935)                                             NaN
8 1/2 (1963)                                                     NaN
8 Heads in a Duffel Bag (1997)                                   NaN
8 Seconds (1994)                                                 NaN
A Chef in Love (1996)                                            NaN
Above the Rim (1994)        

Now, let's go through each movie I rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones I rated.

So for each movie I rated, I'll retrieve the list of similar movies from our correlation matrix. I'll then scale those correlation scores by how well I rated the movie they are similar to, so movies similar to ones I liked count more than movies similar to ones I hated:

In [15]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print "Adding sims for " + myRatings.index[i] + "..."
    # Retrieve similar movies to this one that I rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    print sims
    # Now scale its similarity by how well I rated this movie
    sims = sims.map(lambda x: x * myRatings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
#Glance at our results so far:
#print simCandidates.head(20)

Adding sims for Empire Strikes Back, The (1980)...
title
2001: A Space Odyssey (1968)                    0.141598
Abyss, The (1989)                               0.277867
African Queen, The (1951)                       0.231657
Air Force One (1997)                            0.165620
Aladdin (1992)                                  0.311063
Alien (1979)                                    0.201669
Aliens (1986)                                   0.292577
Amadeus (1984)                                  0.149328
American President, The (1995)                  0.213057
Annie Hall (1977)                              -0.002235
Apocalypse Now (1979)                           0.084026
Apollo 13 (1995)                                0.196901
Babe (1995)                                     0.109333
Back to the Future (1985)                       0.345285
Batman (1989)                                   0.300169
Batman Forever (1995)                           0.112007
Batman Returns (1992)          

In [16]:
#Glance at our results so far:
print simCandidates.head(20)

title
2001: A Space Odyssey (1968)      0.707991
Abyss, The (1989)                 1.389334
African Queen, The (1951)         1.158286
Air Force One (1997)              0.828101
Aladdin (1992)                    1.555313
Alien (1979)                      1.008343
Aliens (1986)                     1.462883
Amadeus (1984)                    0.746641
American President, The (1995)    1.065284
Annie Hall (1977)                -0.011175
Apocalypse Now (1979)             0.420130
Apollo 13 (1995)                  0.984504
Babe (1995)                       0.546663
Back to the Future (1985)         1.726427
Batman (1989)                     1.500844
Batman Forever (1995)             0.560036
Batman Returns (1992)             0.667613
Beauty and the Beast (1991)       0.786928
Ben-Hur (1959)                    1.052943
Birdcage, The (1996)              0.460942
dtype: float64


In [17]:
print "sorting..."
simCandidates.sort_values(inplace = True, ascending = False)
print simCandidates.head(20)

sorting...
title
Empire Strikes Back, The (1980)                       5.000000
Star Wars (1977)                                      5.000000
Empire Strikes Back, The (1980)                       3.741763
Star Wars (1977)                                      3.741763
Return of the Jedi (1983)                             3.606146
Return of the Jedi (1983)                             3.362779
Raiders of the Lost Ark (1981)                        2.693297
Raiders of the Lost Ark (1981)                        2.680586
Austin Powers: International Man of Mystery (1997)    1.887164
Sting, The (1973)                                     1.837692
Bridge on the River Kwai, The (1957)                  1.783717
Indiana Jones and the Last Crusade (1989)             1.750535
Cinderella (1950)                                     1.749598
Back to the Future (1985)                             1.726427
Terminator 2: Judgment Day (1991)                     1.667662
Frighteners, The (1996)               

This is starting to look like something useful! Note that some of the same movies came up more than once, because they were similar to more than one movie I rated. We'll use groupby() to add together the scores from movies that show up more than once, so they'll count more:

In [18]:
#simCandidates.index, simCandidates.values
simCandidates.head()
#simCandidates.nunique()

title
Empire Strikes Back, The (1980)    5.000000
Star Wars (1977)                   5.000000
Empire Strikes Back, The (1980)    3.741763
Star Wars (1977)                   3.741763
Return of the Jedi (1983)          3.606146
dtype: float64

In [19]:
simCandidates.index

Index([u'Empire Strikes Back, The (1980)', u'Star Wars (1977)',
       u'Empire Strikes Back, The (1980)', u'Star Wars (1977)',
       u'Return of the Jedi (1983)', u'Return of the Jedi (1983)',
       u'Raiders of the Lost Ark (1981)', u'Raiders of the Lost Ark (1981)',
       u'Austin Powers: International Man of Mystery (1997)',
       u'Sting, The (1973)',
       ...
       u'Courage Under Fire (1996)', u'What's Eating Gilbert Grape (1993)',
       u'Murder at 1600 (1997)', u'This Is Spinal Tap (1984)',
       u'Brazil (1985)', u'Real Genius (1985)', u'Annie Hall (1977)',
       u'Remains of the Day, The (1993)', u'Piano, The (1993)',
       u'First Wives Club, The (1996)'],
      dtype='object', name=u'title', length=515)

In [20]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates

title
12 Angry Men (1957)                                   0.921447
2001: A Space Odyssey (1968)                          1.867302
Absolute Power (1997)                                 0.427199
Abyss, The (1989)                                     2.407877
African Queen, The (1951)                             2.310987
Air Force One (1997)                                  1.393921
Aladdin (1992)                                        2.513417
Alien (1979)                                          2.195566
Aliens (1986)                                         2.791258
Amadeus (1984)                                        2.021675
American President, The (1995)                        1.631229
Annie Hall (1977)                                    -0.511775
Apocalypse Now (1979)                                 0.563009
Apollo 13 (1995)                                      2.124853
Army of Darkness (1993)                               0.519211
Austin Powers: International Man of Mystery (1997

In [21]:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

title
Empire Strikes Back, The (1980)              8.877450
Star Wars (1977)                             8.870971
Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
dtype: float64

The last thing we have to do is filter out movies I've already rated, as recommending a movie I've already watched isn't helpful:

In [22]:
filteredSims = simCandidates.drop(myRatings.index)
filteredSims.head(10)

title
Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64

There we have it!