# Item based collaborative filtering

Start by importing the MoiveLens 100k data set.

Based on this video:

https://www.youtube.com/watch?v=PA1XIDSHldc&t=347s

In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, usecols=range(3))

m_cols = ['movie_id', 'title']
movies = pd.read_csv('u.item', sep='\|', engine = 'python', names=m_cols, usecols=range(2))

ratings = pd.merge(movies, ratings)

ratings.head()


Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Now well will pivot this table to construct a nice matrix of users and the movies they ratecd. NaN indicates missing data.

In [5]:
userRatings = ratings.pivot_table(index = ['user_id'], columns = ['title'], values = 'rating')
userRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


Now the magic happens, pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix. This gives us a correlation score between every pair of movies. Where at least 1 user rated both movies, otherwise it will be NaN.

Note, the same movies on X and Y axis should be 1.0, that means a 100 pct correlation. A negative number means way different.

In [6]:
corrMatrix = userRatings.corr()
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
101 Dalmatians (1996),-1.0,,1.0,-0.04989,0.269191,0.048973,0.266928,-0.043407,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,0.0,0.707107,,
12 Angry Men (1957),-0.5,,-0.04989,1.0,0.666667,0.256625,0.274772,0.178848,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.144338,1.0,1.0,
187 (1997),-0.5,,0.269191,0.666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.475327,,,


Now to refine our work

We want to avoid spurious results that happened from just a handful of users that happened to rate the same pair of movies. In order to restrict our results to movies that lots of people rated together - and also give use more popular results that are more easily recognizable - we will use the min_periods argument to throw out results where fewer than 100 users rated a give movie pair:

In [8]:
corrMatrix = userRatings.corr(method = 'pearson', min_periods = 100)
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


There is no user 0 because he added it in.

Let's take a user, extract his ratings from the userRatings DataFrame, and use dropna() to get rid of missing data (leaving me only with a series of the movies he actually rated.

In [9]:
myRatings = userRatings.loc[3].dropna()
myRatings

title
187 (1997)                                                     2.0
Air Force One (1997)                                           2.0
Alien: Resurrection (1997)                                     3.0
Apostle, The (1997)                                            4.0
Bean (1997)                                                    2.0
Boogie Nights (1997)                                           5.0
Chasing Amy (1997)                                             3.0
Conspiracy Theory (1997)                                       5.0
Contact (1997)                                                 2.0
Cop Land (1997)                                                4.0
Crash (1996)                                                   1.0
Critical Care (1997)                                           1.0
Dante's Peak (1997)                                            2.0
Deconstructing Harry (1997)                                    3.0
Deep Rising (1998)                                      

Now let's go through each movie I rated one at a time, build up a list of possible recommendations based on the movies similar to the ones I rated.

So for each movie I rated, i will retrieve the list of similar movies from our correlation matrix. I will then scale thoses correlations scores by how well I rated the movies they are similar to, so movies similar to ones I liked count more than movies similar to the ones I hated:

In [13]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print ("Adding sims for " + myRatings.index[i] + "...")
    #Retrieve simliar movies to this one that I rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    #Now scale its similarity by how well I rated this movie
    sims = sims.map(lambda x: x * myRatings[i])
    #Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
#Glance at our reslutsl so far:
print ("sorting...")
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(10))

Adding sims for 187 (1997)...
Adding sims for Air Force One (1997)...
Adding sims for Alien: Resurrection (1997)...
Adding sims for Apostle, The (1997)...
Adding sims for Bean (1997)...
Adding sims for Boogie Nights (1997)...
Adding sims for Chasing Amy (1997)...
Adding sims for Conspiracy Theory (1997)...
Adding sims for Contact (1997)...
Adding sims for Cop Land (1997)...
Adding sims for Crash (1996)...
Adding sims for Critical Care (1997)...
Adding sims for Dante's Peak (1997)...
Adding sims for Deconstructing Harry (1997)...
Adding sims for Deep Rising (1998)...
Adding sims for Desperate Measures (1998)...
Adding sims for Devil's Advocate, The (1997)...
Adding sims for Devil's Own, The (1997)...
Adding sims for Edge, The (1997)...
Adding sims for Event Horizon (1997)...
Adding sims for Everyone Says I Love You (1996)...
Adding sims for Fallen (1998)...
Adding sims for G.I. Jane (1997)...
Adding sims for Game, The (1997)...
Adding sims for Good Will Hunting (1997)...
Adding sims for

The output is the top recommendated mvoies, even if it is the one he already likes.

We will use groupby() to add together the scores from movies that show up more than once, so they will count more:

A way to remove the dupes too.

In [14]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()

In [15]:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

Air Force One (1997)             13.308921
Conspiracy Theory (1997)         12.609345
Liar Liar (1997)                 11.880811
Return of the Jedi (1983)        10.529740
Independence Day (ID4) (1996)     9.517384
Game, The (1997)                  9.505438
Rock, The (1996)                  9.395350
Scream (1996)                     9.065357
Saint, The (1997)                 9.057600
Murder at 1600 (1997)             8.891054
dtype: float64

I want to compare the start and the end result in Excel

In [17]:
#Input user
#myRatings
myRatings.to_excel('InputUser.xlsx')

In [18]:
#Results
#simCandidates
simCandidates.to_excel('OutputUser.xlsx')