# Item based collaborative filtering

Start by importing the MoiveLens 100k data set.

Based on this video:

https://www.youtube.com/watch?v=PA1XIDSHldc&t=347s


The last user ID is 943

In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()


Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [2]:
m_cols = ['movie_id', 'title']
movies = pd.read_csv('u.item', sep='\|', engine = 'python', names=m_cols, usecols=range(2))
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [3]:
ratings = pd.merge(movies, ratings)
ratings.tail()

Unnamed: 0,movie_id,title,user_id,rating
99995,1678,Mat' i syn (1997),863,1
99996,1679,B. Monkey (1998),863,3
99997,1680,Sliding Doors (1998),863,2
99998,1681,You So Crazy (1994),896,3
99999,1682,Scream of Stone (Schrei aus Stein) (1991),916,3


In [4]:
#Here is my own review user_id 944
loadTy = pd.read_excel('loadTy.xlsx')
loadTy.head()

Unnamed: 0,movie_id,title,user_id,rating
0,225,101 Dalmatians (1996),944,4.0
1,178,12 Angry Men (1957),944,5.0
2,330,187 (1997),944,
3,1353,1-900 (1994),944,
4,1011,2 Days in the Valley (1996),944,


In [5]:
#I am going to try to add my own record
ratings = pd.concat([ratings, loadTy])

In [6]:
#View my data
ratings.tail()

Unnamed: 0,movie_id,title,user_id,rating
1659,208,Young Frankenstein (1974),944,
1660,232,Young Guns (1988),944,
1661,1188,Young Guns II (1990),944,
1662,547,"Young Poisoner's Handbook, The (1995)",944,
1663,1164,Zeus and Roxanne (1997),944,


Now well will pivot this table to construct a nice matrix of users and the movies they ratecd. NaN indicates missing data.

Make the index column the user_id, then the columns headers are the title values. The actual table values will be the ratings themselves.

In [7]:
userRatings = ratings.pivot_table(index = ['user_id'], columns = ['title'], values = 'rating')
userRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


Now the magic happens, pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix. This gives us a correlation score between every pair of movies. Where at least 1 user rated both movies, otherwise it will be NaN.

Note, the same movies on X and Y axis should be 1.0, that means a 100 pct correlation. A negative number means way different.

In [8]:
corrMatrix = userRatings.corr()
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
101 Dalmatians (1996),-1.0,,1.0,-4.8168240000000004e-17,0.269191,0.048973,0.266928,-0.098801,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,0.0,0.707107,,
12 Angry Men (1957),-0.5,,-4.8168240000000004e-17,1.0,0.666667,0.256625,0.274772,0.137362,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.144338,1.0,1.0,
187 (1997),-0.5,,0.269191,0.6666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.475327,,,


Now to refine our work

We want to avoid spurious results that happened from just a handful of users that happened to rate the same pair of movies. In order to restrict our results to movies that lots of people rated together - and also give use more popular results that are more easily recognizable - we will use the min_periods argument to throw out results where fewer than 100 users rated a give movie pair:

In [9]:
corrMatrix = userRatings.corr(method = 'pearson', min_periods = 100)
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


There is no user 0 because he added it in.

Let's take a user, extract his ratings from the userRatings DataFrame, and use dropna() to get rid of missing data (leaving me only with a series of the movies he actually rated.

In [10]:
myRatings = userRatings.loc[944].dropna()
myRatings

title
101 Dalmatians (1996)                    4.0
12 Angry Men (1957)                      5.0
2001: A Space Odyssey (1968)             1.0
Ace Ventura: Pet Detective (1994)        1.0
Ace Ventura: When Nature Calls (1995)    1.0
Air Bud (1997)                           3.0
Air Force One (1997)                     4.0
Akira (1988)                             4.0
Aladdin (1992)                           4.0
Alien (1979)                             4.0
Alien 3 (1992)                           2.0
Aliens (1986)                            5.0
All Dogs Go to Heaven 2 (1996)           4.0
Back to the Future (1985)                3.0
Die Hard (1988)                          5.0
Dumbo (1941)                             1.0
Godfather, The (1972)                    5.0
Godfather: Part II, The (1974)           5.0
GoodFellas (1990)                        5.0
Next Karate Kid, The (1994)              4.0
Platoon (1986)                           4.0
Return of the Jedi (1983)                5.0
Star

Now let's go through each movie I rated one at a time, build up a list of possible recommendations based on the movies similar to the ones I rated.

So for each movie I rated, i will retrieve the list of similar movies from our correlation matrix. I will then scale thoses correlations scores by how well I rated the movies they are similar to, so movies similar to ones I liked count more than movies similar to the ones I hated:

In [11]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print ("Adding sims for " + myRatings.index[i] + "...")
    #Retrieve simliar movies to this one that I rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    #Now scale its similarity by how well I rated this movie
    sims = sims.map(lambda x: x * myRatings[i])
    #Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
#Glance at our reslutsl so far:
print ("sorting...")
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(10))

Adding sims for 101 Dalmatians (1996)...
Adding sims for 12 Angry Men (1957)...
Adding sims for 2001: A Space Odyssey (1968)...
Adding sims for Ace Ventura: Pet Detective (1994)...
Adding sims for Ace Ventura: When Nature Calls (1995)...
Adding sims for Air Bud (1997)...
Adding sims for Air Force One (1997)...
Adding sims for Akira (1988)...
Adding sims for Aladdin (1992)...
Adding sims for Alien (1979)...
Adding sims for Alien 3 (1992)...
Adding sims for Aliens (1986)...
Adding sims for All Dogs Go to Heaven 2 (1996)...
Adding sims for Back to the Future (1985)...
Adding sims for Die Hard (1988)...
Adding sims for Dumbo (1941)...
Adding sims for Godfather, The (1972)...
Adding sims for Godfather: Part II, The (1974)...
Adding sims for GoodFellas (1990)...
Adding sims for Next Karate Kid, The (1994)...
Adding sims for Platoon (1986)...
Adding sims for Return of the Jedi (1983)...
Adding sims for Star Wars (1977)...
sorting...
Return of the Jedi (1983)         5.0
12 Angry Men (1957)   

The output is the top recommendated mvoies, even if it is the one he already likes.

We will use groupby() to add together the scores from movies that show up more than once, so they will count more:

A way to remove the dupes too.

In [12]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()

In [13]:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

Empire Strikes Back, The (1980)    16.802675
Star Wars (1977)                   16.775034
Raiders of the Lost Ark (1981)     16.502822
Die Hard (1988)                    16.385364
Aliens (1986)                      16.018912
GoodFellas (1990)                  15.935898
Godfather: Part II, The (1974)     15.487069
Back to the Future (1985)          15.137290
Return of the Jedi (1983)          14.770081
Godfather, The (1972)              13.758054
dtype: float64

I want to compare the start and the end result in Excel

In [14]:
#Input user
#myRatings
myRatings.to_excel('InputUser.xlsx')

In [15]:
#Results
#simCandidates
simCandidates.to_excel('OutputUser.xlsx')