This notebook shows two simple examples of Collaborative Filtering using user-based and item-based approaches. The source data is from [here]('http://files.grouplens.org/datasets/movielens/ml-latest-small.zip').

### Import packages

In [1]:
import pandas as pd
import numpy as np
import random
from sklearn.metrics.pairwise import pairwise_distances

### Set-up

In [2]:
# location and filenames of the input data
ratings_file = 'https://bitbucket.org/vishal_derive/vcu-data-mining/raw/37a416e794c656e5c84dd87149cdbbf3c0d8737b/data/ratings.csv'
movies_file = 'https://bitbucket.org/vishal_derive/vcu-data-mining/raw/37a416e794c656e5c84dd87149cdbbf3c0d8737b/data/movies.csv'

### Read data

In [3]:
# ratings data set
ratings_df = pd.read_csv(ratings_file)
ratings_df.shape

(100836, 4)

In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
# movies data set
movies_df = pd.read_csv(movies_file)
movies_df.shape

(9742, 3)

In [6]:
    movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Prepare data

Let's drop columns that we don't need, and then combine those two datasets.

In [7]:
ratings_df = ratings_df.drop('timestamp', axis=1)
movies_df = movies_df.drop('genres', axis=1)

df = ratings_df.merge(movies_df, on='movieId', how='inner')
df.shape

(100836, 4)

In [8]:
df.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)


### Matrix representation

Collaborative Filtering requires the data to be in a user-item matrix format.

In [9]:
user_item_df = df.pivot_table(index='userId', columns='title', values='rating')
user_item_df.shape

(610, 9719)

In [10]:
user_item_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


For the purpose of this analysis, we will restrict the number of movies to 500.

In [11]:
# take the most rated top 500 movies
top_500_movies = df['title'].value_counts()[:500].index
top_500_movies[:10]

Index(['Forrest Gump (1994)', 'Shawshank Redemption, The (1994)',
       'Pulp Fiction (1994)', 'Silence of the Lambs, The (1991)',
       'Matrix, The (1999)', 'Star Wars: Episode IV - A New Hope (1977)',
       'Jurassic Park (1993)', 'Braveheart (1995)',
       'Terminator 2: Judgment Day (1991)', 'Schindler's List (1993)'],
      dtype='object')

In [12]:
# subset the uset_item_df dataframe by taking only those top 500 titles (columns)
user_item_df = user_item_df[top_500_movies]
user_item_df.head()

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,"Terminal, The (2004)",Beverly Hills Cop (1984),Sneakers (1992),"Pianist, The (2002)",M*A*S*H (a.k.a. MASH) (1970),"Simpsons Movie, The (2007)",Fear and Loathing in Las Vegas (1998),Adaptation (2002),Phenomenon (1996),Gran Torino (2008)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,5.0,...,,,3.0,,5.0,,,,,
2,,3.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,0.5,...,,,,,,,,,,
4,,,1.0,5.0,1.0,5.0,,,,,...,,,,,,,,,,
5,,3.0,5.0,,,,,4.0,3.0,5.0,...,,,,,,,,,,


In [13]:
# reset index to remove the multi-level index
user_item_df = user_item_df.reset_index()
user_item_df.head()

title,userId,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),...,"Terminal, The (2004)",Beverly Hills Cop (1984),Sneakers (1992),"Pianist, The (2002)",M*A*S*H (a.k.a. MASH) (1970),"Simpsons Movie, The (2007)",Fear and Loathing in Las Vegas (1998),Adaptation (2002),Phenomenon (1996),Gran Torino (2008)
0,1,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,...,,,3.0,,5.0,,,,,
1,2,,3.0,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,4,,,1.0,5.0,1.0,5.0,,,,...,,,,,,,,,,
4,5,,3.0,5.0,,,,,4.0,3.0,...,,,,,,,,,,


Let's take a quick look at the top rated movies.

In [14]:
all_titles = [x for x in user_item_df.columns if x != 'userId']
all_titles[:5]

['Forrest Gump (1994)',
 'Shawshank Redemption, The (1994)',
 'Pulp Fiction (1994)',
 'Silence of the Lambs, The (1991)',
 'Matrix, The (1999)']

In [15]:
avg_ratings = user_item_df[all_titles].mean().sort_values(ascending=False)
avg_ratings.head(20)

title
Shawshank Redemption, The (1994)                                                  4.429022
Godfather, The (1972)                                                             4.289062
Fight Club (1999)                                                                 4.272936
Cool Hand Luke (1967)                                                             4.271930
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)       4.268041
Rear Window (1954)                                                                4.261905
Godfather: Part II, The (1974)                                                    4.259690
Departed, The (2006)                                                              4.252336
Goodfellas (1990)                                                                 4.250000
Casablanca (1942)                                                                 4.240000
Dark Knight, The (2008)                                                           4.

Let's check the ratings of a specific movie.

In [16]:
avg_ratings[avg_ratings.index.str.contains('Groundhog')]

title
Groundhog Day (1993)    3.944056
dtype: float64

What % of this entire population saw (rated) this movie?

In [17]:
user_item_df['Groundhog Day (1993)'].notnull().sum() / len(user_item_df)

0.23442622950819672

How many users have *not* watched (rated) this movie?

In [18]:
user_item_df['Groundhog Day (1993)'].isnull().sum()

467

### User-based Collaborative Filtering

Let's randomly choose one user who has not watched *Groundhog Day.* We will then proceed to predict his/her rating for this movie.

In [19]:
target_movie = 'Groundhog Day (1993)'

# first take all users who have not watched this movie
target_users = user_item_df[user_item_df[target_movie].isnull()]
target_users.shape

(467, 501)

In [20]:
# randomly select one user from this group
random.seed(5)
target_user = random.sample(list(target_users['userId']), 1)[0]
target_user

419

This is our **target** user, for whom we wish to predict the rating for *Groundhog Day*.

Let's calculate the distance from this user to all other users. In order to calculate the distance, let's first collect all movies that the target user has rated.

In [21]:
movies_rated_by_target_user = user_item_df[user_item_df['userId'] \
                                           == target_user][all_titles]\
                                .dropna(axis=1)
movies_rated_by_target_user.T

Unnamed: 0_level_0,418
title,Unnamed: 1_level_1
Forrest Gump (1994),4.5
"Shawshank Redemption, The (1994)",5.0
Pulp Fiction (1994),5.0
"Silence of the Lambs, The (1991)",3.0
"Matrix, The (1999)",4.0
...,...
Super Size Me (2004),5.0
James and the Giant Peach (1996),1.0
Big Daddy (1999),3.5
Troy (2004),4.5


In [22]:
# all movies rated by the target user 
rated_movies_by_target_user = movies_rated_by_target_user.columns

# create a mask to select all users but the target user, and users who have rated the target movie
mask = (user_item_df['userId'] != target_user) & (user_item_df[target_movie].notnull())

# apply the filter -- take all users ID's that satisfy this criteria
user_search_space = user_item_df[mask]['userId'].values

# take all users from this search space, and get their ratings for all movies that our target user rated
X = user_item_df[mask][rated_movies_by_target_user].fillna(0)

# take the target use, and get their ratings for all movies sans the target movie
y = movies_rated_by_target_user

len(X), len(y)

(143, 1)

In [23]:
user_dist = pairwise_distances(X, pd.DataFrame(y), metric='cosine')
user_dist[:10]

array([[0.40874643],
       [0.72447801],
       [0.90237203],
       [0.38965766],
       [0.50749988],
       [0.45438661],
       [0.54907574],
       [0.76357013],
       [0.41574712],
       [0.31965346]])

Find the use that is "closest" to the target user.

In [24]:
min_dist = [user_dist[user_dist == user_dist.min()]][0][0]
min_dist

0.04178191608089288

In [25]:
closest_user = user_search_space[np.argmin(user_dist)]
closest_user

414

In [26]:
# the closet user and his/her ratings
user_item_df[user_item_df['userId'] == closest_user].T[:10]

Unnamed: 0_level_0,413
title,Unnamed: 1_level_1
userId,414.0
Forrest Gump (1994),5.0
"Shawshank Redemption, The (1994)",5.0
Pulp Fiction (1994),5.0
"Silence of the Lambs, The (1991)",4.0
"Matrix, The (1999)",5.0
Star Wars: Episode IV - A New Hope (1977),5.0
Jurassic Park (1993),4.0
Braveheart (1995),5.0
Terminator 2: Judgment Day (1991),5.0


In [27]:
# target user's ratings (for comparison)
user_item_df[user_item_df['userId'] == target_user].T[:10]

Unnamed: 0_level_0,418
title,Unnamed: 1_level_1
userId,419.0
Forrest Gump (1994),4.5
"Shawshank Redemption, The (1994)",5.0
Pulp Fiction (1994),5.0
"Silence of the Lambs, The (1991)",3.0
"Matrix, The (1999)",4.0
Star Wars: Episode IV - A New Hope (1977),
Jurassic Park (1993),3.5
Braveheart (1995),4.0
Terminator 2: Judgment Day (1991),


In [28]:
predicted_rating = user_item_df[user_item_df['userId'] == closest_user]\
    [target_movie]
predicted_rating

413    4.0
Name: Groundhog Day (1993), dtype: float64

This is a demonstartion of how user-based distance can be used to predict ratings (and recommend movies).

____________

### Item-based Collaborative Filtering

In [30]:
# list of all movies other than the target movie
all_other_movies = [col for col in user_item_df.columns if col not in (target_movie, 'userId')]

X = user_item_df[all_other_movies].fillna(0)
y = user_item_df[target_movie].fillna(0)

In [31]:
item_dist = pairwise_distances(X.T, y.values.reshape(-1, len(y)), metric='cosine')
item_dist[:10]

array([[0.4774823 ],
       [0.50736902],
       [0.52567167],
       [0.53229855],
       [0.46384951],
       [0.43052535],
       [0.52281557],
       [0.59504138],
       [0.54181274],
       [0.55983412]])

In [32]:
min_item_dist = [item_dist[item_dist == item_dist.min()]][0][0]
min_item_dist

0.38764197403915945

In [33]:
# list of all movies sorted by distance (shortest first)
[title for d, title in sorted(zip(item_dist, all_other_movies))][:10]

['Back to the Future (1985)',
 'Monty Python and the Holy Grail (1975)',
 'Men in Black (a.k.a. MIB) (1997)',
 'Princess Bride, The (1987)',
 "Ferris Bueller's Day Off (1986)",
 'Being John Malkovich (1999)',
 'Ghostbusters (a.k.a. Ghost Busters) (1984)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Indiana Jones and the Last Crusade (1989)',
 'Star Wars: Episode IV - A New Hope (1977)']

In [34]:
closest_item = all_other_movies[np.argmin(item_dist)]
closest_item

'Back to the Future (1985)'

Calculate the averate rating for this title. This will become our predicted rating for *Groundhog Day*.

In [35]:
predicted_rating_2 = user_item_df[closest_item].mean()
round(predicted_rating_2, 1)

4.0

This is a demonstartion of how item-based distance can be used to predict ratings (and recommend movies).