# Item - Item Collaborative Filter Example

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.metrics.pairwise import cosine_similarity

## Initial Setup
The Movie Lens data set which contains user ratings for 9066 unique movies has been used in this example.

Source: https://grouplens.org/datasets/movielens/
Last updated 9/2018.

In [3]:
# Relative path reference that contains the dataset
path = "data/"

# Reading and displaying the list of movies that were rated
movie_names = pd.read_csv(path + "movies.csv")
movie_names.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Dropping the Genres column from the movie dataframe, as it is not required for collaborative filtering:

In [4]:
movie_names.drop('genres', axis = 1, inplace = True)
movie_names.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


Reading the User Ratings for corresponding Movie Ids:

In [5]:
ratings = pd.read_csv(path+'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Dropping the timestamp column from the User Ratings, as it is not required for collaborative filtering:

In [6]:
ratings.drop('timestamp',axis = 1, inplace = True)
print("Unique users:" + str(ratings.userId.unique().size) + ", Unique movies = " + str(ratings.movieId.unique().size))

Unique users:610, Unique movies = 9724


### Creating the Utility Matrix
which has the movie ratings (columns) for each user (rows).
In this case, the aggregation function is set to average if a user has rated the same movie more than once.

In [7]:
ratings_matrix = pd.crosstab(ratings.userId, ratings.movieId, ratings.rating, aggfunc=np.mean).fillna(value=0)
ratings_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculation of Cosine Similarity

In [8]:
# Example
a = [[1,0,1,0],
    [0,1,1,0]]

cosine_similarity(a)

array([[1. , 0.5],
       [0.5, 1. ]])

In [9]:
similarity_matrix = pd.DataFrame(cosine_similarity(ratings_matrix.T))

Why are using ratings_matrix.T?

### Input configuration example:

#### Example:
Lets say User ID - 5 has watched and rated the movie "Grumpier Old Men (1995)".
Based on this user's latest selection and previously rated movies, we want to leverage on item-item collaborative filtering to recommend 10 other movies to this user.

***NOTE - *** You are encouraged to build a user interface to test for different user-item input combinations and display appropriate recommendations for each input pair.

In [24]:
selected_Movie = 'Toy Story (1995)'
selected_User = 500
number_of_recos = 10

Retreiving the 10 most similar movies to the selected Movie:

In [25]:
selected_MovieId = int(movie_names.loc[movie_names['title'] == selected_Movie]['movieId'])
selected_MovieId

1

In [26]:
similarity_matrix.index

RangeIndex(start=0, stop=9724, step=1)

In [27]:
similarity_df = pd.DataFrame(similarity_matrix[selected_MovieId])
similarity_df = similarity_df.sort_values(by = selected_MovieId, ascending= False)

In [29]:
similarity_df.shape

(9724, 1)

In [30]:
similar_items = similarity_df[selected_MovieId].index.tolist()[:number_of_recos+1]
similar_items

[1, 322, 436, 325, 418, 504, 483, 506, 512, 18, 276]

Retrieving the similarity score vectors for the similar movies:

In [32]:
similar_item_scores = similarity_matrix.loc[similar_items,selected_MovieId]
similar_item_scores

1      1.000000
322    0.588438
436    0.549818
325    0.544981
418    0.538046
504    0.524876
483    0.518161
506    0.515620
512    0.507458
18     0.497560
276    0.497368
Name: 1, dtype: float64

**For User ID 5 (selected User), getting the user ratings of the similar items (if any):**

In [18]:
user_vector = ratings_matrix.loc[selected_User,similar_items].fillna(0)
user_vector

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return getattr(section, self.name)[new_key]


movieId
1      4.0
322    4.0
436    3.0
325    0.0
418    0.0
504    0.0
483    0.0
506    0.0
512    0.0
18     0.0
276    0.0
Name: 500, dtype: float64

### Ad-hoc rating score computation steps:

- Calculating the final ratings of all movies similar to the selected movie for the selected user by computing:

    (Matrix dot product of the user ratings vector and the similarity score vector / sum of similarity score vector)


- Taking the top 10 movies in descending order of score.


In [19]:
score = user_vector.dot(similar_item_scores).div(similar_item_scores.sum())
score = score.sort_values(ascending=False)
score.head(10)


9118    4.0
5872    4.0
8383    4.0
8335    4.0
8313    4.0
5598    4.0
9345    4.0
9346    4.0
9363    4.0
9385    4.0
dtype: float64

In [35]:
score.shape

(9724,)

### Filtering the top 10 scored (rated) movies and display the results and recommended items:

In [33]:
top_10_scored = score.index.tolist()[:number_of_recos]
for i in top_10_scored:
    print(i,movie_names[movie_names['movieId'] == int(i)]['title'].values)

9118 []
5872 ['Die Another Day (2002)']
8383 ['Hope Springs (2003)']
8335 ['Make Way for Tomorrow (1937)']
8313 []
5598 []
9345 []
9346 []
9363 []
9385 []


In [34]:
for i in similar_items:
    print(i,movie_names[movie_names['movieId'] == int(i)]['title'].values)

1 ['Toy Story (1995)']
322 ['Swimming with Sharks (1995)']
436 ['Color of Night (1994)']
325 ["National Lampoon's Senior Trip (1995)"]
418 ['Being Human (1993)']
504 ['No Escape (1994)']
483 []
506 ['Orlando (1992)']
512 ['Puppet Masters, The (1994)']
18 ['Four Rooms (1995)']
276 ['Milk Money (1994)']
