# Collaborative Filtering

In this notebook we build a movie reccomender system based on Item-Item collaborative filtering. <br><br>

We will be using the MovieLens Dataset (https://files.grouplens.org/datasets/movielens/ml-latest-small.zip)<br><br>
Citation: <br>
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

## EDA and data cleaning

In [1]:
import pandas as pd 

### movies.csv

In [2]:
movies = pd.read_csv('ml-latest-small/movies.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies.shape

(9742, 3)

In [5]:
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [6]:
movies.title.nunique(), movies.movieId.nunique()

(9737, 9742)

#### There are some duplicated titles

In [7]:
duplicated_movies_title = movies[movies.duplicated('title')].title.values
duplicated_movies_title

array(['Emma (1996)', 'War of the Worlds (2005)',
       'Confessions of a Dangerous Mind (2002)', 'Eros (2004)',
       'Saturn 3 (1980)'], dtype=object)

#### 5 movies have been duplicated once

### ratings.csv

In [8]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')

In [9]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [10]:
ratings = ratings.drop('timestamp', axis=1)

In [11]:
ratings.shape

(100836, 3)

In [12]:
print(f'No. of users: {ratings.userId.nunique()}') 
print(f'No. of movies rated: {ratings.movieId.nunique()}')

No. of users: 610
No. of movies rated: 9724


In [13]:
print(f'No. of movies rated: {ratings.movieId.nunique()}')
# We are not concerned with duplicate titles, for now
print(f'No. of movies total: {movies.movieId.nunique()}')

No. of movies rated: 9724
No. of movies total: 9742


18 movies have not been rated

In [14]:
### Movies that have not been rated 
tmp = ratings.merge(movies.drop('genres', axis=1), how='outer')
unrated_movies = tmp[tmp.isna().any(axis=1)]
unrated_movies.head()

Unnamed: 0,userId,movieId,rating,title
100836,,1076,,"Innocents, The (1961)"
100837,,2939,,Niagara (1953)
100838,,3338,,For All Mankind (1989)
100839,,3456,,"Color of Paradise, The (Rang-e khoda) (1999)"
100840,,4194,,I Know Where I'm Going! (1945)


In [15]:
print(unrated_movies[['movieId', 'title']])
print('Number of movies not rated:', unrated_movies.shape[0])

        movieId                                         title
100836     1076                         Innocents, The (1961)
100837     2939                                Niagara (1953)
100838     3338                        For All Mankind (1989)
100839     3456  Color of Paradise, The (Rang-e khoda) (1999)
100840     4194                I Know Where I'm Going! (1945)
100841     5721                            Chosen, The (1981)
100842     6668   Road Home, The (Wo de fu qin mu qin) (1999)
100843     6849                                Scrooge (1970)
100844     7020                                  Proof (1991)
100845     7792                     Parallax View, The (1974)
100846     8765                      This Gun for Hire (1942)
100847    25855                  Roaring Twenties, The (1939)
100848    26085                   Mutiny on the Bounty (1962)
100849    30892            In the Realms of the Unreal (2004)
100850    32160                      Twentieth Century (1934)
100851  

We can just ignore these movies 

### Merging the dataframe

In [16]:
ratings = ratings.merge(movies.drop('genres', axis=1))
ratings.sample(3)

Unnamed: 0,userId,movieId,rating,title
34470,265,1784,4.0,As Good as It Gets (1997)
92520,182,4052,3.5,Antitrust (2001)
3569,608,500,2.0,Mrs. Doubtfire (1993)


#### Let's see what to do with the duplicated movie titles in movies.csv

In [17]:
duplicated_movies = movies[movies.title.isin(duplicated_movies_title)]
duplicated_movies

Unnamed: 0,movieId,title,genres
650,838,Emma (1996),Comedy|Drama|Romance
2141,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
4169,6003,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller
5601,26958,Emma (1996),Romance
5854,32600,Eros (2004),Drama
5931,34048,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller
6932,64997,War of the Worlds (2005),Action|Sci-Fi
9106,144606,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller
9135,147002,Eros (2004),Drama|Romance
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller


In [18]:
duplicated_movieId = duplicated_movies.movieId.values
duplicated_movieId

array([   838,   2851,   6003,  26958,  32600,  34048,  64997, 144606,
       147002, 168358])

In [19]:
duplicated_movies = ratings[ratings.movieId.isin(duplicated_movieId)]
duplicated_movies

Unnamed: 0,userId,movieId,rating,title
18651,3,2851,5.0,Saturn 3 (1980)
18652,217,2851,3.0,Saturn 3 (1980)
18653,288,2851,2.0,Saturn 3 (1980)
18654,469,2851,3.0,Saturn 3 (1980)
33502,6,838,4.0,Emma (1996)
...,...,...,...,...
89385,111,144606,4.0,Confessions of a Dangerous Mind (2002)
96980,318,147002,4.0,Eros (2004)
99604,509,26958,3.5,Emma (1996)
99664,514,168358,2.5,Saturn 3 (1980)


In [20]:
duplicated_movies.groupby(['title','movieId'])['rating'].agg(['mean','count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
title,movieId,Unnamed: 2_level_1,Unnamed: 3_level_1
Confessions of a Dangerous Mind (2002),6003,3.6,15
Confessions of a Dangerous Mind (2002),144606,4.0,1
Emma (1996),838,3.916667,30
Emma (1996),26958,3.5,1
Eros (2004),32600,3.5,1
Eros (2004),147002,4.0,1
Saturn 3 (1980),2851,3.25,4
Saturn 3 (1980),168358,2.5,1
War of the Worlds (2005),34048,3.15,50
War of the Worlds (2005),64997,3.0,2


### Why should we fix this? 
The same movie could get recommended twice, because to the model movieId 6003 and 144606 are two different movies
### How can we fix this?
We could just delete the duplicate records. In the case of War of the Worlds, it would probably not make much of a difference. <br>
But consider Eros, which has only been rated twice, deleting a duplicate would mean deleting half of the data. <br>
A better approach would be to merge the duplicates

In [21]:
duplicated_movies_title

array(['Emma (1996)', 'War of the Worlds (2005)',
       'Confessions of a Dangerous Mind (2002)', 'Eros (2004)',
       'Saturn 3 (1980)'], dtype=object)

In [22]:
duplicated_titles = duplicated_movies['title'].unique()

id_mapping = {}

for title in duplicated_movies_title:
    title_rows = duplicated_movies[duplicated_movies['title'] == title]
    
    # Find the minimum movieId among the duplicates
    min_movie_id = title_rows['movieId'].min()
    
    # Mapping each duplicated movieId to the minimum one
    for movie_id in title_rows['movieId'].unique():
        id_mapping[movie_id] = min_movie_id

ratings['movieId'] = ratings['movieId'].map(lambda x: id_mapping.get(x, x))

## Collaborative filtering

### Utility matrix
#### We need to create a user-movie matrix for collaborative filtering

In [23]:
import numpy as np 
from scipy.sparse import csr_matrix

n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()

user_mapper = dict(zip(np.unique(ratings.userId), range(n_users)))
movie_mapper = dict(zip(np.unique(ratings.movieId), range(n_movies)))

In [24]:
X = csr_matrix((ratings.rating,
                ([user_mapper[i] for i in ratings.userId], [movie_mapper[i] for i in ratings.movieId])),
               shape=(n_users,n_movies))

In [25]:
X.shape

(610, 9719)

#### Sparsity

In [26]:
1 - X.nnz / (X.shape[0] * X.shape[1])

0.9829922460483859

Sparsity less then 99.5% is generally considered ok

### Making the Collaborative filtering model

In [27]:
user_inv_mapper = dict(zip(list(range(n_users)), np.unique(ratings["userId"])))
movie_inv_mapper = dict(zip(list(range(n_movies)), np.unique(ratings["movieId"])))
movie_titles = dict(zip(movies['movieId'], movies['title']))

In [28]:
from sklearn.neighbors import NearestNeighbors

kNN = NearestNeighbors(n_neighbors=6, algorithm="brute", metric='cosine')
X_T = X.T
kNN.fit(X_T)

In [29]:
def find_similar_movies(movie_id, k):

    neighbour_ids = []
    movie_index = movie_mapper[movie_id]
    movie_vector = X_T[movie_index]
    if isinstance(movie_vector, (np.ndarray)):
        movie_vector = movie_vector.reshape(1,-1)
        
    neighbour = kNN.kneighbors(movie_vector, return_distance=False)
    for i in range(k):
        n = neighbour.item(i)
        neighbour_ids.append(movie_inv_mapper[n])
    neighbour_ids.pop(0)
    
    return neighbour_ids

In [30]:
movie_id = 1
similar_movies = find_similar_movies(movie_id, 6)
movie_title = movie_titles[movie_id]

print(f"Movies similar to {movie_title}:\n")
for i in similar_movies:
    print(movie_titles[i])

Movies similar to Toy Story (1995):

Toy Story 2 (1999)
Jurassic Park (1993)
Independence Day (a.k.a. ID4) (1996)
Star Wars: Episode IV - A New Hope (1977)
Forrest Gump (1994)
