# Hybrid Recommender System : Content and Colloborative
### Brief:
Some hybrids predict using content and collaborative filtering techniques separately to produce results. <br>
Some others introduce content-based techniques into collaborative filters and vice versa

### Objective: 
To display a list of recommendations on the side pane when watching a movie.<br>
Content-based system would be appropriate (for a 'more like this' feature), but all items with similar content may not be good ones. Ex: 'The Dark Knight' would be good but user may not be interested in 'Batman and Robin'. <br> 
Hence, here, we use collaborative filters to predict the items recommended by content-based systems.

### Workflow : 
1. Take in a movie title and user as input
2. Use a content-based model to compute the 25 most similar movies
3. Compute the predicted ratings that the user might give these 25 movies using a collaborative filter
4. Return the top 10 movies with the highest predicted rating

### Exploration and Transformations

In [1]:
import pandas as pd
import numpy as np
from surprise import SVD, Reader, Dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel # for dot product
from surprise import SVD, Reader, Dataset

In [2]:
# Data from: https://www.kaggle.com/rounakbanik/the-movies-dataset/downloads/ratings_small.csv/7
ratings = pd.read_csv('ratings_small.csv')

# reove timestamp
ratings = ratings.drop('timestamp', axis=1)

ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [3]:
# Data from: https://drive.google.com/file/d/14I-ANsvQ-U1rv8eoBL9LVnGlO8AipOIV/view?usp=sharing
id_map = pd.read_csv('movie_ids.csv')

id_map.head()

Unnamed: 0,title,movieId,id
0,Toy Story,1,862.0
1,Jumanji,2,8844.0
2,Grumpier Old Men,3,15602.0
3,Waiting to Exhale,4,31357.0
4,Father of the Bride Part II,5,11862.0


In [4]:
# Data from : https://drive.google.com/file/d/1_giDCXnyn0tL5WCHuUd08NTdgdHm1fU6/view?usp=sharing
metadata = pd.read_csv('movies_metadata.csv', dtype={'popularity':str})

metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [5]:
# since df['id'] is not clean
def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

#Clean the ids of df
metadata['id'] = metadata['id'].apply(clean_ids)

#Filter all rows that have a null ID
metadata = metadata[metadata['id'].notnull()]

In [6]:
# convert id_map[id] to integer
id_map['id'] = id_map['id'].astype(int)

# merge id_map and metadata based on id
id_map_overview = id_map.merge(metadata, on='id')

id_map_overview.head()

Unnamed: 0,title_x,movieId,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title_y,video,vote_average,vote_count
0,Toy Story,1,862,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,Jumanji,2,8844,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,Grumpier Old Men,3,15602,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,Waiting to Exhale,4,31357,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,Father of the Bride Part II,5,11862,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,tt0113041,en,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [7]:
id_map_overview = id_map_overview[['title_x', 'id', 'overview']]
id_map_overview.head()

Unnamed: 0,title_x,id,overview
0,Toy Story,862,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,8844,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,15602,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,11862,Just when George Banks has recovered from his ...


In [8]:
#Define a TF-IDF Vectorizer Object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english', lowercase=True)

#Replace NaN with an empty string
id_map_overview['overview'] = id_map_overview['overview'].fillna('')

#Construct the required TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(id_map_overview['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(9355, 29727)

In [9]:
# Compute the cosine similarity matrix
# liner kernel (X,Y) = (X Transpose) * Y
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [10]:
cosine_sim = pd.DataFrame(cosine_sim)
cosine_sim = cosine_sim.set_index(id_map_overview['id'])
cosine_sim.columns = list(id_map_overview['id'])
cosine_sim.head()

Unnamed: 0_level_0,862,8844,15602,31357,11862,949,11860,45325,9091,710,...,373348,338766,390734,314420,390989,159550,392572,402672,315011,391698
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
862,1.0,0.018119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01629,0.0,0.0,0.0,0.0,0.011803
8844,0.018119,1.0,0.049213,0.0,0.0,0.050304,0.0,0.0,0.100976,0.0,...,0.005412,0.0,0.019912,0.0,0.021216,0.0,0.0,0.0,0.005903,0.011174
15602,0.0,0.049213,1.0,0.0,0.025893,0.0,0.0,0.006723,0.0,0.0,...,0.0,0.015003,0.0,0.0,0.0,0.0,0.0,0.0,0.00753,0.0
31357,0.0,0.0,0.0,1.0,0.0,0.007807,0.0,0.009349,0.0,0.0,...,0.0,0.0,0.010951,0.060062,0.0,0.012607,0.0,0.0,0.0,0.0
11862,0.0,0.0,0.025893,0.0,1.0,0.0,0.033066,0.0,0.033625,0.0,...,0.0,0.0,0.006832,0.0,0.0,0.019206,0.011132,0.0,0.0,0.0


In [11]:
id_map_overview = id_map_overview.set_index('title_x')
id_map_overview.head()

Unnamed: 0_level_0,id,overview
title_x,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,862,"Led by Woody, Andy's toys live happily in his ..."
Jumanji,8844,When siblings Judy and Peter discover an encha...
Grumpier Old Men,15602,A family wedding reignites the ancient feud be...
Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom..."
Father of the Bride Part II,11862,Just when George Banks has recovered from his ...


In [12]:
reverse_index = id_map.set_index('id')
reverse_index.head()

Unnamed: 0_level_0,title,movieId
id,Unnamed: 1_level_1,Unnamed: 2_level_1
862,Toy Story,1
8844,Jumanji,2
15602,Grumpier Old Men,3
31357,Waiting to Exhale,4
11862,Father of the Bride Part II,5


In [13]:
# Fitting SVD for collab filter
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)
svd = SVD()
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f526744fcf8>

In [14]:
# the function
def hybrid(user_id, title):
    # index of title in the similarity matrix
    ind = id_map_overview.loc[title, 'id']
    
    # similarity vector of the movie with other movies
    sim_vector = np.array(cosine_sim.loc[ind])
    
    # keep track the column names
    tup_list = list(zip(list(id_map_overview['id']), sim_vector))
    
    # sort the vector keeping track of column names
    sorted_vector = sorted(tup_list, key=lambda x: x[1], reverse=True)
    
    # take the 25 most similar items
    content_movies = sorted_vector[1:26]
    
    # get the 25 indices
    content_ind = [x[0] for x in content_movies]
    
    # predict the ratings using svd
    pred_ratings = [svd.predict(user_id, ind)[3] for ind in content_ind]
    
    # zip them up
    pred_tup = list(zip(content_ind, pred_ratings))
    
    # sort the final movies list
    sorted_pred = sorted(pred_tup, key=lambda x: x[1], reverse=True)
    
    final_ind_list = [x[0] for x in sorted_pred][:10]
    
    # return the movie titles and ids
    return reverse_index.loc[final_ind_list]

In [15]:
# Recommendations for user 1
hybrid(1, 'Alien')

Unnamed: 0_level_0,title,movieId
id,Unnamed: 1_level_1,Unnamed: 2_level_1
7453,The Hitchhiker's Guide to the Galaxy,33004
200,Star Trek: Insurrection,2393
830,Forbidden Planet,1301
10127,Critters 2,4493
11260,Meet Dave,60516
11076,Fly Away Home,986
22777,The Million Dollar Duck,2031
19185,Night of the Living Dead,8225
10384,Supernova,3190
7249,Futurama: Bender's Big Score,56251


In [16]:
# Recommendations for user 2
hybrid(2, 'Alien')

Unnamed: 0_level_0,title,movieId
id,Unnamed: 1_level_1,Unnamed: 2_level_1
8410,The Wild Blue Yonder,44671
10127,Critters 2,4493
11260,Meet Dave,60516
11076,Fly Away Home,986
22777,The Million Dollar Duck,2031
19185,Night of the Living Dead,8225
10384,Supernova,3190
7249,Futurama: Bender's Big Score,56251
11127,Starship Troopers 3: Marauder,64508
286217,The Martian,134130


#### Remarks:
The recommendations are dominated by the content-based component as expected, but with a noticable difference in the recommendations for different users watching the same movie