## Content based recommendation systems

This recommendation is based on the similarities between items, e.g., if a user likes item A, the user may also likes a similar item B. However, it requires the context/properties of each item to determine the similarity measure. Content-based recommendation usually applies to scenarios where previous actions of an user is available and we have reliable metadata for items.

We can use TF-IDF to extract features from movie titles and genres. Then the similarity between two movies can be computed using the cosine similarity.

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

## load datasets

In [39]:
movies_df = pd.read_csv('./datasets/ml-latest-small/movies.csv')
movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings_df = pd.read_csv('./datasets/ml-latest-small/ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
tags_df = pd.read_csv('./datasets/ml-latest-small/tags.csv')
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## process data

In [40]:
# remove the parenthes in movie titles
movies_df['title'] = movies_df.title.str.replace('([\(\)])', '')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story 1995,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji 1995,Adventure|Children|Fantasy
2,3,Grumpier Old Men 1995,Comedy|Romance
3,4,Waiting to Exhale 1995,Comedy|Drama|Romance
4,5,Father of the Bride Part II 1995,Comedy


In [41]:
# remove the pipe separator in genres
movies_df['genres'] = movies_df.genres.str.replace('(\|)', ' ')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story 1995,Adventure Animation Children Comedy Fantasy
1,2,Jumanji 1995,Adventure Children Fantasy
2,3,Grumpier Old Men 1995,Comedy Romance
3,4,Waiting to Exhale 1995,Comedy Drama Romance
4,5,Father of the Bride Part II 1995,Comedy


In [55]:
# merge all tags for each movie
movie_tags = tags_df[['movieId', 'tag']].groupby('movieId').agg(lambda x: ' '.join(set(x.tolist())))
movie_tags.head()

Unnamed: 0_level_0,tag
movieId,Unnamed: 1_level_1
1,fun pixar
2,game fantasy Robin Williams magic board game
3,old moldy
5,remake pregnancy
7,remake


In [56]:
# join movies_df and movie_tags
movies_df = movies_df.set_index('movieId').join(movie_tags)
movies_df.head()

Unnamed: 0_level_0,title,genres,tag
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story 1995,Adventure Animation Children Comedy Fantasy,fun pixar
2,Jumanji 1995,Adventure Children Fantasy,game fantasy Robin Williams magic board game
3,Grumpier Old Men 1995,Comedy Romance,old moldy
4,Waiting to Exhale 1995,Comedy Drama Romance,
5,Father of the Bride Part II 1995,Comedy,remake pregnancy


In [57]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   9742 non-null   object
 1   genres  9742 non-null   object
 2   tag     1572 non-null   object
dtypes: object(3)
memory usage: 624.4+ KB


In [58]:
# there are some nan values in tags, let's replace them with ' '
movies_df['tag'] = movies_df['tag'].fillna('')
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   9742 non-null   object
 1   genres  9742 non-null   object
 2   tag     9742 non-null   object
dtypes: object(3)
memory usage: 624.4+ KB


In [70]:
movies_df['meta_info'] = movies_df['title'] + ' ' + movies_df['genres'] + ' ' + movies_df['tag']
movies_df.iloc[0, 3]

'Toy Story 1995 Adventure Animation Children Comedy Fantasy fun pixar'

In [71]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      9742 non-null   object
 1   genres     9742 non-null   object
 2   tag        9742 non-null   object
 3   meta_info  9742 non-null   object
dtypes: object(4)
memory usage: 700.5+ KB


## Calculate Tfidf and similarity

In [62]:
# tf-idf matrix
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['meta_info'])

In [63]:
type(tfidf_matrix)

scipy.sparse.csr.csr_matrix

In [64]:
tfidf_matrix.shape

(9742, 9949)

In [74]:
# cosine similarities
similarity = cosine_similarity(tfidf_matrix)
similarity_df = pd.DataFrame(similarity, 
                             index=movies_df.title.values, 
                             columns=movies_df.title.values)

In [75]:
similarity_df.head()

Unnamed: 0,Toy Story 1995,Jumanji 1995,Grumpier Old Men 1995,Waiting to Exhale 1995,Father of the Bride Part II 1995,Heat 1995,Sabrina 1995,Tom and Huck 1995,Sudden Death 1995,GoldenEye 1995,...,Gintama: The Movie 2010,anohana: The Flower We Saw That Day - The Movie 2013,Silver Spoon 2014,Love Live! The School Idol Movie 2015,Jon Stewart Has Left the Building 2015,Black Butler: Book of the Atlantic 2017,No Game No Life: Zero 2017,Flint 2017,Bungo Stray Dogs: Dead Apple 2018,Andrew Dice Clay: Dice Rules 1991
Toy Story 1995,1.0,0.157302,0.066106,0.100834,0.084246,0.116035,0.107809,0.186388,0.099625,0.149215,...,0.051843,0.044272,0.015815,0.049437,0.0,0.099439,0.123299,0.0,0.040626,0.008693
Jumanji 1995,0.157302,1.0,0.040092,0.061153,0.051093,0.082907,0.065384,0.133174,0.071182,0.106614,...,0.0,0.0,0.0,0.0,0.0,0.058027,0.330943,0.0,0.0,0.0
Grumpier Old Men 1995,0.066106,0.040092,1.0,0.10839,0.069006,0.095045,0.115888,0.074427,0.081603,0.085388,...,0.014828,0.0,0.012955,0.0,0.0,0.010193,0.012638,0.0,0.0,0.007121
Waiting to Exhale 1995,0.100834,0.061153,0.10839,1.0,0.105257,0.144975,0.176767,0.113525,0.124472,0.130245,...,0.022617,0.01267,0.036633,0.0,0.0,0.015547,0.019278,0.022173,0.0,0.010861
Father of the Bride Part II 1995,0.084246,0.051093,0.069006,0.105257,1.0,0.121125,0.377426,0.09485,0.103995,0.108819,...,0.018896,0.0,0.016509,0.0,0.0,0.012989,0.016106,0.0,0.0,0.009074


In [77]:
# movie list
movie_list = similarity_df.columns.values

# sample movie
sample_movie = 'Toy Story 1995'

# number of top recommendations
top_n = 10

# movie similarity records
movie_sim = similarity_df[similarity_df.index == sample_movie].values[0]

# sort by similarity
sorted_movie_ids = np.argsort(movie_sim)[::-1]

# recommendations
recommended_movies = movie_list[sorted_movie_ids[1:top_n+1]]

print(f'Top recommendations for {sample_movie} are: \n {recommended_movies}')

Top recommendations for Toy Story 1995 are: 
 ['Toy Story 3 2010' 'Toy Story 2 1999' "Bug's Life, A 1998"
 'Toy, The 1982' 'Fun 1994' "We're Back! A Dinosaur's Story 1993"
 'Now and Then 1995' 'Toy Soldiers 1991' 'NeverEnding Story, The 1984'
 'Wild, The 2006']


In [80]:
# create a wrapper function for recommendation

def content_movie_recommender(input_movie, 
                              similarity_database=similarity_df,
                             movie_database=movie_list,
                             top_n=10):
    movie_sim = similarity_database[similarity_database.index == input_movie].values[0]
    sorted_movie_ids = np.argsort(movie_sim)[::-1]
    recommended_movies = movie_database[sorted_movie_ids][1:top_n+1]
    return recommended_movies

In [81]:
# test 
sample_movies = ['Heat 1995', 'Tom and Huck 1995']

for m in sample_movies:
    rm = content_movie_recommender(m)
    print(f"{m}'s recommendation: {rm}")
    print()

Heat 1995's recommendation: ['Heat, The 2013' 'Body Heat 1981' 'Red Heat 1988' 'City Heat 1984'
 'Dead Heat 1988' 'White Heat 1949' 'In the Heat of the Night 1967'
 'Assassins 1995' 'Bad Boys 1995' 'Hackers 1995']

Tom and Huck 1995's recommendation: ['Adventures of Huck Finn, The 1993' 'Now and Then 1995' 'Tom & Viv 1994'
 'Casper 1995' 'Tom Jones 1963' 'Amazing Panda Adventure, The 1995'
 'Two Much 1995' 'Tom Horn 1980' 'Balto 1995' 'Peeping Tom 1960']

