Movie Recommender System to Give Personalized Recommendations

Built a hybrid movie recommender system using the metadata for 45,000 movies and ratings from 270,000 users.

Built a content-based recommender based on plot descriptions. Doc2Vec based on paragraph vector was applied to find movies with similar plot descriptions.

Combined content-based with collaborative filter-based engines to establish a hybrid movie recommender system to give personalized recommendations for different users.

This example shows the hybrid movie recommender system gives different recommendations to different users according to their previous rating for other movies.

In [101]:
hybrid(1, 'The Godfather')

Unnamed: 0,title,vote_count,vote_average,year,id,est
700,Dead Man,397.0,7.2,1995,922,2.820908
29532,Mother's Day,126.0,6.3,2010,101669,2.70381
24296,Charlie Chan at the Opera,14.0,6.6,1936,28044,2.70381
39342,Les Tuche 2: Le rêve américain,239.0,5.7,2016,369776,2.70381
39623,Batman: The Killing Joke,485.0,6.2,2016,382322,2.70381
30725,Hallettsville,3.0,5.0,2009,9935,2.70381
38167,Moonshine County Express,0.0,0.0,1977,99846,2.70381
10352,Bookies,19.0,6.8,2003,14759,2.70381
37019,Eve's Christmas,3.0,4.5,2004,69016,2.70381


In [102]:
hybrid(50, 'The Godfather')

Unnamed: 0,title,vote_count,vote_average,year,id,est
29532,Mother's Day,126.0,6.3,2010,101669,3.310722
39342,Les Tuche 2: Le rêve américain,239.0,5.7,2016,369776,3.310722
24296,Charlie Chan at the Opera,14.0,6.6,1936,28044,3.310722
39623,Batman: The Killing Joke,485.0,6.2,2016,382322,3.310722
30725,Hallettsville,3.0,5.0,2009,9935,3.310722
38167,Moonshine County Express,0.0,0.0,1977,99846,3.310722
10352,Bookies,19.0,6.8,2003,14759,3.310722
44685,Hurdy-Gurdy Hare,2.0,6.5,1950,236112,3.310722
700,Dead Man,397.0,7.2,1995,922,3.204096


In [103]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD, evaluate
import gensim
from gensim.models.doc2vec import Doc2Vec
import warnings; warnings.simplefilter('ignore')

In [2]:
os.chdir(r'C:/movie project')

Simple Recommender

In [3]:
md = pd. read_csv('movies_metadata.csv')
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
m = vote_counts.quantile(0.95)
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')

def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
qualified['wr'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('wr', ascending=False).head(250)

In [4]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

In [5]:
qualified.head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


Content Based Recommender

Read csv file and delete 19730, 29503, 35587

In [6]:
md = md.drop([19730, 29503, 35587])
md['id'] = md['id'].astype('int')

In [46]:
smd = md
smd['description'] = smd['overview'].fillna('')
smd['description'] = smd['description'].astype('str').apply(lambda x: str.lower(x.replace(",", "")))
smd['description'] = smd['description'].astype('str').apply(lambda x: str.lower(x.replace(".", "")))

In [48]:
def getText():
    discuss_train=list(smd['description'])
    return discuss_train
 
text=getText()

TaggededDocument=gensim.models.doc2vec.TaggedDocument

def X_train(cut_sentence):
    x_train=[]
    for i, text in enumerate(cut_sentence):
        word_list=text.split(' ')
        l=len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document=TaggededDocument(word_list,tags=[i])
        x_train.append(document)
    return x_train

c=X_train(text)

def train(x_train):
    model=Doc2Vec(x_train, min_count=1, size=200)
    return model

model_dm=train(c)

md = md.reset_index()
titles = md['title']
indices = pd.Series(md.index, index=md['title'])


In [56]:
def get_recommendations(title):
    idx = indices[title]
    strl= md['overview'].iloc[idx]
    test_text=strl.split(' ')
    inferred_vector=model_dm.infer_vector(doc_words=test_text,alpha=0.025, min_alpha = 0.001, steps=10000)
    sims = model_dm.docvecs.most_similar([inferred_vector],topn=11)
    movie_indices = [i[0] for i in sims[1:10]]
    return titles.iloc[movie_indices]

Check the recommendations for The Shawshank Redemption

In [57]:
get_recommendations('The Shawshank Redemption')

24186                           The Uncertainty Principle
5443                                             Quitting
20581                                    The Great Gatsby
42936                                  Unexpected Journey
32789                                           12 Chairs
25888    Johan Falk: GSI - Gruppen för särskilda insatser
1216                                         Evil Dead II
22654                     Better Living Through Chemistry
11773                                  The Devil Commands
Name: title, dtype: object

Popularity and Ratings

In [71]:
smd = md
smd['description'] = smd['overview'].fillna('')
def improved_recommendations(title):
    idx = indices[title]
    strl= smd['description'].iloc[idx]
    test_text=strl.split(' ')
    inferred_vector=model_dm.infer_vector(doc_words=test_text,alpha=0.025, min_alpha = 0.001, steps=10000)
    sims = model_dm.docvecs.most_similar([inferred_vector],topn=50)
    movie_indices = [i[0] for i in sims[1:50]]
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [73]:
improved_recommendations('The Shawshank Redemption')

Unnamed: 0,title,vote_count,vote_average,year,wr
43641,Baby Driver,2083,7,2017,6.697372
23169,The Raid 2,832,7,2014,6.398329
10332,Transporter 2,1076,6,2005,5.78297
16051,Undisputed III : Redemption,182,7,2010,5.76345
10369,Domino,450,6,2005,5.629282
10919,Magic,59,7,1978,5.454939
9086,Pusher,162,6,1996,5.450143
30409,Curfew,33,7,2012,5.368919
6424,Avanti!,49,6,1972,5.321501
18839,Madhouse,21,6,1974,5.279748


Collaborative Filtering

In [74]:
reader = Reader()

In [75]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [76]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

In [77]:
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8907
MAE:  0.6876
------------
Fold 2
RMSE: 0.8920
MAE:  0.6853
------------
Fold 3
RMSE: 0.9027
MAE:  0.6968
------------
Fold 4
RMSE: 0.9049
MAE:  0.6984
------------
Fold 5
RMSE: 0.8990
MAE:  0.6905
------------
------------
Mean RMSE: 0.8979
Mean MAE : 0.6917
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'mae': [0.6876289845916422,
                             0.6853187852842678,
                             0.6968463315260166,
                             0.6983666668898275,
                             0.6905333288418136],
                            'rmse': [0.8907329120066598,
                             0.8919991071163405,
                             0.9027361586391015,
                             0.9048594539655962,
                             0.8990288481051337]})

In [78]:
trainset = data.build_full_trainset()
svd.train(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a286175a58>

In [79]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [80]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.7111648412283342, details={'was_impossible': False})

Hybrid Recommender

In [82]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [83]:
#new link
id_map = pd.read_csv('links.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [90]:
id_map = pd.read_csv('links.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [84]:
indices_map = id_map.set_index('id')

In [96]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    strl= smd['description'].iloc[idx]
    test_text=strl.split(' ')
    inferred_vector=model_dm.infer_vector(doc_words=test_text,alpha=0.025, min_alpha = 0.001, steps=10000)
    sims = model_dm.docvecs.most_similar([inferred_vector])
    movie_indices = [i[0] for i in sims[1:50]]

    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [101]:
hybrid(1, 'The Godfather')

Unnamed: 0,title,vote_count,vote_average,year,id,est
700,Dead Man,397.0,7.2,1995,922,2.820908
29532,Mother's Day,126.0,6.3,2010,101669,2.70381
24296,Charlie Chan at the Opera,14.0,6.6,1936,28044,2.70381
39342,Les Tuche 2: Le rêve américain,239.0,5.7,2016,369776,2.70381
39623,Batman: The Killing Joke,485.0,6.2,2016,382322,2.70381
30725,Hallettsville,3.0,5.0,2009,9935,2.70381
38167,Moonshine County Express,0.0,0.0,1977,99846,2.70381
10352,Bookies,19.0,6.8,2003,14759,2.70381
37019,Eve's Christmas,3.0,4.5,2004,69016,2.70381


In [102]:
hybrid(50, 'The Godfather')

Unnamed: 0,title,vote_count,vote_average,year,id,est
29532,Mother's Day,126.0,6.3,2010,101669,3.310722
39342,Les Tuche 2: Le rêve américain,239.0,5.7,2016,369776,3.310722
24296,Charlie Chan at the Opera,14.0,6.6,1936,28044,3.310722
39623,Batman: The Killing Joke,485.0,6.2,2016,382322,3.310722
30725,Hallettsville,3.0,5.0,2009,9935,3.310722
38167,Moonshine County Express,0.0,0.0,1977,99846,3.310722
10352,Bookies,19.0,6.8,2003,14759,3.310722
44685,Hurdy-Gurdy Hare,2.0,6.5,1950,236112,3.310722
700,Dead Man,397.0,7.2,1995,922,3.204096


Thank Rounak Banik @ Kaggle (https://www.kaggle.com/rounakbanik/movie-recommender-systems) for the code except the Doc2Vec part.