# Movies Recommender System

In [2]:
pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 14.4 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1619419 sha256=da62977ac040ca348c78fe7b3603ad09c2d6a3fd2682c39d10730852ebdf887d
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [4]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD 
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

## Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre. 

In [63]:
df= pd.read_csv('movies_metadata_womandirectorfilter.csv')
df.head()
print(df.shape)

(45466, 26)


In [65]:
f = df[df['Female Director']== 1]
md = f.copy()
print(md.shape)

(1886, 26)


In [66]:
md.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1886 entries, 17 to 45408
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  1886 non-null   object 
 1   belongs_to_collection  65 non-null     object 
 2   budget                 1886 non-null   object 
 3   genres                 1886 non-null   object 
 4   homepage               406 non-null    object 
 5   id                     1886 non-null   object 
 6   imdb_id                1885 non-null   object 
 7   original_language      1884 non-null   object 
 8   original_title         1886 non-null   object 
 9   overview               1848 non-null   object 
 10  popularity             1886 non-null   object 
 11  poster_path            1866 non-null   object 
 12  production_companies   1886 non-null   object 
 13  production_countries   1886 non-null   object 
 14  release_date           1883 non-null   object 
 15  re

In [67]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [68]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.179745493107105

In [69]:
m = vote_counts.quantile(0.95)
m

109.0

In [70]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [71]:
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(96, 6)

In [72]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [73]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [74]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

### Top Movies

In [75]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
5878,City of God,2002,1852,8,14.9593,"[Drama, Crime]",7.843239
31783,Mustang,2015,378,8,6.49255,[Drama],7.368773
33356,Wonder Woman,2017,5025,7,294.337037,"[Action, Adventure, Fantasy]",6.961354
4178,Shrek,2001,4183,7,17.9877,"[Adventure, Animation, Comedy, Family, Fantasy]",6.953773
36223,Me Before You,2016,2674,7,34.34759,"[Drama, Romance]",6.928707
6562,Lost in Translation,2003,1943,7,11.6094,[Drama],6.90331
32117,The Intern,2015,1926,7,15.6517,[Comedy],6.902502
17454,One Day,2011,1006,7,11.239,"[Drama, Romance]",6.822056
39773,The Edge of Seventeen,2016,952,7,9.51953,"[Comedy, Drama]",6.812999
10386,Green Street Hooligans,2005,652,7,7.77262,"[Crime, Drama]",6.73928


In [76]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [77]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the Top 15 Romance Movies (Romance almost didn't feature at all in our Generic Top Chart despite  being one of the most popular movie genres).

### Top Romance Movies

In [78]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
36223,Me Before You,2016,2674,7,34.34759,6.963042
17454,One Day,2011,1006,7,11.239,6.905048
3844,Pay It Forward,2000,447,7,10.4797,6.799735
11942,Across the Universe,2007,419,7,9.16158,6.787955
33869,Miss You Already,2015,232,7,7.634145,6.650749
257,Little Women,1994,222,7,9.77499,6.638231
36038,My King,2015,207,7,5.88082,6.617676
13798,The Proposal,2009,1858,6,8.05834,5.977058
11411,The Holiday,2006,1259,6,14.0434,5.966616
6080,Bend It Like Beckham,2002,593,6,6.26268,5.93241


## Content Based Recommender



In [79]:
links_small = f.copy()
#links_small = links_small[links_small['imdb_id'].notnull()]['imdb_id'].astype('int')

In [34]:
#md = md.drop([19730, 29503, 35587])

In [33]:
#Check EDA Notebook for how and why I got these indices.
#md['id'] = md['id'].astype('int')

In [81]:
smd = md.copy()
smd.shape

(1886, 27)

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [82]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [83]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [84]:
tfidf_matrix.shape

(1886, 68426)

#### Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [85]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [86]:
cosine_sim[0]

array([1.        , 0.00295543, 0.        , ..., 0.        , 0.00516445,
       0.00677522])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [87]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [88]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [89]:
get_recommendations('The Holiday').head(10)

1705          A Holiday Heist
959                      Stay
1309           The Ocean Waif
698     My Sucky Teen Romance
1010                      Eat
569               The Snowman
273       Living 'til the End
946       Crazy for Christmas
1383         Pete's Christmas
89                 Love & Sex
Name: title, dtype: object

In [90]:
get_recommendations('Lost in Translation').head(10)

28                                     Private Parts
151                                         Besotted
917                                Stockholm Stories
805     Concepción Arenal, la visitadora de cárceles
633                              All About Actresses
1175                                   Freaky Friday
356                                 Cadillac Records
111                                The Man Who Cried
656                                Middle of Nowhere
541                               An Unlikely Weapon
Name: title, dtype: object

In [91]:
print(md.iloc[28]['genres'])
print(md.iloc[1175]['genres'])
print(md.iloc[917]['genres'])
print(md.iloc[656]['genres'])

['Comedy', 'Drama']
['TV Movie', 'Comedy', 'Family', 'Fantasy']
['Drama']
['Drama']
