In [6]:
import os
import numpy as np
import pandas as pd

In [7]:
os.getcwd()

'C:\\Users\\srish\\01.Projects\\Recommendation-Systems'

In [9]:
metadata = pd.read_csv('D:\\EverythingDS\\DataSets\\the-movies-dataset\\movies_metadata.csv', low_memory=False)

metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


##### Simple Recommender

•There are some caveats when choosing an optimal metric. While it may make sense to choose the overall rating of the movies it popularity is equally important.

•Considering this it is necessary that we come up with a weighted rating that takes into account the average rating and the number of votes it has garnered.

•Such a system will make sure that a movie with a 8.5 rating from 100,000 voters gets a higher score than a another movie with the higher rating but a far lesser voters.

Weighted ratio formula

WR = [(v/(v+m)) * R] + [(m/(v+m)) * C]

v = no of votes for a movie

m = minimum number of votes required to be considered for the top list

R = rating of a movie

C = mean rating of all movies

In [11]:
C = metadata['vote_average'].mean()
print(C)

5.618207215133889


In [12]:
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


In [13]:
# Filter out all qualified movies into a new DataFrame

In [14]:
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [8]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)
q_movies['score'].head()

0    7.640253
1    6.820293
4    5.660700
5    7.537201
8    5.556626
Name: score, dtype: float64

In [10]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

q_movies[['title', 'vote_count', 'vote_average', 'score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


##### Content Based Recommender

The idea here is to build a system that recommends movies that are similar to a particular movie. 

More specifically, you will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

In [11]:
#Print plot overviews of the first 5 movies
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In this raw form we can compute the similarities, we need to compute the word vectors of each overview or document

Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. 

This will give us a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each row represents a movie.

TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. 

This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

metadata['overview'] = metadata['overview'].fillna('')

tfidf_matrix = tfidf.fit_transform(metadata['overview'])

tfidf_matrix.shape

(45466, 75827)

75,827 different words are used to describe the 45,000 movies

To calculate the similarity score we use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate

In [13]:
#change paging size to accomodate calulation

In [14]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix via dot product
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [24]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [16]:
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

Pairwise labels or “edges” indicating point similarity or dissimilarity
are used to learn a transformation of the data such that similar
points are “close” to one another and dissimilar points are
distant in the transformed space.

In [17]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return metadata['title'].iloc[movie_indices]

In [18]:
get_recommendations('The Dark Knight Rises')

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [19]:
get_recommendations('The Departed')

26931                     Southie
2626                     No Mercy
7551             Infernal Affairs
36333              The Anarchists
23647      Shoot First, Die Later
19985                The Enforcer
40680                 Line Walker
28150                 Super Bitch
7191     The Wrong Arm of the Law
44005         Pitbull Tough Women
Name: title, dtype: object