<br><br><center><h1>Movie Recommendation System</h1><br><br><br>
<center><img src = "imgs/MoviePoster4.jpg", width = 700 , height = 400><br>

<h2>About the Dataset</h2>
<p>The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. This dataset captures feature points like cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.

These feature points could be potentially used to train our machine learning models for content and collaborative filtering.
</p><br>
This dataset consists of the following files:
<ul>
<li>movies_metadata.csv: This file contains information on ~45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.</li>
<li>keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.</li>
<li>credits.csv: Consists of Cast and Crew Information for all the movies. Available in the form of a stringified JSON Object.</li>
<li>links.csv: This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.</li><li>
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.</li><li>
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies</li>

<br><br>
<h2>We are going to make 3 types of Recommender Systems here:</h2>
<br>
<ol><h3>
    <li> Simple recommenders </li><br>
    <li> Content-based recommenders</li><br>
    <li> Collaborative filtering engines</li></h3>
</ol>
  
<p><img src = "imgs/MoviePoster6.png", width = 400 , height = 200><br>

### Import Libraries

In [178]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from ast import literal_eval

### Import Metadata

In [2]:
metadata = pd.read_csv("Datasets/movies_metadata.csv")
metadata.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [214]:
metadata[['title','id','vote_average','vote_count','overview','genres','cast', 'director', 'keywords']].head(6)

Unnamed: 0,title,id,vote_average,vote_count,overview,genres,cast,director,keywords
0,Toy Story,862,7.7,5415.0,"Led by Woody, Andy's toys live happily in his ...","[animation, comedy, family]","[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]"
1,Jumanji,8844,6.9,2413.0,When siblings Judy and Peter discover an encha...,"[adventure, fantasy, family]","[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]"
2,Grumpier Old Men,15602,6.5,92.0,A family wedding reignites the ancient feud be...,"[romance, comedy]","[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]"
3,Waiting to Exhale,31357,6.1,34.0,"Cheated on, mistreated and stepped on, the wom...","[comedy, drama, romance]","[whitneyhouston, angelabassett, lorettadevine]",forestwhitaker,"[basedonnovel, interracialrelationship, single..."
4,Father of the Bride Part II,11862,5.7,173.0,Just when George Banks has recovered from his ...,[comedy],"[stevemartin, dianekeaton, martinshort]",charlesshyer,"[baby, midlifecrisis, confidence]"
5,Heat,949,7.7,1886.0,"Obsessive master thief, Neil McCauley leads a ...","[action, crime, drama]","[alpacino, robertdeniro, valkilmer]",michaelmann,"[robbery, detective, bank]"


In [3]:
metadata.iloc[:,9]

0        Led by Woody, Andy's toys live happily in his ...
1        When siblings Judy and Peter discover an encha...
2        A family wedding reignites the ancient feud be...
3        Cheated on, mistreated and stepped on, the wom...
4        Just when George Banks has recovered from his ...
                               ...                        
45461          Rising and falling between a man and woman.
45462    An artist struggles to finish his work while a...
45463    When one of her hits goes wrong, a professiona...
45464    In a small town live two brothers, one a minis...
45465    50 years after decriminalisation of homosexual...
Name: overview, Length: 45466, dtype: object

In [4]:
metadata.tail(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0
45465,False,,0,[],,461257,tt6980792,en,Queerama,50 years after decriminalisation of homosexual...,...,2017-06-09,0.0,75.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Queerama,False,0.0,0.0


In [5]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [6]:
metadata.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


<h2>Data Visualization

In [184]:
px.histogram(metadata, x= metadata.vote_average ,  barmode='group' , title='Vote Average' ,  width=700 , height=500 )

In [209]:
px.histogram(metadata, x= metadata.original_language ,  barmode='group' , title='language' ,  width=700 , height=500 )

In [210]:
px.histogram(metadata, x= metadata.adult ,  barmode='group' , title='Adult Movies' ,  width=700 , height=500 )

<br><h2>Simple Recommender</h2>

<br><p><font size = 4><b>Simple recommenders offer generalized recommendations to every user, based on movie popularity and/or genre. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.Like Top 10 movies of IMDB.
   
<font size = 3>Here we have vote avg and vote count, But the problem is some movies have rating 10 but only 5 people have rated it and some will be having 9.5 but 1000 people have rated it.By rating first one comes on top but we know that we all will be going for the second one.So what we need to do here is that we need to find the weighted average</font><br><br>

In [190]:
sr_movies = metadata[['id','title','vote_average','vote_count']]
sr_movies.head()

Unnamed: 0,id,title,vote_average,vote_count
0,862,Toy Story,7.7,5415.0
1,8844,Jumanji,6.9,2413.0
2,15602,Grumpier Old Men,6.5,92.0
3,31357,Waiting to Exhale,6.1,34.0
4,11862,Father of the Bride Part II,5.7,173.0


In [191]:
sr_movies[sr_movies.vote_average==0]

Unnamed: 0,id,title,vote_average,vote_count
83,188588,Last Summer in the Hamptons,0.0,0.0
107,96357,Headless Body in Topless Bar,0.0,0.0
126,290157,Jupiter's Wife,0.0,0.0
132,124636,Sonic Outlaws,0.0,0.0
137,124639,Target,0.0,0.0
...,...,...,...,...
46594,323132,Altar of Fire,0.0,0.0
46596,325439,The Wonders of Aladdin,0.0,0.0
46614,276895,Deep Hearts,0.0,0.0
46626,227506,Satan Triumphant,0.0,0.0


In [192]:
# we need all these values to calculate the weighted average
# v - no of votes for the movie                               ---------vote count         
# c - mean vote across the whole report 
# m - minimum votes required to be listed in the chart
# r - average rating of the movie                            -----------------vote avg

c = metadata['vote_average'].mean()
m = metadata['vote_count'].quantile(0.80)     # ------------- taking the min votes to be 80th percentile so we get the top 20%

print(c,'\n',m)

5.6117278654770075 
 49.0


In [193]:
sr_movies = sr_movies.loc[sr_movies['vote_count'] >= m]
sr_movies

Unnamed: 0,id,title,vote_average,vote_count
0,862,Toy Story,7.7,5415.0
1,8844,Jumanji,6.9,2413.0
2,15602,Grumpier Old Men,6.5,92.0
4,11862,Father of the Bride Part II,5.7,173.0
5,949,Heat,7.7,1886.0
...,...,...,...,...
46501,430365,With Open Arms,5.2,94.0
46505,248705,The Visitors: Bastille Day,4.0,167.0
46510,44918,Titanic 2,3.4,55.0
46599,455661,In a Heartbeat,8.3,146.0


In [194]:
# function to compute weighted average

def weighted_rating(x, m=m, c=c):
    v = x['vote_count']
    r = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * r) + (m/(m+v) * c)

In [195]:
sr_movies['weighted_rating'] = sr_movies.apply(weighted_rating, axis=1)

In [196]:
sr_movies

Unnamed: 0,id,title,vote_average,vote_count,weighted_rating
0,862,Toy Story,7.7,5415.0,7.681273
1,8844,Jumanji,6.9,2413.0,6.874360
2,15602,Grumpier Old Men,6.5,92.0,6.191310
4,11862,Father of the Bride Part II,5.7,173.0,5.680517
5,949,Heat,7.7,1886.0,7.647119
...,...,...,...,...,...
46501,430365,With Open Arms,5.2,94.0,5.341082
46505,248705,The Visitors: Bastille Day,4.0,167.0,4.365623
46510,44918,Titanic 2,3.4,55.0,4.442064
46599,455661,In a Heartbeat,8.3,146.0,7.624485


In [197]:
#Sort movies based on score calculated above
sr_movies = sr_movies.sort_values('weighted_rating', ascending=False)

#Print the top 20 movies
sr_movies.head(20)

Unnamed: 0,id,title,vote_average,vote_count,weighted_rating
10397,19404,Dilwale Dulhania Le Jayenge,9.1,661.0,8.85926
314,278,The Shawshank Redemption,8.5,8358.0,8.483166
841,238,The Godfather,8.5,6024.0,8.476696
41418,372058,Your Name.,8.5,1030.0,8.368837
12589,155,The Dark Knight,8.3,12269.0,8.289306
2870,550,Fight Club,8.3,9678.0,8.286458
292,680,Pulp Fiction,8.3,8670.0,8.284892
522,424,Schindler's List,8.3,4436.0,8.27063
23868,244786,Whiplash,8.3,4376.0,8.270232
5529,129,Spirited Away,8.3,3968.0,8.267208


In [198]:
# bottom 20 movies
sr_movies.tail(20)

Unnamed: 0,id,title,vote_average,vote_count,weighted_rating
13606,7916,Far Cry,3.1,73.0,4.108809
5004,11535,Rollerball,3.4,106.0,4.099191
30975,63315,The Hottie & The Nottie,2.9,65.0,4.065567
12466,7278,Meet the Spartans,3.8,370.0,4.011873
21437,205321,Sharknado,3.8,484.0,3.966557
5470,9544,FearDotCom,3.2,106.0,3.962417
10904,10073,Date Movie,3.6,225.0,3.959762
24615,218043,Left Behind,3.7,396.0,3.910505
4025,580,Jaws: The Revenge,3.5,227.0,3.874908
9798,10214,Son of the Mask,3.6,346.0,3.849556


<br><br><br><h2>2. Content Based Recommender</h2>

<p><font size = 4 ><b>Content based recommender </b> is a system that recommends movies that are similar to a particular movie. To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.</font></p><br>

In [15]:
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [16]:
metadata['overview'].isnull().sum()

954

In [17]:
#---------- just over 2% of data is missing which won't make any difference
#-------NLP processings

tfidf = TfidfVectorizer(stop_words='english')
metadata['overview'] = metadata['overview'].fillna('')
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
tfidf_matrix.shape

(45466, 75827)

In [18]:
tfidf.get_feature_names()[4000:4020]

['arijit',
 'arin',
 'arindam',
 'arion',
 'ariosto',
 'aris',
 'arisan',
 'arise',
 'arisen',
 'arises',
 'arisia',
 'arising',
 'aristakisyan',
 'aristide',
 'aristocats',
 'aristocracy',
 'aristocrat',
 'aristocratic',
 'aristocrats',
 'aristophanes']

<br><p>we will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. we use the <b><font size = 4>cosine similarity</font></b> score since it is independent of magnitude and is relatively easy and fast to calculate. Mathematically, it is defined as follows:</p><br><img src = "imgs\cosine_sim.webp">

<p>Since we already used tf-idf vectorizer now we only need to find the dot product of the matrixes to get the cosine similiarity therefore we are going to use <b>linear_kernal()</b> instead of <b>cosine_similarities()</b></p><br>

In [19]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [20]:
cosine_sim.shape

(45466, 45466)

In [21]:
# lets get index for each title

indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

In [22]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [226]:
get_recommendations('Toy Story')

15348                                Bitten
2997                               Scrooged
10301          Karol: A Man Who Became Pope
24523                 The Return of Dracula
23843                     So Evil, So Young
29202                             Flodder 3
43427    The Policeman of the 16th Precinct
38476                   Phileine Says Sorry
42721                   Feuten: Het Feestje
8327                             Veer-Zaara
Name: title, dtype: object

In [24]:
get_recommendations('Kurukshetra')

36204                        Yaadein
34356                Raja Hindustani
36356                        Humraaz
35192     The Legend of Bhagat Singh
10309    Dilwale Dulhania Le Jayenge
34427      Har Dil Jo Pyar Karega...
39420                    The Journey
36205                         Indian
15020                  Chalte Chalte
34316                       Tahkhana
Name: title, dtype: object

<br><p><font size = 3>Here we have created a recommendation system based on <b><font size = 5>plot description</font></b> , Although people may like a movie because of the cast or crew members at that time this system won't be useful. So what we gonna do next is that we are going to create a recommendation sytem based on <b><font size = 5>crew ,cast</font></b> and <b><font size = 5>genres</font></b>.</font></p>
<p><font size = 3>we dont have required information in the dataset we currently have so to get infromation on crew , cast and genre we need to import different datasets.</font></p><br>

In [25]:
credits = pd.read_csv('Datasets/credits.csv')
keywords = pd.read_csv('Datasets/keywords.csv')



In [26]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [27]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [28]:
#remove data with bad ids

metadata = metadata.drop([35587, 19730, 29503])

#now we havve to merge the data with our metadata
#for that we need to change the id's datatype to int

keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

#now merge the datasets

metadata = metadata.merge(credits , on = 'id')
metadata = metadata.merge(keywords , on = 'id')
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [29]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46628 entries, 0 to 46627
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  46628 non-null  object 
 1   belongs_to_collection  4574 non-null   object 
 2   budget                 46628 non-null  object 
 3   genres                 46628 non-null  object 
 4   homepage               8009 non-null   object 
 5   id                     46628 non-null  int32  
 6   imdb_id                46611 non-null  object 
 7   original_language      46617 non-null  object 
 8   original_title         46628 non-null  object 
 9   overview               46628 non-null  object 
 10  popularity             46624 non-null  object 
 11  poster_path            46229 non-null  object 
 12  production_companies   46624 non-null  object 
 13  production_countries   46624 non-null  object 
 14  release_date           46540 non-null  object 
 15  re

In [30]:
# convert the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [31]:
#to get the director name

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [32]:
# create a function to extractb top 3 elements from each feature columns

def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing data
    return []

In [33]:
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


In [34]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [35]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)



In [36]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])


In [37]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [38]:
metadata[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


In [39]:
#---we use count vectorizer here because we don'T wanna down weigh the names of director or cast if it occurs few times.
#----also, here we will be using cosine_similarity().

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [40]:
count_matrix.shape

(46628, 73881)

In [41]:
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

<p>If you get this error:<br><br>
    <font color='Red'>MemoryError:</font> Unable to allocate 16.2 GiB for an array with shape (46628, 46628) and data type float64
    <br><br>Follow these steps:</p>
    <img src='imgs/error1.png'>
<p><font size='4.5'>This is a problem of memory allocation you can also get the codes to solve this issue <a href="https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type">Here!</a>

In [42]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [43]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

12589      The Dark Knight
10210        Batman Begins
9311                Shiner
9874       Amongst Friends
7772              Mitchell
516      Romeo Is Bleeding
11463         The Prestige
24090            Quicksand
25038             Deadfall
41063                 Sara
Name: title, dtype: object

In [44]:
get_recommendations('Awakenings', cosine_sim2)

1191                                GoodFellas
1648                          Ill Gotten Gains
3487                Jails, Hospitals & Hip-Hop
20553                                  Sundome
22937                        Viper In The Fist
27277                          Home Sweet Home
32471                                     Joni
34336                               Yolngu Boy
34890                                  Το γάλα
44025    National Geographic American Blackout
Name: title, dtype: object

In [45]:
#without genres,cast,and crew data
get_recommendations('Awakenings')

11381                     Gridiron Gang
35747                 The Bohemian Girl
25271          Alien Nation: Millennium
27280                       On Approval
14571    Battlestar Galactica: The Plan
27293                       The Hunters
1329           Star Trek: First Contact
23096                          Non-Stop
8440                               Toni
45461              Neither Wolf Nor Dog
Name: title, dtype: object

<br><br>
<h2>3. Collaborative filtering engines</h2> 

<p> <font size = 4>Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who she/he is.<br><br>
    Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.</font></p>
<br><img src= 'imgs/collab_filter.png'><br>

In [46]:
#import svd() from surprise library
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [47]:
ratings = pd.read_csv("Datasets/ratings_small.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [48]:
reader = Reader()

In [49]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

<br><p><font size = 3> I will use the Surprise library that used extremely powerful algorithms like <b>Singular Value Decomposition (SVD)</b> to minimise RMSE (Root Mean Square Error) and give great recommendations.</font></p>

In [50]:
# Use the SVD algorithm
svd = SVD()

# Run 5-fold cross-validation and then print results
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8963  0.8929  0.8918  0.9020  0.8996  0.8965  0.0038  
MAE (testset)     0.6887  0.6890  0.6889  0.6931  0.6928  0.6905  0.0020  
Fit time          7.10    7.19    7.38    7.64    8.85    7.63    0.64    
Test time         0.46    0.20    0.20    0.21    0.27    0.27    0.10    


{'test_rmse': array([0.89628527, 0.89287876, 0.89184656, 0.90196862, 0.89956196]),
 'test_mae': array([0.68873543, 0.68896638, 0.68893301, 0.69307429, 0.6927891 ]),
 'fit_time': (7.096038103103638,
  7.187553882598877,
  7.380761623382568,
  7.635105133056641,
  8.852182388305664),
 'test_time': (0.4554595947265625,
  0.20484471321105957,
  0.2047426700592041,
  0.20697426795959473,
  0.2704792022705078)}

In [51]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x15a8edd6dd8>

In [52]:
ratings['userId'].value_counts()

547    2391
564    1868
624    1735
15     1700
73     1610
       ... 
221      20
444      20
484      20
35       20
485      20
Name: userId, Length: 671, dtype: int64

In [53]:
ratings[ratings['userId']==484]

Unnamed: 0,userId,movieId,rating,timestamp
70088,484,1,3.0,851345204
70089,484,3,4.0,851345272
70090,484,17,5.0,851345205
70091,484,25,2.0,851345205
70092,484,32,5.0,851345204
70093,484,41,4.0,851345432
70094,484,62,5.0,851345205
70095,484,65,1.0,851345432
70096,484,76,5.0,851345499
70097,484,104,3.0,851345320


In [175]:
# predict based on this data
svd.predict(484, 569 ,5.0)

Prediction(uid=484, iid=569, r_ui=5.0, est=3.653915673891676, details={'was_impossible': False})

In [166]:
def collab(userId, title):
    idx = indices[title]   
    
    movies = metadata[['title', 'vote_count', 'vote_average', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId ,x).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [225]:
collab(336 , 'Avatar')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,title,vote_count,vote_average,id,est
952,The 39 Steps,217.0,7.4,260,4.17707
286,Once Were Warriors,106.0,7.6,527,4.171292
2674,The Sixth Sense,3223.0,7.7,745,4.153448
3085,Galaxy Quest,722.0,6.9,926,4.048102
334,While You Were Sleeping,340.0,6.5,2064,4.039774
8399,Pandora's Box,46.0,7.6,905,3.939042
18446,"Don't Worry, I'm Fine",168.0,7.3,1254,3.926769
8629,Murder She Said,31.0,7.0,750,3.898549
3832,The Bank Dick,28.0,6.8,911,3.893195
2675,The Thomas Crown Affair,349.0,6.7,913,3.887722


<br><h2>Hybrid Filtering</h2>

<p><font size = 4>We see that each method has its strength. It would be best if we can combine all those strengths and provide a better recommendation. This idea leads us to another improvement of the recommendation, which is the hybrid method. For example, we can combine the content-based and collaborative filtering recommendations together to leverage both domain features.</font></p><br>

In [172]:
def hybrid(userId, title):
    idx = indices[title]
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = metadata.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId,x).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [223]:
hybrid(467,'The Godfather')

Unnamed: 0,title,vote_count,vote_average,id,est
1379,Michael,174.0,5.5,2928,3.910114
25374,The Fearmakers,2.0,6.0,27446,3.786634
12194,Feast of Love,53.0,5.9,14313,3.786634
41014,Breaking Glass,12.0,7.0,26803,3.786634
3415,Black and White,20.0,4.6,38809,3.786634
40152,Graduation,54.0,6.7,374458,3.786634
21365,Back to 1942,14.0,5.3,139329,3.786634
28702,JFK: The Smoking Gun,1.0,0.0,232610,3.786634
33622,Ghost Writer,3.0,6.3,51428,3.786634
22679,River Queen,9.0,6.7,24671,3.786634
