<a href="https://colab.research.google.com/github/vijaylokith/MovieRecommendationEngine_A-Hybrid-Approach/blob/main/Movie_Recommendation_system_Hybrid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Movie Recommendation System: A Hybrid Approach**

As part of this project, three recommendation systems have been designed.
> Popularity-Based Recommendation(for tackling cold start problem)

> Content-Based Recommendation

> Collaborative Filtering Based Recommendation

And finally, all three types have been stacked together to produce a powerful Hybrid Recommendation system.

In [309]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.simplefilter("ignore")

In [368]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## **# Data Reading**

In [311]:
metadata = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Movie Recommendation System/data set/movies_metadata.csv")
print(metadata.shape)
metadata.head()

(45466, 24)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [312]:
sub = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Movie Recommendation System/data set/links_small.csv")
print(sub.shape)
sub.head()

(9125, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [313]:
keyword = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Movie Recommendation System/data set/keywords.csv")
print(keyword.shape)
keyword.head()

(46419, 2)


Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [314]:
credit = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Movie Recommendation System/data set/credits.csv")
print(credit.shape)
credit.head()

(45476, 3)


Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


## **# Basic Exploratory Data Analysis**

**1) Metadata**

In [315]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Observations:
> This table has a total of 24 features, of which we will only be using 4/5 important features.

> There are some features which have missing values, but those features which has missing values are not of great important. All the important features have no NULL/NAN values.

> Data type of all the features or either object or float, we need to convert some of the features to int as we will be performing some mathematical operations on them.

In [316]:
metadata.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
revenue,45460.0,11209350.0,64332250.0,0.0,0.0,0.0,0.0,2787965000.0
runtime,45203.0,94.1282,38.40781,0.0,85.0,95.0,107.0,1256.0
vote_average,45460.0,5.618207,1.924216,0.0,5.0,6.0,6.8,10.0
vote_count,45460.0,109.8973,491.3104,0.0,3.0,10.0,34.0,14075.0


In [317]:
metadata = metadata.drop([19730, 29503, 35587])
meatadata = metadata.reset_index(drop=True)

**2) Sub**

In [318]:
sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9125 non-null   int64  
 1   imdbId   9125 non-null   int64  
 2   tmdbId   9112 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 214.0 KB


In [319]:
sub.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
movieId,9125.0,31123.291836,40782.633604,1.0,2850.0,6290.0,56274.0,164979.0
imdbId,9125.0,479824.392329,743177.360844,417.0,88846.0,119778.0,428441.0,5794766.0
tmdbId,9112.0,39104.545544,62814.519801,2.0,9451.75,15852.0,39160.5,416437.0


In [352]:
sub["tmdbId"].isnull().values.sum()

0

In [351]:
sub["tmdbId"].fillna(method="ffill",inplace=True)

In [353]:
sub["id"] = sub["tmdbId"].astype('int64')

In [354]:
sub["id"].dtype

dtype('int64')

In [322]:
sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9125 non-null   int64  
 1   imdbId   9125 non-null   int64  
 2   tmdbId   9112 non-null   float64
 3   id       9112 non-null   float64
dtypes: float64(2), int64(2)
memory usage: 285.3 KB


**3) Keyword**

In [323]:
keyword.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


**4) Credit**

In [324]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


## **# Popularity-Based Recommendation(for tackling cold start problem)**

> We will be computing the list based on two criteria weighted rating and popularity (both will be given equal weightage).

> Weighted rating will be calculated using the IMDB Weighted Rating formula 
"weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C where: R = average for the movie (mean) = (Rating); v = number of votes for the movie = (votes); m = minimum votes required to be listed in the Top 250 (currently 25000)"

In [325]:
# weighted average
v = metadata["vote_count"]
r = metadata["vote_average"]
c = metadata["vote_average"].mean()
m = metadata["vote_count"].quantile(0.90)

metadata['weighted_avg'] = ((r*v)+(c*m))/(v+m)

In [326]:
metadata["weighted_avg"].dtype

dtype('float64')

In [327]:
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,weighted_avg
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293


> As we have both weighted average and popularity, now we should compute a new metric which has equal bias for both weighted rating and popularity.
Note: We should scale before computing the metric as both weighted rating and popularity are in different scale and range.

In [328]:
# Min-Max Scaling

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(metadata[['weighted_avg', 'popularity']])

In [329]:
movie_normalized = pd.DataFrame(scaled_data, columns=['weighted_avg', 'popularity'])
movie_normalized.head()

Unnamed: 0,weighted_avg,popularity
0,0.834268,0.040087
1,0.665586,0.031079
2,0.484519,0.021394
3,0.435663,0.007049
4,0.427034,0.01532


In [330]:
metadata[['normalized_weighted_avg', 'normalized_popularity']] = movie_normalized
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,weighted_avg,normalized_weighted_avg,normalized_popularity
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253,0.834268,0.040087
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293,0.665586,0.031079
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,5.940132,0.484519,0.021394
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,5.702645,0.435663,0.007049
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.6607,0.427034,0.01532


In [331]:
# now we can create that new metric

metadata["score"] = (0.5 * metadata["normalized_weighted_avg"]) + (0.5 * metadata["normalized_popularity"])

metadata = metadata.sort_values(["score"], ascending=False)

In [332]:
metadata[["id","original_title","score"]].head(10)

Unnamed: 0,id,original_title,score
30698,16318,색즉시공,0.78693
33354,47211,"Deprisa, deprisa",0.635636
24454,36523,Felix The Cat: The Movie,0.623299
292,680,Pulp Fiction,0.608722
12481,155,The Dark Knight,0.593929
42219,362141,Köpek,0.589625
23674,213635,Luther,0.579152
26565,284053,Thor: Ragnarok,0.57575
43641,203835,Amityville: The Awakening,0.568497
26563,166424,Fantastic Four,0.56146


This is the list which will be recommended to any new user.

## **# Content Based Recommendation**

In [333]:
# Before anything we should convert the data type of "id" from object to int.
metadata["id"] = metadata["id"].astype('int')

In [355]:
sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9125 non-null   int64  
 1   imdbId   9125 non-null   int64  
 2   tmdbId   9125 non-null   float64
 3   id       9125 non-null   int64  
dtypes: float64(1), int64(3)
memory usage: 285.3 KB


In [335]:
# sub.columns = ["movieId","imdbId","id"]

In [357]:
smd = metadata[metadata['id'].isin(sub['id'])]
smd.shape

(9099, 28)

In [360]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,weighted_avg,normalized_weighted_avg,normalized_popularity,score
292,False,,8000000,"[{'id': 53, 'name': 'Thriller'}, {'id': 80, 'n...",,680,tt0110912,en,Pulp Fiction,"A burger-loving hit man, his philosophical par...",...,Released,Just because you are a character doesn't mean ...,Pulp Fiction,False,8.3,8670.0,8.251406,0.959995,0.257449,0.608722
12481,False,"{'id': 263, 'name': 'The Dark Knight Collectio...",185000000,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",http://thedarkknight.warnerbros.com/dvdsite/,155,tt0468569,en,The Dark Knight,Batman raises the stakes in his war on crime. ...,...,Released,Why So Serious?,The Dark Knight,False,8.3,12269.0,8.265477,0.96289,0.224968,0.593929
26563,False,,120000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.fantasticfourmovie.com/,166424,tt1502712,en,Fantastic Four,Four young outsiders teleport to a dangerous u...,...,Released,Change is coming.,Fantastic Four,False,4.4,2322.0,4.478531,0.779789,0.343132,0.56146
314,False,,25000000,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",,278,tt0111161,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,...,Released,Fear can hold you prisoner. Hope can set you f...,The Shawshank Redemption,False,8.5,8358.0,8.445869,1.0,0.094332,0.547166
2843,False,,63000000,"[{'id': 18, 'name': 'Drama'}]",http://www.foxmovies.com/movies/fight-club,550,tt0137523,en,Fight Club,A ticking-time-bomb insomniac and a slippery s...,...,Released,Mischief. Mayhem. Soap.,Fight Club,False,8.3,9678.0,8.256385,0.961019,0.116659,0.538839


In [365]:
smd['overview'] = smd['overview'].fillna('')

In [366]:
vectoriser = TfidfVectorizer(ngram_range=(1, 2),min_df=0, stop_words='english')
word_vectors = vectoriser.fit_transform(smd['overview'])

In [367]:
word_vectors.shape

(9099, 244199)

In [None]:
cosine_sim = cosine_similarity(word_vectors,word_vectors)