# MOVIE RECOMMENDAION MODEL

## Introduction
This is just a mini project to try out and optimize the content-based filtering which is used in recommender system.

The data is from Kaggle __[TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)__, which contain movies up until 2017.



## Implementation

### Import necessary packages

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

### Preparing Dataset

Here we are given 2 dataset:
* tmdb_5000_movies.csv: has metadata about the movies (genres, title, keywords, etc.)
* tmdb_5000_credits.csv: has information about the cast, crew behind the movies

In [2]:
df_mov = pd.read_csv('tmdb_5000_movies.csv')
df_cre = pd.read_csv('tmdb_5000_credits.csv')

In [3]:
df_mov.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
df_mov.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

Keep only necessary information for our model, here we only keep columns that contain information indicates the type or genres of the movies

In [5]:
features_mov = ['genres','id', 'keywords','overview', 'production_companies', 'status', 'tagline', 'title', 'vote_count', 'vote_average']
df_mov = df_mov.filter(features_mov)

In [6]:
df_cre.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [7]:
df_cre.columns

Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')

After taking a look at the credits dataset, we can see this one is linked to the movies datasets through *movie_id*.

We remove the title from this dataset and change the *movie_id* columns to *id* so later we can merge with the movies dataset.

In [8]:
features_cre = ['movie_id', 'cast', "crew"]
df_cre = df_cre.filter(features_cre)
df_cre.rename(columns={"movie_id": "id"}, inplace = True)

In [9]:
df = pd.merge(df_mov, df_cre, on="id")

After merging, we will work with this dataset to recommend the approriate movies

### Preprocessing Data


#### Remove NaN value and check type

In [10]:
df.info()
df.tagline.fillna("",inplace = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   genres                4803 non-null   object 
 1   id                    4803 non-null   int64  
 2   keywords              4803 non-null   object 
 3   overview              4800 non-null   object 
 4   production_companies  4803 non-null   object 
 5   status                4803 non-null   object 
 6   tagline               3959 non-null   object 
 7   title                 4803 non-null   object 
 8   vote_count            4803 non-null   int64  
 9   vote_average          4803 non-null   float64
 10  cast                  4803 non-null   object 
 11  crew                  4803 non-null   object 
dtypes: float64(1), int64(2), object(9)
memory usage: 487.8+ KB


In [11]:
df.status.value_counts()

Released           4795
Rumored               5
Post Production       3
Name: status, dtype: int64

Perform some steps so that the model work better:
* Only consider *Released* movies.
* Only consider *popular* movies. This is done by only taking the movies has votes above minimum threshold (0.2 quantile)

In [12]:
import plotly.express as px

fig = px.scatter(df, x='vote_count', y ="vote_average")
fig.show()


In [13]:
df = df[df.status == "Released"]

df = df[df.vote_count > df.vote_count.quantile(0.1)]

In [14]:
import plotly.express as px

fig = px.scatter(df, x='vote_count', y ="vote_average")
fig.show()


In [15]:
df.head(5)

Unnamed: 0,genres,id,keywords,overview,production_companies,status,tagline,title,vote_count,vote_average,cast,crew
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...",Released,Enter the World of Pandora.,Avatar,11800,7.2,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,4500,6.9,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",Released,A Plan No One Escapes,Spectre,4466,6.3,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",Released,The Legend Ends,The Dark Knight Rises,9106,7.6,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",Released,"Lost in our world, found in another.",John Carter,2124,6.1,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


There are some columns have the json format because this dataset is crawl from IMDB. We also need to handle them:
* **genres**: We only need the *name* of the genre
* **keywords**: We only need the *name* of the keyword
* **production_companies**: We only need the *name* of the production company.
* **cast**: We only need the *name* of each cast in the movies.
* **crew**: We only need the *name* of the director fo the movies

In [16]:
#genre: name
#keywords: name
#production_companies: name
#cast: name
from ast import literal_eval

def get_director_name(x):
    for member in x:
        if member["job"] == "Director":
            return member["name"]
        
    return ""



def get_name(x):
    return [item["name"] for item in x]
        

In [17]:
df.rename(columns={"crew":"director"}, inplace = True)
dirty_features = ["genres", "keywords", "production_companies", "cast", "director"]

for item in dirty_features:
    df[item] = df[item].apply(literal_eval)

for item in dirty_features[:-1]:
    df[item] = df[item].apply(get_name)
    

df.director = df.director.apply(get_director_name)


Here we convert each of them into lowercase and more important, we need to remove white space betwwen each word, otherwise, the model will treat "science" in "science" and "science fiction" as the same thing.

In [18]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    if isinstance(x, str):
        return str.lower(x.replace(" ", ""))
    
for item in dirty_features:
    df[item] = df[item].apply(clean_data)

In [19]:
df

Unnamed: 0,genres,id,keywords,overview,production_companies,status,tagline,title,vote_count,vote_average,cast,director
0,"[action, adventure, fantasy, sciencefiction]",19995,"[cultureclash, future, spacewar, spacecolony, ...","In the 22nd century, a paraplegic Marine is di...","[ingeniousfilmpartners, twentiethcenturyfoxfil...",Released,Enter the World of Pandora.,Avatar,11800,7.2,"[samworthington, zoesaldana, sigourneyweaver, ...",jamescameron
1,"[adventure, fantasy, action]",285,"[ocean, drugabuse, exoticisland, eastindiatrad...","Captain Barbossa, long believed to be dead, ha...","[waltdisneypictures, jerrybruckheimerfilms, se...",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,4500,6.9,"[johnnydepp, orlandobloom, keiraknightley, ste...",goreverbinski
2,"[action, adventure, crime]",206647,"[spy, basedonnovel, secretagent, sequel, mi6, ...",A cryptic message from Bond’s past sends him o...,"[columbiapictures, danjaq, b24]",Released,A Plan No One Escapes,Spectre,4466,6.3,"[danielcraig, christophwaltz, léaseydoux, ralp...",sammendes
3,"[action, crime, drama, thriller]",49026,"[dccomics, crimefighter, terrorist, secretiden...",Following the death of District Attorney Harve...,"[legendarypictures, warnerbros., dcentertainme...",Released,The Legend Ends,The Dark Knight Rises,9106,7.6,"[christianbale, michaelcaine, garyoldman, anne...",christophernolan
4,"[action, adventure, sciencefiction]",49529,"[basedonnovel, mars, medallion, spacetravel, p...","John Carter is a war-weary, former military ca...",[waltdisneypictures],Released,"Lost in our world, found in another.",John Carter,2124,6.1,"[taylorkitsch, lynncollins, samanthamorton, wi...",andrewstanton
...,...,...,...,...,...,...,...,...,...,...,...,...
4790,"[drama, foreign]",13898,[],Various women struggle to function in the oppr...,[jafarpanahifilmproductions],Released,,The Circle,17,6.6,"[nargessmamizadeh, maryiampalvinalmani, mojgan...",jafarpanahi
4792,"[crime, horror, mystery, thriller]",36095,"[japan, prostitute, hotel, basedonnovel, hallu...",A wave of gruesome murders is sweeping Tokyo. ...,[daieistudios],Released,Madness. Terror. Murder.,Cure,63,7.4,"[kojiyakusho, masatohagiwara, tsuyoshiujiki, a...",kiyoshikurosawa
4796,"[sciencefiction, drama, thriller]",14337,"[distrust, garage, identitycrisis, timetravel,...",Friends/fledgling entrepreneurs invent a devic...,[thinkfilm],Released,What happens if it actually works?,Primer,658,6.9,"[shanecarruth, davidsullivan, caseygooden, ana...",shanecarruth
4798,"[action, crime, thriller]",9367,"[unitedstates–mexicobarrier, legs, arms, paper...",El Mariachi just wants to play his guitar and ...,[columbiapictures],Released,"He didn't come looking for trouble, but troubl...",El Mariachi,238,6.6,"[carlosgallardo, jaimedehoyos, petermarquardt,...",robertrodriguez


### Build Recommender system

In [20]:
def create_soup(x):
    return ' '.join(x['genres']) + ' ' + ' '.join(x['keywords'][:10]) + ' ' + ' ' .join(x['production_companies']) + ' ' + ' '.join(x['cast']) + ' ' + x['director']
df['soup'] = df.apply(create_soup, axis=1)

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

tfidf = TfidfVectorizer(analyzer='word' , stop_words='english',)
tfidf_matrix = tfidf.fit_transform(df['soup'])

In [22]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, eval_method):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    scores = list(enumerate(eval_method[idx]))

    # Sort the movies based on the similarity scores
    scores = sorted(scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    scores = scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in scores]

    # Return the top 10 most similar movies
    return scores, df['title'].iloc[movie_indices]

In [23]:
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

In [24]:

df.iloc[indices['Street Fighter: The Legend of Chun-Li']].soup

'action adventure sciencefiction thriller martialarts revenge streetfighter basedonvideogame twentiethcenturyfoxfilmcorporation hydeparkfilms kristinkreuk chrisklein nealmcdonough michaelclarkeduncan moonbloodgood robinshou josieho andrzejbartkowiak'

In [25]:
df.iloc[indices['Inception']].soup

'action thriller sciencefiction mystery adventure lossoflover dream kidnapping sleep subconsciousness heist redemption femalehero legendarypictures warnerbros. syncopy leonardodicaprio josephgordon-levitt ellenpage tomhardy kenwatanabe cillianmurphy marioncotillard michaelcaine dileeprao tomberenger petepostlethwaite lukashaas talulahriley tohorumasamune taylorgeare clairegeare johnathangeare yujiokumoto earlcameron ryanhayward mirandanolan russfega timkelleher coraliededykere silvielaguna virgilebramly nicolasclerc jean-micheldagory marcraducci christophernolan'

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim1 = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [27]:
def get_movies(movie_name):
    scores1, movies1 = get_recommendations(movie_name, cosine_sim1)
    scores2, movies2 = get_recommendations(movie_name, cosine_sim2)
    
    print("\t \t Using TF-IDF")
    for i, movie in enumerate(movies1):
        print(f'{movie}: \t {scores1[i][1]}')
    print("\t \t Using Count")
    for i, movie in enumerate(movies2):
        print(f'{movie}: \t {scores2[i][1]}')

In [28]:
movie_name = 'Star Wars'
get_movies(movie_name)

	 	 Using TF-IDF
The Empire Strikes Back: 	 0.1751464325721276
Return of the Jedi: 	 0.10144807025057402
Star Wars: Episode III - Revenge of the Sith: 	 0.06868283562135262
Star Wars: Episode II - Attack of the Clones: 	 0.06569805684669368
Raiders of the Lost Ark: 	 0.05870481842481695
The Elephant Man: 	 0.04858540911258092
Star Wars: Episode I - The Phantom Menace: 	 0.04625878156874888
Krull: 	 0.04319052546312355
American Graffiti: 	 0.04269907285426981
Time Bandits: 	 0.03708888453184478
	 	 Using Count
The Empire Strikes Back: 	 0.23094749811403895
Return of the Jedi: 	 0.1464582583879156
Star Wars: Episode III - Revenge of the Sith: 	 0.10888420966802473
Star Wars: Episode II - Attack of the Clones: 	 0.10438335009588318
Raiders of the Lost Ark: 	 0.08772689266130253
Titan A.E.: 	 0.08572408331227328
Street Fighter: The Legend of Chun-Li: 	 0.08399210511316162
Star Wars: Episode I - The Phantom Menace: 	 0.08341483294531393
Star Wars: Clone Wars: Volume 1: 	 0.08271527783091086