


# Movie Recommender System

This project involves coding a **Content Based** [**Recommender System**](https://en.wikipedia.org/wiki/Recommender_system).
The dataset used is a [movie dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) from Kaggle.

The second part of this project involves creating a website using  [Streamlit](https://streamlit.io/) and deploying it on Heroku.







In [1]:
import numpy as np
import pandas as pd
import ast
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

## Exploring our dataset

There are 2 datasets
  - movies : containing features of movies
  - credits : containing features of cast and crew of the movie

In [2]:
movies = pd.read_csv("data/tmdb_5000_movies.csv")
credits = pd.read_csv("data/tmdb_5000_credits.csv")

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


The 2 datasets are merged together into one single dataset 'movies' on the 'title' column. This makes it easier for further work.

In [5]:
movies = movies.merge(credits, on = "title")

In [6]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

### There are a lot features in our dataset. I will be using only a subset of these features for our Recommender System. The selected features are:

  - genres
  - id
  - keywords
  - title
  - overview
  - cast
  - crew

In [7]:
movies = movies[["movie_id","title","overview","genres","keywords","cast","crew"]]
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


Checking for any null values and duplicates in our data

In [8]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [9]:
movies.dropna(inplace = True) #dropping null values

In [10]:
movies.duplicated().sum()

0

In [11]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64



---



---



## Creating custom features for our Recommender System

 - Genre and Keyword feature: For these two features we will be converting them from key value pairs to lists. This will make it easier for us to construct proper features for our Recommender System.

 - Cast feature: For this we will take the first 3 cast members for each movie.

 - Crew feature: We will fetch the director from crew dictionary for each movie

In [12]:
#helper function to create features

def convert_to_list(genre):
  genre_list = []
  for ele in ast.literal_eval(genre):
    genre_list.append(ele['name'])

  return genre_list

def top_3_cast(cast_list):
  cast = []
  counter = 1
  for i in ast.literal_eval(cast_list):
      if counter <=3:
        cast.append(i['name'])
        counter += 1
      else:
        break
  return cast

def fetch_director(crew_list):
  crew = []
  for i in ast.literal_eval(crew_list):
    if i['job'] == 'Director':
      crew.append(i['name'])

  return crew

In [13]:
movies['genres'] = movies['genres'].apply(convert_to_list)

In [14]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [15]:
movies['keywords'] = movies['keywords'].apply(convert_to_list)
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [16]:
movies['cast'] = movies['cast'].apply(top_3_cast)
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [17]:
movies['crew'] = movies['crew'].apply(fetch_director)
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


We are removing spaces from each word to create unique tokens.

For example (Robert Pattinson) and (Robert Downey Jr) should be treated as different entities as they are different people. If the spaces are not removed then the first entity of each token 'Robert' may convey to the recommender system that they are similar people.

To avoid this we convert Robert Pattinson  to RobertPattinson.

In [18]:
#helper function to create unique tokens

def remove_space(obj):

  tokens = []
  for i in obj:
    tokens.append(i.replace(" ",""))
  return tokens


In [19]:
movies['cast'] = movies['cast'].apply(remove_space)
movies['crew'] = movies['crew'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)
movies['keywords'] = movies['keywords'].apply(remove_space)

We will create a single description feature for each movie by combining the features overview, genres, keywords, cast and crew

In [20]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [21]:
movies['description'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [22]:
df = movies.drop(columns=['overview','genres','keywords','cast','crew'])

df['description'] = df['description'].apply(lambda x: " ".join(x))
df.head()

Unnamed: 0,movie_id,title,description
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [23]:
# A sample movie description

df.iloc[100].description

'Tells the story of Benjamin Button, a man who starts aging backwards with bizarre consequences. Fantasy Drama Thriller Mystery Romance diary navy funeral tea travel hospital CateBlanchett BradPitt TildaSwinton DavidFincher'



---



---



## Building the Recommender System

The description feature is converted in vector embeddings using Count Vectorizer. These vectors are used to compute similarity scores. Top 5 movies with highest similarity scores are recommended.

All stop words are removed as well for better performance.
All words of description are stemmed.

Since the descriptions are too long. We will be using the top 5000 most frequent words in our corpus to vectorize it. This reduces the complexity of our problem.

In [24]:
stemmer = PorterStemmer()

In [25]:
def stemming(desc):
    stemmed = []
    for word in desc.split():
        stemmed.append(stemmer.stem(word))
    return " ".join(stemmed)

In [26]:
df[ 'description'] = df['description'].apply(stemming)

In [36]:
df.head()

Unnamed: 0,movie_id,title,description
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."


In [28]:
cv = CountVectorizer(max_features=5000,stop_words='english') # Vectorizing

vector = cv.fit_transform(df['description']).toarray()

In [29]:
vector.shape   # We have 4806 vector embeddings of our descriptions with each vector having 5000 dimensions


(4806, 5000)

In [30]:
similarity = cosine_similarity(vector)
similarity

array([[1.        , 0.08346223, 0.0860309 , ..., 0.04499213, 0.        ,
        0.        ],
       [0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
        0.02615329],
       [0.0860309 , 0.06063391, 1.        , ..., 0.02451452, 0.        ,
        0.        ],
       ...,
       [0.04499213, 0.02378257, 0.02451452, ..., 1.        , 0.03962144,
        0.04229549],
       [0.        , 0.        , 0.        , ..., 0.03962144, 1.        ,
        0.08714204],
       [0.        , 0.02615329, 0.        , ..., 0.04229549, 0.08714204,
        1.        ]])

In [31]:
similarity[1]  #This similarity vector represents the cosine similarity of each vector with all other vectors 

array([0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
       0.02615329])

In [32]:
# Function to recommend movies
def recommend(movie):
    
    index = df[df['title'] == movie].index[0]
    
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    
    for d in distances[1:6]:
        print(df.iloc[d[0]].title)
        

### Sample movie recommendations 

In [33]:
recommend("Titanic")

The Notebook
Under the Same Moon
Ghost Ship
The Bounty
Pirates of the Caribbean: On Stranger Tides


In [34]:
recommend("Jurassic Park")

Jurassic World
The Lost World: Jurassic Park
Jurassic Park III
Sea Rex 3D: Journey to a Prehistoric World
Land of the Lost


In [35]:
recommend("Troy")

About Schmidt
1982
Kingdom of Heaven
I Served the King of England
The Legend of Hercules


In [37]:
pickle.dump(df,open('movie_list.pkl','wb'))

In [38]:
pickle.dump(similarity,open('similarity.pkl','wb'))

--------------------