In [1]:
import pandas as pd
import numpy as np
import json
import ast

<h2>Load Data</h2>

MovieLens Dataset - https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data
<br/>
This dataset is an ensemble of data collected from TMDB and GroupLens.
The Movie Details, Credits and Keywords have been collected from the TMDB Open API.
<br/>
Other resources: https://grouplens.org/datasets/movielens/latest/
<br/>https://developer.themoviedb.org/docs/getting-started

In [2]:
df_credits = pd.read_csv("../datasets/origin/credits.csv")
df_movies = pd.read_csv("../datasets/origin/movies_metadata.csv")

  df_movies = pd.read_csv("../datasets/origin/movies_metadata.csv")


In [3]:
df_credits.shape, df_movies.shape

((45476, 3), (45466, 24))

<h2>Read Data</h2>

The first dataset contains the following features:
<ul>
<li>cast: The name of lead and supporting actors.</li>
<li>crew: The name of Director, Editor, Composer, Writer etc.</li>
<li>id: A unique identifier for each movie.</li>
</ul>

In [4]:
df_credits.dtypes

cast    object
crew    object
id       int64
dtype: object

The second dataset has the following features:
<ul>
<li>adult: Indicates whether the movie is classified as adult content or not</li>
<li>belongs_to_collection</li>
<li>budget: The budget in which the movie was made</li>
<li>genres: The genre of the movie, Action, Comedy ,Thriller etc.</li>
<li>homepage: A link to the homepage of the movie.</li>
<li>id: This is infact the movie_id as in the first dataset.</li>
<li>imdb_id: The unique identifier assigned to the movie on the IMDb website.</li>
<li>original_language: The language in which the movie was made.</li>
<li>original_title: The title of the movie before translation or adaptation.</li>
<li>overview: A brief description of the movie.</li>
<li>popularity: A numeric quantity specifying the movie popularity.</li>
<li>poster_path</li>
<li>production_companies: The production house of the movie.</li>
<li>production_countries: The country in which it was produced.</li>
<li>release_date: The date on which it was released.</li>
<li>revenue: The worldwide revenue generated by the movie.</li>
<li>runtime: The running time of the movie in minutes.</li>
<li>spoken_languages</li>
<li>status: "Released" or "Rumored".</li>
<li>tagline: Movie's tagline.</li>
<li>title: Title of the movie.</li>
<li>video: Specifies if the content is classified as a video or a movie</li>
<li>vote_average: average ratings the movie recieved.</li>
<li>vote_count: the count of votes recieved.</li>
</ul>

In [5]:
df_movies.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

In [6]:
df_credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [7]:
df_movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


<h2>Data Preprocessing</h2>

<ul>
<li><b>Apply Filters:</b> keep only english language movies, filtered out movies without a runtime, perform data cleaning by removing duplicate records and handling missing values.</li>
<li><b>Transform Data:</b> convert data types as necessary.</li>
<li><b>Integrate Data:</b> includes merge structured data (movie metadata) and unstructured (json credits data) into a single structured data.</li>
<li><b>Reduce Data:</b> remove stop words, punctuation, and special characters from text data and keep only relevant features from dataset.</li>
</ul>

In [8]:
# sample for bad id
# row_19730 = df_movies.iloc[19730]
# row_19730

In [9]:
# Filtering rows with bad IDs as their id is showing date
df_movies = df_movies.drop([19730, 29503, 35587])

In [10]:
# Convert 'id' column to int
df_movies['id'] = df_movies['id'].astype('int')

In [11]:
# Merge movies and credits data by using id column
df = df_movies.merge(df_credits, on='id')
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."


In [12]:
df.shape

(45538, 26)

In [13]:
# Kept only English-language movies
df = df[df['original_language'] == 'en']
df.shape

(32316, 26)

In [14]:
# Filtered out movies without a runtime
df = df[df['runtime'] > 0]
df.shape

(31558, 26)

In [15]:
# Convert 'release_date' column to datetime
df['release_date'] = pd.to_datetime(df['release_date'])

In [16]:
# Checking min, max year
min_year = df['release_date'].dt.year.min()
max_year = df['release_date'].dt.year.max()

print(f"Minimum Year: {min_year}")
print(f"Maximum Year: {max_year}")

Minimum Year: 1878.0
Maximum Year: 2018.0


In [17]:
# Checking if there are null
df[df['genres'].isnull()].shape, df[df['cast'].isnull()].shape, df[df['crew'].isnull()].shape

((0, 26), (0, 26), (0, 26))

In [18]:
df.iloc[0]

adult                                                                False
belongs_to_collection    {'id': 10194, 'name': 'Toy Story Collection', ...
budget                                                            30000000
genres                   [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
homepage                              http://toystory.disney.com/toy-story
id                                                                     862
imdb_id                                                          tt0114709
original_language                                                       en
original_title                                                   Toy Story
overview                 Led by Woody, Andy's toys live happily in his ...
popularity                                                       21.946943
poster_path                               /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
production_companies        [{'name': 'Pixar Animation Studios', 'id': 3}]
production_countries     

In [19]:
df.head(10)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...","[{'credit_id': '52fe4292c3a36847f802916d', 'de..."
6,False,,58000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,11860,tt0114319,en,Sabrina,An ugly duckling having undergone a remarkable...,...,127.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,You are cordially invited to the most surprisi...,Sabrina,False,6.2,141.0,"[{'cast_id': 1, 'character': 'Linus Larrabee',...","[{'credit_id': '52fe44959251416c75039da9', 'de..."
7,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,45325,tt0112302,en,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...",...,97.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Original Bad Boys.,Tom and Huck,False,5.4,45.0,"[{'cast_id': 2, 'character': 'Tom Sawyer', 'cr...","[{'credit_id': '52fe46bdc3a36847f810f797', 'de..."
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,"[{'cast_id': 1, 'character': 'Darren Francis T...","[{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de..."
9,False,"{'id': 645, 'name': 'James Bond Collection', '...",58000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.mgm.com/view/movie/757/Goldeneye/,710,tt0113189,en,GoldenEye,James Bond must unmask the mysterious head of ...,...,130.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,No limits. No fears. No substitutes.,GoldenEye,False,6.6,1194.0,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '52fe426ec3a36847f801e14b', 'de..."


In [20]:
df.iloc[23]['genres']

"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name': 'Fantasy'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name': 'Thriller'}]"

In [21]:
# extract genres names as comma-separated string
df['genres'] = df['genres'].apply(
    lambda x: ', '.join([genre['name'] for genre in json.loads(x.replace("'",'"'))]) if x is not None else None
)

df.iloc[23]['genres']

'Drama, Fantasy, Science Fiction, Thriller'

In [22]:
# def normalize_genres(genre_str):
#     return ', '.join(sorted([g.strip() for g in genre_str.split(',')]))

# df['genres'] = df['genres'].apply(normalize_genres)
# df.iloc[23]['genres']

In [23]:
# extract cast names with a comma-separated string

def get_names_from_cast(df):
  names = []
  for index, row in df.iterrows():
    cast_list = ast.literal_eval(row['cast'])
    names_per_row = []
    for item in cast_list:
      names_per_row.append(item['name'])
    names.append(', '.join(names_per_row))
  return names

df['cast'] = get_names_from_cast(df)

df.iloc[23]['cast']

'Mary Steenburgen, Sean Patrick Flanery, Lance Henriksen, Jeff Goldblum, Brandon Smith'

In [24]:
# extract names with a comma-separated string if job is Director

def get_directors_from_crew(df):
  directors = []
  for index, row in df.iterrows():
    crew_list = ast.literal_eval(row['crew'])
    director_per_row = []
    for item in crew_list:
      if item['job'] == 'Director':
        director_per_row.append(item['name'])
    directors.append(', '.join(director_per_row))
  return directors

df['director'] = get_directors_from_crew(df)

df.iloc[23]['director']


'Victor Salva'

In [25]:
# extract names with a space-separated string if job is Producer

def get_producers_from_crew(df):
  producers = []
  for index, row in df.iterrows():
    crew_list = ast.literal_eval(row['crew'])
    producer_per_row = []
    for item in crew_list:
      if item['job'] == 'Producer':
        producer_per_row.append(item['name'])
    producers.append(', '.join(producer_per_row))
  return producers

df['producer'] = get_producers_from_crew(df)

df.iloc[23]['producer']


'Roger Birnbaum, Daniel Grodnik'

In [26]:
# Drop the original 'crew' column
df = df.drop('crew', axis=1)

In [27]:
df.shape

(31558, 27)

In [28]:
df.iloc[1]['spoken_languages']

"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]"

In [29]:
# extract spoken_languages names with a comma-separated string

def get_names_from_spoken_languages(df):
  names = []
  for index, row in df.iterrows():
    cast_list = ast.literal_eval(row['spoken_languages'])
    names_per_row = []
    for item in cast_list:
      names_per_row.append(item['name'])
    names.append(', '.join(names_per_row))
  return names

df['spoken_languages'] = get_names_from_spoken_languages(df)

df.iloc[23]['spoken_languages']

'English'

In [30]:
df.iloc[1]['spoken_languages']

'English, Français'

In [31]:
df.iloc[0]['belongs_to_collection']

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

In [32]:
# extract collection's name
def safe_json_extract(x, key='name'):
    if isinstance(x, str):
        try:
            return json.loads(x.replace("'",'"')).get(key)
        except json.JSONDecodeError:
            return None
    return None

df['belongs_to_collection'] = df['belongs_to_collection'].apply(safe_json_extract)

df.iloc[0]['belongs_to_collection']

'Toy Story Collection'

In [33]:
# extract production_countries names with a comma-separated string

def get_names_from_production_countries(df):
  names = []
  for index, row in df.iterrows():
    cast_list = ast.literal_eval(row['production_countries'])
    names_per_row = []
    for item in cast_list:
      names_per_row.append(item['name'])
    names.append(', '.join(names_per_row))
  return names

df['production_countries'] = get_names_from_production_countries(df)

df.iloc[23]['production_countries']

'United States of America'

In [34]:
df.iloc[6]['production_countries']

'Germany, United States of America'

In [35]:
# extract production_companies names with a comma-separated string

def get_names_from_production_companies(df):
  names = []
  for index, row in df.iterrows():
    cast_list = ast.literal_eval(row['production_companies'])
    names_per_row = []
    for item in cast_list:
      names_per_row.append(item['name'])
    names.append(', '.join(names_per_row))
  return names

df['production_companies'] = get_names_from_production_companies(df)

df.iloc[23]['production_companies']

'Caravan Pictures, Hollywood Pictures'

In [36]:
df.iloc[6]['production_companies']

'Paramount Pictures, Scott Rudin Productions, Mirage Enterprises, Sandollar Productions, Constellation Entertainment, Worldwide, Mont Blanc Entertainment GmbH'

In [37]:
# Check for NaN values in each column
nan_counts = df.isna().sum()

# Print the count of NaN values for each column
print(nan_counts)

adult                        0
belongs_to_collection    29339
budget                       0
genres                       0
homepage                 25465
id                           0
imdb_id                      9
original_language            0
original_title               0
overview                    42
popularity                   0
poster_path                231
production_companies         0
production_countries         0
release_date                43
revenue                      0
runtime                      0
spoken_languages             0
status                      58
tagline                  13715
title                        0
video                        0
vote_average                 0
vote_count                   0
cast                         0
director                     0
producer                     0
dtype: int64


In [38]:
# Dropped missing overviews to avoid empty descriptions in NLP
df.dropna(subset = ['overview'], inplace=True)
df.shape

(31516, 27)

In [39]:
df.iloc[0]

adult                                                                False
belongs_to_collection                                 Toy Story Collection
budget                                                            30000000
genres                                           Animation, Comedy, Family
homepage                              http://toystory.disney.com/toy-story
id                                                                     862
imdb_id                                                          tt0114709
original_language                                                       en
original_title                                                   Toy Story
overview                 Led by Woody, Andy's toys live happily in his ...
popularity                                                       21.946943
poster_path                               /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
production_companies                               Pixar Animation Studios
production_countries     

In [40]:
# We create a feature called "text_corpus." 
# This feature will include all relevant information from our data, 
# such as the title, genres, cast, director, producer and overview. 
# We will use this new column as the basis for our recommendation system.

def create_text_corpus(x):
    text_corpus = ""
    text_corpus += x['title'] + ", " + x['genres'] + ", " + x['cast'] + ", " + x['director'] + ", " + x['producer'] + ", " + x['overview']
    return text_corpus 
    # return ' '.join(x['title']) +' '.join(x['genres']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['producer']) + ' ' + ' '.join(x['overview'])

df['text_corpus'] = df.apply(create_text_corpus, axis=1)

In [41]:
df['text_corpus'].head()

0    Toy Story, Animation, Comedy, Family, Tom Hank...
1    Jumanji, Adventure, Fantasy, Family, Robin Wil...
2    Grumpier Old Men, Romance, Comedy, Walter Matt...
3    Waiting to Exhale, Comedy, Drama, Romance, Whi...
4    Father of the Bride Part II, Comedy, Steve Mar...
Name: text_corpus, dtype: object

In [42]:
# Check for NaN values in each column
nan_counts = df.isna().sum()

# Print the count of NaN values for each column
print(nan_counts)

adult                        0
belongs_to_collection    29297
budget                       0
genres                       0
homepage                 25424
id                           0
imdb_id                      9
original_language            0
original_title               0
overview                     0
popularity                   0
poster_path                226
production_companies         0
production_countries         0
release_date                43
revenue                      0
runtime                      0
spoken_languages             0
status                      56
tagline                  13674
title                        0
video                        0
vote_average                 0
vote_count                   0
cast                         0
director                     0
producer                     0
text_corpus                  0
dtype: int64


<h2>Saving Cleaned Movies Dataset</h2>

In [43]:
df.to_pickle("../datasets/clean/movies_df.pkl") 

<h2>Load Cleaned Movies Dataset</h2>

In [44]:
df = pd.read_pickle("../datasets/clean/movies_df.pkl")  

In [45]:
df.shape

(31516, 28)

In [46]:
df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,cast,director,producer,text_corpus
0,False,Toy Story Collection,30000000,"Animation, Comedy, Family",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...",John Lasseter,"Bonnie Arnold, Ralph Guggenheim","Toy Story, Animation, Comedy, Family, Tom Hank..."
1,False,,65000000,"Adventure, Fantasy, Family",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...",Joe Johnston,"Scott Kroopf, William Teitler","Jumanji, Adventure, Fantasy, Family, Robin Wil..."
2,False,Grumpy Old Men Collection,0,"Romance, Comedy",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"Walter Matthau, Jack Lemmon, Ann-Margret, Soph...",Howard Deutch,,"Grumpier Old Men, Romance, Comedy, Walter Matt..."
3,False,,16000000,"Comedy, Drama, Romance",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"Whitney Houston, Angela Bassett, Loretta Devin...",Forest Whitaker,"Ronald Bass, Ezra Swerdlow, Deborah Schindler,...","Waiting to Exhale, Comedy, Drama, Romance, Whi..."
4,False,Father of the Bride Collection,0,Comedy,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"Steve Martin, Diane Keaton, Martin Short, Kimb...",Charles Shyer,Nancy Meyers,"Father of the Bride Part II, Comedy, Steve Mar..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45531,False,,0,Science Fiction,,222848,tt0112613,en,Caged Heat 3000,It's the year 3000 AD. The world's most danger...,...,Released,,Caged Heat 3000,False,3.5,1.0,"Lisa Boyle, Kena Land, Zaneta Polard, Don Yana...",Aaron Osborne,Mike Upton,"Caged Heat 3000, Science Fiction, Lisa Boyle, ..."
45532,False,,0,"Drama, Action, Romance",,30840,tt0102797,en,Robin Hood,"Yet another version of the classic epic, with ...",...,Released,,Robin Hood,False,5.7,26.0,"Patrick Bergin, Uma Thurman, David Morrissey, ...",John Irvin,Sarah Radclyffe,"Robin Hood, Drama, Action, Romance, Patrick Be..."
45535,False,,0,"Action, Drama, Thriller",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,Released,A deadly game of wits.,Betrayal,False,3.8,6.0,"Erika Eleniak, Adam Baldwin, Julie du Page, Ja...",Mark L. Lester,,"Betrayal, Action, Drama, Thriller, Erika Eleni..."
45536,False,,0,,,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,Released,,Satan Triumphant,False,0.0,0.0,"Iwan Mosschuchin, Nathalie Lissenko, Pavel Pav...",Yakov Protazanov,Joseph N. Ermolieff,"Satan Triumphant, , Iwan Mosschuchin, Nathalie..."
