# Supplemental Data - Additional Metadata

1. [Introduction](#intro)
2. [Load & Inspect Metadata Dataset](#metadata-loading)
3. [Clean Metadata Dataset](#metadata-cleaning)

## Introduction <a name="intro"></a>

In this notebook we will be cleaning a dataset from TMDB. This will be the primary source of metadata for our list of ~52k movies from our MovieLens user review text dataset. We verified that the majority of movies from the MovieLens dataset are contained within this dataset.

This dataset has rich information on title, synposis, year of release, budget, revenue , popularity, original language in which movie/tv show was produced, production companies, production countries, user vote averages, runtime, release date, tagline, actors & directors. It has a pipe delimiter and a complex structure, with JSON-like objects inside each row. To simplify things, we'll filter for the columns we are interested in.

Link to dataset: https://www.kaggle.com/datasets/kakarlaramcharan/tmdb-data-0920

### Load & Inspect Metadata Dataset <a name="metadata-loading"></a>

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# To read dataset in JSON format
import json

In [2]:
# Read in TMDB file with only the columns we need
metadata_df = pd.read_csv("data/movie_data_tmbd.csv", delimiter="|", usecols=["adult", "budget", "genres", "imdb_id", "original_language", "original_title", "production_companies", "release_date", "revenue", "runtime", "title", "vote_average", "vote_count"])

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.


This gives us an error, so we will try limiting the number of rows we're importing until we get an error.

In [4]:
# Limit rows until we get an error
metadata_df = pd.read_csv("data/movie_data_tmbd.csv", delimiter="|", usecols=["adult", "budget", "genres", "imdb_id", "original_language", "original_title", "production_companies", "revenue", "runtime", "title", "vote_average", "vote_count"], nrows=17000)

We can import up to 17,000 rows before getting an error. If we change nrows to 18,000, we get an error. 

Let's try using the built-in python csv module and reading the file line-by-line, process each line, and append it to a list. Then, we can create a DataFrame from that list.

In [179]:
import csv

lines = []
header = None

with open("data/movie_data_tmbd.csv", "r") as f:
    csv_reader = csv.reader(f, delimiter="|")
    for i, line in enumerate(csv_reader):
        if i == 0:
            header = line
        else:
            try:
                lines.append(line)
            except Exception as e:
                print(f"Skipping line {i} due to error: {e}")
                
# Create DataFrame from list
metadata_df = pd.DataFrame(lines, columns=header)

# Filter for columns we need
metadata_df = metadata_df[["adult", "budget", "genres", "imdb_id", "original_language", "original_title", "production_companies", "release_date", "revenue", "runtime", "title", "vote_average", "vote_count", "cast", "directors"]]

In [180]:
# Check file was imported correctly
metadata_df.head()

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,production_companies,release_date,revenue,runtime,title,vote_average,vote_count,cast,directors
0,False,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",tt0055827,fr,Le Caporal épinglé,"[{'id': 16059, 'logo_path': None, 'name': 'Les...",1962-05-23,0,90,The Elusive Corporal,5.9,10,"[{'cast_id': 3, 'character': 'Caporal', 'credi...","[{'credit_id': '52fe4626c3a36847f80ef68b', 'de..."
1,False,0,"[{'id': 18, 'name': 'Drama'}]",tt0055910,fr,Cybèle ou les dimanches de ville d'Avray,"[{'id': 7808, 'logo_path': None, 'name': 'Fidè...",1962-11-12,0,110,Sundays and Cybele,7.4,28,"[{'cast_id': 4, 'character': 'Pierre', 'credit...","[{'credit_id': '52fe4626c3a36847f80ef6c7', 'de..."
2,False,0,"[{'id': 18, 'name': 'Drama'}, {'id': 37, 'name...",tt0056195,en,Lonely Are the Brave,"[{'id': 3810, 'logo_path': None, 'name': 'Joel...",1962-05-24,0,107,Lonely Are the Brave,7.5,70,"[{'cast_id': 1, 'character': 'John W. ""Jack"" B...","[{'credit_id': '52fe4626c3a36847f80ef733', 'de..."
3,False,0,"[{'id': 99, 'name': 'Documentary'}]",tt0072962,fr,Vérités et Mensonges,"[{'id': 36547, 'logo_path': None, 'name': 'SAC...",1975-03-12,0,89,F for Fake,7.5,178,"[{'cast_id': 3, 'character': 'Himself', 'credi...","[{'credit_id': '52fe4626c3a36847f80ef75b', 'de..."
4,False,500000,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",tt0056196,en,Long Day's Journey Into Night,"[{'id': 77598, 'logo_path': None, 'name': 'Fir...",1962-10-09,0,174,Long Day's Journey Into Night,6.9,32,"[{'cast_id': 1, 'character': 'Mary Tyrone', 'c...","[{'credit_id': '52fe4626c3a36847f80ef791', 'de..."


The errors were resolved! We still have the JSON like objects stored inside, so we'll need to extract the parts that we will be using for analysis.

In [181]:
# Check the shape
metadata_df.shape

(119938, 15)

In [182]:
# Check dtypes and nulls
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119938 entries, 0 to 119937
Data columns (total 15 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   adult                 119938 non-null  object
 1   budget                119394 non-null  object
 2   genres                119394 non-null  object
 3   imdb_id               119394 non-null  object
 4   original_language     119394 non-null  object
 5   original_title        119394 non-null  object
 6   production_companies  119073 non-null  object
 7   release_date          119073 non-null  object
 8   revenue               119073 non-null  object
 9   runtime               118752 non-null  object
 10  title                 118752 non-null  object
 11  vote_average          118752 non-null  object
 12  vote_count            118752 non-null  object
 13  cast                  118752 non-null  object
 14  directors             118752 non-null  object
dtypes: object(15)
mem

In [183]:
# Check null values in each column
metadata_df.isna().sum()

adult                      0
budget                   544
genres                   544
imdb_id                  544
original_language        544
original_title           544
production_companies     865
release_date             865
revenue                  865
runtime                 1186
title                   1186
vote_average            1186
vote_count              1186
cast                    1186
directors               1186
dtype: int64

In [184]:
# Check for duplicates
metadata_df.duplicated().sum()

0

Let's drop the null `imdb_id` values, since we won't be able to easily join to our other dataset without it.

In [185]:
# Create a copy and drop just the null imdb_ids
metadata_df_dropna = metadata_df.dropna(subset='imdb_id')

In [186]:
# Sanity check
metadata_df_dropna.isna().sum()

adult                     0
budget                    0
genres                    0
imdb_id                   0
original_language         0
original_title            0
production_companies    321
release_date            321
revenue                 321
runtime                 642
title                   642
vote_average            642
vote_count              642
cast                    642
directors               642
dtype: int64

We already dropped the duplicates, so we'll create a clean_df copy and proceed with the rest of the cleaning. 

In [187]:
# Create a copy before altering our dataframe
clean_metadata_df = metadata_df_dropna.copy()

In [188]:
# Check shape

clean_metadata_df.shape

(119394, 15)

### Clean Metadata Dataset <a name="metadata-cleaning"></a>

As mentioned, we have already dropped the duplicates and confirmed there are no null values. Now we need to extract the appropriate info from the `genres`, `production_companies`, `directors`, and `cast` columns. 

In [189]:
import ast

# Convert the 'genres' column to Python objects

clean_metadata_df['genres'] = clean_metadata_df['genres'].apply(lambda x: ast.literal_eval(x) if x is not None and isinstance(x, str) else x)

# Extract the genre names, skipping None values

clean_metadata_df['genre_names'] = clean_metadata_df['genres'].apply(lambda x: [d['name'] for d in x] if x is not None else x)

In [190]:
# Check that it worked

clean_metadata_df.sample()

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,production_companies,release_date,revenue,runtime,title,vote_average,vote_count,cast,directors,genre_names
29511,False,0,"[{'id': 99, 'name': 'Documentary'}]",tt4009728,en,United Skates,"[{'id': 110433, 'logo_path': None, 'name': 'Sw...",2018-04-19,0,89,United Skates,7.8,13,"[{'cast_id': 3, 'character': 'Himself', 'credi...","[{'credit_id': '5aa5e04e0e0a263dc1000b88', 'de...",[Documentary]


In [191]:
# Convert the 'production_companies' column to Python objects

clean_metadata_df['production_companies'] = clean_metadata_df['production_companies'].apply(lambda x: ast.literal_eval(x) if x is not None and isinstance(x, str) else x)

# Extract the company names, adding additional check for list type

clean_metadata_df['company_names'] = clean_metadata_df['production_companies'].apply(lambda x: [d['name'] for d in x] if x is not None and isinstance(x, list) else x)

In [192]:
# Check that it worked

clean_metadata_df.sample()

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,production_companies,release_date,revenue,runtime,title,vote_average,vote_count,cast,directors,genre_names,company_names
25596,False,0,"[{'id': 18, 'name': 'Drama'}]",tt0084390,ja,楢山節考,"[{'id': 5822, 'logo_path': '/qyTbRgCyU9NLKvKai...",1983-04-29,0,130,The Ballad of Narayama,7.2,77,"[{'cast_id': 2, 'character': 'Tatsuhei', 'cred...","[{'credit_id': '52fe45eec3a36847f80e2acd', 'de...",[Drama],"[Toei Company, Ltd.]"


In [193]:
# Convert the 'directors' column to Python objects

clean_metadata_df['directors'] = clean_metadata_df['directors'].apply(lambda x: ast.literal_eval(x) if x is not None and isinstance(x, str) else x)

# Extract the director names
clean_metadata_df['director_names'] = clean_metadata_df['directors'].apply(lambda x: [d['name'] for d in x] if x is not None else x)

In [194]:
# Check that it worked

clean_metadata_df.sample()

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,production_companies,release_date,revenue,runtime,title,vote_average,vote_count,cast,directors,genre_names,company_names,director_names
32859,False,0,"[{'id': 18, 'name': 'Drama'}]",tt1552436,pt,Olhos Azuis,"[{'id': 2425, 'logo_path': None, 'name': 'Coev...",2010-05-28,37470,111,Blue Eyes,6.5,11,"[{'cast_id': 1, 'character': 'Marshall', 'cred...","[{'credit_id': '52fe49a0c3a36847f81a4781', 'de...",[Drama],[Coevos Filmes],[José Joffily]


In [195]:
# Convert the 'cast' column to Python objects

clean_metadata_df['cast'] = clean_metadata_df['cast'].apply(lambda x: ast.literal_eval(x) if x is not None and isinstance(x, str) else x)

# Extract the actor names

clean_metadata_df['actor_names'] = clean_metadata_df['cast'].apply(lambda x: [d['name'] for d in x] if x is not None else x)

In [196]:
# Check that it worked

clean_metadata_df.sample()

Unnamed: 0,adult,budget,genres,imdb_id,original_language,original_title,production_companies,release_date,revenue,runtime,title,vote_average,vote_count,cast,directors,genre_names,company_names,director_names,actor_names
45785,False,0,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",tt0095728,it,Il nido del ragno,"[{'id': 69093, 'logo_path': None, 'name': 'Spl...",1988-08-25,0,86,The Spider Labyrinth,7.0,11,"[{'cast_id': 2, 'character': 'Professor Alan W...","[{'credit_id': '52fe4a32c3a36847f81c0177', 'de...","[Horror, Thriller]","[Splendida, Reteitalia]",[Gianfranco Giagni],"[Roland Wybenga, Claudio Capone, Paola Rinaldi..."


Great, now we've extracted the needed info from the genre, production company, director, and actor columns and created new columns to store the information. We can drop the other columns from our clean df.

In [197]:
# Drop the redundant messy columns

clean_metadata_df = clean_metadata_df.drop(['genres', 'production_companies', 'cast', 'directors'], axis=1)

In [198]:
# Sanity check

clean_metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119394 entries, 0 to 119937
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   adult              119394 non-null  object
 1   budget             119394 non-null  object
 2   imdb_id            119394 non-null  object
 3   original_language  119394 non-null  object
 4   original_title     119394 non-null  object
 5   release_date       119073 non-null  object
 6   revenue            119073 non-null  object
 7   runtime            118752 non-null  object
 8   title              118752 non-null  object
 9   vote_average       118752 non-null  object
 10  vote_count         118752 non-null  object
 11  genre_names        119394 non-null  object
 12  company_names      119073 non-null  object
 13  director_names     118752 non-null  object
 14  actor_names        118752 non-null  object
dtypes: object(15)
memory usage: 14.6+ MB


Finally, we will also check and filter out any titles where `Adult` = False, since we won't be using those titles for this analysis.

In [212]:
# Filter to only include rows where 'adult' is 'False'

clean_metadata_df = clean_metadata_df[clean_metadata_df['adult'] == 'False']

We'll convert revenue, runtime, and budget to float values.

In [215]:
# Convert 'runtime' column to float

clean_metadata_df['runtime'] = clean_metadata_df['runtime'].astype(float)

In [216]:
# Convert 'budget' column to float

clean_metadata_df['budget'] = clean_metadata_df['budget'].astype(float)

In [217]:
# Convert 'revenue' column to float

clean_metadata_df['revenue'] = clean_metadata_df['revenue'].astype(float)

Let's convert `release_date` to datetime and extract the year.

In [220]:
# Convert 'release_date' to datetime

clean_metadata_df['release_date'] = pd.to_datetime(clean_metadata_df['release_date'], errors='coerce')

In [223]:
# Extract just the year and create a new column for it

clean_metadata_df['year'] = clean_metadata_df['release_date'].dt.year

In [224]:
# Sanity check

clean_metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119073 entries, 0 to 119937
Data columns (total 17 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   adult              119073 non-null  object        
 1   budget             119073 non-null  float64       
 2   imdb_id            119073 non-null  object        
 3   original_language  119073 non-null  object        
 4   original_title     119073 non-null  object        
 5   release_date       118323 non-null  datetime64[ns]
 6   revenue            118752 non-null  float64       
 7   runtime            118752 non-null  float64       
 8   title              118752 non-null  object        
 9   vote_average       118752 non-null  object        
 10  vote_count         118752 non-null  object        
 11  genre_names        119073 non-null  object        
 12  company_names      118752 non-null  object        
 13  director_names     118752 non-null  object       

Let's drop the `adult` and `original_title` columns, since we don't need them. We'll also re-order and rename some of our columns.

In [225]:
# Drop adult column 

clean_metadata_df = clean_metadata_df.drop(['adult'], axis=1)

In [226]:
# Drop original_title column 

clean_metadata_df = clean_metadata_df.drop(['original_title'], axis=1)

In [227]:
# Check 

clean_metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119073 entries, 0 to 119937
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   budget             119073 non-null  float64       
 1   imdb_id            119073 non-null  object        
 2   original_language  119073 non-null  object        
 3   release_date       118323 non-null  datetime64[ns]
 4   revenue            118752 non-null  float64       
 5   runtime            118752 non-null  float64       
 6   title              118752 non-null  object        
 7   vote_average       118752 non-null  object        
 8   vote_count         118752 non-null  object        
 9   genre_names        119073 non-null  object        
 10  company_names      118752 non-null  object        
 11  director_names     118752 non-null  object        
 12  actor_names        118752 non-null  object        
 13  release_year       118323 non-null  float64      

In [228]:
# Rename columns 

clean_metadata_df = clean_metadata_df.rename(columns={
    'director_names': 'directors',
    'actor_names': 'actors', 
    'vote_average': 'rating',
    'vote_count': 'votes',
    'company_names': 'production_companies',
    'genre_names': 'genres',
    'revenue': 'box_office_gross'
})

In [235]:
# Save new column order 

new_column_order = ['imdb_id', 
                    'title', 
                    'actors',
                    'directors',
                    'genres',
                    'original_language',
                    'year',
                    'runtime',   
                    'budget', 
                    'box_office_gross',
                    'production_companies',
                    'votes',
                    'rating'
                   ]

In [236]:
# Change column order 

clean_metadata_df = clean_metadata_df[new_column_order]

In [239]:
# Check 

clean_metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119073 entries, 0 to 119937
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   imdb_id               119073 non-null  object 
 1   title                 118752 non-null  object 
 2   actors                118752 non-null  object 
 3   directors             118752 non-null  object 
 4   genres                119073 non-null  object 
 5   original_language     119073 non-null  object 
 6   year                  118323 non-null  float64
 7   runtime               118752 non-null  float64
 8   budget                119073 non-null  float64
 9   box_office_gross      118752 non-null  float64
 10  production_companies  118752 non-null  object 
 11  votes                 118752 non-null  object 
 12  rating                118752 non-null  object 
dtypes: float64(4), object(9)
memory usage: 12.7+ MB


Let's filter this list based on the list of IDs that we extracted and saved to a text file in our main MovieLens cleaning notebook.

In [240]:
with open('data/cleaned_csv/final_ids.txt', 'r') as f:
    final_ids = [line.strip() for line in f.readlines()]

In [241]:
# Filter the DataFrame to only include rows whose 'imdb_id' is in the list 'final_ids'
filtered_df = clean_metadata_df[clean_metadata_df['imdb_id'].isin(final_ids)]

In [246]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52078 entries, 0 to 52077
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   imdb_id               52078 non-null  object 
 1   title                 52078 non-null  object 
 2   actors                51764 non-null  object 
 3   directors             52013 non-null  object 
 4   genres                52078 non-null  object 
 5   original_language     51346 non-null  object 
 6   year                  52077 non-null  float64
 7   runtime               52030 non-null  float64
 8   budget                18166 non-null  float64
 9   box_office_gross      21525 non-null  float64
 10  production_companies  50883 non-null  object 
 11  votes                 52078 non-null  int64  
 12  rating                52078 non-null  float64
dtypes: float64(5), int64(1), object(7)
memory usage: 5.2+ MB


Let's now save `clean_metadata_df` to a JSON file so we can preserve the cleaning we've done. We'll be using this to fill in missing vaules from our main data.

In [247]:
# RUN ONLY ONCE
# Save filtered_df to a JSON file

filtered_df.to_json('data/cleaned_csv/clean_metadata.json', orient='records', lines=True)

Great! Now we have our reasonably clean metadata. We'll open the JSON file in our main notebook to create our final dataframe for EDA and modeling.