# Data Prep for Supplemental IMDB Data

In this notebook, we wrestle with a [massive data set](https://www.kaggle.com/datasets/ashirwadsangwan/imdb-dataset?select=title.akas.tsv) from Kaggle with movie info scraped from IMDB. We found the provided IMDB SQL dataset lacking. We liked the profitability measures we calculated using the data provided from 'The Numbers', so that is our target list of movies whose features we want to fill out.

In [1]:
# Import pandas, obviously...
import pandas as pd

## Get data for movies in our target list

In [2]:
# Load in the basics table
title_basics_df = pd.read_csv('../zippedData/extra.imdb/title.basics.tsv/data.tsv.gz', 
                              delimiter='\t', usecols=['tconst', 'titleType', 'primaryTitle', 
                              'originalTitle', 'startYear', 'runtimeMinutes', 'genres'])

In [3]:
title_basics_df.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,1894,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,1892,5,"Animation,Short"


In [4]:
# Load in the ratings table
title_ratings_df = pd.read_csv('../zippedData/extra.imdb/title.ratings.tsv/data.tsv.gz', 
                              delimiter='\t', usecols=['tconst', 'averageRating', 'numVotes'])

In [5]:
title_ratings_df.head(2)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1990
1,tt0000002,5.8,265


In [6]:
# Convert release year to datetime
title_basics_df['startYear'] = pd.to_datetime(title_basics_df['startYear'], errors='coerce', format='%Y').dt.year

In [7]:
# Filter out titles released before the year 2000 and anything that is not a movie
movies_df = title_basics_df[(title_basics_df['startYear'] >= 2000) & (title_basics_df['titleType'] == 'movie')]

In [8]:
# Merge filtered data with ratings table 
movies_df = movies_df.merge(title_ratings_df, on='tconst', how='left')

In [9]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 309901 entries, 0 to 309900
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          309901 non-null  object 
 1   titleType       309901 non-null  object 
 2   primaryTitle    309901 non-null  object 
 3   originalTitle   309901 non-null  object 
 4   startYear       309901 non-null  float64
 5   runtimeMinutes  309901 non-null  object 
 6   genres          309901 non-null  object 
 7   averageRating   169053 non-null  float64
 8   numVotes        169053 non-null  float64
dtypes: float64(3), object(6)
memory usage: 23.6+ MB


In [10]:
movies_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,2019.0,\N,"Action,Crime",,
1,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,2021.0,94,Documentary,6.8,58.0
2,tt0015414,movie,La tierra de los toros,La tierra de los toros,2000.0,60,\N,5.2,16.0
3,tt0035423,movie,Kate & Leopold,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",6.4,87464.0
4,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,2020.0,70,Drama,6.4,179.0


Looks great so far! Now let's filter out everything not in our target list. We will need to read in the data from The Numbers and get a target list using that.

In [11]:
# Read in The Numbers dataset
tn_budgets = pd.read_csv('../zippedData/tn.movie_budgets.csv.gz', parse_dates=['release_date'], encoding='UTF-8')

# Get year of release
tn_budgets['release_year'] = tn_budgets['release_date'].dt.year

# Filter out movies released before 2000
tn_clean = tn_budgets[tn_budgets['release_year'] >= 2000]

# Get list of movies whose attributes we want to look at
# I.e. the movies that we actually have profitability measures for
target_list = tn_clean['movie'].values.tolist()

# Here is the normalizer to fix encoding issues from other cleaning workbook
def normalize_text(text):
    return text.replace('â\x80\x99', "'").replace('â\x80\x94', " ").replace('Ã©', "e")

# Get normalized target list
target_list_normalized = [normalize_text(i) for i in target_list]
target_list_normalized

['Avatar',
 'Pirates of the Caribbean: On Stranger Tides',
 'Dark Phoenix',
 'Avengers: Age of Ultron',
 'Star Wars Ep. VIII: The Last Jedi',
 'Star Wars Ep. VII: The Force Awakens',
 'Avengers: Infinity War',
 "Pirates of the Caribbean: At World's End",
 'Justice League',
 'Spectre',
 'The Dark Knight Rises',
 'Solo: A Star Wars Story',
 'The Lone Ranger',
 'John Carter',
 'Tangled',
 'Spider-Man 3',
 'Captain America: Civil War',
 'Batman v Superman: Dawn of Justice',
 'The Hobbit: An Unexpected Journey',
 'Harry Potter and the Half-Blood Prince',
 'The Hobbit: The Desolation of Smaug',
 'The Hobbit: The Battle of the Five Armies',
 'The Fate of the Furious',
 'Superman Returns',
 'Pirates of the Caribbean: Dead Men Tell No Tales',
 'Quantum of Solace',
 'The Avengers',
 "Pirates of the Caribbean: Dead Man's Chest",
 'Man of Steel',
 'The Chronicles of Narnia: Prince Caspian',
 'The Amazing Spider-Man',
 'Battleship',
 'Transformers: The Last Knight',
 'Jurassic World',
 'Men in Blac

In [12]:
# Slice our DataFrame to exclude movies not in our target list
movies_clean = movies_df[movies_df['primaryTitle'].isin(target_list_normalized) == True]
movies_clean

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes
48,tt0118589,movie,Glitter,Glitter,2001.0,104,"Drama,Music,Romance",2.4,23898.0
69,tt0120166,movie,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001.0,86,"Adventure,Family,Fantasy",4.2,653.0
80,tt0120630,movie,Chicken Run,Chicken Run,2000.0,84,"Adventure,Animation,Comedy",7.1,201714.0
81,tt0120667,movie,Fantastic Four,Fantastic Four,2005.0,106,"Action,Adventure,Fantasy",5.7,338503.0
83,tt0120679,movie,Frida,Frida,2002.0,123,"Biography,Drama,Romance",7.3,93279.0
...,...,...,...,...,...,...,...,...,...
309635,tt9889072,movie,The Promise,The Promise,2017.0,\N,Drama,,
309684,tt9892546,movie,Aladdin,Aladdin,2020.0,104,"Drama,Musical,Romance",4.5,46.0
309695,tt9893078,movie,Sublime,Sublime,2019.0,93,"Biography,Documentary,Music",8.0,8.0
309758,tt9899880,movie,Columbus,Columbus,2018.0,82,"Comedy,Drama",4.1,352.0


Much more manageable! Looks like we've got a few more records than the length of our target list, which might point to some duplicates. We will investigate that further in a moment. Next, let's figure out how to use the Principals and Names tables to add in features for Director, Writer, etc...

...This could get tricky.

## Who was involved?

In [13]:
# Load in Names table
name_basics_df = pd.read_csv('../zippedData/extra.imdb/name.basics.tsv/data.tsv.gz', 
                             delimiter='\t', usecols=['nconst', 'primaryName'])

In [14]:
name_basics_df.head(2)

Unnamed: 0,nconst,primaryName
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall


In [15]:
# Load in Principals table
title_principals_df = pd.read_csv('../zippedData/extra.imdb/title.principals.tsv/data.tsv.gz', 
                              delimiter='\t', usecols=['tconst', 'ordering', 'nconst', 'category'])

In [16]:
title_principals_df.head(2)

Unnamed: 0,tconst,ordering,nconst,category
0,tt0000001,1,nm1588970,self
1,tt0000001,2,nm0005690,director


In [17]:
# Merge Principals with Names table 
people_df = title_principals_df.merge(name_basics_df, on='nconst', how='left')

In [18]:
# Get list of movie IDs ('tconst') whose attributes we want to look at
target_id_list = movies_clean['tconst'].values.tolist()

# Slice our DataFrame to exclude records for movies not in our target list
people_clean = people_df[people_df['tconst'].isin(target_id_list) == True]
people_clean

Unnamed: 0,tconst,ordering,nconst,category,primaryName
1015780,tt0118589,10,nm0801005,cinematographer,Geoffrey Simpson
1015781,tt0118589,1,nm0001014,actress,Mariah Carey
1015782,tt0118589,2,nm0073160,actor,Eric Benét
1015783,tt0118589,3,nm0066586,actor,Max Beesley
1015784,tt0118589,4,nm0004771,actress,Da Brat
...,...,...,...,...,...
57989129,tt9899880,9,nm6962385,production_designer,Kamyab Aminashayeri
58006461,tt9906218,1,nm0932216,director,Nick Willing
58006462,tt9906218,2,nm0131204,producer,Michele Camarda
58006463,tt9906218,3,nm8687848,composer,Madison Willing


So these are the people involved with the movies we want to know about. We have multiple records for each film, which means we can't just do a simple join. Hmmmm... Let's start with directors.

#### Directors

In [19]:
# Get dataframe of just directors
directors = people_clean[['tconst', 'primaryName']][people_clean['category'] == 'director']

# Rename column
directors = directors.rename(columns={'primaryName': 'Directors'})

# Combine multiple directors into one tuple per film
directors = directors.groupby('tconst').agg(','.join)
directors.head()

Unnamed: 0_level_0,Directors
tconst,Unnamed: 1_level_1
tt0118589,Vondie Curtis-Hall
tt0120166,David Lister
tt0120630,"Peter Lord,Nick Park"
tt0120667,Tim Story
tt0120679,Julie Taymor


Nice! Now we can merge this into our movies_clean dataframe.

In [20]:
# Left join with movies_clean on the left
movies_clean = movies_clean.merge(directors, on='tconst', how='left')
movies_clean.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Directors
0,tt0118589,movie,Glitter,Glitter,2001.0,104,"Drama,Music,Romance",2.4,23898.0,Vondie Curtis-Hall
1,tt0120166,movie,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001.0,86,"Adventure,Family,Fantasy",4.2,653.0,David Lister
2,tt0120630,movie,Chicken Run,Chicken Run,2000.0,84,"Adventure,Animation,Comedy",7.1,201714.0,"Peter Lord,Nick Park"
3,tt0120667,movie,Fantastic Four,Fantastic Four,2005.0,106,"Action,Adventure,Fantasy",5.7,338503.0,Tim Story
4,tt0120679,movie,Frida,Frida,2002.0,123,"Biography,Drama,Romance",7.3,93279.0,Julie Taymor


Looks great. Who else do we care about?

In [21]:
people_clean['category'].value_counts()

actor                  15599
producer                9804
actress                 9355
writer                  8316
director                6865
composer                3898
cinematographer         3135
editor                  1845
self                     928
production_designer      511
archive_footage           61
archive_sound              1
Name: category, dtype: int64

Let's add producers, writers, and actors. For other jobs we likely don't have enough data.

#### Producers

In [22]:
# Get dataframe of just producers
producers = people_clean[['tconst', 'primaryName']][people_clean['category'] == 'producer']

# Rename column
producers = producers.rename(columns={'primaryName': 'Producers'})

# Combine multiple producers into one listing per film
producers = producers.groupby('tconst').agg(', '.join)

# Left join with movies_clean on the left
movies_clean = movies_clean.merge(producers, on='tconst', how='left')
movies_clean.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Directors,Producers
0,tt0118589,movie,Glitter,Glitter,2001.0,104,"Drama,Music,Romance",2.4,23898.0,Vondie Curtis-Hall,Laurence Mark
1,tt0120166,movie,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001.0,86,"Adventure,Family,Fantasy",4.2,653.0,David Lister,"Elizabeth Matthews, Peter H. Matthews"


#### Writers

In [23]:
# Get dataframe of just writers
writers = people_clean[['tconst', 'primaryName']][people_clean['category'] == 'writer']

# Rename column
writers = writers.rename(columns={'primaryName': 'Writers'})

# Combine multiple writers into one listing per film
writers = writers.groupby('tconst').agg(', '.join)

# Left join with movies_clean on the left
movies_clean = movies_clean.merge(writers, on='tconst', how='left')
movies_clean.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Directors,Producers,Writers
0,tt0118589,movie,Glitter,Glitter,2001.0,104,"Drama,Music,Romance",2.4,23898.0,Vondie Curtis-Hall,Laurence Mark,"Cheryl L. West, Kate Lanier"
1,tt0120166,movie,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001.0,86,"Adventure,Family,Fantasy",4.2,653.0,David Lister,"Elizabeth Matthews, Peter H. Matthews",Brett Morris


#### Actors

In [24]:
# Get dataframe of just Actors, including people listed as 'actor', 'actress', or 'self'.
actors = people_clean[['tconst', 'primaryName']][(people_clean['category'] == 'actor') |
                                                 (people_clean['category'] == 'actress') |
                                                 (people_clean['category'] == 'self')]

# Rename column
actors = actors.rename(columns={'primaryName': 'Actors'})

# Combine multiple actors into one listing per film
actors = actors.groupby('tconst').agg(', '.join)

# Left join with movies_clean on the left
movies_clean = movies_clean.merge(actors, on='tconst', how='left')
movies_clean.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Directors,Producers,Writers,Actors
0,tt0118589,movie,Glitter,Glitter,2001.0,104,"Drama,Music,Romance",2.4,23898.0,Vondie Curtis-Hall,Laurence Mark,"Cheryl L. West, Kate Lanier","Mariah Carey, Eric Benét, Max Beesley, Da Brat"
1,tt0120166,movie,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001.0,86,"Adventure,Family,Fantasy",4.2,653.0,David Lister,"Elizabeth Matthews, Peter H. Matthews",Brett Morris,"Robert Davi, Kelly LeBrock, Byron Taylor, Roxa..."


## Clean up columns to match other IMDB data

In [25]:
# Let's drop the two unnecessary columns
movies_clean = movies_clean.drop(columns=['titleType', 'originalTitle'])

In [26]:
# Let's use a dictionary to rename our columns to match our other dataset
rename_dict = {'tconst':'ID', 'primaryTitle':'Title', 'startYear':'Year', 
               'runtimeMinutes':'Runtime', 'genres':'Genres', 'averageRating':'AvgRating',
               'numVotes':'VoteCount'}

movies_clean = movies_clean.rename(columns=rename_dict)

In [27]:
movies_clean

Unnamed: 0,ID,Title,Year,Runtime,Genres,AvgRating,VoteCount,Directors,Producers,Writers,Actors
0,tt0118589,Glitter,2001.0,104,"Drama,Music,Romance",2.4,23898.0,Vondie Curtis-Hall,Laurence Mark,"Cheryl L. West, Kate Lanier","Mariah Carey, Eric Benét, Max Beesley, Da Brat"
1,tt0120166,The Sorcerer's Apprentice,2001.0,86,"Adventure,Family,Fantasy",4.2,653.0,David Lister,"Elizabeth Matthews, Peter H. Matthews",Brett Morris,"Robert Davi, Kelly LeBrock, Byron Taylor, Roxa..."
2,tt0120630,Chicken Run,2000.0,84,"Adventure,Animation,Comedy",7.1,201714.0,"Peter Lord,Nick Park",David Sproxton,"Karey Kirkpatrick, Mark Burton, John O'Farrell","Mel Gibson, Julia Sawalha, Phil Daniels, Lynn ..."
3,tt0120667,Fantastic Four,2005.0,106,"Action,Adventure,Fantasy",5.7,338503.0,Tim Story,Avi Arad,"Mark Frost, Michael France, Stan Lee, Jack Kirby","Ioan Gruffudd, Michael Chiklis, Chris Evans, J..."
4,tt0120679,Frida,2002.0,123,"Biography,Drama,Romance",7.3,93279.0,Julie Taymor,,"Anna Thomas, Hayden Herrera, Clancy Sigal, Dia...","Salma Hayek, Alfred Molina, Geoffrey Rush, Mía..."
...,...,...,...,...,...,...,...,...,...,...,...
6740,tt9889072,The Promise,2017.0,\N,Drama,,,Edwine Dorival,"Dorival Farlone, James-Paul Gannaway, Dorival ...",Mackenson Dorival,"Bekenya Jane Augustin, Lidwine Berthil, Nathan..."
6741,tt9892546,Aladdin,2020.0,104,"Drama,Musical,Romance",4.5,46.0,,"Josh Menning, David L. Walker, Greg Williams",Christina Sussmann,"Natasha Alam, Vitaliy Versace, Yuliya Zelenska..."
6742,tt9893078,Sublime,2019.0,93,"Biography,Documentary,Music",8.0,8.0,Bill Guttentag,"Dave Kaplan, Terry Leonard",Nayeema Raza,"Kenji Easley, Floyd Gaugh, Sublime, Eric Wilson"
6743,tt9899880,Columbus,2018.0,82,"Comedy,Drama",4.1,352.0,Hatef Alimardani,,,"Farhad Aslani, Majid Salehi, Saeed Poursamimi,..."


Awesome! Now let's write this data into a new file to bring into our main data cleaning notebook.

## Save down into new file

In [28]:
movies_clean.to_csv('../cleanedData/imdb_supplemental_cleaned.csv')