# Data Cleaning

This notebook contains initial data preparation and cleaning for datasets from The Numbers, TMDB, and Box Office Mojo. An additional large movie dataset from Kaggle is explored at the bottom.

### To-do list

- Read and load data files into dataframes
- Combine dataframes as needed (match on movie title most likely, check for others)
- Clean up null/missing values

In [127]:
import pandas as pd 
from zipfile import ZipFile 
import fuzzywuzzy
import rapidfuzz


## The Numbers

In [207]:
# Load data from The Numbers into dataframe
df_tn = pd.read_csv('../data/processed/tn.movie_budgets.csv', sep=';')
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [227]:
df_tn[:20]

Unnamed: 0,id,release_date,title,production_budget,domestic_gross,worldwide_gross,year,month,day
0,1,2009-12-18,Avatar,"$425,000,000","$760,507,625","$2,776,345,279",2009,12,18
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",2011,5,20
2,3,2019-06-07,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",2019,6,7
3,4,2015-05-01,Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",2015,5,1
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",2017,12,15
5,6,2015-12-18,Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220",2015,12,18
6,7,2018-04-27,Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200",2018,4,27
7,8,2007-05-24,Pirates of the Caribbean: At World’s End,"$300,000,000","$309,420,425","$963,420,425",2007,5,24
8,9,2017-11-17,Justice League,"$300,000,000","$229,024,295","$655,945,209",2017,11,17
9,10,2015-11-06,Spectre,"$300,000,000","$200,074,175","$879,620,923",2015,11,6


In [210]:
# Create a mapping of month names to numbers
month_mapping = {
    "Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04",
    "May": "05", "Jun": "06", "Jul": "07", "Aug": "08",
    "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"
}

In [211]:
# Define a function to replace the month name with a number in each string
def replace_month_name(date_string):
    for month, num in month_mapping.items():
        date_string = date_string.replace(month, num)
    return date_string

In [212]:
# Apply the function to the 'release_date' to replace month names with numbers
df_tn['release_date'] = df_tn['release_date'].apply(replace_month_name)

In [213]:
# And finally, convert the 'release_date' column into datetime
df_tn['release_date'] = pd.to_datetime(df_tn['release_date'], format='%m %d, %Y')

In [214]:
# Split the newly converted 'release_date' column into separate columns 'year', 'month', and 'day' 
df_tn['year'] = df_tn['release_date'].dt.year 
df_tn['month'] = df_tn['release_date'].dt.month 
df_tn['day'] = df_tn['release_date'].dt.day 

In [215]:
df_tn['year'].value_counts()[:10]

year
2015    338
2010    274
2008    264
2006    260
2014    255
2011    254
2009    239
2013    238
2012    235
2005    223
Name: count, dtype: int64

In [226]:
# Rename 'movie' column to 'title' so it matches the TMDb table
df_tn.rename(columns={'movie': 'title'}, inplace=True)

#### Columns that I be needing from elsewhere mon

- runtime
- genre
- studio?

IF time allows, feature engineering on actors, directors, writers.

#### Merging df_tn and df_tmdb on movie title

In [252]:
# Merge df_tn and df_tmdb on movie title
df_tn_tmdb = pd.merge(df_tn, df_tmdb, how='inner', on='title')
df_tn_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2320 entries, 0 to 2319
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 2320 non-null   int64         
 1   release_date       2320 non-null   datetime64[ns]
 2   title              2320 non-null   object        
 3   production_budget  2320 non-null   object        
 4   domestic_gross     2320 non-null   object        
 5   worldwide_gross    2320 non-null   object        
 6   year               2320 non-null   int32         
 7   month              2320 non-null   int32         
 8   day                2320 non-null   int32         
 9   genre_ids          2320 non-null   object        
 10  original_language  2320 non-null   object        
dtypes: datetime64[ns](1), int32(3), int64(1), object(6)
memory usage: 172.3+ KB


In [256]:
df_tn_tmdb[df_tn_tmdb.duplicated(['title'],keep=False)][:30]

Unnamed: 0,id,release_date,title,production_budget,domestic_gross,worldwide_gross,year,month,day,genre_ids,original_language
4,9,2017-11-17,Justice League,"$300,000,000","$229,024,295","$655,945,209",2017,11,17,"[28, 12, 14, 878]",en
5,9,2017-11-17,Justice League,"$300,000,000","$229,024,295","$655,945,209",2017,11,17,"[28, 12, 14, 878]",en
6,10,2015-11-06,Spectre,"$300,000,000","$200,074,175","$879,620,923",2015,11,6,"[28, 12, 80]",en
7,10,2015-11-06,Spectre,"$300,000,000","$200,074,175","$879,620,923",2015,11,6,"[28, 12, 80]",en
20,27,2012-05-04,The Avengers,"$225,000,000","$623,279,547","$1,517,935,897",2012,5,4,"[878, 28, 12]",en
28,39,2010-05-14,Robin Hood,"$210,000,000","$105,487,148","$322,459,006",2010,5,14,"[28, 12, 18]",en
29,39,2010-05-14,Robin Hood,"$210,000,000","$105,487,148","$322,459,006",2010,5,14,"[12, 16, 10751]",en
30,39,2010-05-14,Robin Hood,"$210,000,000","$105,487,148","$322,459,006",2010,5,14,"[12, 28, 53]",en
31,42,2018-02-16,Black Panther,"$200,000,000","$700,059,566","$1,348,258,224",2018,2,16,"[28, 16]",en
32,42,2018-02-16,Black Panther,"$200,000,000","$700,059,566","$1,348,258,224",2018,2,16,"[28, 12, 14, 878]",en


In [254]:
# Check for duplicates
len(df_tn_tmdb['title'].unique())

1900

In [257]:
# Drop duplicate titles
df_tn_tmdb.drop_duplicates(subset='title', inplace=True)
df_tn_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1900 entries, 0 to 2319
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 1900 non-null   int64         
 1   release_date       1900 non-null   datetime64[ns]
 2   title              1900 non-null   object        
 3   production_budget  1900 non-null   object        
 4   domestic_gross     1900 non-null   object        
 5   worldwide_gross    1900 non-null   object        
 6   year               1900 non-null   int32         
 7   month              1900 non-null   int32         
 8   day                1900 non-null   int32         
 9   genre_ids          1900 non-null   object        
 10  original_language  1900 non-null   object        
dtypes: datetime64[ns](1), int32(3), int64(1), object(6)
memory usage: 155.9+ KB


In [258]:
df_tn_tmdb.to_csv('../data/processed/tn_tmdb_merged.csv', index=False)

---

## Box Office Mojo

In [6]:
# Load Box Office Mojo data into dataframe
df_bom = pd.read_csv('../data/raw/bom.movie_gross.csv.gz', compression='gzip')
df_bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [219]:
df_bom.loc[df_bom['title'].str.contains('Harry Potter')]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
328,Harry Potter and the Deathly Hallows Part 2,WB,381000000.0,960500000,2011


In [220]:
df_bom['year'].value_counts()

year
2015    450
2016    436
2012    400
2011    399
2014    395
2013    350
2010    328
2017    321
2018    308
Name: count, dtype: int64

In [221]:
df_tmdb['title'].isin(df_bom['title']).value_counts()

title
False    21357
True      2681
Name: count, dtype: int64

In [222]:
df_tmdb['original_title'].isin(df_bom['title']).value_counts()

original_title
False    21779
True      2259
Name: count, dtype: int64

In [223]:
df_bom['title'].isin(df_tmdb['title']).value_counts()

title
True     2299
False    1088
Name: count, dtype: int64

In [224]:
df_bom['title'].isin(df_tmdb['original_title']).value_counts()

title
True     1938
False    1449
Name: count, dtype: int64

---

## TMDB

In [245]:
# Load TMDb data into dataframe
df_tmdb = pd.read_csv('../data/raw/tmdb.movies.csv.gz', compression='gzip')
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [246]:
# Convert release_date column to datetime
df_tmdb['release_date'] = pd.to_datetime(df_tmdb['release_date'], format='%Y-%m-%d')

In [106]:
df_tmdb.sample(10)

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
6385,6385,"[18, 10751]",154537,en,Arcadia,1.365,2012-10-13,Arcadia,6.1,11
4167,4167,[],410534,en,Path to Glory: The Rise and Rise of the Polish...,0.827,2011-04-01,Path to Glory: The Rise and Rise of the Polish...,10.0,1
17,17,"[10402, 10749]",52371,nl,LelleBelle,18.586,2010-10-09,LelleBelle,4.5,36
9722,9722,[18],82527,en,Dirty Dancing,0.806,2013-12-05,Dirty Dancing,9.3,6
5145,5145,[53],64767,en,Deadrise,0.6,2011-04-05,Deadrise,1.5,2
23984,23984,"[35, 10402]",353616,en,Pitch Perfect 3,14.89,2017-12-22,Pitch Perfect 3,6.5,1884
18820,18820,[99],379500,en,Under the Gun,1.424,2016-05-13,Under the Gun,5.8,11
4931,4931,[99],240863,pt,Laura,0.6,2011-10-14,Laura,5.0,1
12091,12091,[27],336294,en,American Guinea Pig: Bouquet of Guts and Gore,2.148,2015-05-05,American Guinea Pig: Bouquet of Guts and Gore,4.7,13
17653,17653,"[18, 80, 53]",393717,fr,Braqueurs,8.826,2016-09-24,The Crew,6.2,134


In [109]:
type(df_tmdb['genre_ids'][4166])

str

In [110]:
df_tmdb.loc[df_tmdb['genre_ids'] == '[]']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
517,517,[],31059,ru,Наша Russia: Яйца судьбы,3.867,2010-01-21,Nasha Russia: Yaytsa sudby,4.3,25
559,559,[],151316,en,Shrek’s Yule Log,3.424,2010-12-07,Shrek’s Yule Log,4.7,9
589,589,[],75828,en,Erratum,3.154,2010-09-16,Erratum,6.6,7
689,689,[],150782,en,Bikini Frankenstein,2.625,2010-01-18,Bikini Frankenstein,6.0,4
731,731,[],200946,en,Weakness,2.451,2010-10-24,Weakness,4.5,2
...,...,...,...,...,...,...,...,...,...,...
26495,26495,[],556601,en,Recursion,0.600,2018-08-28,Recursion,2.0,1
26497,26497,[],514045,en,The Portuguese Kid,0.600,2018-02-14,The Portuguese Kid,2.0,1
26498,26498,[],497839,en,The 23rd Annual Critics' Choice Awards,0.600,2018-01-11,The 23rd Annual Critics' Choice Awards,2.0,1
26500,26500,[],561932,en,Two,0.600,2018-02-04,Two,1.0,1


In [247]:
df_tmdb = df_tmdb.loc[df_tmdb['genre_ids'] != '[]']
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24038 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Unnamed: 0         24038 non-null  int64         
 1   genre_ids          24038 non-null  object        
 2   id                 24038 non-null  int64         
 3   original_language  24038 non-null  object        
 4   original_title     24038 non-null  object        
 5   popularity         24038 non-null  float64       
 6   release_date       24038 non-null  datetime64[ns]
 7   title              24038 non-null  object        
 8   vote_average       24038 non-null  float64       
 9   vote_count         24038 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(3), object(4)
memory usage: 2.0+ MB


In [248]:
# Split the 'release_date' column into separate columns 'year', 'month', and 'day' 
df_tmdb['year'] = df_tmdb['release_date'].dt.year 
df_tmdb['month'] = df_tmdb['release_date'].dt.month 
df_tmdb['day'] = df_tmdb['release_date'].dt.day 

In [249]:
df_tmdb['year'].value_counts()[:15]

year
2016    2953
2015    2952
2017    2944
2013    2793
2014    2782
2018    2480
2011    2407
2012    2330
2010    2114
2019      63
2009      40
2008      14
1986      10
2004      10
2005       9
Name: count, dtype: int64

In [117]:
df_tmdb['title'].isin(df_tn['movie']).value_counts()

title
False    21799
True      2239
Name: count, dtype: int64

In [118]:
df_tmdb['original_title'].isin(df_tn['movie']).value_counts()

original_title
False    21858
True      2180
Name: count, dtype: int64

In [250]:
df_tmdb.drop(columns=['id', 'Unnamed: 0', 'original_title', 'popularity', 'vote_average', 'vote_count', 'release_date', 'year', 'month', 'day'], inplace=True)

In [251]:
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24038 entries, 0 to 26516
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   genre_ids          24038 non-null  object
 1   original_language  24038 non-null  object
 2   title              24038 non-null  object
dtypes: object(3)
memory usage: 751.2+ KB


---

### Kaggle Movies dataset https://www.kaggle.com/datasets/mohammedalsubaie/movies

In [192]:
df_movies = pd.read_csv('../data/raw/movies.csv')

In [193]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 18 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   movie_id              1048575 non-null  int64  
 1   title                 1048562 non-null  object 
 2   vote_average          1048575 non-null  float64
 3   vote_count            1048575 non-null  int64  
 4   status                1048575 non-null  object 
 5   release_date          899153 non-null   object 
 6   revenue               1048575 non-null  int64  
 7   adult                 1048575 non-null  bool   
 8   budget                1048575 non-null  int64  
 9   imdb_id               576991 non-null   object 
 10  original_language     1048575 non-null  object 
 11  original_title        1048562 non-null  object 
 12  overview              850351 non-null   object 
 13  popularity            1048575 non-null  float64
 14  genres                646292 non-n

In [198]:
df_movies.head()

Unnamed: 0,title,release_date,revenue,adult,budget,imdb_id,original_language,original_title,genres,production_companies,production_countries
0,Inception,7/15/2010,825532764,False,160000000,tt1375666,en,Inception,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America"
1,Interstellar,11/5/2014,701729206,False,165000000,tt0816692,en,Interstellar,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America"
2,The Dark Knight,7/16/2008,1004558444,False,185000000,tt0468569,en,The Dark Knight,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America"
3,Avatar,12/15/2009,2923706026,False,237000000,tt0499549,en,Avatar,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom"
4,The Avengers,4/25/2012,1518815515,False,220000000,tt0848228,en,The Avengers,"Science Fiction, Action, Adventure",Marvel Studios,United States of America


In [197]:
columns_to_drop = ['imdb_id', 'movie_id', 'vote_average', 'vote_count', 'status', 'overview', 'popularity', 'spoken_languages']
df_movies.drop(columns=columns_to_drop, inplace=True)

In [203]:
df_movies.drop(columns=['original_title', 'production_countries'], inplace=True)

In [204]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 8 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   title                 1048562 non-null  object
 1   release_date          899153 non-null   object
 2   revenue               1048575 non-null  int64 
 3   adult                 1048575 non-null  bool  
 4   budget                1048575 non-null  int64 
 5   original_language     1048575 non-null  object
 6   genres                646292 non-null   object
 7   production_companies  487889 non-null   object
dtypes: bool(1), int64(2), object(5)
memory usage: 57.0+ MB
