# Data Cleaning

This notebook contains initial data preparation and simple data cleaning for EDA later on.

### To-do list

- Read and load data files into dataframes
- Combine dataframes as needed (match on movie title most likely, check for others)
- Clean up null/missing values

In [1]:
import pandas as pd 
from zipfile import ZipFile 

### The Numbers

In [86]:
# Load data from The Numbers into dataframe
df_tn = pd.read_csv('../data/raw/tn.movie_budgets.csv.gz', compression='gzip', encoding='latin1')
df_tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [97]:
df_tn.sample(10)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,year,month,day
3454,55,2015-10-30,Dancin' It's On,"$12,000,000",$0,$0,2015,10,30
1727,28,2009-12-18,The Young Victoria,"$35,000,000","$11,001,272","$31,878,891",2009,12,18
1187,88,1994-07-01,Baby's Day Out,"$50,000,000","$16,581,575","$16,581,575",1994,7,1
109,10,1995-07-28,Waterworld,"$175,000,000","$88,246,220","$264,246,220",1995,7,28
2322,23,2017-12-31,Matilda,"$25,000,000",$0,"$9,370,285",2017,12,31
3105,6,2016-09-23,Queen of Katwe,"$15,000,000","$8,874,389","$10,055,481",2016,9,23
1914,15,2002-08-16,Blue Crush,"$30,000,000","$40,118,420","$51,618,420",2002,8,16
1105,6,2006-03-24,Inside Man,"$50,000,000","$88,634,237","$185,798,265",2006,3,24
5713,14,1962-08-10,The Brain That Wouldn't Die,"$60,000",$0,$0,1962,8,10
2778,79,1997-05-02,Austin Powers: International Man of Mystery,"$18,000,000","$53,883,989","$67,683,989",1997,5,2


In [88]:
# Create a mapping of month names to numbers
month_mapping = {
    "Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04",
    "May": "05", "Jun": "06", "Jul": "07", "Aug": "08",
    "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"
}

In [89]:
# Define a function to replace the month name with a number in each string
def replace_month_name(date_string):
    for month, num in month_mapping.items():
        date_string = date_string.replace(month, num)
    return date_string

In [90]:
# Apply the function to the 'release_date' to replace month names with numbers
df_tn['release_date'] = df_tn['release_date'].apply(replace_month_name)

In [94]:
# And finally, convert the 'release_date' column into datetime
df_tn['release_date'] = pd.to_datetime(df_tn['release_date'], format='%m %d, %Y')

In [96]:
# Split the newly converted 'release_date' column into separate columns 'year', 'month', and 'day' 
df_tn['year'] = df_tn['release_date'].dt.year 
df_tn['month'] = df_tn['release_date'].dt.month 
df_tn['day'] = df_tn['release_date'].dt.day 

#### Columns that I be needing from elsewhere mon

- runtime
- genre
- studio?

IF time allows, feature engineering on actors, directors, writers.

---

### Box Office Mojo

In [6]:
# Load Box Office Mojo data into dataframe
df_bom = pd.read_csv('../data/raw/bom.movie_gross.csv.gz', compression='gzip')
df_bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


---

### TMDb

In [9]:
# Load TMDb data into dataframe
df_tmdb = pd.read_csv('../data/raw/tmdb.movies.csv.gz', compression='gzip')
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [25]:
# Convert release_date column to datetime
df_tmdb['release_date'] = pd.to_datetime(df_tmdb['release_date'], format='%Y-%m-%d')

In [104]:
df_tmdb.sample(10)

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
25367,25367,"[35, 18, 10751]",511562,en,To the Beat,2.157,2018-03-13,To the Beat,5.1,4
10971,10971,"[53, 27]",301199,en,Bible Belt Slasher: The Holy Terror,0.6,2013-04-13,Bible Belt Slasher: The Holy Terror,1.5,1
2893,2893,[53],68280,fr,Requiem Pour Une Tueuse,6.213,2011-06-01,Requiem for a Killer,4.7,49
19580,19580,[],409371,en,Home,0.672,2016-08-01,Home,7.0,1
751,751,[35],48132,en,Kevin Smith: Too Fat For 40,2.36,2010-10-22,Kevin Smith: Too Fat For 40,6.5,15
17783,17783,"[35, 18]",356298,en,Don't Think Twice,7.304,2016-07-22,Don't Think Twice,6.4,240
18644,18644,[35],387054,en,Jimmy Carr: Funny Business,1.824,2016-03-18,Jimmy Carr: Funny Business,6.9,51
13459,13459,[99],230154,ar,Cairo Drive,0.6,2014-09-30,Cairo Drive,8.0,1
21269,21269,[18],430406,en,Running Wild,5.691,2017-02-10,Running Wild,6.4,26
12777,12777,[99],269761,zh,Diaoyu Islands: The Truth,0.938,2014-03-11,Diaoyu Islands: The Truth,5.5,3


In [73]:
df_tmdb.loc[df_tmdb['release_date'] == '2013-10-11']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
7922,7922,"[28, 18, 53]",109424,en,Captain Phillips,15.483,2013-10-11,Captain Phillips,7.6,3909
8012,8012,"[28, 80, 53]",106747,en,Machete Kills,10.36,2013-10-11,Machete Kills,5.5,1079
8020,8020,"[27, 9648, 53]",9022,en,All the Boys Love Mandy Lane,10.03,2013-10-11,All the Boys Love Mandy Lane,5.5,326
8210,8210,"[18, 10402]",111479,en,CBGB,7.056,2013-10-11,CBGB,6.5,70
8236,8236,"[53, 27]",227877,en,Torment,6.826,2013-10-11,Torment,4.7,52
8312,8312,[18],158908,en,The Inevitable Defeat of Mister & Pete,6.115,2013-10-11,The Inevitable Defeat of Mister & Pete,7.6,44
8348,8348,"[27, 14]",158752,en,Escape from Tomorrow,5.743,2013-10-11,Escape from Tomorrow,4.8,117
8438,8438,[18],152745,en,As I Lay Dying,4.835,2013-10-11,As I Lay Dying,5.7,34
9022,9022,[35],169642,en,Zero Charisma,1.815,2013-10-11,Zero Charisma,6.6,25
9149,9149,[18],254918,en,Louder Than Words,1.519,2013-10-11,Louder Than Words,5.7,26


---

### Rotten Tomatoes

In [28]:
# Load Rotten Tomatoes reviews into dataframe
df_rt_reviews = pd.read_csv('../data/raw/rt.reviews.tsv.gz', sep='\t', compression='gzip', encoding='latin1')
df_rt_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [29]:
# Load Rotten Tomatoes movie info into dataframe
df_rt_movie_info = pd.read_csv('../data/raw/rt.movie_info.tsv.gz', sep='\t', compression='gzip')
df_rt_movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB
