# Data Cleaning
In this notebook, we inspect the provided data files to identify data cleaning needs and execute all data cleaning tasks. Cleaned data is then exported to the data_cleaned folder for consumption by the analysis notebooks.

## Initial setup

### Importing libraries

In [11]:
import pandas as pd

### Data inspection function
Below we define the function `inspect_dataframe`, which provides a summary of the data contained in the input data frame.

In [12]:
def inspect_dataframe(df):
    print(f"There are {len(df)} rows in this data frame.\n")
    print('Column Names and Types:\n')
    print(df.dtypes,'\n')
    print('The table below shows the number of missing values in each column:\n')
    print(df.isna().sum(),'\n')
    for columnName in df.columns:
        if df[columnName].dtype in (object, str):
            print(f"There are {len(df[columnName].unique())} unique values in {columnName}. The first five are listed below:\n")
            for value in df[columnName].unique()[:5]:
                print(value,)
            print('\n')
        elif df[columnName].dtype in (int, float):
            column_min = df[columnName].min()
            column_max = df[columnName].max()
            print(f"The column {columnName} ranges between a minimum value of {column_min} and a maximum value of {column_max}.\n")

## The `imdb.title.basics.csv` data source

### Loading `imdb.title.basics.csv`

In [13]:
title_basics_df = pd.read_csv('data/imdb.title.basics.csv')

### Inspecting `imdb.title.basics.csv`
From the summary below we note the following initial impressions:
 * The variable `tconst` appears to be a primary key.
 * The variable `primary_title` is the expected title of the movie formated as a string.
 * The variable `original_title` seems like it might encode the name of the original release of a movie that has been re-released. This may be most useful when the original title is in the original language of the movie.
 * The variable `start_year` ranges from 2010 to 2115, indicating that this column most likely needs to inspected more closely to identify malformed data.
 * The variable `runtime_minutes` ranges from 1 to 51420, indicating that this column most likely needs to inspected more closely to identify malformed data.
 * The variable `genres` contains a comma separated list of applicable genres. We should is identify a list of unique genres and re-encode these lists as distinct indicator variables. 

In [14]:
inspect_dataframe(title_basics_df)

There are 146144 rows in this data frame.

Column Names and Types:

tconst              object
primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
dtype: object 

The table below shows the number of missing values in each column:

tconst                 0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64 

There are 146144 unique values in tconst. The first five are listed below:

tt0063540
tt0066787
tt0069049
tt0069204
tt0100275


There are 136071 unique values in primary_title. The first five are listed below:

Sunghursh
One Day Before the Rainy Season
The Other Side of the Wind
Sabse Bada Sukh
The Wandering Soap Opera


There are 137774 unique values in original_title. The first five are listed below:

Sunghursh
Ashad Ka Ek Din
The Other Side of the Wind
Sabse Bada Sukh
La Telenovela Errante


The column start_year

### Cleaning `imdb.title.basic.csv`

#### Dropping movies with `start_year` in the future
Below we drop rows with movies that have a `start_year` after 2019. This ensures that our analysis covers full years and avoids incorporating data from unreleased movies. 

In [24]:
title_basics_df = title_basics_df[title_basics_df['start_year']>2019]

#### Dropping movies with extreme run times

In [28]:
title_basics_df[title_basics_df['runtime_minutes']> 120]

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
3669,tt10356650,Dream: The Life of Two Tales,Dream: The Life of Two Tales,2020,128.0,"Action,Comedy,Drama"
81669,tt4695264,Lawrence: After Arabia,Lawrence: After Arabia,2020,126.0,Drama
127678,tt7830722,Rashna:The Ray of Light,Rashna:The Ray of Light,2020,150.0,"Mystery,Thriller"
136826,tt8741304,Variance,Variance,2020,134.0,Drama


## The `imdb.title.ratings.csv` data source

### Loading the `imdb.title.ratings.csv` data source

In [15]:
title_ratings_df = pd.read_csv('data/imdb.title.ratings.csv')

### Inspecting `imdb.title.ratings.csv`
This file seems to contain ratings information for the films listed in `imdb.title.basic.csv` data source. All of the variables seem to be properly formatted and there are no missing values. This source requires no further action. 

In [16]:
inspect_dataframe(title_ratings_df)

There are 73856 rows in this data frame.

Column Names and Types:

tconst            object
averagerating    float64
numvotes           int64
dtype: object 

The table below shows the number of missing values in each column:

tconst           0
averagerating    0
numvotes         0
dtype: int64 

There are 73856 unique values in tconst. The first five are listed below:

tt10356526
tt10384606
tt1042974
tt1043726
tt1060240


The column averagerating ranges between a minimum value of 1.0 and a maximum value of 10.0.

The column numvotes ranges between a minimum value of 5 and a maximum value of 1841066.



## The `bom.movie_gross.csv` data source

### Loading the `bom.movie_gross.csv` data source

In [17]:
title_ratings_df = pd.read_csv('data/bom.movie_gross.csv')

### Inspecting `bom.movie_gross.csv`
From the summary below we note the following initial impressions:
 * The `title` variable seems to be properly formatted and is nearly unique. Further analysis should identify the one duplicate value and determine if the year separates the duplicate titles.
 * The `studio` variable is has very few missing values. It seems to be formatted as an abbreviation of the studio name. It would be nice to identify the full names of studios and map this variable to one that provides the full name.
 * The `domestic_gross` variable appears to be properly formatted with a small number of missing values. The maximum value is large but plausible. This variable should be inspected for outliers. It may be necessary to correct this variable for inflation.  
 * The `foreign_gross` variable seems to be improperly formatted as a string. This should be converted to a float after confirming that there is a uniform currency. 
 * The `year` variable is properly formatted and has no missing values. 
 
Since this data source does not share a key with the imdb sources, it will be necessary to match titles, most likely on the combination of title and year.  

In [18]:
inspect_dataframe(title_ratings_df)

There are 3387 rows in this data frame.

Column Names and Types:

title              object
studio             object
domestic_gross    float64
foreign_gross      object
year                int64
dtype: object 

The table below shows the number of missing values in each column:

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64 

There are 3386 unique values in title. The first five are listed below:

Toy Story 3
Alice in Wonderland (2010)
Harry Potter and the Deathly Hallows Part 1
Inception
Shrek Forever After


There are 258 unique values in studio. The first five are listed below:

BV
WB
P/DW
Sum.
Par.


The column domestic_gross ranges between a minimum value of 100.0 and a maximum value of 936700000.0.

There are 1205 unique values in foreign_gross. The first five are listed below:

652000000
691300000
664300000
535700000
513900000


The column year ranges between a minimum value of 2010 and a maximum v