# Data Cleaning
In this notebook, we inspect the provided data files to identify data cleaning needs and execute all data cleaning tasks. Cleaned data is then exported to the data_cleaned folder for consumption by the analysis notebooks.

## Initial setup

### Importing libraries

In [1]:
import pandas as pd
import os

### Setting path to data

In [2]:
raw_data_path = '/home/schart/Flatiron/DataScience/Phase1/Project/Movie_Analysis/data/raw/'
cleaned_data_path = '/home/schart/Flatiron/DataScience/Phase1/Project/Movie_Analysis/data/cleaned/'

### Data import function

In [28]:
def get_data(raw_data_path, file_name):
    file_path = os.path.join(raw_data_path, file_name)
    if file_name.endswith('.csv'):
        df = pd.read_csv(file_path)
    elif file_name.endswith('.tsv'):
        try:
            df = pd.read_csv(file_path, sep='\t')
        except:
            df = pd.read_csv(file_path, sep='\t', encoding='windows-1252')
                
    return df

In [29]:
get_data(raw_data_path=raw_data_path, file_name='rt.reviews.tsv')

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
...,...,...,...,...,...,...,...,...
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


### Data saving function

In [4]:
def save_cleaned_data(cleaned_data_path, file_name, df):
    file_path = os.path.join(cleaned_data_path, file_name)
    df.to_csv(file_path)

### Data inspection function
Below we define the function `inspect_dataframe`, which provides a summary of the data contained in the input data frame.

In [25]:
def inspect_dataframe(df):
    print(f"There are {len(df)} rows in this data frame.\n")
    print('Column Names and Types:\n')
    print(df.dtypes,'\n')
    print('The table below shows the number of missing values in each column:\n')
    print(df.isna().sum(),'\n')
    #for columnName in df.columns:
    #    if df[columnName].dtype in (object, str):
    #        print(f"There are {len(df[columnName].unique())} unique values in {columnName}. The first five are listed below:\n")
    #        for value in df[columnName].unique()[:5]:
    #            print(value,)
    #        print('\n')
    #    elif df[columnName].dtype in (int, float):
    #        column_min = df[columnName].min()
    #        column_max = df[columnName].max()
    #        print(f"The column {columnName} ranges between a minimum value of {column_min} and a maximum value of {column_max}.\n")

## Listing data files

In [30]:
for file in sorted(os.listdir(raw_data_path)):
    print(f"==== {file} ====")
    try:
        print(inspect_dataframe(get_data(raw_data_path,file)))
    except:
        print('Could not inspect file')

==== bom.movie_gross.csv ====
There are 3387 rows in this data frame.

Column Names and Types:

title              object
studio             object
domestic_gross    float64
foreign_gross      object
year                int64
dtype: object 

The table below shows the number of missing values in each column:

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64 

None
==== imdb.name.basics.csv ====
There are 606648 rows in this data frame.

Column Names and Types:

nconst                 object
primary_name           object
birth_year            float64
death_year            float64
primary_profession     object
known_for_titles       object
dtype: object 

The table below shows the number of missing values in each column:

nconst                     0
primary_name               0
birth_year            523912
death_year            599865
primary_profession     51340
known_for_titles       30204
dtype: int64 

Non

In [1]:
tmdb_df = get_data(raw_data_path=raw_data_path, file_name='imdb.name.basics.csv.gz')
tmdb_df.head()

NameError: name 'get_data' is not defined

## The `imdb.title.basics.csv` data source

### Loading `imdb.title.basics.csv`

In [8]:
title_basics_df = get_data(raw_data_path, 'imdb.title.basics.csv')

### Inspecting `imdb.title.basics.csv`
From the summary below we note the following initial impressions:
 * The variable `tconst` appears to be a primary key.
 * The variable `primary_title` is the expected title of the movie formated as a string.
 * The variable `original_title` seems like it might encode the name of the original release of a movie that has been re-released. This may be most useful when the original title is in the original language of the movie.
 * The variable `start_year` ranges from 2010 to 2115, indicating that this column most likely needs to inspected more closely to identify malformed data.
 * The variable `runtime_minutes` ranges from 1 to 51420, indicating that this column most likely needs to inspected more closely to identify malformed data.
 * The variable `genres` contains a comma separated list of applicable genres. We should identify a list of unique genres and re-encode these lists as distinct indicator variables. 

In [9]:
inspect_dataframe(title_basics_df)

There are 146144 rows in this data frame.

Column Names and Types:

tconst              object
primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
dtype: object 

The table below shows the number of missing values in each column:

tconst                 0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64 

There are 146144 unique values in tconst. The first five are listed below:

tt0063540
tt0066787
tt0069049
tt0069204
tt0100275


There are 136071 unique values in primary_title. The first five are listed below:

Sunghursh
One Day Before the Rainy Season
The Other Side of the Wind
Sabse Bada Sukh
The Wandering Soap Opera


There are 137774 unique values in original_title. The first five are listed below:

Sunghursh
Ashad Ka Ek Din
The Other Side of the Wind
Sabse Bada Sukh
La Telenovela Errante


The column start_year

### Cleaning `imdb.title.basic.csv`

#### Dropping movies with `start_year` in the future
Below we drop rows with movies that have a `start_year` after 2019. This ensures that our analysis covers full years and avoids incorporating data from unreleased movies. 

In [82]:
cleaned_title_basics_df = title_basics_df.query('start_year <= 2019')

#### Inspecting movies with extreme `runtime_minutes`
Manually verifying the runtime of a few of the longest movies in the list shows that these values are accurate. For the purposes of this analysis, we drop movies that run longer than four hours. Movies with extremely long run times tend to be experimental and are unlikely to be relevant to our analysis which will focus on commercially viable films/


In [83]:
title_basics_df.query('runtime_minutes > 240').sort_values('runtime_minutes')

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
3918,tt10374170,Friends with Awkwardness,Friends with Awkwardness,2019,241.0,Comedy
118879,tt7131170,Rideshare,Rideshare,2016,242.0,"Horror,Thriller"
77066,tt4417796,Les choses et les mots de Mudimbe,Les choses et les mots de Mudimbe,2015,243.0,Documentary
36099,tt2321493,Romanian Art Scene 2009-2011,Romanian Art Scene 2009-2011,2012,243.0,Documentary
125405,tt7639166,Just Tell Her,Just Tell Her,2017,243.0,Action
...,...,...,...,...,...,...
88717,tt5136218,London EC1,London EC1,2015,5460.0,"Comedy,Drama,Mystery"
87264,tt5068890,Hunger!,Hunger!,2015,6000.0,"Documentary,Drama"
123467,tt7492094,Nari,Nari,2017,6017.0,Documentary
44840,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400.0,Documentary


#### Dropping movies with extreme `runtime_minutes`

In [84]:
cleaned_title_basics_df = cleaned_title_basics_df.query('runtime_minutes <= 240')

#### Re-encoding the `genres` variable
We noticed that `genres` is a, potentially empty, list of applicable genres. Below we re-encode this data using indicator variables.

In [31]:
def get_unique_genres(df):
    unique_genres = set()
    for title_genres in df['genres']:
        try:
            for genre in title_genres.split(','):
                unique_genres.add(genre)
        except:
            continue
    unique_genres_list = sorted(list(unique_genres))
    return unique_genres_list

In [15]:
def genre_indicator(title_genres, genre):
    try:
        title_genres_list = title_genres.split(',')
        if genre in title_genres_list:
            value = True
        else:
            value = False
    except:
        value = False
    return value

In [16]:
def test_genre_indicator():
    test_title_genres = "Drama,Comedy"
    test_genre1 = "Drama"
    test_genre2 = "Musical"
    return genre_indicator(test_title_genres, test_genre1) == True and genre_indicator(test_title_genres, test_genre2) == False

test_genre_indicator()

True

In [88]:
def make_genre_indicators_df(df):
    df_out = pd.DataFrame()
    df_out['tconst'] = df['tconst']
    unique_genres = get_unique_genres(df)
    for genre in unique_genres:
        genre_name = genre.lower().replace('-','_')
        column_name = "genre_"+genre_name
        df_out[column_name] = df['genres'].apply(lambda x: genre_indicator(x, genre))
    return df_out

In [32]:
get_unique_genres(title_basics_df)

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [89]:
make_genre_indicators_df(cleaned_title_basics_df)

Unnamed: 0,tconst,genre_action,genre_adult,genre_adventure,genre_animation,genre_biography,genre_comedy,genre_crime,genre_documentary,genre_drama,...,genre_news,genre_reality_tv,genre_romance,genre_sci_fi,genre_short,genre_sport,genre_talk_show,genre_thriller,genre_war,genre_western
0,tt0063540,True,False,False,False,False,False,True,False,True,...,False,False,False,False,False,False,False,False,False,False
1,tt0066787,False,False,False,False,True,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
2,tt0069049,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
4,tt0100275,False,False,False,False,False,True,False,False,True,...,False,False,False,False,False,False,False,False,False,False
5,tt0111414,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146135,tt9916170,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
146136,tt9916186,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
146137,tt9916190,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
146139,tt9916538,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


### Inspecting cleaned data

In [90]:
inspect_dataframe(cleaned_title_basics_df)

There are 114109 rows in this data frame.

Column Names and Types:

tconst              object
primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
dtype: object 

The table below shows the number of missing values in each column:

tconst                0
primary_title         0
original_title        4
start_year            0
runtime_minutes       0
genres             2164
dtype: int64 

There are 114109 unique values in tconst. The first five are listed below:

tt0063540
tt0066787
tt0069049
tt0100275
tt0111414


There are 107188 unique values in primary_title. The first five are listed below:

Sunghursh
One Day Before the Rainy Season
The Other Side of the Wind
The Wandering Soap Opera
A Thin Life


There are 108641 unique values in original_title. The first five are listed below:

Sunghursh
Ashad Ka Ek Din
The Other Side of the Wind
La Telenovela Errante
A Thin Life


The column start_year ranges betwee

### Saving cleaned data

In [91]:
save_cleaned_data(cleaned_data_path, 'imdb.title.basics.csv', cleaned_title_basics_df)

## The `imdb.title.ratings.csv` data source

### Loading the `imdb.title.ratings.csv` data source

In [92]:
title_ratings_df = get_data(raw_data_path, 'imdb.title.ratings.csv')

### Inspecting `imdb.title.ratings.csv`
This file seems to contain ratings information for the films listed in `imdb.title.basic.csv` data source. All of the variables seem to be properly formatted and there are no missing values. It would be useful to join this table with `imdb.title.basic`. 

In [93]:
inspect_dataframe(title_ratings_df)

There are 73856 rows in this data frame.

Column Names and Types:

tconst            object
averagerating    float64
numvotes           int64
dtype: object 

The table below shows the number of missing values in each column:

tconst           0
averagerating    0
numvotes         0
dtype: int64 

There are 73856 unique values in tconst. The first five are listed below:

tt10356526
tt10384606
tt1042974
tt1043726
tt1060240


The column averagerating ranges between a minimum value of 1.0 and a maximum value of 10.0.

The column numvotes ranges between a minimum value of 5 and a maximum value of 1841066.



In [99]:
left = cleaned_title_basics_df.set_index('tconst')
right = title_ratings_df.set_index('tconst')
title_basic_ratings_df = left.join(right, how='left')
inspect_dataframe(title_basic_ratings_df)

There are 114109 rows in this data frame.

Column Names and Types:

primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
averagerating      float64
numvotes           float64
dtype: object 

The table below shows the number of missing values in each column:

primary_title          0
original_title         4
start_year             0
runtime_minutes        0
genres              2164
averagerating      47952
numvotes           47952
dtype: int64 

There are 107188 unique values in primary_title. The first five are listed below:

Sunghursh
One Day Before the Rainy Season
The Other Side of the Wind
The Wandering Soap Opera
A Thin Life


There are 108641 unique values in original_title. The first five are listed below:

Sunghursh
Ashad Ka Ek Din
The Other Side of the Wind
La Telenovela Errante
A Thin Life


The column start_year ranges between a minimum value of 2010 and a maximum value of 2019.

The column run

### Saving joined data

In [102]:
save_cleaned_data(cleaned_data_path=cleaned_data_path, file_name='imdb.title.basic_join_ratings.csv', df=title_basic_ratings_df)

## The `bom.movie_gross.csv` data source

### Loading the `bom.movie_gross.csv` data source

In [10]:
movie_gross_df = get_data(raw_data_path, 'bom.movie_gross.csv')

### Inspecting `bom.movie_gross.csv`
From the summary below we note the following initial impressions:
 * The `title` variable seems to be properly formatted and is nearly unique. Further analysis should identify the one duplicate value and determine if the year separates the duplicate titles.
 * The `studio` variable is has very few missing values. It seems to be formatted as an abbreviation of the studio name. It would be nice to identify the full names of studios and map this variable to one that provides the full name.
 * The `domestic_gross` variable appears to be properly formatted with a small number of missing values. The maximum value is large but plausible. This variable should be inspected for outliers. It may be necessary to correct this variable for inflation.  
 * The `foreign_gross` variable seems to be improperly formatted as a string. This should be converted to a float after confirming that there is a uniform currency. 
 * The `year` variable is properly formatted and has no missing values. 
 
Since this data source does not share a key with the imdb sources, it will be necessary to match titles, most likely on the combination of title and year.  

In [12]:
inspect_dataframe(movie_gross_df)

There are 3387 rows in this data frame.

Column Names and Types:

title              object
studio             object
domestic_gross    float64
foreign_gross      object
year                int64
dtype: object 

The table below shows the number of missing values in each column:

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64 

There are 3386 unique values in title. The first five are listed below:

Toy Story 3
Alice in Wonderland (2010)
Harry Potter and the Deathly Hallows Part 1
Inception
Shrek Forever After


There are 258 unique values in studio. The first five are listed below:

BV
WB
P/DW
Sum.
Par.


The column domestic_gross ranges between a minimum value of 100.0 and a maximum value of 936700000.0.

There are 1205 unique values in foreign_gross. The first five are listed below:

652000000
691300000
664300000
535700000
513900000


The column year ranges between a minimum value of 2010 and a maximum v

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
