# Exploratory Data Analysis

## Setup
Import the relevant packages.


In [144]:
import json
import pandas as pd

We have two data files, `tmdb_5000_credits.csv` and `tmdb_5000_movies.csv`. The 
data files can all be found in the directory `raw-data`. For simplicity, we 
refer to the files as `credits.csv` and `movies.csv` respectively. 

We first look at what type of data we are dealing with. 

In [145]:
# load in csv as dataframes
movies = pd.read_csv("raw-data/tmdb_5000_movies.csv")
credits = pd.read_csv("raw-data/tmdb_5000_credits.csv")

In [146]:
# get key information
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [147]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


## Data Cleaning
The columns `genres`, `keywords`, `production_companies`, `production_countries`, `spoken_languages` in `movies.csv` and `cast` and `crew` in `credits.csv` are lists of dicts, so we load them in as JSON objects. 

In [148]:
# parse dates
movies["release_date"] = pd.to_datetime(movies["release_date"]).apply(lambda x: x.date())

# parse columns with nested JSON objects as actual JSON objects instead of str
# movies.csv
movies_json_columns = [
    "genres", "keywords", "production_companies", "production_countries", 
    "spoken_languages"
]
for col in movies_json_columns:
    movies[col] = movies[col].apply(json.loads)

# credits.csv
credits_json_columns = ["cast", "crew"]
for col in credits_json_columns:
    credits[col] = credits[col].apply(json.loads)

We also require the datasets in long form. First we work with `credits.csv`.

In [149]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de..."
1,285,Pirates of the Caribbean: At World's End,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de..."
2,206647,Spectre,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de..."
3,49026,The Dark Knight Rises,"[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de..."
4,49529,John Carter,"[{'cast_id': 5, 'character': 'John Carter', 'c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de..."


In [191]:
list_of_casts = []
for i in range(0, credits.shape[0], 1):
    cast = pd.json_normalize(credits.cast[i])
    cast["movie_id"] = credits.iloc[i, 0]
    cast["title"] = credits.iloc[i,1]

    list_of_casts.append(cast)

flat_casts = pd.concat(list_of_casts)
flat_casts.sample(5)


Unnamed: 0,cast_id,character,credit_id,gender,id,name,order,movie_id,title
22,23.0,Susana Parrado,52fe4479c3a36847f8098351,1.0,33834.0,Ele Keats,22.0,7305,Alive
7,1017.0,Palach,53e225e7c3a3684860000787,0.0,1265313.0,Juris Laucinsh,7.0,110402,Hard to Be a God
12,27.0,Blind Man,52fe44bc9251416c7503f1a7,0.0,170759.0,Wayne Federman,12.0,12133,Step Brothers
9,11.0,Mr. Piln,52fe4532c3a36847f80c193d,0.0,59455.0,Steve Uzzell,9.0,9813,The Quiet
49,57.0,Kid C,5470b1029251414f54001b5e,0.0,1387892.0,Robert Roser,49.0,11592,Serial Mom


We do the same to flatten crew

In [192]:
list_of_crews = []
for i in range(0, credits.shape[0], 1):
    crew = pd.json_normalize(credits.crew[i])
    crew["movie_id"] = credits.iloc[i, 0]
    crew["title"] = credits.iloc[i,1]

    list_of_crews.append(crew)

flat_crew = pd.concat(list_of_casts)
flat_crew.sample(5)

Unnamed: 0,cast_id,character,credit_id,gender,id,name,order,movie_id,title
13,15.0,Randy Brandston,57b7c69b9251414ec6001a14,0.0,1213527.0,Connor Matheus,13.0,15489,Snow Day
18,38.0,Custodian,56395d7fc3a3681b5c02247d,2.0,47944.0,Anthony O'Donnell,18.0,116,Match Point
1,12.0,Akasha,52fe44af9251416c7503d587,1.0,21352.0,Aaliyah,1.0,11979,Queen of the Damned
89,141.0,Student (uncredited),5919af8cc3a368423c05a415,0.0,1817318.0,Steve Crawford,89.0,11615,The Life of David Gale
1,14.0,Peter Colt,52fe448c9251416c75038aff,2.0,6162.0,Paul Bettany,1.0,11823,Wimbledon


Now we join the cast and crew tables to arrive at `credits.csv` in long form. 

In [193]:
flat_casts.columns = ["cast_order" if x == "order" else x for x in flat_casts.columns]
flat_crew.columns = ["crew_order" if x == "order" else x for x in flat_crew.columns]

In [197]:
# flattened credits completely such that every row is exactly one observation
# an observation is 
flat_credits = pd.concat([flat_casts, flat_crew])
flat_credits.sample(5)

Unnamed: 0,cast_id,character,credit_id,gender,id,name,cast_order,movie_id,title,crew_order
42,52.0,Dancer,58c2b4c792514104f701143b,1.0,1773686.0,Leslie Geldbach,,10571,Boys and Girls,42.0
16,22.0,Tomb Robber,547dd54ec3a3685aed00631f,0.0,1061054.0,Julian Black Antelope,,68737,Seventh Son,16.0
12,21.0,Interviewee,52fe473c9251416c750923b9,0.0,1086506.0,Morena Busa Sesatsa,12.0,17654,District 9,
37,53.0,Mrs. Start (voice),58dae204c3a3686cbb00029c,0.0,56105.0,Renée Taylor,39.0,950,Ice Age: The Meltdown,
3,4.0,Celimene,52fe441cc3a368484e00ff69,1.0,4390.0,Ludivine Sagnier,3.0,21494,Moliere,


Now we work with `movies.csv`. Recall its structure:

In [198]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.avatarmovie.com/,19995,"[{'id': 1463, 'name': 'culture clash'}, {'id':...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{'name': 'Ingenious Film Partners', 'id': 289...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-12-10,2787965087,162.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://disney.go.com/disneypictures/pirates/,285,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{'name': 'Walt Disney Pictures', 'id': 2}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2007-05-19,961000000,169.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2015-10-26,880674609,148.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",http://www.thedarkknightrises.com/,49026,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{'name': 'Legendary Pictures', 'id': 923}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-07-16,1084939099,165.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://movies.disney.com/john-carter,49529,"[{'id': 818, 'name': 'based on novel'}, {'id':...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-03-07,284139100,132.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


First we drop the `homepage` column, as it is of no use. 

In [201]:
movies.drop(columns=["homepage"], inplace=True)
movies.head()

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",19995,"[{'id': 1463, 'name': 'culture clash'}, {'id':...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{'name': 'Ingenious Film Partners', 'id': 289...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-12-10,2787965087,162.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",285,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{'name': 'Walt Disney Pictures', 'id': 2}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2007-05-19,961000000,169.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",206647,"[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2015-10-26,880674609,148.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",49026,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{'name': 'Legendary Pictures', 'id': 923}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-07-16,1084939099,165.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",49529,"[{'id': 818, 'name': 'based on novel'}, {'id':...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-03-07,284139100,132.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Now we work with the genres. Note that we would like to turn each genre into 
a column, i.e. one-hot encoding. Code from:
https://stackoverflow.com/questions/48213149/creating-one-hot-encodings-from-a-column-of-dictionaries-with-pandas 