# Exploratory Data Analysis

## Setup
Import the relevant packages.


In [1]:
import json
import pandas as pd

We have a single file `20220310.csv`. Let's first look at what type of data we 
are dealing with. 

In [10]:
# load in csv as dataframes
# lineterminator option is used as there are some line separation issues.
raw_data = pd.read_csv("raw-data/20220310.csv", lineterminator="\n")

# get summary information on the data frame, including columns and data types
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  5000 non-null   bool   
 1   backdrop_path          4874 non-null   object 
 2   belongs_to_collection  1742 non-null   object 
 3   budget                 5000 non-null   int64  
 4   genres                 5000 non-null   object 
 5   homepage               2418 non-null   object 
 6   id                     5000 non-null   int64  
 7   imdb_id                4938 non-null   object 
 8   original_language      5000 non-null   object 
 9   original_title         5000 non-null   object 
 10  overview               4963 non-null   object 
 11  popularity             5000 non-null   float64
 12  poster_path            4995 non-null   object 
 13  production_companies   5000 non-null   object 
 14  production_countries   5000 non-null   object 
 15  rele

Let's also look at the first few rows to see what we are dealing with. 

In [11]:
raw_data.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,credits,keywords
0,False,/iQFcwSGbZXMkeyKrxbPnwnRo5fl.jpg,"{'id': 531241, 'name': 'Spider-Man (Avengers) ...",200000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",https://www.spidermannowayhome.movie,634649,tt10872600,en,Spider-Man: No Way Home,...,148.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The Multiverse unleashed.,Spider-Man: No Way Home,False,8.3,9509,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 242, 'name': 'new york ci..."
1,False,/5P8SmMzSNYikXpxil6BYzJ16611.jpg,"{'id': 948485, 'name': 'The Batman Collection'...",185000000,"[{'id': 80, 'name': 'Crime'}, {'id': 9648, 'na...",https://www.thebatman.com,414906,tt1877830,en,The Batman,...,176.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Unmask the truth.,The Batman,False,8.0,1738,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 849, 'name': 'dc comics'}..."
2,False,/tAztR7AXEesMQAAi5ncFPSZtYlI.jpg,,0,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",https://www.hulu.com/movie/no-exit-4800d468-b5...,833425,tt7550014,en,No Exit,...,96.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Who will survive the dead of winter?,No Exit,False,6.5,208,"{'cast': [{'adult': False, 'gender': 1, 'id': ...","{'keywords': [{'id': 818, 'name': 'based on no..."
3,False,/3G1Q5xF40HkUBJXxt2DQgQzKTp5.jpg,,50000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",https://movies.disney.com/encanto,568124,tt2953050,en,Encanto,...,102.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a little magic in all of us ...almost ...,Encanto,False,7.7,5282,"{'cast': [{'adult': False, 'gender': 1, 'id': ...","{'keywords': [{'id': 2343, 'name': 'magic'}, {..."
4,False,/7CamWBejQ9JQOO5vAghZfrFpMXY.jpg,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",https://www.netflix.com/title/81424708,928381,tt14465894,fr,Sans répit,...,95.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,Restless,False,5.9,152,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 9714, 'name': 'remake'}]}"


## Data Cleaning

First we remove columns that we will obviously not be using, e.g. `backdrop_path`,
`homepage`, etc. 

In [15]:
useless_columns = ["backdrop_path", "belongs_to_collection", "homepage", 
"poster_path", "video"]

raw_data.drop(useless_columns, axis=1, inplace=True)
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   adult                 5000 non-null   bool   
 1   budget                5000 non-null   int64  
 2   genres                5000 non-null   object 
 3   id                    5000 non-null   int64  
 4   imdb_id               4938 non-null   object 
 5   original_language     5000 non-null   object 
 6   original_title        5000 non-null   object 
 7   overview              4963 non-null   object 
 8   popularity            5000 non-null   float64
 9   production_companies  5000 non-null   object 
 10  production_countries  5000 non-null   object 
 11  release_date          4980 non-null   object 
 12  revenue               5000 non-null   int64  
 13  runtime               4999 non-null   float64
 14  spoken_languages      5000 non-null   object 
 15  status               


The columns `genres`, `keywords`, `production_companies`, `production_countries`, `spoken_languages` in `movies.csv` and `cast` and `crew` in `credits.csv` are lists of dicts, so we load them in as JSON objects. 

In [None]:
# parse dates
movies["release_date"] = pd.to_datetime(movies["release_date"]).apply(lambda x: x.date())

# parse columns with nested JSON objects as actual JSON objects instead of str
# movies.csv
movies_json_columns = [
    "genres", "keywords", "production_companies", "production_countries", 
    "spoken_languages"
]
for col in movies_json_columns:
    movies[col] = movies[col].apply(json.loads)

# credits.csv
credits_json_columns = ["cast", "crew"]
for col in credits_json_columns:
    credits[col] = credits[col].apply(json.loads)

We also require the datasets in long form. First we work with `credits.csv`.

In [None]:
credits.head()

In [None]:
list_of_casts = []
for i in range(0, credits.shape[0], 1):
    cast = pd.json_normalize(credits.cast[i])
    cast["movie_id"] = credits.iloc[i, 0]
    cast["title"] = credits.iloc[i,1]

    list_of_casts.append(cast)

flat_casts = pd.concat(list_of_casts)
flat_casts.sample(5)


We do the same to flatten crew

In [None]:
list_of_crews = []
for i in range(0, credits.shape[0], 1):
    crew = pd.json_normalize(credits.crew[i])
    crew["movie_id"] = credits.iloc[i, 0]
    crew["title"] = credits.iloc[i,1]

    list_of_crews.append(crew)

flat_crew = pd.concat(list_of_casts)
flat_crew.sample(5)

Now we join the cast and crew tables to arrive at `credits.csv` in long form. 

In [None]:
flat_casts.columns = ["cast_order" if x == "order" else x for x in flat_casts.columns]
flat_crew.columns = ["crew_order" if x == "order" else x for x in flat_crew.columns]

In [None]:
# flattened credits completely such that every row is exactly one observation
# an observation is 
flat_credits = pd.concat([flat_casts, flat_crew])
flat_credits.sample(5)

Now we work with `movies.csv`. Recall its structure:

In [None]:
movies.head()

First we drop the `homepage` column, as it is of no use. 

In [None]:
movies.drop(columns=["homepage"], inplace=True)
movies.head()

Now we work with the genres. Note that we would like to turn each genre into 
a column, i.e. one-hot encoding. Code from:
https://stackoverflow.com/questions/48213149/creating-one-hot-encodings-from-a-column-of-dictionaries-with-pandas 