# Exploratory Data Analysis

## Setup
Import the relevant packages.


In [1]:
import json
import pandas as pd

We have a single file `20220310.csv`. Let's first look at what type of data we 
are dealing with. 

In [5]:
# load in csv as dataframes
raw_data = pd.read_csv("raw-data/20220310.csv")

# get summary information on the data frame, including columns and data types
raw_data.info()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 26, saw 376


In [None]:
# get key information
movies.info()

In [None]:
credits.info()

## Data Cleaning
The columns `genres`, `keywords`, `production_companies`, `production_countries`, `spoken_languages` in `movies.csv` and `cast` and `crew` in `credits.csv` are lists of dicts, so we load them in as JSON objects. 

In [None]:
# parse dates
movies["release_date"] = pd.to_datetime(movies["release_date"]).apply(lambda x: x.date())

# parse columns with nested JSON objects as actual JSON objects instead of str
# movies.csv
movies_json_columns = [
    "genres", "keywords", "production_companies", "production_countries", 
    "spoken_languages"
]
for col in movies_json_columns:
    movies[col] = movies[col].apply(json.loads)

# credits.csv
credits_json_columns = ["cast", "crew"]
for col in credits_json_columns:
    credits[col] = credits[col].apply(json.loads)

We also require the datasets in long form. First we work with `credits.csv`.

In [None]:
credits.head()

In [None]:
list_of_casts = []
for i in range(0, credits.shape[0], 1):
    cast = pd.json_normalize(credits.cast[i])
    cast["movie_id"] = credits.iloc[i, 0]
    cast["title"] = credits.iloc[i,1]

    list_of_casts.append(cast)

flat_casts = pd.concat(list_of_casts)
flat_casts.sample(5)


We do the same to flatten crew

In [None]:
list_of_crews = []
for i in range(0, credits.shape[0], 1):
    crew = pd.json_normalize(credits.crew[i])
    crew["movie_id"] = credits.iloc[i, 0]
    crew["title"] = credits.iloc[i,1]

    list_of_crews.append(crew)

flat_crew = pd.concat(list_of_casts)
flat_crew.sample(5)

Now we join the cast and crew tables to arrive at `credits.csv` in long form. 

In [None]:
flat_casts.columns = ["cast_order" if x == "order" else x for x in flat_casts.columns]
flat_crew.columns = ["crew_order" if x == "order" else x for x in flat_crew.columns]

In [None]:
# flattened credits completely such that every row is exactly one observation
# an observation is 
flat_credits = pd.concat([flat_casts, flat_crew])
flat_credits.sample(5)

Now we work with `movies.csv`. Recall its structure:

In [None]:
movies.head()

First we drop the `homepage` column, as it is of no use. 

In [None]:
movies.drop(columns=["homepage"], inplace=True)
movies.head()

Now we work with the genres. Note that we would like to turn each genre into 
a column, i.e. one-hot encoding. Code from:
https://stackoverflow.com/questions/48213149/creating-one-hot-encodings-from-a-column-of-dictionaries-with-pandas 