# Exploratory Data Analysis

## Setup
Import the relevant packages.


In [24]:
import json
import pandas as pd

We have two data files, `tmdb_5000_credits.csv` and `tmdb_5000_movies.csv`. The 
data files can all be found in the directory `raw-data`. For simplicity, we 
refer to the files as `credits.csv` and `movies.csv` respectively. 

We first look at what type of data we are dealing with. 

In [21]:
# load in csv as dataframes
movies = pd.read_csv("raw-data/tmdb_5000_movies.csv")
credits = pd.read_csv("raw-data/tmdb_5000_credits.csv")

In [22]:
# get key information
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [None]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


## Data Cleaning
Main goals are cleaning up nested JSON objects according to tidy data 
principles. 

In [27]:
# parse dates
movies["release_date"] = pd.to_datetime(movies["release_date"]).apply(lambda x: x.date())

# parse columns with JSON objects
json_columns = [
    "genres", "keywords", "production_countries", "production_companies", 
    "spoken_languages"
]
for col in json_columns:
    movies[col] = movies[col].apply(json.loads)

movies.head(5)

TypeError: the JSON object must be str, bytes or bytearray, not list