# Exploratory Data Analysis

## Setup
Import the relevant packages.


In [1]:
import ast
import pandas as pd

We have a single file `20220310.csv`. Let's first look at what type of data we 
are dealing with. 

In [2]:
# load in csv as dataframes
# lineterminator option is used as there are some line separation issues.
raw_data = pd.read_csv("raw-data/20220310.csv", lineterminator="\n")

# get summary information on the data frame, including columns and data types
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  5000 non-null   bool   
 1   backdrop_path          4874 non-null   object 
 2   belongs_to_collection  1742 non-null   object 
 3   budget                 5000 non-null   int64  
 4   genres                 5000 non-null   object 
 5   homepage               2418 non-null   object 
 6   id                     5000 non-null   int64  
 7   imdb_id                4938 non-null   object 
 8   original_language      5000 non-null   object 
 9   original_title         5000 non-null   object 
 10  overview               4963 non-null   object 
 11  popularity             5000 non-null   float64
 12  poster_path            4995 non-null   object 
 13  production_companies   5000 non-null   object 
 14  production_countries   5000 non-null   object 
 15  rele

Let's also look at the first few rows to see what we are dealing with. 

In [3]:
raw_data.head()
raw_data.iloc[[0]].transpose()

Unnamed: 0,0
adult,False
backdrop_path,/iQFcwSGbZXMkeyKrxbPnwnRo5fl.jpg
belongs_to_collection,"{'id': 531241, 'name': 'Spider-Man (Avengers) ..."
budget,200000000
genres,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
homepage,https://www.spidermannowayhome.movie
id,634649
imdb_id,tt10872600
original_language,en
original_title,Spider-Man: No Way Home


## Data Cleaning

First we remove columns that we will obviously not be using, e.g. `backdrop_path`,
`homepage`, etc. 

In [4]:
useless_columns = ["backdrop_path", "belongs_to_collection", "homepage", 
"poster_path", "video"]

raw_data.drop(useless_columns, axis=1, inplace=True)
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   adult                 5000 non-null   bool   
 1   budget                5000 non-null   int64  
 2   genres                5000 non-null   object 
 3   id                    5000 non-null   int64  
 4   imdb_id               4938 non-null   object 
 5   original_language     5000 non-null   object 
 6   original_title        5000 non-null   object 
 7   overview              4963 non-null   object 
 8   popularity            5000 non-null   float64
 9   production_companies  5000 non-null   object 
 10  production_countries  5000 non-null   object 
 11  release_date          4980 non-null   object 
 12  revenue               5000 non-null   int64  
 13  runtime               4999 non-null   float64
 14  spoken_languages      5000 non-null   object 
 15  status               

Next we parse `release_date` as a column of date objects, and convert `status` into a Boolean column (`True` for released, `False` otherwise). 

In [5]:
# parse dates
raw_data["release_date"] = pd.to_datetime(raw_data["release_date"])
# convert status column to Boolean, and only keep released films
raw_data["status"] = raw_data["status"].map({"Released": True})

The columns `genres`, `production_companies`, `production_countries`, `spoken_languages`,
`credits`, and `keywords` are lists of quasi-JSON style strings. However, single
quotes have been used, so instead we store them in the dataframe as lists of 
dictionaries.

In [6]:
# parse columns with nested JSON objects as actual JSON objects instead of str
json_cols = [
    "genres", "production_companies", "production_countries", "spoken_languages", 
    "credits", "keywords"
]

for col in json_cols:
    raw_data[col] = raw_data[col].apply(ast.literal_eval)

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   adult                 5000 non-null   bool          
 1   budget                5000 non-null   int64         
 2   genres                5000 non-null   object        
 3   id                    5000 non-null   int64         
 4   imdb_id               4938 non-null   object        
 5   original_language     5000 non-null   object        
 6   original_title        5000 non-null   object        
 7   overview              4963 non-null   object        
 8   popularity            5000 non-null   float64       
 9   production_companies  5000 non-null   object        
 10  production_countries  5000 non-null   object        
 11  release_date          4980 non-null   datetime64[ns]
 12  revenue               5000 non-null   int64         
 13  runtime           

Due to the use of `ast.literal_eval`, we have ended up with the following columns:

- `genres`, `production_companies`, `production_countries`, `spoken_languages` are lists
- `credits` and `keywords` are dictionaries.

Morevover, credits has keys `cast` and `crew`, while keywords simply has one 
key `keywords`. The corresponding values are in the same list of dictionaries structure 
as above. Hence, the best way to clean this data is to create two new columns 
for `cast` and `crew`, and simply reassign the `keywords` column to the actual 
value inside `keywords`. 

In [7]:
list_cols = ["genres", "production_companies", "production_countries", "spoken_languages",
                "cast", "crew", "keywords"]

cast = []
crew = []
for movie in raw_data["credits"]:
    cast.append(movie["cast"])
    crew.append(movie["crew"])
raw_data["cast"] = pd.Series(cast)
raw_data["crew"] = pd.Series(crew)

new_keywords = []
for movie in raw_data["keywords"]:
    new_keywords.append(movie["keywords"])
raw_data["keywords"] = pd.Series(new_keywords)

def helper(col):
    new_col = []
    for val in col:
        new_val = []
        for x in val:
            new_val.append(dict(x))
        
        new_col.append(new_val)
    
    return pd.Series(new_col)

for col in list_cols:
    raw_data[col] = helper(raw_data[col])
    print(col + " done")

genres done
production_companies done
production_countries done
spoken_languages done
cast done
crew done
keywords done


We can now drop the `credits` column. 

In [8]:
raw_data.drop("credits", axis=1, inplace=True)
raw_data.head()

Unnamed: 0,adult,budget,genres,id,imdb_id,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,keywords,cast,crew
0,False,200000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",634649,tt10872600,en,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,10552.154,"[{'id': 420, 'logo_path': '/hUzeosd33nzE5MCNsZ...",...,148.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",True,The Multiverse unleashed.,Spider-Man: No Way Home,8.3,9509,"[{'id': 242, 'name': 'new york city'}, {'id': ...","[{'adult': False, 'gender': 2, 'id': 1136406, ...","[{'adult': False, 'gender': 1, 'id': 2519, 'kn..."
1,False,185000000,"[{'id': 80, 'name': 'Crime'}, {'id': 9648, 'na...",414906,tt1877830,en,The Batman,"In his second year of fighting crime, Batman u...",3956.45,"[{'id': 101405, 'logo_path': None, 'name': '6t...",...,176.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",True,Unmask the truth.,The Batman,8.0,1738,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...","[{'adult': False, 'gender': 2, 'id': 11288, 'k...","[{'adult': False, 'gender': 2, 'id': 2122, 'kn..."
2,False,0,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",833425,tt7550014,en,No Exit,Stranded at a rest stop in the mountains durin...,3434.009,"[{'id': 127928, 'logo_path': '/h0rjX5vjW5r8yEn...",...,96.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",True,Who will survive the dead of winter?,No Exit,6.5,208,"[{'id': 818, 'name': 'based on novel or book'}...","[{'adult': False, 'gender': 1, 'id': 2378813, ...","[{'adult': False, 'gender': 2, 'id': 2199, 'kn..."
3,False,50000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",568124,tt2953050,en,Encanto,"The tale of an extraordinary family, the Madri...",3275.706,"[{'id': 6125, 'logo_path': '/tVPmo07IHhBs4Huil...",...,102.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",True,There's a little magic in all of us ...almost ...,Encanto,7.7,5282,"[{'id': 2343, 'name': 'magic'}, {'id': 4344, '...","[{'adult': False, 'gender': 1, 'id': 968367, '...","[{'adult': False, 'gender': 0, 'id': 8159, 'kn..."
4,False,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",928381,tt14465894,fr,Sans répit,After going to extremes to cover up an acciden...,2371.051,"[{'id': 152208, 'logo_path': None, 'name': 'Br...",...,95.0,"[{'english_name': 'French', 'iso_639_1': 'fr',...",True,,Restless,5.9,152,"[{'id': 9714, 'name': 'remake'}]","[{'adult': False, 'gender': 2, 'id': 1077537, ...","[{'adult': False, 'gender': 0, 'id': 64130, 'k..."


The column `genres` has a structure such that each film has a list of key-value 
pairs, e.g. `['id':28, 'name': 'Action',...]`. We don't need this key-value 
structure, so we just keep a list of the `name` of genres of a film. Same 
applies to columns `production_companies`, `spoken_languages`, and `keywords`. 

We don't apply this to `cast` and `crew` because every production team member 
has additional information such as their gender, role, etc.. 

In [9]:
# need a copy to avoid SettingWithoutCopyWarning
raw_data_copy = raw_data.copy()
for i in range(0, raw_data.shape[0], 1):
    # extract the relevant values of these columns
    genres_names = [x["name"] for x in raw_data_copy.iloc[i]["genres"]]
    production_companies_names = [x["name"] for x in raw_data_copy.iloc[i]["production_companies"]]
    spoken_languages_names = [x["iso_639_1"] for x in raw_data_copy.iloc[i]["spoken_languages"]]
    keywords_names = [x["name"] for x in raw_data_copy.iloc[i]["keywords"]]

    # reassign entry values
    raw_data.at[i, "genres"] = genres_names
    raw_data.at[i, "production_companies"] = production_companies_names
    raw_data.at[i, "spoken_languages"] = spoken_languages_names
    raw_data.at[i, "keywords"] = keywords_names

In [10]:
raw_data[["genres", "production_companies", "spoken_languages","keywords"]].head()

Unnamed: 0,genres,production_companies,spoken_languages,keywords
0,"[Action, Adventure, Science Fiction]","[Marvel Studios, Pascal Pictures, Columbia Pic...","[en, tl]","[new york city, hero, villain, comic book, seq..."
1,"[Crime, Mystery, Thriller]","[6th & Idaho, Dylan Clark Productions, DC Film...",[en],"[dc comics, crime fighter, secret identity, vi..."
2,"[Horror, Thriller]","[20th Century Studios, Flitcraft]",[en],"[based on novel or book, winter, isolation, bl..."
3,"[Animation, Comedy, Family, Fantasy]","[Walt Disney Animation Studios, Walt Disney Pi...","[en, es]","[magic, musical, forest, family relationships,..."
4,"[Action, Thriller, Crime]","[Bright Lights Films, uMedia, Mahi Films]",[fr],[remake]


We finally have lists of valid JSON strings inside the columns. We are now ready to flatten the data and perform one-hot encoding. Whether or not a column should be flattened or one-hot encoded depends on the structure of the model being used. Hence, we define functions to perform either for future use. 

In [11]:
# creates a flat dataframe based on any column
def flatten(df: pd.DataFrame, cols: list):
    """
    INPUTS:
        df:     (pd.DataFrame) a dataframe
        cols:   (list) list of column names in df
    OUTPUTS:
                (pd.DataFrame) a dataframe flattened according to cols
    """

    for col in cols:
        df = df.explode(col)
    
    return df

# creates a new one-hot encoded dataframe based on any column
def one_hot(df: pd.DataFrame, cols: list):
    """
    INPUTS:
        df:     (pd.DataFrame) a dataframe
        cols:   (list) list of column names in df
    OUTPUTS:
                (pd.DataFrame) a dataframe one-hot encoded according to cols
    """
    # flatten data on col
    flat_df = flatten(df, cols)
    
    # one-hot encode flat_df using pd.get_dummies()
    return pd.get_dummies(flat_df, columns=cols).groupby(flat_df.index).max()
    
    

We store the cleaned data for use in the next notebook. 

In [12]:
%store raw_data

Stored 'raw_data' (DataFrame)
