## DATA CLEANING
First, we need to clean the data before exploring and making predictions with it!

In [1]:
import pandas as pd
import numpy as np

In [2]:
studio_ghibli = pd.read_csv("../data/Studio Ghibli.csv")

studio_ghibli.head()

Unnamed: 0,Name,Year,Director,Screenplay,Budget,Revenue,Genre 1,Genre 2,Genre 3,Duration
0,When Marnie Was There\n (2014),2014,Hiromasa Yonebayashi,Joan G. Robinson,$1150000000.00,$34949567.00,Animation,Drama,,1h 43m
1,The Tale of The Princess Kaguya\n (2013),2013,Isao Takahata,Riko Sakaguchi,$49300000.00,$24366656.00,Animation,Drama,Fantasy,2h 17m
2,The Wind Rises\n (2013),2013,Hayao Miyazaki,Tatsuo Hori,$30000000.00,$117932401.00,Drama,Animation,Romance,2h 6m
3,From Up on Poppy Hill\n (2011),2011,Goro Miyazaki,Hayao Miyazaki,$22000000.00,$61037844.00,Animation,Drama,,1h 31m
4,The Secret World of Arrietty\n (2010),2010,Hiromasa Yonebayashi,Mary Norton,$23000000.00,$149480483.00,Fantasy,Animation,Family,1h 34m


In [3]:
# Looking at the shape
studio_ghibli.shape

(23, 10)

In [4]:
# And the data types
studio_ghibli.dtypes

Name          object
Year           int64
Director      object
Screenplay    object
Budget        object
Revenue       object
Genre 1       object
Genre 2       object
Genre 3       object
Duration      object
dtype: object

Here, most of the data types must need to be changed.

### Changin data types

In [5]:
# Changing the dtypes that are supposed to be strings
studio_ghibli["Name"] = studio_ghibli["Name"].astype("string")
studio_ghibli["Director"] = studio_ghibli["Director"].astype("string")
studio_ghibli["Screenplay"] = studio_ghibli["Screenplay"].astype("string")
studio_ghibli["Genre 1"] = studio_ghibli["Genre 1"].astype("string")
studio_ghibli["Genre 2"] = studio_ghibli["Genre 2"].astype("string")
studio_ghibli["Genre 3"] = studio_ghibli["Genre 3"].astype("string")

studio_ghibli.dtypes

Name          string[python]
Year                   int64
Director      string[python]
Screenplay    string[python]
Budget                object
Revenue               object
Genre 1       string[python]
Genre 2       string[python]
Genre 3       string[python]
Duration              object
dtype: object

Now, the ones for `int64`, but first, let's note that money values are formatted with characters. In order to make them integers, we need to remove the unnecessary symbols.

In [6]:
# Budget column
budget_array = []
for budget in studio_ghibli["Budget"]:
    budget = budget.replace("$", "") 
    budget = budget.replace(".00", "")
    budget_array = np.append(budget_array, budget)

updated_budget = pd.DataFrame({"Budget": budget_array})
studio_ghibli.update(updated_budget)

In [7]:
# Revenue column
revenue_array = []
for revenue in studio_ghibli["Revenue"]:
    revenue = revenue.replace("$", "") 
    revenue = revenue.replace(".00", "")
    revenue_array = np.append(revenue_array, revenue)

updated_revenue = pd.DataFrame({"Revenue": revenue_array})
studio_ghibli.update(updated_revenue)

In [8]:
# And now, changing them to int64
studio_ghibli["Budget"] = studio_ghibli["Budget"].astype("int64")
studio_ghibli["Revenue"] = studio_ghibli["Revenue"].astype("int64")
studio_ghibli.dtypes

Name          string[python]
Year                   int64
Director      string[python]
Screenplay    string[python]
Budget                 int64
Revenue                int64
Genre 1       string[python]
Genre 2       string[python]
Genre 3       string[python]
Duration              object
dtype: object

And for `Year` column, `int64` is not necessary, let's change it to `int16`.

In [9]:
studio_ghibli["Year"] = studio_ghibli["Year"].astype("int16")
studio_ghibli.dtypes

Name          string[python]
Year                   int16
Director      string[python]
Screenplay    string[python]
Budget                 int64
Revenue                int64
Genre 1       string[python]
Genre 2       string[python]
Genre 3       string[python]
Duration              object
dtype: object

Lastly, let's change `Duration` to a time Datatype.

In [10]:
duration_array = []
for duration in studio_ghibli["Duration"]:
    # Splitting hours from minutes
    hours_minutes = duration.split()

    hour = "0" + hours_minutes[0].replace("h", "")
    minutes = hours_minutes[1].replace("m", "")
    
    if len(minutes) == 1:
        minutes = "0" + minutes 

    time = hour + ":" + minutes

    duration_array = np.append(duration_array, time)


In [17]:
updated_duration = pd.DataFrame({"Duration": duration_array})
updated_duration["Duration"] = pd.to_timedelta(updated_duration["Duration"] + ':00')
studio_ghibli.update(updated_duration)

studio_ghibli["Duration"] = studio_ghibli["Duration"].astype("timedelta64[ns]")
studio_ghibli.dtypes



Name           string[python]
Year                    int16
Director       string[python]
Screenplay     string[python]
Budget                  int64
Revenue                 int64
Genre 1        string[python]
Genre 2        string[python]
Genre 3        string[python]
Duration      timedelta64[ns]
dtype: object

### Handling Invalid or Missing Values

In [19]:
# Checking where are the invalid or missing values
studio_ghibli.isna().sum()

Name          0
Year          0
Director      0
Screenplay    9
Budget        0
Revenue       0
Genre 1       0
Genre 2       0
Genre 3       4
Duration      0
dtype: int64

In [20]:
# And if there is duplicated data
studio_ghibli.duplicated().sum()

np.int64(0)

Perfect! There is no duplicated data, however we have some `NaN` values. In practice, we need to decide if we should delete the row entirely, or fill the blank with an average or zero.
Because our database is small, perhaps deleting rows can be counterproductive... so, let's try fill in the blanks. 