## DATA CLEANING
First, we need to clean the data before exploring and making predictions with it!

In [1]:
import pandas as pd
import numpy as np

In [2]:
studio_ghibli = pd.read_csv("../data/Studio Ghibli.csv")

studio_ghibli.head()

Unnamed: 0,Name,Year,Director,Screenplay,Budget,Revenue,Genre 1,Genre 2,Genre 3,Duration
0,When Marnie Was There\n (2014),2014,Hiromasa Yonebayashi,Joan G. Robinson,$1150000000.00,$34949567.00,Animation,Drama,,1h 43m
1,The Tale of The Princess Kaguya\n (2013),2013,Isao Takahata,Riko Sakaguchi,$49300000.00,$24366656.00,Animation,Drama,Fantasy,2h 17m
2,The Wind Rises\n (2013),2013,Hayao Miyazaki,Tatsuo Hori,$30000000.00,$117932401.00,Drama,Animation,Romance,2h 6m
3,From Up on Poppy Hill\n (2011),2011,Goro Miyazaki,Hayao Miyazaki,$22000000.00,$61037844.00,Animation,Drama,,1h 31m
4,The Secret World of Arrietty\n (2010),2010,Hiromasa Yonebayashi,Mary Norton,$23000000.00,$149480483.00,Fantasy,Animation,Family,1h 34m


In [3]:
# Looking at the shape
studio_ghibli.shape

(23, 10)

In [4]:
# And the data types
studio_ghibli.dtypes

Name          object
Year           int64
Director      object
Screenplay    object
Budget        object
Revenue       object
Genre 1       object
Genre 2       object
Genre 3       object
Duration      object
dtype: object

We can see most of the data types need to be changed and some formatting rework has to be done.

### Changing data types and formatting the dataset

First, let's change the datatypes that are supposed to be strings.

In [5]:
studio_ghibli["Name"] = studio_ghibli["Name"].astype("string")
studio_ghibli["Director"] = studio_ghibli["Director"].astype("string")
studio_ghibli["Screenplay"] = studio_ghibli["Screenplay"].astype("string")
studio_ghibli["Genre 1"] = studio_ghibli["Genre 1"].astype("string")
studio_ghibli["Genre 2"] = studio_ghibli["Genre 2"].astype("string")
studio_ghibli["Genre 3"] = studio_ghibli["Genre 3"].astype("string")

studio_ghibli.dtypes

Name          string[python]
Year                   int64
Director      string[python]
Screenplay    string[python]
Budget                object
Revenue               object
Genre 1       string[python]
Genre 2       string[python]
Genre 3       string[python]
Duration              object
dtype: object

And we can see `Name` has `\n` value to add an enter to the text and because the year of the movie is already saved in a colum, we can remove it from the name. 

In [6]:
studio_ghibli["Name"] = studio_ghibli["Name"].str.split("\n").str[0]

Now, the ones for `int64`, but first, let's note that money values are formatted with characters. In order to make them integers, we need to remove the unnecessary symbols.

In [7]:
studio_ghibli["Budget"] = studio_ghibli["Budget"].str.replace("$", "").str.replace(".00", "")
studio_ghibli["Revenue"] = studio_ghibli["Revenue"].str.replace("$", "").str.replace(".00", "")

In [8]:
# And now, changing them to int64
studio_ghibli["Budget"] = studio_ghibli["Budget"].astype("int64")
studio_ghibli["Revenue"] = studio_ghibli["Revenue"].astype("int64")
studio_ghibli.dtypes

Name                  object
Year                   int64
Director      string[python]
Screenplay    string[python]
Budget                 int64
Revenue                int64
Genre 1       string[python]
Genre 2       string[python]
Genre 3       string[python]
Duration              object
dtype: object

And for `Year` column, `int64` is not necessary, let's change it to `int16`.

In [9]:
studio_ghibli["Year"] = studio_ghibli["Year"].astype("int16")
studio_ghibli.dtypes

Name                  object
Year                   int16
Director      string[python]
Screenplay    string[python]
Budget                 int64
Revenue                int64
Genre 1       string[python]
Genre 2       string[python]
Genre 3       string[python]
Duration              object
dtype: object

Lastly, let's change `Duration` to a time Datatype.

In [10]:
studio_ghibli["Duration"] = pd.to_timedelta(studio_ghibli["Duration"].str.replace(" ", ""))
studio_ghibli.dtypes


Name                   object
Year                    int16
Director       string[python]
Screenplay     string[python]
Budget                  int64
Revenue                 int64
Genre 1        string[python]
Genre 2        string[python]
Genre 3        string[python]
Duration      timedelta64[ns]
dtype: object

### Handling Invalid or Missing Values

In [11]:
# Checking where are the invalid or missing values
studio_ghibli.isna().sum()

Name          0
Year          0
Director      0
Screenplay    9
Budget        0
Revenue       0
Genre 1       0
Genre 2       0
Genre 3       4
Duration      0
dtype: int64

In [12]:
# And if there is duplicated data
studio_ghibli.duplicated().sum()

np.int64(0)

Perfect! There is no duplicated data, however we have some `NaN` values. In practice, we need to decide if we should delete the row entirely, or fill the blank, normally with the average value or zero.
Because our database is small, perhaps deleting rows can be counterproductive... so, let's try fill in the blanks. 

My first approach will be to search for the missing values in Google, as these movies are famous and easy to find. And because this dataset is small, filling up the blank spaces wouldn't take a lot of my time.

Now, let's first check which movies don't have a screenplay writer assigned. 

In [13]:
studio_ghibli['Name'][studio_ghibli["Screenplay"].isna()]

5                        Ponyo
8               Only Yesterday
9                Spirited Away
10    My Neighbors the Yamadas
13          My Neighbor Totoro
14           Princess Mononoke
16           Castle in the Sky
18                    Pom Poko
19                 Porco Rosso
Name: Name, dtype: object

From what I found, this data will be added to the dataset:

In [14]:
screenplay_writer = [
    "Hayao Miyazaki", "Hayao Miyazaki", "Hayao Miyazaki", 
    "Isao Takahata", "Hayao Miyazaki", "Hayao Miyazaki", 
    "Hayao Miyazaki", "Isao Takahata", "Hayao Miyazaki"
    ]

# Getting the indexes so we can know where to add these values
screenplay_index = (studio_ghibli['Name'][studio_ghibli["Screenplay"].isna()]).index.tolist() 

studio_ghibli.loc[screenplay_index,"Screenplay"] = screenplay_writer
studio_ghibli.isna().sum()


Name          0
Year          0
Director      0
Screenplay    0
Budget        0
Revenue       0
Genre 1       0
Genre 2       0
Genre 3       4
Duration      0
dtype: int64

We can see we removed all of `Screenplay` `NaN` values!

Lastly, let's follow the same pipeline for `Genre 3`, so we check where those `NaN` values show up.

In [15]:
studio_ghibli['Name'][studio_ghibli["Genre 3"].isna()]

0        When Marnie Was There
3        From Up on Poppy Hill
10    My Neighbors the Yamadas
22       The Boy and the Heron
Name: Name, dtype: object

And to avoid duplicated genres, let's check for these rows which are the `Genre 1` and `Genre 2` values.

In [16]:
genre_index = studio_ghibli['Name'][studio_ghibli["Genre 3"].isna()].index.tolist()
studio_ghibli.loc[ genre_index , ["Genre 1", "Genre 2"]]

Unnamed: 0,Genre 1,Genre 2
0,Animation,Drama
3,Animation,Drama
10,Animation,Family
22,Fantasy,Adventure


Now, this is the data I found and that will be added to the dataset: 

In [17]:
genre_3 = [
    "Thriller", "Romance",
    "Comedy", "Animation"
    ]

studio_ghibli.loc[genre_index,"Genre 3"] = genre_3
studio_ghibli.isna().sum()

Name          0
Year          0
Director      0
Screenplay    0
Budget        0
Revenue       0
Genre 1       0
Genre 2       0
Genre 3       0
Duration      0
dtype: int64

Great! We cleaned the data, we are now ready to explore it, but first, let's save this clean database so we can use it in futures notebooks.

### Saving clean database

In [18]:
studio_ghibli.to_csv("../data/Studio Ghibli - Clean.csv", index=False)

In [19]:
studio_ghibli.head()

Unnamed: 0,Name,Year,Director,Screenplay,Budget,Revenue,Genre 1,Genre 2,Genre 3,Duration
0,When Marnie Was There,2014,Hiromasa Yonebayashi,Joan G. Robinson,1150000000,34949567,Animation,Drama,Thriller,0 days 01:43:00
1,The Tale of The Princess Kaguya,2013,Isao Takahata,Riko Sakaguchi,49300000,24366656,Animation,Drama,Fantasy,0 days 02:17:00
2,The Wind Rises,2013,Hayao Miyazaki,Tatsuo Hori,30000000,117932401,Drama,Animation,Romance,0 days 02:06:00
3,From Up on Poppy Hill,2011,Goro Miyazaki,Hayao Miyazaki,22000000,61037844,Animation,Drama,Romance,0 days 01:31:00
4,The Secret World of Arrietty,2010,Hiromasa Yonebayashi,Mary Norton,23000000,149480483,Fantasy,Animation,Family,0 days 01:34:00
