# Processing Anime Data 

## Background

We will use pyjanitor to showcase how to conveniently chain methods together to perform data cleaning in one shot. We will chain the dataframe methods together with pyjanitor methods to complete the data cleaning process. The below example shows a one-shot script followed by a step-by-step detail of each part of the method chain.

We have adapted a [TidyTuesday analysis](https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-04-23/readme.md) that was originally performed in R. The original text from TidyTuesday will be shown in blockquotes.

Note: TidyTuesday is based on the principles discussed and made popular by Hadley Wickham in his paper [Tidy Data](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf).

*The original text from TidyTuesday will be shown in blockquotes.*
Here is a description of the Anime data set that we will use.

>This week's data comes from [Tam Nguyen](https://github.com/tamdrashtri) and [MyAnimeList.net via Kaggle](https://www.kaggle.com/aludosan/myanimelist-anime-dataset-as-20190204). [According to Wikipedia](https://en.wikipedia.org/wiki/MyAnimeList) - "MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. The site claims to have 4.4 million anime and 775,000 manga entries. In 2015, the site received 120 million visitors a month."
>
>Anime without rankings or popularity scores were excluded. Producers, genre, and studio were converted from lists to tidy observations, so there will be repetitions of shows with multiple producers, genres, etc. The raw data is also uploaded.
>
>Lots of interesting ways to explore the data this week!

Import libraries and load data

In [20]:
# Import pyjanitor and pandas
import janitor
import pandas as pd

In [21]:
# Supress user warnings when we try overwriting our custom pandas flavor functions
# dont think this warning is necessary anymore
import warnings
warnings.filterwarnings('ignore')

## One-Shot

In [22]:
filename = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-23/raw_anime.csv'
df = pd.read_csv(filename)

clean_df = (df
            .process_text(column = 'producers', string_function = 'replace', pat="(\[|\])|'", repl="")
            .process_text(column = 'producers', string_function = 'split', pat = ',')
            .explode('producers')
            .process_text(column = 'producers', string_function = 'strip')
            .process_text(column = 'genre', string_function = 'replace', pat="(\[|\])|'", repl="")
            .process_text(column = 'genre', string_function = 'split', pat = ',')
            .explode('genre')
            .process_text(column = 'genre', string_function = 'strip')
            .process_text(column = 'studio', string_function = 'replace', pat="(\[|\])|'", repl="")
            .process_text(column = 'studio', string_function = 'split', pat = ',')
            .explode('studio')
            .process_text(column = 'studio', string_function = 'strip')
            .process_text(column = 'aired', string_function = 'replace', pat="\{|\}|'from':\s*|'to':\s*", repl="")
            .process_text(column = 'aired', string_function = 'split', pat = ',')
            .process_text(column = 'aired', string_function = 'slice', start = 0, stop = 2)
            .process_text(column = 'aired', string_function = 'join', sep = ',')
            .deconcatenate_column(column_name="aired", new_column_names=["start_date", "end_date"], sep=",")
            .remove_columns(column_names=["aired"])
            .process_text(column = 'start_date', string_function = 'replace', pat="'", repl="")
            .process_text(column = 'start_date', string_function = 'slice', start = 0, stop = 10)
            .process_text(column = 'end_date', string_function = 'replace', pat="'", repl="")
            .process_text(column = 'end_date', string_function = 'slice', start = 0, stop = 11)
            .to_datetime("start_date", format="%Y-%m-%d", errors="coerce")
            .to_datetime("end_date", format="%Y-%m-%d", errors="coerce")
            .fill_empty(columns=["rank", "popularity"], value=0)
            .filter_on("rank != 0 & popularity != 0")
           )

clean_df.head()

Unnamed: 0,animeID,name,title_english,title_japanese,title_synonyms,type,source,producers,genre,studio,...,popularity,members,favorites,synopsis,background,premiered,broadcast,related,start_date,end_date
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Action,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Adventure,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Comedy,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Drama,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Sci-Fi,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24


## Multi-Step

>Data Dictionary
>
>Heads up the dataset is about 97 mb - if you want to free up some space, drop the synopsis and background, they are long strings, or broadcast, premiered, related as they are redundant or less useful.
>
>|variable       |class     |description |
|:--------------|:---------|:-----------|
|animeID        |double    | Anime ID (as in https://myanimelist.net/anime/animeID)          |
|name           |character |anime title - extracted from the site.           |
|title_english  |character | title in English (sometimes is different, sometimes is missing)          |
|title_japanese |character | title in Japanese (if Anime is Chinese or Korean, the title, if available, in the respective language)          |
|title_synonyms |character | other variants of the title         |
|type           |character | anime type (e.g. TV, Movie, OVA)          |
|source         |character | source of anime (i.e original, manga, game, music, visual novel etc.)         |
|producers      |character | producers          |
|genre          |character | genre         |
|studio         |character | studio           |
|episodes       |double    | number of episodes           |
|status         |character | Aired or not aired      |
|airing         |logical   | True/False is still airing          |
|start_date     |double    | Start date (ymd)        |
|end_date       |double    | End date (ymd)        |
|duration       |character | Per episode duration or entire duration, text string        |
|rating         |character | Age rating         |
|score          |double    | Score (higher = better)       |
|scored_by      |double    | Number of users that scored          |
|rank           |double    | Rank - weight according to MyAnimeList formula          |
|popularity     |double    |  based on how many members/users have the respective anime in their list          |
|members        |double    | number members that added this anime in their list         |
|favorites      |double    | number members that favorites these in their list          |
|synopsis       |character | long string with anime synopsis          |
|background     |character | long string with production background and other things          |
|premiered      |character | anime premiered on season/year          |
|broadcast      |character | when is (regularly) broadcasted         |
|related        |character | dictionary: related animes, series, games etc.

### Step 0: Load data

In [23]:
filename = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-23/raw_anime.csv'
df = pd.read_csv(filename)

In [24]:
df.head(3).T

Unnamed: 0,0,1,2
animeID,1,5,6
name,Cowboy Bebop,Cowboy Bebop: Tengoku no Tobira,Trigun
title_english,Cowboy Bebop,Cowboy Bebop: The Movie,Trigun
title_japanese,カウボーイビバップ,カウボーイビバップ 天国の扉,トライガン
title_synonyms,[],"[""Cowboy Bebop: Knockin' on Heaven's Door""]",[]
type,TV,Movie,TV
source,Original,Original,Manga
producers,['Bandai Visual'],"['Sunrise', 'Bandai Visual']",['Victor Entertainment']
genre,"['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space']","['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'D..."
studio,['Sunrise'],['Bones'],['Madhouse']


### Step 1: Clean `producers` column

The first step tries to clean up the `producers` column by removing some brackets ('[]') and trim off some empty spaces

>```
>clean_df <- raw_df %>% 
>  # Producers
>  mutate(producers = str_remove(producers, "\\["),
         producers = str_remove(producers, "\\]"))
>```

What is mutate? This [link](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html) compares R's `mutate` to be similar to pandas' `df.assign`.
However, `df.assign` returns a new DataFrame whereas `mutate` adds a new variable while preserving the previous ones.
Therefore, for this example, I will compare `mutate` to be similar to `df['col'] = X`

As we can see, this looks like a list of items but in string form

In [25]:
# Let's see what we trying to remove
df.loc[df['producers'].str.contains("\[", na=False), 'producers'].head()

0               ['Bandai Visual']
1    ['Sunrise', 'Bandai Visual']
2        ['Victor Entertainment']
3               ['Bandai Visual']
4          ['TV Tokyo', 'Dentsu']
Name: producers, dtype: object

Let's use the `process_text` function to *process* this (pun intended).

In [26]:
clean_df = (
    df
    .process_text(column = 'producers', string_function = 'replace', pat="(\[|\])|'", repl="")
)

With brackets and the `'` removed.

In [27]:
clean_df['producers'].head()

0             Bandai Visual
1    Sunrise, Bandai Visual
2      Victor Entertainment
3             Bandai Visual
4          TV Tokyo, Dentsu
Name: producers, dtype: object

Brackets are removed. Now the next part
>```
>  separate_rows(producers, sep = ",") %>% 
>```

It seems like separate rows will go through each value of the column, and if the value is a list, will create a new row for each value in the list with the remaining column values being the same. This is commonly known as an `explode` method, for which we will use from `pandas`. First, we get the column into list form with the `split` string method.

In [28]:
clean_df = (
    clean_df
    .process_text(column = 'producers', string_function = 'split', pat = ',')
    .explode(column='producers')
)

Now every producer is its own row.

In [29]:
clean_df['producers'].head()

0           Bandai Visual
1                 Sunrise
1           Bandai Visual
2    Victor Entertainment
3           Bandai Visual
Name: producers, dtype: object

Now remove single quotes and a bit of trimming. We removed the single quotes earlier, so let's trim with Pandas `strip` string method.
>```
  mutate(producers = str_remove(producers, "\\'"),
         producers = str_remove(producers, "\\'"),
         producers = str_trim(producers)) %>% 
```

In [30]:
clean_df = clean_df.process_text(column = 'producers', string_function = 'strip')

Finally, here is our cleaned `producers` column.

In [31]:
clean_df['producers'].head()

0           Bandai Visual
1                 Sunrise
1           Bandai Visual
2    Victor Entertainment
3           Bandai Visual
Name: producers, dtype: object

### Step 2: Clean `genre` and `studio` Columns

Let's do the same process for columns `Genre` and `Studio`

>```
>  # Genre
  mutate(genre = str_remove(genre, "\\["),
         genre = str_remove(genre, "\\]")) %>% 
  separate_rows(genre, sep = ",") %>% 
  mutate(genre = str_remove(genre, "\\'"),
         genre = str_remove(genre, "\\'"),
         genre = str_trim(genre)) %>% 
>  # Studio
  mutate(studio = str_remove(studio, "\\["),
         studio = str_remove(studio, "\\]")) %>% 
  separate_rows(studio, sep = ",") %>% 
  mutate(studio = str_remove(studio, "\\'"),
         studio = str_remove(studio, "\\'"),
         studio = str_trim(studio)) %>% 
```

In [32]:
clean_df = (
    clean_df
    # Perform operation for genre.
    .process_text(column = 'genre', string_function = 'replace', pat="(\[|\])|'", repl="")
    .process_text(column = 'genre', string_function = 'split', pat = ',')
    .explode('genre')
    .process_text(column = 'genre', string_function = 'strip')
    # Now do it for studio
    .process_text(column = 'studio', string_function = 'replace', pat="(\[|\])|'", repl="")
    .process_text(column = 'studio', string_function = 'split', pat = ',')
    .explode('studio')
    .process_text(column = 'studio', string_function = 'strip')
)

Resulting cleaned columns.

In [33]:
clean_df[['genre', 'studio']].head()

Unnamed: 0,genre,studio
0,Action,Sunrise
0,Adventure,Sunrise
0,Comedy,Sunrise
0,Drama,Sunrise
0,Sci-Fi,Sunrise


### Step 3: Clean `aired` column

The `aired` column has something a little different. In addition to the usual removing some strings and whitespace trimming, we want to separate the values into two separate columns `start_date` and `end_date`

>```r
>  # Aired
  mutate(aired = str_remove(aired, "\\{"),
         aired = str_remove(aired, "\\}"),
         aired = str_remove(aired, "'from': "),
         aired = str_remove(aired, "'to': "),
         aired = word(aired, start = 1, 2, sep = ",")) %>% 
  separate(aired, into = c("start_date", "end_date"), sep = ",") %>% 
  mutate(start_date = str_remove_all(start_date, "'"),
         start_date = str_sub(start_date, 1, 10),
         end_date = str_remove_all(start_date, "'"),
         end_date = str_sub(end_date, 1, 10)) %>%
  mutate(start_date = lubridate::ymd(start_date),
         end_date = lubridate::ymd(end_date)) %>%
```

We will use the `process_text` function, as well as pyjanitor's `deconcatenate_column` to clean up.

In [34]:
# Currently looks like this
clean_df['aired'].head()

0    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
0    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
0    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
0    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
0    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
Name: aired, dtype: object

In [35]:
clean_df = (clean_df
.process_text(column = 'aired', string_function = 'replace', pat="\{|\}|'from':\s*|'to':\s*", repl="")
.process_text(column = 'aired', string_function = 'split', pat = ',')
.process_text(column = 'aired', string_function = 'slice', start = 0, stop = 2)
.process_text(column = 'aired', string_function = 'join', sep = ',')
.deconcatenate_column(column_name="aired", new_column_names=["start_date", "end_date"], sep=",")
.remove_columns(column_names=["aired"])
.process_text(column = 'start_date', string_function = 'replace', pat="'", repl="")
.process_text(column = 'start_date', string_function = 'slice', start = 0, stop = 10)
.process_text(column = 'end_date', string_function = 'replace', pat="'", repl="")
.process_text(column = 'end_date', string_function = 'slice', start = 0, stop = 11)
.to_datetime("start_date", format="%Y-%m-%d", errors="coerce")
.to_datetime("end_date", format="%Y-%m-%d", errors="coerce")
)

In [36]:
# Resulting 'start_date' and 'end_date' columns with 'aired' column removed
clean_df[['start_date', 'end_date']].head()

Unnamed: 0,start_date,end_date
0,1998-04-03,1999-04-24
0,1998-04-03,1999-04-24
0,1998-04-03,1999-04-24
0,1998-04-03,1999-04-24
0,1998-04-03,1999-04-24


### Step 4: Filter out unranked and unpopular series

Finally, let's drop the unranked or unpopular series with pyjanitor's `filter_on`.

In [37]:
# First fill any NA values with 0 and then filter != 0
clean_df = clean_df.fill_empty(column_names=["rank", "popularity"], value=0).filter_on(
    "rank != 0 & popularity != 0"
)

### End Result

In [38]:
clean_df.head()

Unnamed: 0,animeID,name,title_english,title_japanese,title_synonyms,type,source,producers,genre,studio,...,popularity,members,favorites,synopsis,background,premiered,broadcast,related,start_date,end_date
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Action,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Adventure,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Comedy,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Drama,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,Bandai Visual,Sci-Fi,Sunrise,...,39.0,795733.0,43460.0,"In the year 2071, humanity has colonized sever...",When Cowboy Bebop first aired in spring of 199...,Spring 1998,Saturdays at 01:00 (JST),"{'Adaptation': [{'mal_id': 173, 'type': 'manga...",1998-04-03,1999-04-24
