# Tidy Up Web-Scraped Media Franchise Data

## Background
This example combines functionalities of [pyjanitor](https://anaconda.org/conda-forge/pyjanitor) and [pandas-flavor](https://anaconda.org/conda-forge/pandas-flavor) to showcase an explicit--and thus reproducible--workflow enabled by dataframe __method chaining__.

The data cleaning workflow largely follows the [R example](https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-07-02/revenue.R) from [the tidytuesday project](https://github.com/rfordatascience/tidytuesday). The raw data is scraped from [Wikipedia page](https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises) titled "*List of highest-grossing media franchises*". The workflow is presented both in multi-step (section1) and in one-shot (section 2) fashions.

More specifically, you will find several data-cleaning techniques that one may encounter frequently in web-scraping tasks; This includes:

* String operations with regular expressions (with `pyjanitor`)
* Data type changes (with `pyjanitor`)
* Split strings in cells into seperate rows (with `pyjanitor` and `pandas`)
* Split strings in cells into separate columns (with `pyjanitor`)
* Filter dataframe values based on substring pattern (with `pyjanitor`)
* Column value remapping with fuzzy substring matching (with `pyjanitor` + `pandas-flavor`)

Data visualization is not included in this example. But if you are looking for inspirations, [here](https://www.reddit.com/r/dataisbeautiful/comments/c53540/highest_grossing_media_franchises_oc/) is a good example. 

---

## Structural convention
### 1. Annotation system in code comments
This example includes both `pyjanitor` and `pandas-flavors` methods. As you step through this example, you will see the following annotation system in code comments that explains various methods' categories:

* `[pyjanitor]` denotes `pyjanitor` methods
* `[pandas-flavor]` denotes custom `pandas-flavor` methods
* `[pyjanitor + pandas-flavor]` denotes `pandas-flavor` methods built on top of `pyjanitor` functions

### 2. R counterpart snippets and python code in tandem
The multi-step workflow is presented by alternating the original R snippets (from tidytuesday) and the corresponding python implementations.

---

## Python implementation

### Preparation

In [1]:
# Import pyjanitor and pandas
import janitor
import pandas as pd
import pandas_flavor as pf
from typing import List

In [2]:
# Supress user warnings when we try overwriting our custom pandas flavor functions
import warnings
warnings.filterwarnings('ignore')

---

### Section 1 Multi-Step
#### Load data

R snippet:
```R
url <- "https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises"
df <- url %>% 
  read_html() %>% 
  html_table(fill = TRUE) %>% 
  .[[2]]
```

In [3]:
fileurl = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises'
df_raw = pd.read_html(fileurl)[3]
df_raw.head(3)

Unnamed: 0,Franchise,Year of inception,Total revenue (USD),Revenue breakdown (est.),Original media,Creator(s),Owner(s)
0,Pokémon,1996,est. $95 billion[a],Licensed merchandise – $64.015 billion[b] Vide...,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
1,Hello Kitty,1974,est. $86 billion,Merchandise sales – $85.832 billion[j] Manga m...,Cartoon character[38],Yuko Shimizu Shintaro Tsuji,Sanrio
2,Winnie the Pooh,1924,est. $76 billion,Retail sales – $76.19 billion[n] DVD & Blu-ray...,Book[59],A. A. Milne E. H. Shepard,The Walt Disney Company


#### Rename columns
R snippet:
```R
clean_money <- df %>% 
  set_names(nm = c("franchise", "year_created", "total_revenue", "revenue_items",
                   "original_media", "creators", "owners"))
```

In [4]:
colnames = (
    'franchise',
    'year_created',
    'total_revenue',
    'revenue_items',
    'original_media',
    'creators',
    'owners'
)
df_dirty = df_raw.rename(columns = dict(zip(df_raw.columns, colnames)))

df_dirty.head(3)

Unnamed: 0,franchise,year_created,total_revenue,revenue_items,original_media,creators,owners
0,Pokémon,1996,est. $95 billion[a],Licensed merchandise – $64.015 billion[b] Vide...,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
1,Hello Kitty,1974,est. $86 billion,Merchandise sales – $85.832 billion[j] Manga m...,Cartoon character[38],Yuko Shimizu Shintaro Tsuji,Sanrio
2,Winnie the Pooh,1924,est. $76 billion,Retail sales – $76.19 billion[n] DVD & Blu-ray...,Book[59],A. A. Milne E. H. Shepard,The Walt Disney Company


#### Clean up `total_revenue` column
R snippet:
```R
clean_money <- df %>% ... %>%
mutate(total_revenue = str_remove(total_revenue, "est."),
     total_revenue = str_trim(total_revenue),
     total_revenue = str_remove(total_revenue, "[$]"),
     total_revenue = word(total_revenue, 1, 1),
     total_revenue = as.double(total_revenue))
```

In [5]:
column_name = 'total_revenue'
df_clean_money = (
    df_dirty
    .process_text(column=column_name, string_function='replace', pat='(est.)|(\$)', repl='') # [pyjanitor]
    .process_text(column=column_name, string_function='strip') 
    .process_text(column=column_name, string_function='slice', stop=2)
    .change_type(column_name, float)  # [pyjanitor]
)
df_clean_money.head(3)

Unnamed: 0,franchise,year_created,total_revenue,revenue_items,original_media,creators,owners
0,Pokémon,1996,95.0,Licensed merchandise – $64.015 billion[b] Vide...,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
1,Hello Kitty,1974,86.0,Merchandise sales – $85.832 billion[j] Manga m...,Cartoon character[38],Yuko Shimizu Shintaro Tsuji,Sanrio
2,Winnie the Pooh,1924,76.0,Retail sales – $76.19 billion[n] DVD & Blu-ray...,Book[59],A. A. Milne E. H. Shepard,The Walt Disney Company


#### Split column `revenue_items` into `revenue_category` and `revenue`
R snippet:
```R
clean_category <- clean_money %>% 
    separate_rows(revenue_items, sep = "\\[") %>% 
    filter(str_detect(revenue_items, "illion")) %>% 
    separate(revenue_items, into = c("revenue_category", "revenue"), sep = "[$]") %>% 
    mutate(revenue_category = str_remove(revenue_category, " – "),
         revenue_category = str_remove(revenue_category, regex(".*\\]")),
         revenue_category = str_remove(revenue_category, "\n")) 
```

In [6]:
column_name = 'revenue_items'
df_clean_category = (
    df_clean_money
    .process_text(column=column_name, string_function='split',pat="\[.+?\]")
    .explode(column_name)
    .filter_string(column_name, 'illion')  # [pyjanitor]
    .deconcatenate_column(column_name, new_column_names=['revenue_category', 'revenue'], sep='\$', preserve_position=True) # [pyjanitor]
    .process_text(column='revenue_category', string_function='replace', pat='(\s–\s)|(.*\])', repl="")
    .process_text(column='revenue_category', string_function='strip')

)

df_clean_category.head(3)

Unnamed: 0,franchise,year_created,total_revenue,revenue_category,revenue,original_media,creators,owners
0,Pokémon,1996,95.0,Licensed merchandise,64.015 billion,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
0,Pokémon,1996,95.0,Video games,19.533 billion,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
0,Pokémon,1996,95.0,Card game,11.491 billion,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...


#### Clean up `revenue_category` column
R snippet:
```R
clean_df <- clean_category %>% 
  mutate(revenue_category = case_when(
    str_detect(str_to_lower(revenue_category), "box office") ~ "Box Office",
    str_detect(str_to_lower(revenue_category), "dvd|blu|vhs|home video|video rentals|video sales|streaming|home entertainment") ~ "Home Video/Entertainment",
    str_detect(str_to_lower(revenue_category), "video game|computer game|mobile game|console|game|pachinko|pet|card") ~ "Video Games/Games",
    str_detect(str_to_lower(revenue_category), "comic|manga") ~ "Comic or Manga",
    str_detect(str_to_lower(revenue_category), "music|soundtrack") ~ "Music",
    str_detect(str_to_lower(revenue_category), "tv") ~ "TV",
    str_detect(str_to_lower(revenue_category), "merchandise|licens|mall|stage|retail") ~ "Merchandise, Licensing & Retail",
    TRUE ~ revenue_category))
```

In [7]:
# pandas-flavor helper functions


# [pyjanitor + pandas-flavor]
@pf.register_dataframe_method
def fuzzy_match_replace(df, column_name: str, mapper: dict = None):
    """Value remapping for specific column with fuzzy matching and replacement
    of strings

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    mapper: dict, default to None
        {oldstr_0: newstr_0, oldstr_1: newstr_1, ..., oldstr_n: newstr_n}

    Returns
    --------
    df: pd.Dataframe

    """
    for k, v in mapper.items():
        condition = df[column_name].str.contains(k)
        # [pyjanitor] update_where: update value when condition is True
        df = df.update_where(condition, column_name, v)
    return df

In [8]:
# Value mapper `revenue_category`
value_mapper = {
    'box office': 'Box Office',
    'dvd|blu|vhs|home video|video rentals|video sales|streaming|home entertainment': 'Home Video/Entertainment',
    'video game|computer game|mobile game|console|game|pachinko|pet|card': 'Video Games/Games',
    'comic|manga': 'Comic or Manga',
    'music|soundtrac': 'Music',
    'tv': 'TV',
    'merchandise|licens|mall|stage|retail': 'Merchandise, Licensing & Retail',
}

column_name = 'revenue_category'
df_clean_category = (
    df_clean_category
    .process_text(column=column_name, string_function='lower')  
    .reset_index(drop=True)
    .fuzzy_match_replace(column_name, mapper=value_mapper)  # [pyjanitor + pandas_flavor]
)
df_clean_category.head(3)

Unnamed: 0,franchise,year_created,total_revenue,revenue_category,revenue,original_media,creators,owners
0,Pokémon,1996,95.0,"Merchandise, Licensing & Retail",64.015 billion,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
1,Pokémon,1996,95.0,Video Games/Games,19.533 billion,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
2,Pokémon,1996,95.0,Video Games/Games,11.491 billion,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...


#### Clean up `revenue` column
R snippet:
```R
%>% 
mutate(revenue = str_remove(revenue, "illion"),
     revenue = str_trim(revenue),
     revenue = str_remove(revenue, " "),
     revenue = case_when(str_detect(revenue, "m") ~ paste0(str_extract(revenue, "[:digit:]+"), "e-3"),
                         str_detect(revenue, "b") ~ str_extract(revenue, "[:digit:]+"),
                         TRUE ~ NA_character_),
     revenue = format(revenue, scientific = FALSE),
     revenue = parse_number(revenue)) %>%
mutate(original_media = str_remove(original_media, "\\[.+"))
```

In [9]:
# clean up revenue values
column_name = 'revenue'
df_clean = (
    df_clean_category.copy()
    .process_text(column=column_name, string_function='replace', pat='illion', repl='')
    .process_text(column=column_name, string_function='strip')
    .process_text(column=column_name, string_function='replace', pat=' ', repl='')
    .process_text(column=column_name, string_function='replace', pat='\s*b', repl='')
    .process_text(column=column_name, string_function='replace', pat='\s*m', repl='e-3')
    .process_text(column='original_media', string_function='replace', pat='\[.+', repl='')
    .process_text(column='year_created', string_function='slice', stop=4)
    .transform(pd.to_numeric, errors='ignore')
)
df_clean.head(3)

Unnamed: 0,franchise,year_created,total_revenue,revenue_category,revenue,original_media,creators,owners
0,Pokémon,1996,95.0,"Merchandise, Licensing & Retail",64.015,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
1,Pokémon,1996,95.0,Video Games/Games,19.533,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...
2,Pokémon,1996,95.0,Video Games/Games,11.491,Video game,Satoshi Tajiri Ken Sugimori,Nintendo (trademark) The Pokémon Company (Nint...


---

### Section 2 One-Shot

In [10]:
df_clean = (
    pd.read_html(fileurl)[3]
    .rename(columns = dict(zip(df_raw.columns, colnames)))
    .process_text(column='total_revenue', string_function='replace', pat='(est.)|(\$)', repl='')
    .process_text(column='total_revenue', string_function='strip')
    .process_text(column='total_revenue', string_function='slice', stop=2)
    .change_type('total_revenue', float)  # [pyjanitor]
    .process_text(column='revenue_items', string_function='split',pat="\[.+?\]")
    .explode('revenue_items')
    .filter_string('revenue_items', 'illion')  # [pyjanitor]
    .deconcatenate_column('revenue_items', new_column_names=['revenue_category', 'revenue'], sep='\$', preserve_position=True)
    .process_text(column='revenue_category', string_function='replace', pat='(\s–\s)|(.*\])', repl="")
    .process_text(column='revenue_category', string_function='strip')
    .process_text(column='revenue_category', string_function='lower')  # [pyjanitor] convert to lower case
    .reset_index(drop=True)
    .fuzzy_match_replace('revenue_category', mapper=value_mapper)  # [pyjanitor + pandas_flavor]
    .process_text(column='revenue', string_function='replace', pat='illion', repl='')
    .process_text(column='revenue', string_function='strip')
    .process_text(column='revenue', string_function='replace', pat=' ', repl='')
    .process_text(column='revenue', string_function='replace', pat='\s*b', repl='')
    .process_text(column='revenue', string_function='replace', pat='\s*m', repl='e-3')
    .process_text(column='original_media', string_function='replace', pat='\[.+', repl='')
    .transform(pd.to_numeric, errors='ignore')
)

---

### Final aggregation and join
R snippet:
```R
 sum_df <- clean_df %>%
  group_by(franchise, revenue_category) %>% 
  summarize(revenue = sum(revenue))

metadata_df <- clean_df %>% 
  select(franchise:revenue_category, original_media:owners, -total_revenue)

final_df <- left_join(sum_df, metadata_df, 
                      by = c("franchise", "revenue_category")) %>% 
  distinct(.keep_all = TRUE)

final_df
```

In [11]:
df_sum = (
    df_clean.groupby(['franchise', 'revenue_category'])
    .sum()
    .reset_index()
)
df_sum.head(3)

Unnamed: 0,franchise,revenue_category,total_revenue,revenue
0,Anpanman,Box Office,60.0,0.067
1,Anpanman,"Merchandise, Licensing & Retail",60.0,60.375
2,Anpanman,museum,60.0,0.025


In [12]:
df_metadata = df_clean[
    ['franchise', 'revenue_category', 'original_media', 'creators']
]
df_metadata.head(3)

Unnamed: 0,franchise,revenue_category,original_media,creators
0,Pokémon,"Merchandise, Licensing & Retail",Video game,Satoshi Tajiri Ken Sugimori
1,Pokémon,Video Games/Games,Video game,Satoshi Tajiri Ken Sugimori
2,Pokémon,Video Games/Games,Video game,Satoshi Tajiri Ken Sugimori


---
### Final Dataframe

In [13]:
# Generate final dataframe
df_final = (
    pd.merge(
        df_sum, df_metadata, how='left', on=['franchise', 'revenue_category']
    )
    .drop_duplicates(keep='first')
    .reset_index(drop=True)
)
df_final.head(3)

Unnamed: 0,franchise,revenue_category,total_revenue,revenue,original_media,creators
0,Anpanman,Box Office,60.0,0.067,Manga,Takashi Yanase
1,Anpanman,"Merchandise, Licensing & Retail",60.0,60.375,Manga,Takashi Yanase
2,Anpanman,museum,60.0,0.025,Manga,Takashi Yanase
