## Using the OpenRefine divide-and-conquer approach to cleaning

[OpenRefine](https://openrefine.org) is a great tool for inspecting and cleaning data, in no small part due to a workflow that makes heavy use of filters and facets.  In this lecture, we will develop/discuss tools for replicating these filters and facets in `Python` and `more_dfply`

In [2]:
import pandas as pd
from dfply import *

In [3]:
artists = pd.read_csv("./data/Artists.csv")

## Performing `mutate` using `text_facet` and `case_when` or `ifelse`

Filters and facets are not just used to inspect data in OpenRefine, but are also used for a divide-and-conquer style of transformation.  When performing a transformation while a filter or facet is present, **that transformation only affects the *visible* rows, leaving the others unchanges**. 

This style of transformation can be used replicated by (for example) combining using `text_filter` with the `more_dfply.ifelse` and `more_dfply.case_when` functions.

#### Binary choice with `ifelse`

In [4]:
from more_dfply import ifelse, case_when
from more_dfply.facets import text_filter, text_facet

df = (artists
 >> select(X.Nationality)
 >> mutate(is_american = ifelse(text_filter(X.Nationality, 'American'), 1, 0))
 >> head
)

df

Unnamed: 0,Nationality,is_american
0,American,1
1,Spanish,0
2,American,1
3,American,1
4,Danish,0


#### Any number of conditions with `case_when`

In [5]:
(artists
 >> select(X.Nationality)
 >> mutate(north_american = case_when((text_filter(X.Nationality, 'American'), 1),
                                      (text_filter(X.Nationality, 'Canadian'), 1),
                                      (text_filter(X.Nationality, 'Mexican'), 1),
                                      (True, 0)))
 >> head(20)
)

Unnamed: 0,Nationality,north_american
0,American,1.0
1,Spanish,0.0
2,American,1.0
3,American,1.0
4,Danish,0.0
5,Italian,0.0
6,American,1.0
7,American,1.0
8,American,1.0
9,French,0.0


#### Performing `mutate`s based on `text_facet`s with `ifelse` and `case_when`

In [6]:
from more_dfply import ifelse, case_when
na = ['American', 'Canadian', 'Mexican']

(artists
 >> select(X.Nationality)
 >> mutate(north_american = ifelse(text_facet(X.Nationality, na), 1, 0))
 >> head
)

Unnamed: 0,Nationality,north_american
0,American,1
1,Spanish,0
2,American,1
3,American,1
4,Danish,0


#### Divide-and-conquer with `case_when`

In [7]:
na = ['American', 'Canadian', 'Mexican']
scand = ['Finnish', 'Icelandic', 'Norwegian', 'Danish', 'Swedish', 'Greenlandic']

(artists
 >> select(X.Nationality)
 >> mutate(region = case_when((text_facet(X.Nationality, na), "North American"), 
                              (text_facet(X.Nationality, scand), "Scandinavian"),
                              ( True, "Other")))
 >> head
)

Unnamed: 0,Nationality,region
0,American,North American
1,Spanish,Other
2,American,North American
3,American,North American
4,Danish,Scandinavian


#### Use `facet_by_label_count` to filter/mutate based on label frequency

In [8]:
from more_dfply.facets import facet_by_label_count

(artists
 >> select(X.Nationality)
 >> mutate(Nationality_trimmed = ifelse(facet_by_label_count(X.Nationality, from_ = 500), 
                           X.Nationality, 
                           "Other"))
)

Unnamed: 0,Nationality,Nationality_trimmed
0,American,American
1,Spanish,Other
2,American,American
3,American,American
4,Danish,Other
...,...,...
15217,American,American
15218,American,American
15219,German,German
15220,,Other


## Cleaning Case Study -- Cleaning up `FIRST.APPEARANCE`

In this case study, we will illustrate the iterative process of cleaning up a text column by focusing on extracting the year from the `FIRST.APPEARANCE` column of the comic book wiki data.

In [9]:
comics = pd.read_csv('./data/Comic_Data_Messy.csv')
comics

Unnamed: 0,page_id,urlslug,ID,ALIGN,SEX,ALIVE,APPEARANCES,FIRST.APPEARANCE,comic,PHYSICAL
0,666101,\/Jonathan_Dillon_(Earth-616),Public Identity,,Male Characters,Living Characters,4.0,Apr-97,marvel,"Blue Eyes , Brown Hair"
1,280850,\/John_(Mutant)_(Earth-616),Public Identity,,Male Characters,Deceased Characters,,Oct-01,marvl,"Blue Eyes , Blond Hair"
2,129267,\/wiki\/Gene_LaBostrie_(New_Earth),Public Identity,Good Characters,Male Characters,Living Characters,15.0,"1987, September",DC comics,", Black Hair"
3,157368,\/wiki\/Reemuz_(New_Earth),Public Identity,Good Characters,Male Characters,Deceased Characters,15.0,"1992, September",DC,"Black Eyes ,"
4,16171,\/Aquon_(Earth-616),Secret Identity,Bad Characters,Male Characters,Living Characters,1.0,Jul-73,marvl,","
...,...,...,...,...,...,...,...,...,...,...
23267,183949,\/wiki\/Queen_of_Hearts_IV_(New_Earth),,Bad Characters,Female Characters,Living Characters,2.0,"2009, October",DC Comics,","
23268,30345,\/Regent_(Earth-616),Secret Identity,Bad Characters,Male Characters,Living Characters,3.0,Dec-93,marvel,", Brown Hair"
23269,432532,\/Malcolm_Monroe_(Earth-616),Secret Identity,Good Characters,Male Characters,Deceased Characters,1.0,Mar-11,marvel,"Blue Eyes , Black Hair"
23270,16723,\/Tether_(Earth-616),Secret Identity,Neutral Characters,Female Characters,Living Characters,2.0,Oct-97,Marvel Comics,", No Hair"


### Note on work-flow and illustrating cell evolution

When performing this task, I would use two notebooks cells.  

**View Cell:** The first, which I have labeled `# View Cell` is used to explore the various patterns in this column.  I start by identifying a pattern and adding a filter to this cell's pipe to show only the corresponding rows in part to bug test my regular expressions--but also to inspect all the matching patterns.  After a pattern has been dealt with, the filter is negated to show the remaining cases not yet dealt with.

**Transformation Cell:** The second cell, labeled `# Transformation Cell` uses `mutate`, `case_when`, and the filter patterns from the View cells to apply the divide-and-conquer technique.  After I am satisfied with the filter pattern from the view cell, a tuple is added to `case_when` to transform these entries.

**Marking cell evolution:** To illustrate the evolution of these two cells, I have added a comment that counts the evolution steps.  *To be clear, this is all accomplished in two cells in practice*

In [10]:
# View Cell
# Evolution 1

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
0,Apr-97
1,Oct-01
2,"1987, September"
3,"1992, September"
4,Jul-73
...,...
23267,"2009, October"
23268,Dec-93
23269,Mar-11
23270,Oct-97


In [11]:
# Transform Cell
# Evolution 1

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> mutate(year = case_when((True, "Just getting started")))
 >> filter_by(X.year.notna())
 )

Unnamed: 0,FIRST.APPEARANCE,year
0,Apr-97,Just getting started


### Case 1 - `Mon-YY` from the 2000's

The first case I identified and fixed where rows with the `Mon-YY` pattern that came from the 2000's (with the assumption there were not comics from the 1900s or 1910s.

#### Build a `text_filter`

In [12]:
# View Cell
# Evolution 2

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
1,Oct-01
8,Jul-06
11,May-04
13,Jan-09
19,Aug-05
...,...
23205,Jul-04
23221,Feb-11
23224,Mar-10
23251,May-00


#### Transform these cases

In [13]:
# Transform Cell
# Evolution 2

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> mutate(year = case_when((text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True)
                            , '20' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                           ))
 >> filter_by(X.year.notna())
 )

Unnamed: 0,FIRST.APPEARANCE,year
1,Oct-01,2001
8,Jul-06,2006
11,May-04,2004
13,Jan-09,2009
19,Aug-05,2005
...,...,...
23205,Jul-04,2004
23221,Feb-11,2011
23224,Mar-10,2010
23251,May-00,2000


#### Hide these cases

In [14]:
# View Cell
# Evolution 3

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
0,Apr-97
2,"1987, September"
3,"1992, September"
4,Jul-73
5,"2002, April"
...,...
23266,Jan-61
23267,"2009, October"
23268,Dec-93
23270,Oct-97


### Case 2 - The rest of the  `Mon-YY` rows

Next I fixed the rest of the rows with the `Mon-YY` pattern, which came from the 1900's (with the assumption there were not comics from the 1900s or 1910s.

#### Build a `text_filter`

In [15]:
# View Cell
# Evolution 4

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
0,Apr-97
4,Jul-73
7,Oct-86
9,Aug-89
12,Dec-99
...,...
23264,Aug-75
23266,Jan-61
23268,Dec-93
23270,Oct-97


#### Transform these cases

In [16]:
# Transform Cell
# Evolution 3

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> mutate(year = case_when((text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True)
                            , '20' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                            (text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True)
                            , '19' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                           ))
 >> filter_by(X.year.notna())
 )

Unnamed: 0,FIRST.APPEARANCE,year
0,Apr-97,1997
1,Oct-01,2001
4,Jul-73,1973
7,Oct-86,1986
8,Jul-06,2006
...,...,...
23266,Jan-61,1961
23268,Dec-93,1993
23269,Mar-11,2011
23270,Oct-97,1997


#### Hide these cases

In [17]:
# View Cell
# Evolution 6

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
2,"1987, September"
3,"1992, September"
5,"2002, April"
6,"1999, May"
10,"2004, July"
...,...
23260,"1999, September"
23261,"2010, November"
23262,"2007, August"
23265,"1968, October"


### Case 3 - `YYYY, Month`

After excluding the two patterns dealt with so far, the next pattern that jumps out is the `YYYY, Month` pattern.

#### Build a `text_filter`

In [18]:
# View Cell
# Evolution 7

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True))
 >> filter_by(text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
2,"1987, September"
3,"1992, September"
5,"2002, April"
6,"1999, May"
10,"2004, July"
...,...
23260,"1999, September"
23261,"2010, November"
23262,"2007, August"
23265,"1968, October"


#### Transform these cases

In [19]:
# Transform Cell
# Evolution 4

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> mutate(year = case_when((text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True)
                            , '20' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                             (text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True)
                             , '19' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                             (text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True)
                             , X['FIRST.APPEARANCE'].str.split(',').str.get(0)),
                        
                           ))
 >> filter_by(X.year.notna())
 )

Unnamed: 0,FIRST.APPEARANCE,year
0,Apr-97,1997
1,Oct-01,2001
2,"1987, September",1987
3,"1992, September",1992
4,Jul-73,1973
...,...,...
23267,"2009, October",2009
23268,Dec-93,1993
23269,Mar-11,2011
23270,Oct-97,1997


#### Hide these cases

In [20]:
# View Cell
# Evolution 8

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
191,1987
242,1997
387,1993
831,1999
957,1993
...,...
22873,1992
22928,1993
22989,1987
23063,1997


### Case 4 - `YYYY`

Notice that excluding the third pattern brought out a pattern that we haven't seen yet, namely `YYYY`.  It is often the case that some patterns are buried in a large data set and it is for this reason that we evolve the View cell by excluding pattern we have already cleaned up.

#### Build a `text_filter`

In [21]:
# View Cell
# Evolution 9

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True))
 >> filter_by(text_filter(X['FIRST.APPEARANCE'], '^\d{4}$', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE
191,1987
242,1997
387,1993
831,1999
957,1993
...,...
22873,1992
22928,1993
22989,1987
23063,1997


#### Transform these cases

In [22]:
# Transform Cell
# Evolution 5

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> mutate(year = case_when((text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True)
                            , '20' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                             (text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True)
                             , '19' + X['FIRST.APPEARANCE'].str.split('-').str.get(1)),
                             (text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True)
                             , X['FIRST.APPEARANCE'].str.split(',').str.get(0)),
                             (text_filter(X['FIRST.APPEARANCE'], '^\d{4}$', regex=True)
                             , X['FIRST.APPEARANCE'])
                           ))
 >> filter_by(X.year.notna())
 )

Unnamed: 0,FIRST.APPEARANCE,year
0,Apr-97,1997
1,Oct-01,2001
2,"1987, September",1987
3,"1992, September",1992
4,Jul-73,1973
...,...,...
23267,"2009, October",2009
23268,Dec-93,1993
23269,Mar-11,2011
23270,Oct-97,1997


#### Hide these cases

In [23]:
# View Cell
# Evolution 10

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True))
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], '^\d{4}$', regex=True))
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE


#### Note

The blank search in the last evolution of the View cell indicates that we have dealt with all of the non-empty cases.

### Coda - Refactoring the code

Now that we have accomplish our task, we should take a moment to clean up our code with a little refactoring.  This is accomplished by moving the filter and transformation expressions variable to make the `case_when` transformation more readable for the next programmer (which is like to be you!).

#### Build a `text_filter`

In [24]:
# Filters
mmm_yy_2000s = text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-[01]\d', regex=True)
mmm_yy = text_filter(X['FIRST.APPEARANCE'], '[a-zA-Z]{3}-\d\d', regex=True)
yyyy_comma = text_filter(X['FIRST.APPEARANCE'], '\d{4}, ', regex=True)

# Transforms
split_hyphen =  X['FIRST.APPEARANCE'].str.split('-')
split_hyphen_get_year =  split_hyphen.str.get(1)
split_comma = X['FIRST.APPEARANCE'].str.split(',')
split_comma_get_year = split_comma.str.get(0)
leave_unchanged = X['FIRST.APPEARANCE']

In [25]:
(comics
 >> select(X['FIRST.APPEARANCE'])
 >> mutate(year = case_when((mmm_yy_2000s, '20' + split_hyphen_get_year),
                            (mmm_yy, '19' + split_hyphen_get_year),
                            (yyyy_comma, split_comma_get_year),
                            (True, leave_unchanged)))
 )

Unnamed: 0,FIRST.APPEARANCE,year
0,Apr-97,1997
1,Oct-01,2001
2,"1987, September",1987
3,"1992, September",1992
4,Jul-73,1973
...,...,...
23267,"2009, October",2009
23268,Dec-93,1993
23269,Mar-11,2011
23270,Oct-97,1997


## <font color="red"> Exercise 4.5.1 - Extracting the month </font>

To practice the work flow illustrated in the case study, apply this technique to extracting the month from the `FIRST.APPEARANCE` column--with the resulting column containing the full name of the month.

**Hint:** You might want to use the `Series.map` method with a translation dictionary when the month name is incomplete.

**Tasks.** 

1. Iteratively add lines to both the View and Transform cells provided below as you explore and transform each case found in the `Month` column
2. Refactor your code by moving various predicate expersions and transformation to `lambda` functions with meaningful names.

In [31]:
# View Cell
mon_to_month = {'Jan':'January', 'Feb':'Febraury', 'Mar':'March','Apr':'April','May':'May','Jun':'June','Jul':'July',
                'Aug':'August','Sep':'Sept','Oct':'October','Nov':'November','Dec':'December'} # Complete this

(comics
 >> select(X['FIRST.APPEARANCE'])
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], "[A-Za-z]{3}-\d{2}", regex=True)) 
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], ", [A-Za-z]*$", regex=True)) 
 >> filter_by(~text_filter(X['FIRST.APPEARANCE'], "^\d{4}$", regex=True)) 
 >> filter_by(X['FIRST.APPEARANCE'].notna())
 )

Unnamed: 0,FIRST.APPEARANCE


In [32]:
# Transform Cell

(comics
    >> select(X['FIRST.APPEARANCE'])
    >> mutate(month = case_when((text_filter(X['FIRST.APPEARANCE'], "[A-Za-z]{3}-\d{2}", regex=True),
                                    X['FIRST.APPEARANCE'].str.split("-").str.get(0).map(mon_to_month)),
                                (text_filter(X['FIRST.APPEARANCE'], ", [A-Za-z]*$", regex=True),
                                    X['FIRST.APPEARANCE'].str.split(", ").str.get(1)),
                                (text_filter(X['FIRST.APPEARANCE'], "^\d{4}$", regex=True),
                                    np.nan),
                                (True, np.nan))
             )
)

Unnamed: 0,FIRST.APPEARANCE,month
0,Apr-97,April
1,Oct-01,October
2,"1987, September",September
3,"1992, September",September
4,Jul-73,July
...,...,...
23267,"2009, October",October
23268,Dec-93,December
23269,Mar-11,March
23270,Oct-97,October


In [35]:
# Filters
mmm_yy = text_filter(X['FIRST.APPEARANCE'], "[A-Za-z]{3}-\d{2}", regex=True)
year_comma_fullmonth = text_filter(X['FIRST.APPEARANCE'], ", [A-Za-z]*$", regex=True)

# Transforms
split = lambda sep: X["FIRST.APPEARANCE"].str.split(sep)
split_and_get = lambda sep, idx: split(sep).str.get(idx)
get_month_from_hyphen = split_and_get("-", 0)
get_month_from_comma = split_and_get(", ", 1)
default_v = np.nan 

(comics
    >> select(X['FIRST.APPEARANCE'])
    >> mutate(month = case_when((mmm_yy, get_month_from_hyphen.map(mon_to_month)),
                                (year_comma_fullmonth, get_month_from_comma),
                                (True, default_v))
             )
)

Unnamed: 0,FIRST.APPEARANCE,month
0,Apr-97,April
1,Oct-01,October
2,"1987, September",September
3,"1992, September",September
4,Jul-73,July
...,...,...
23267,"2009, October",October
23268,Dec-93,December
23269,Mar-11,March
23270,Oct-97,October
