In [1]:
!pip install more_itertools

Collecting more_itertools
  Downloading more_itertools-8.14.0-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 609 kB/s eta 0:00:011
[?25hInstalling collected packages: more-itertools
Successfully installed more-itertools-8.14.0


In [2]:
!pip install more_dfply



## OpenRefine-like Filters and Facets

[OpenRefine](https://openrefine.org) is a great tool for inspecting and cleaning data, in no small part due to a workflow that makes heavy use of filters and facets.  In this lecture, we will develop/discuss tools for replicating these filters and facets in `Python` and `more_dfply`

In [3]:
import pandas as pd
from dfply import *

In [4]:
artists = pd.read_csv("./data/Artists.csv")

### Text filter

<img src="./img/text_filter.png"/>

First, we will create a function to mimic the text filter.  This function will need to

1. Accept column intention in the first argument.
2. Accept text in the second argument
3. Allow for the following optional arguments: `case=False`, `invert=False`, `regex=False`

#### Performing a text filter

In [5]:
from more_dfply.facets import text_filter

(artists
 >> filter_by(text_filter(X.Nationality, 'american'))
 >> head
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
6,7,Bill Aron,"American, born 1941",American,Male,1941,0,,
7,9,David Aronson,"American, born Lithuania 1923",American,Male,1923,0,Q5230870,500003363.0


#### Inverting the search with `~`

In [6]:
(artists
 >> filter_by(~text_filter(X.Nationality, 'american'))
 >> head
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,
5,6,Danilo Aroldi,"Italian, born 1925",Italian,Male,1925,0,,
9,11,Jean (Hans) Arp,"French, born Germany (Alsace). 1886–1966",French,Male,1886,1966,Q153739,500031000.0
10,12,Jüri Arrak,"Estonian, born 1936",Estonian,Male,1936,0,,


#### Switching to a case sensitive search

In [7]:
(artists
 >> filter_by(text_filter(X.Nationality, 'amer', case=True))
 >> head
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
12313,38243,Barthelemy Toguo,"Cameroonian, born 1967",Cameroonian,Male,1967,0,,


#### Using a regular expression

In [8]:
(artists
 >> filter_by(text_filter(X.ArtistBio, ', born \d{4}', regex=True))
 >> head
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,
5,6,Danilo Aroldi,"Italian, born 1925",Italian,Male,1925,0,,


In [9]:
(artists
 >> filter_by(X.ArtistBio.str.contains(', born \d{4}', regex=True, na=False))
 >> head
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,
5,6,Danilo Aroldi,"Italian, born 1925",Italian,Male,1925,0,,


#### You can combine `text_filter`s with `&` and `|`

In [10]:
(artists
 >> filter_by((text_filter(X.ArtistBio, ', born \d{4}', regex=True) 
               & text_filter(X.Nationality, 'American')))
 >> head
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
6,7,Bill Aron,"American, born 1941",American,Male,1941,0,,
19,24,Donald Ashcraft,"American, born 1927",American,Male,1927,0,,
38,44,Robert Abel,"American, born 1937",American,Male,1937,0,Q7341326,


## Inspecting and filtering with `text_facet`s

<img src="./img/text_facet.png"/>

Text facets are another powerful OpenRefine tool, which allow you to 

1. Inspect all the unique labels in a column along with their frequencies
2. Filter rows by including/excluding various labels
3. Perform transformation on the filtered rows.

#### Inspecting text facets with `get_text_facet`

In [35]:
from more_dfply.facets import get_text_facets
from composable.sequence import slice

(get_text_facets(artists.Nationality) 
 #>> slice(0,5) # Like head but for lists
)

[('American', 5194),
 ('German', 969),
 ('British', 854),
 ('French', 848),
 ('Italian', 539),
 ('Japanese', 509),
 ('Swiss', 294),
 ('Dutch', 277),
 ('Russian', 267),
 ('Austrian', 242),
 ('Canadian', 193),
 ('Nationality unknown', 180),
 ('Brazilian', 163),
 ('Spanish', 160),
 ('Argentine', 140),
 ('Polish', 130),
 ('Swedish', 130),
 ('Mexican', 129),
 ('Danish', 119),
 ('Belgian', 93),
 ('Chinese', 80),
 ('Czech', 78),
 ('Israeli', 74),
 ('Chilean', 72),
 ('South African', 68),
 ('Cuban', 63),
 ('Finnish', 61),
 ('Venezuelan', 60),
 ('Australian', 55),
 ('Colombian', 54),
 ('Hungarian', 53),
 ('Norwegian', 47),
 ('Indian', 38),
 ('Peruvian', 37),
 ('Korean', 34),
 ('Croatian', 27),
 ('Uruguayan', 24),
 ('Yugoslav', 23),
 ('Turkish', 22),
 ('Irish', 22),
 ('Scottish', 20),
 ('Romanian', 20),
 ('New Zealander', 17),
 ('Haitian', 16),
 ('Portuguese', 13),
 ('Greek', 12),
 ('Icelandic', 12),
 ('Iranian', 11),
 ('Ukrainian', 11),
 ('Serbian', 11),
 ('Slovenian', 9),
 ('Slovak', 8),
 ('Bo

#### Filtering rows using `text_facet`

In [12]:
from more_dfply.facets import text_facet

(artists
 >> filter_by(text_facet(X.Nationality, 'American','Canadian','Mexican'))
 >> head(10)
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
6,7,Bill Aron,"American, born 1941",American,Male,1941,0,,
7,9,David Aronson,"American, born Lithuania 1923",American,Male,1923,0,Q5230870,500003363.0
8,10,Irene Aronson,"American, born Germany 1918",American,Female,1918,0,Q19748568,500042413.0
11,13,J. Arrelano Fischer,"Mexican, 1911–1995",Mexican,Male,1911,1995,,
15,19,Richard Artschwager,"American, 1923–2013",American,Male,1923,2013,Q568262,500114981.0
16,21,Ruth Asawa,"American, 1926–2013",American,Female,1926,2013,Q7382874,500077806.0
19,24,Donald Ashcraft,"American, born 1927",American,Male,1927,0,,


#### `text_facet` takes any combination of strings and lists of strings as arguments

In [13]:
(artists
 >> filter_by(text_facet(X.Nationality, 'American',['Canadian','Mexican']))
 >> head(10)
)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
6,7,Bill Aron,"American, born 1941",American,Male,1941,0,,
7,9,David Aronson,"American, born Lithuania 1923",American,Male,1923,0,Q5230870,500003363.0
8,10,Irene Aronson,"American, born Germany 1918",American,Female,1918,0,Q19748568,500042413.0
11,13,J. Arrelano Fischer,"Mexican, 1911–1995",Mexican,Male,1911,1995,,
15,19,Richard Artschwager,"American, 1923–2013",American,Male,1923,2013,Q568262,500114981.0
16,21,Ruth Asawa,"American, 1926–2013",American,Female,1926,2013,Q7382874,500077806.0
19,24,Donald Ashcraft,"American, born 1927",American,Male,1927,0,,


## <font color="red"> Exercise 4.4.1 - The Super Hero Dating Game - Part 3</font>

Let's redo Exercise 4.2.1, but this time using our new tools.  Here is the prompt for that previous exercise.

> Yesterday, you notice another singles add in the local paper, which reads

>> SBiM looking for SyFy super hero (will also consider Star Wars (George Lucas), Star Trek, or NBC - Heroes ... check the `Publisher` column).  Eye color must be either blue or brown and last name must start with either B or P.

Rewrite the query using `text_filter` and `text_facet`.

In [16]:
from more_dfply import fix_names
from dfply import *
heroes_raw = pd.read_csv('./data/heroes_information.csv', na_values=['-', '-99.0', ''])
heroes = (heroes_raw >> fix_names)
heroes.head()

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,


In [34]:
# Your code here
(heroes
 >> filter_by(text_facet(X.Publisher, 'George Lucas','Star Trek','NBC - Heroes'),
              text_filter(X.Eye_color, 'blue|brown', regex=True),
              text_filter(X.name, '\sB|P', regex=True, case=True)
             )
)

Unnamed: 0,Unnamed_0,name,Gender,Eye_color,Race,Hair_color,Height,Publisher,Skin_color,Alignment,Weight
177,177,Claire Bennet,Female,blue,,Blond,,NBC - Heroes,,good,
238,238,Elle Bishop,Female,blue,,Blond,,NBC - Heroes,,bad,
486,486,Nathan Petrelli,Male,brown,,,,NBC - Heroes,,good,
