# Advanced Applications of Mutate

## Three helpful functions

In [1]:
# Install stuff
!pip install unpythonic



In [2]:
import pandas as pd
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline

## Hiding stack traceback

We hide the exception traceback for didactic reasons (code source: [see this post](https://stackoverflow.com/questions/46222753/how-do-i-suppress-tracebacks-in-jupyter)).  Don't run this cell if you want to see a full traceback.

In [3]:
import sys
ipython = get_ipython()

def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
                   exception_only=False, running_compiled_code=False):
    etype, value, tb = sys.exc_info()
    return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))

ipython.showtraceback = hide_traceback

## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

#### MoMA Exhibitions

In [4]:
exhib_url = "https://github.com/MuseumofModernArt/exhibitions/raw/master/MoMAExhibitions1929to1989.csv"
dat_cols = ['ExhibitionBeginDate', 'ExhibitionEndDate', 'ConstituentBeginDate' ,'ConstituentEndDate']
exhibitions = pd.read_csv(exhib_url, 
                          encoding="ISO-8859-1",
                          parse_dates=dat_cols)
exhibitions.head(2)

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 19021981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 18391906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053


#### MoMA Artists

In [5]:
artists = pd.read_csv("./data/Artists.csv")
artists.head(2)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,


#### MoMA Artwork

In [6]:
from more_dfply import fix_names

artwork = (pd.read_csv("./data/Artworks.csv")
           >> fix_names
           >> mutate(id = X.index + 1)
          )
artwork.head(2)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference_cm,Depth_cm,Diameter_cm,Height_cm,Length_cm,Weight_kg,Width_cm,Seat_Height_cm,Duration_sec,id
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.6,,,168.9,,,1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.6401,,,29.8451,,,2


# Three helpful column functions

In this section, we will focus on two useful column functions: `ifelse`, `coalesce` and `case_when`

# Branching with `ifelse`

The functions `ifelse` 

* allows us to pick between two options in a `mutate`.
* has the following syntax: `ifelse(cond, then, else)`
* Will return `then` with `cond == True`
* Will return `else` with `cond == False`

In [7]:
from more_dfply import ifelse

(artwork
 >> select(X.Gender)
 >> mutate(recode_gender = ifelse(X.Gender == '(Male)',
                                  'm',
                                  'f'))
 >>head
)

Unnamed: 0,Gender,recode_gender
0,(Male),m
1,(Male),m
2,(Male),m
3,(Male),m
4,(Male),m


### `then` and `else` conform to the `len(cond)`

* Singletons are repeated.
* Short vectors are tiled.
* Series/lists that are too long are truncated.

#### Some example conditions

In [8]:
from numpy import repeat, arange
all_true = repeat(True, 5)
all_false = repeat(False, 5)
all_true, all_false

(array([ True,  True,  True,  True,  True]),
 array([False, False, False, False, False]))

#### Series that are too long or too short

In [9]:
short = arange(1,4,1)
long = arange(1,10,1)
short, long

(array([1, 2, 3]), array([1, 2, 3, 4, 5, 6, 7, 8, 9]))

#### Singletons are repeated

In [10]:
ifelse(all_true, 'singleton', long)

0    singleton
1    singleton
2    singleton
3    singleton
4    singleton
dtype: object

#### Short sequences are tiled

The sequence of a short sequence is repeated, over and over, until it has the same length as `cond`

In [11]:
ifelse(all_true, short, long)

0    1
1    2
2    3
3    1
4    2
dtype: int64

#### Long sequences are truncated

The sequence of a long sequence is repeated, over and over, until it has the same length as `cond`

In [12]:
ifelse(all_false, short, long)

0    1
1    2
2    3
3    4
4    5
dtype: int64

### `then` and `else` are only evaluated if needed (when `Intention`s)

* If `cond` is all true, then `else Intention` will not be evaluated.
* Similarly, if `cond` is all false, the an `then Intention` will not be evaluated.

#### It is important that conditionals don't evaluate the "other" expression

In [13]:
'safe' if True else 1/0

'safe'

In [14]:
def my_ifelse(cond, then, else_):
    return then if cond else else_

In [15]:
my_ifelse(True, 'safe', 1/0)

ZeroDivisionError: division by zero

#### An expression that will crash if `else` is evaluated

In [16]:
(X/0).evaluate(2)

ZeroDivisionError: division by zero

#### No crash $\Rightarrow$ `else` was not evaluated

In [17]:
ifelse(all_true, 'safe', X.height_cm/0)

<dfply.base.Intention at 0x7f6a229eb3d0>

In [18]:
ifelse(all_true, 'safe', X.height_cm/0).evaluate(2)

0    safe
1    safe
2    safe
3    safe
4    safe
dtype: object

## <font color="red"> Exercise 3.4.1 </font>

Consider the `Nationality` column `exhibition` table.  We would like to make a new column that reclassifies this column titled `"American"` that contains `1` if the artist is of American decent and `0` otherwise. 

In [20]:
exhibitions.Nationality.unique()

array(['American', 'French', 'Dutch', 'Italian', nan, 'Spanish', 'German',
       'Mexican', 'Austrian', 'Finnish', 'Swedish', 'architect', 'Swiss',
       'British', 'Czech', 'Belgian', 'Russian', 'Guatemalan',
       'Russian-Lithuanian', 'English', 'Nationality unknown', 'Greek',
       'Norwegian', 'Georgian', 'Latvian', 'Polish', 'Japanese',
       'Milanese', 'Danish', 'Netherlandish', 'Romanian', 'Flemish',
       'Israeli', 'Scottish', 'Hungarian', 'Yugoslav', 'Brazilian',
       'Ukrainian', 'Catalan', 'Florentine', 'Venetian', 'Peruvian',
       'Canadian', 'Bolivian', 'Cuban', 'Irish', 'Chinese', 'Argentine',
       'Chilean', 'Colombian', 'Uruguayan', 'Ecuadorian', 'Venezuelan',
       'Australian', 'Haitian', 'Indian', 'Korean', 'Turkish', 'New',
       'Tanzanian', 'New Zealander', 'South', 'Icelandic', 'Iranian',
       'Panamanian', 'Rhodesian', 'Sudanese', 'Moroccan and American',
       'Canadian Inuit', 'Slovene', 'Bosnian', 'South African',
       'Croatian', 'Luxem

In [24]:
from more_dfply import ifelse

(exhibitions
  >> select(X.Nationality)
  >> mutate(recode_nationality = ifelse(X.Nationality == "American", 1, 0))
  >> head
)

Unnamed: 0,Nationality,recode_nationality
0,American,1
1,French,0
2,French,0
3,Dutch,0
4,French,0


## Generalizing `ifelse` with `case_when`

`case_when` takes one more `(pred, then)` tuples
* `pred` is a `bool` expression
* `then` is added/coalesced with the answer when `pred == True`

This is similar to the R `case_when` from `dplyr`. See [case_when docs](https://dplyr.tidyverse.org/reference/case_when.html)

In [25]:
from more_dfply import case_when

#### Some example conditions

In [26]:
df = pd.DataFrame({'cat':['a','b','b','c','c'],
                   'val':[ 1,  1,  2,  1, 2]})
df

Unnamed: 0,cat,val
0,a,1
1,b,1
2,b,2
3,c,1
4,c,2


#### `case_when` with one predicate pair

Unmatched values are `nan`

In [27]:
(df
 >> mutate(new = case_when((X.cat == 'a', df.val + 1))))

Unnamed: 0,cat,val,new
0,a,1,2.0
1,b,1,
2,b,2,
3,c,1,
4,c,2,


#### Left-hand pairs have precident

In [28]:
(df
 >> mutate(new = case_when((X.cat == 'a', df.val + 1),
                           (X.cat == 'b', df.val + 2))))

Unnamed: 0,cat,val,new
0,a,1,2.0
1,b,1,3.0
2,b,2,4.0
3,c,1,
4,c,2,


#### Singletons are accepted

In [29]:
(df
 >> mutate(new =  case_when((X.cat == 'a', df.val + 1),
                            (X.cat == 'b', df.val + 2),
                            (X.cat == 'c', 18))))

Unnamed: 0,cat,val,new
0,a,1,2.0
1,b,1,3.0
2,b,2,4.0
3,c,1,18.0
4,c,2,18.0


## <font color="red"> Exercise 3.4.2 </font>

Consider the `Nationality` column `exhibition` table.  We would like to make a new column that reclassifies this column as `"North American"`, `"European"`, or `"Other"`.  Use `case_when` to accomplish this task. 

In [30]:
exhibitions.Nationality.unique()

array(['American', 'French', 'Dutch', 'Italian', nan, 'Spanish', 'German',
       'Mexican', 'Austrian', 'Finnish', 'Swedish', 'architect', 'Swiss',
       'British', 'Czech', 'Belgian', 'Russian', 'Guatemalan',
       'Russian-Lithuanian', 'English', 'Nationality unknown', 'Greek',
       'Norwegian', 'Georgian', 'Latvian', 'Polish', 'Japanese',
       'Milanese', 'Danish', 'Netherlandish', 'Romanian', 'Flemish',
       'Israeli', 'Scottish', 'Hungarian', 'Yugoslav', 'Brazilian',
       'Ukrainian', 'Catalan', 'Florentine', 'Venetian', 'Peruvian',
       'Canadian', 'Bolivian', 'Cuban', 'Irish', 'Chinese', 'Argentine',
       'Chilean', 'Colombian', 'Uruguayan', 'Ecuadorian', 'Venezuelan',
       'Australian', 'Haitian', 'Indian', 'Korean', 'Turkish', 'New',
       'Tanzanian', 'New Zealander', 'South', 'Icelandic', 'Iranian',
       'Panamanian', 'Rhodesian', 'Sudanese', 'Moroccan and American',
       'Canadian Inuit', 'Slovene', 'Bosnian', 'South African',
       'Croatian', 'Luxem

In [31]:
na = {"American", "Mexican", "Guatemalan", "Canadian", "Cuban", "Haitian", "Panamanian", "Canadian Inuit", "Native American", "American and Mexican"}

eu = {"French", "Dutch", "Italian", "Spanish", "German", "Austrian", "Finnish", "Swedish", "Swiss", "British", "Czech", "Belgian", "Russian", "Russian-Lithuanian", "English", 
"Greek", "Norwegian", "Georgian", "Latvian", "Polish", "Milanese", "Danish", "Netherlandish", "Romanian", "Flemish", "Scottish", "Hungarian", "Yugoslav", "Ukrainian", "Catalan",
"Florentine", "Venetian", "Irish", "Turkish", "Icelandic", "Slovene", "Bosnian", "Croatian", "Luxembourgish"}

other = {"Japanese", "Israeli", "Brazilian", "Peruvian", "Bolivian", "Chinese", "Argentine", "Chilean", "Colombian", "Uruguayan", "Ecuadorian", "Venezuelan", "Australian", 
"Indian", "Korean", "Tanzanian", "New Zealander", "South", "Iranian", "Rhodesian", "Sudanese", "Moroccan and American", "South African"}

In [35]:
exhibitions.Nationality.head()

0    American
1      French
2      French
3       Dutch
4      French
Name: Nationality, dtype: object

In [41]:
(exhibitions
>> select(X.Nationality)
>> mutate(new_column = case_when((X.Nationality.isin(na), "North American"),
                                (X.Nationality.isin(eu), "European"),
                                (X.Nationality.isin(other), "Other")
         )
)  
)      

Unnamed: 0,Nationality,new_column
0,American,North American
1,French,European
2,French,European
3,Dutch,European
4,French,European
...,...,...
34553,Japanese,Other
34554,Japanese,Other
34555,Japanese,Other
34556,Japanese,Other


# Using `coalesce` to remove missing values

* Syntax: `coalesce(col1, col2, ...)`
* Returns a `pd.Series`
* Each entry is the first non-missing value from `col1`, `col2`, ... (working left to right).

In [28]:
from more_dfply import coalesce

In [29]:
df = pd.DataFrame({'cat':['a','b','b','c','c'],
                   'val':[ 1,  1,  2,  1, 2]})
df

Unnamed: 0,cat,val
0,a,1
1,b,1
2,b,2
3,c,1
4,c,2


#### Example `df` with some missing values

In [30]:
df = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=list('abc'))
df.loc[::2, 'a'] = np.nan
df.loc[::3, 'b'] = np.nan
df

Unnamed: 0,a,b,c
0,,,7
1,5.0,6.0,2
2,,3.0,5
3,4.0,,3
4,,9.0,8


#### `coaleace` first two columns

In [31]:
(df
>> mutate(d = coalesce(df.a, df.b)))

Unnamed: 0,a,b,c,d
0,,,7,
1,5.0,6.0,2,5.0
2,,3.0,5,3.0
3,4.0,,3,4.0
4,,9.0,8,9.0


#### `coaleace` first all three columns

In [32]:
(df
>> mutate(d = coalesce(df.a, df.b, df.c)))

Unnamed: 0,a,b,c,d
0,,,7,7.0
1,5.0,6.0,2,5.0
2,,3.0,5,3.0
3,4.0,,3,4.0
4,,9.0,8,9.0


#### `coaleace` handles `dfply.Intention`s

In [33]:
(df
>> mutate(d = coalesce(X.a, X.b)))

Unnamed: 0,a,b,c,d
0,,,7,
1,5.0,6.0,2,5.0
2,,3.0,5,3.0
3,4.0,,3,4.0
4,,9.0,8,9.0


#### `coalesce` ignores unnecessary arguments

In [35]:
(df
 >> mutate(d = coalesce(X.a, df.b, df.c, 
                        X/0 # Ignored
                       )))

Unnamed: 0,a,b,c,d
0,,,7,7.0
1,5.0,6.0,2,5.0
2,,3.0,5,3.0
3,4.0,,3,4.0
4,,9.0,8,9.0
