# Advanced Applications of Mutate

## Map and apply

In [2]:
import pandas as pd
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline

## Hiding stack traceback

We hide the exception traceback for didactic reasons (code source: [see this post](https://stackoverflow.com/questions/46222753/how-do-i-suppress-tracebacks-in-jupyter)).  Don't run this cell if you want to see a full traceback.

In [3]:
import sys
ipython = get_ipython()

def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
                   exception_only=False, running_compiled_code=False):
    etype, value, tb = sys.exc_info()
    return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))

ipython.showtraceback = hide_traceback

## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

#### MoMA Artists

In [4]:
artists = pd.read_csv("./data/Artists.csv")
artists.head(2)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,


# Transforming columns with the `map` and `apply` methods

Next, we will take a look at two useful `pandas Series` methods that allow us to apply very general transformations: `map` and `apply`.

## Transforming a column with `map`

`df.col.map` can be used to

* Apply a translation `dict`
* Apply a function
* Apply a `pd.Series`

In [5]:
artists.Gender.value_counts()

Male          9762
Female        2300
male            15
Non-Binary       2
female           1
Non-binary       1
Name: Gender, dtype: int64

#### `map`ping a translation `dict`

In [6]:
new_gender = {'Male':'m', 'Female':'f', 'male':'m', 'female':'f', 'Non-Binary':'nb', 'Non-binary':'nb'}
(artists
 >> select(X.Gender)
 >> mutate(new_gender = X.Gender.map(new_gender))
 >> head(9)
)

Unnamed: 0,Gender,new_gender
0,Male,m
1,Male,m
2,Male,m
3,Male,m
4,Male,m
5,Male,m
6,Male,m
7,Male,m
8,Female,f


#### Setting a default with `collections.defaultdict`

In [7]:
from collections import defaultdict

from_america = defaultdict(lambda: 'Not America')
from_america.update({'American':'America'})

#### Applying the `defaultdict`

In [8]:
(artists
 >> select(X.Nationality)
 >> mutate(from_america = X.Nationality.map(from_america))
 >> head(3)
)

Unnamed: 0,Nationality,from_america
0,American,America
1,Spanish,Not America
2,American,America


#### `map`ping a simple function

In [9]:
(artists
 >> select(X.Nationality)
 >> mutate(from_USA = X.Nationality.map(lambda n: 'USA' if n == 'American' else 'Other'))
 >> head(3)
)

Unnamed: 0,Nationality,from_USA
0,American,USA
1,Spanish,Other
2,American,USA


## Be sure to `apply` yourself!

* `df.col.apply` is used to apply any function to a column.
    * Including positional and keyword arguments
* Could literally be used to perform *any* mutation

#### Applying a unary function

In [10]:
century = lambda year_string: (int(year_string)//100)*100

(artists
 >> select(X.BeginDate)
 >> mutate(century_of_birth = X.BeginDate.apply(century))
 >> head(3)
)

Unnamed: 0,BeginDate,century_of_birth
0,1930,1900
1,1936,1900
2,1941,1900


## Using anonymous functions

* There is no need to name a `lambda`
* An embedded `lambda` is called an **anonymous function**

In [11]:
(artists
 >> select(X.EndDate)
 >> mutate(new_end_date = (X.EndDate
                           .apply(lambda y: y if int(y) > 0 else np.nan)
                           .astype('Int64')))
 >> head(2)
)

Unnamed: 0,EndDate,new_end_date
0,1992,1992.0
1,0,


## `apply` or `map`

* Use `map` for simple functions
* Use `apply` when adding additional arguments

#### MoMA Artwork

In [12]:
from more_dfply import fix_names

artwork = (pd.read_csv("./data/Artworks.csv")
           >> fix_names
           >> mutate(id = X.index + 1)
          )
artwork.head(2)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference_cm,Depth_cm,Diameter_cm,Height_cm,Length_cm,Weight_kg,Width_cm,Seat_Height_cm,Duration_sec,id
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.6,,,168.9,,,1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.6401,,,29.8451,,,2


#### Setting a positional argument

We want to apply `round(val, 1)`


In [13]:
(artwork
 >> select(X.Height_cm)
 >> mutate(rounded_height = X.Height_cm.apply(round, args=(1,)))
 >> head(3)
)

Unnamed: 0,Height_cm,rounded_height
0,48.6,48.6
1,40.6401,40.6
2,34.3,34.3


#### Setting a keyword argument

We want to apply `logp1(val, base=n)`

In [14]:
from math import log, e

log1p = lambda num, base=e: log(num + 1, base)
(artwork
 >> select(X.Height_cm)
 >> mutate(log10_plus_1 = X.Height_cm.apply(log1p, base = 10),
           log2_plus_1 = X.Height_cm.apply(log1p, base = 2),
           ln_plus_1 = X.Height_cm.apply(log1p, base = e))
 >> head(3)
)

Unnamed: 0,Height_cm,log10_plus_1,log2_plus_1,ln_plus_1
0,48.6,1.695482,5.632268,3.903991
1,40.6401,1.619512,5.379902,3.729064
2,34.3,1.547775,5.141596,3.563883


## <font color="red"> Exercise 3.3.1 </font>

An **Indicator column** for a category contains 1 for the rows that match that label and 0 otherwise.  The `exhibitions` dataframe.  Complete the following tasks.

1. Use `exhibitions.ExhibitionRole.unique()` to get a list of unique columns.
2. Use `mutate` and `map` with `defaultdict` to create an indicator column for each category (ignore missing rows).
3. Comment on the quality of your solution, especially in light of the [DRY principle](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)

#### MoMA Exhibitions

In [15]:
dat_cols = ['ExhibitionBeginDate', 'ExhibitionEndDate', 'ConstituentBeginDate' ,'ConstituentEndDate']
exhibitions = pd.read_csv('./data/MoMAExhibitions1929to1989.csv', 
                          encoding="ISO-8859-1",
                          parse_dates=dat_cols)
exhibitions.head(2)

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 19021981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 18391906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053


In [16]:
exhibitions.ExhibitionRole.unique()

array(['Curator', 'Artist', nan, 'Arranger', 'Installer',
       'Competition Judge', 'Designer', 'Preparer'], dtype=object)

In [27]:
from collections import defaultdict

c = defaultdict(lambda: 0)
c.update({'Curator': 1})

In [31]:
art = defaultdict(lambda: 0)
art.update({'Artist': 1})

In [32]:
ar = defaultdict(lambda: 0)
ar.update({'Arranger': 1})

In [33]:
i = defaultdict(lambda: 0)
i.update({'Installer': 1})

In [34]:
cj = defaultdict(lambda: 0)
cj.update({'Competition Judge': 1})

In [35]:
d = defaultdict(lambda: 0)
d.update({'Designer': 1})

In [36]:
p = defaultdict(lambda: 0)
p.update({'Preparer': 1})

In [37]:
(exhibitions
>> select(X.ExhibitionRole)
>> mutate(c = X.ExhibitionRole.map(c))
>> mutate(art = X.ExhibitionRole.map(art))
>> mutate(ar = X.ExhibitionRole.map(ar))
>> mutate(i = X.ExhibitionRole.map(i))
>> mutate(cj = X.ExhibitionRole.map(cj))
>> mutate(d = X.ExhibitionRole.map(d))
>> mutate(p = X.ExhibitionRole.map(p))
)



Unnamed: 0,ExhibitionRole,c,art,ar,i,cj,d,p
0,Curator,1,0,0,0,0,0,0
1,Artist,0,1,0,0,0,0,0
2,Artist,0,1,0,0,0,0,0
3,Artist,0,1,0,0,0,0,0
4,Artist,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
34553,Artist,0,1,0,0,0,0,0
34554,Artist,0,1,0,0,0,0,0
34555,Artist,0,1,0,0,0,0,0
34556,Artist,0,1,0,0,0,0,0


My solution doesn't really follow the DRY principle very well. I just kept repaeting the same mutate lines over and over again, but each one had different variables to them.