# More on Piping, Intentions, and Column Expressions

In [1]:
import pandas as pd
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline

In [2]:
artists = pd.read_csv("./data/Artists.csv")
artwork = pd.read_csv("./data/Artworks.csv")

In [3]:
# carried over from the last lecture
bad_lbls = (artists >> 
             filter_by(X.Nationality.str.lower().str.startswith('nation').astype('bool')) >>
             pull('Nationality')).unique()
recode_bad_lbls = {old_lbl:'Nationality unknown' for old_lbl in bad_lbls}
replace_zero = {0:np.NaN}

## Why we love piping? 

### Reason 1: Composition Baby!

It is very easy to put separate pipe together.

In [5]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID'))
artists_new = (artists_renamed >>
                mutate(Nationality = X.Nationality.replace(recode_bad_lbls)))
artists_new = (artists_new >>
                mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## To compose separate pipes

1. Switch ending `)` to `>>`
2. Remove the next assignment
3. ??
4. Profit!

In [5]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >> #)
#artists_new = (artists >>
                mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >> #)
#artists_new = (artists_renamed >>
                mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## End product ... full process in a single pipe

In [6]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## Why we love piping? 

### Reason 2: Textual Gravity!

A pipe clearly expression the intention of our code by

1. Reading left-to-right and top-to-bottom
2. Putting the verbs up front

In [7]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## Why we love piping? 

### Reason 3: Easy debugging

Comments make it easy to debug a pipe.

## Debugging Step 1 - Start at the top

Use comments to remove all part of the chain

*Don't forget the ending `)`*

In [11]:
artists_renamed = (artists
                    >> rename(Wiki_QID = 'Wiki QID')
                    >> mutate(Nationality = X.Nationality.replace(recode_bad_lbls))
                    >> mutate(BeginDate = X.BeginDate.replace(replace_zero))
                  )

## Debugging Step 2 - Work your way down the pipe

Add in each part, one-at-a-time, checking the results

*Don't forget the ending `)`*

In [9]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') ) #>>
                    #mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    #mutate(BeginDate = X.BeginDate.replace(replace_zero)))

In [10]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) ) #>>
                    #mutate(BeginDate = X.BeginDate.replace(replace_zero)))

In [11]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    mutate(BeginDate = X.BeginDate.replace(replace_zero)))

# More about Intentions 

## `X` is an `Intention`

<img src="img/dfply_X_intention_1.png" width = 800>

Think of it as recording an expression for later evaluation

In [13]:
expr = X.BeginDate.head()
expr

<dfply.base.Intention at 0x11868bcc0>

## Use `evaluate` to apply the expression

We can apply an expression *later* using `evaluate` with a dataframe.

In [15]:
expr.evaluate(artists)

0    1930
1    1936
2    1941
3    1946
4    1941
Name: BeginDate, dtype: int64

## Intention expressions are reusable!

In [16]:
# Reusable!
expr.evaluate(artwork)

0    (1841)
1    (1944)
2    (1876)
3    (1944)
4    (1876)
Name: BeginDate, dtype: object

## <font color="red"> Exercise 1 </font>
    
**Task:** Use `X` to create an expression that replaces spaces in column names with underscores.  Apply this expression to fresh instances of `artists` and `artwork`.

## <font color="blue"> Key </font>

In [17]:
replace_spaces = X.columns.str.replace(' ', '_').str.replace('[().]', '')
replace_spaces

<dfply.base.Intention at 0x118687748>

In [18]:
replace_spaces.evaluate(artists)

Index(['ConstituentID', 'DisplayName', 'ArtistBio', 'Nationality', 'Gender',
       'BeginDate', 'EndDate', 'Wiki_QID', 'ULAN'],
      dtype='object')

In [19]:
replace_spaces.evaluate(artwork)

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference_cm', 'Depth_cm', 'Diameter_cm', 'Height_cm', 'Length_cm',
       'Weight_kg', 'Width_cm', 'Seat_Height_cm', 'Duration_sec'],
      dtype='object')

## Not just for data frames ... works for any* expression

In [20]:
double, line = 2*X, 3*X + 5

In [21]:
double.evaluate(3), line.evaluate(6)

(6, 23)

## We can make functions lazy too!

Decorate a function with `make_symbolic` to allow lazy evaluation of `Intention` objects

In [22]:
from math import log
log = make_symbolic(log)

In [23]:
log(8, 2) # Works as expected with numbers

3.0

## Passing in `X` now makes a `log` expression

In [25]:
expr = log(X, 2) # Passing in X makes it lazy/symbolic
expr

<dfply.base.Intention at 0x11867f9e8>

In [26]:
expr.evaluate(16) # Evaluate later

4.0

## `pyspark.sql` set up

In [27]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, column, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()

## `pyspark.sql` columns also generate expression 

In [30]:
column('height')

Column<b'height'>

In [31]:
column('height') > 3

Column<b'(height > 3)'>

## `col == column`

In [34]:
5*col('height') + 2 # col is an alias for column

Column<b'((height * 5) + 2)'>

## Column expressions can combine columns

In [42]:
col('height') + col('weight')

Column<b'(height + weight)'>

## Columns work with other `pyspark.sql.functions`

In [35]:
mean(col('height'))

Column<b'avg(height)'>

## `sqlalchemy` columns generate expression too

In [36]:
from sqlalchemy import func as f
f.column('height')

<sqlalchemy.sql.functions.Function at 0x11a802a20; column>

In [37]:
f.column('height') > 3

<sqlalchemy.sql.elements.BinaryExpression object at 0x1181d3128>

## `col == column`

In [38]:
5*f.col('height') + 2 # col is an alias for column

<sqlalchemy.sql.elements.BinaryExpression object at 0x1181d34a8>

## Column expressions can combine columns

In [39]:
f.col('height') + f.col('weight')

<sqlalchemy.sql.elements.BinaryExpression object at 0x1181d3710>

## Columns work with other `pyspark.sql.functions`

In [40]:
f.avg(col('height'))

<sqlalchemy.sql.functions.Function at 0x10ca1c9b0; avg>

## Up Next

Now we will continue on to [Lecture 2.4](./2_4_pandas_dtypes.ipynb).