# It all the same!

When releasing the `dplyr` library, Wickham proclaimed the functions in the library to be the *Grammar of data manipulation*.  While I might quibble with this claim, he makes a strong point: **The verbs presented in `dplyr` provide a platform-independent language for manipulating tables.**

In this lecture, we will illustrate this point by constructing wrapper functions for `pyspark` that provide the same interface and functionality as the corresponding `dplyr` function.

## Set up

Let's read in a data set in `pyspark`

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean

spark = SparkSession.builder.appName('Ops').getOrCreate()
df_spark = spark.read.csv('data/heroes_information.csv', inferSchema=True, header=True)

## Understanding `pipeable`

Before we start, we much understand `pipe`s.

<img src="./img/pipe_meaning.png" width=400>

## Example 1 - `headk`

Since we naturally use `df_spark.take` to test our code, we start by implementing a `pipeable` wrapper for this function.  To start, get help on `headk`.  Pay attention to

1. The signature of the method
2. The description of the arguments, etc.

In [2]:
?df_spark.take

## Understanding the relationship between `df_spark.take` and `take`

<img src="./img/take.png" width=800>



#### Creating `take`

<img src="./img/take.png" width=500>

In [3]:
from functoolz import pipeable
@pipeable
def headk(num, df):
    return df.take(num)

#### Testing `head`

In [4]:
from more_pyspark import to_pandas

(df_spark
 >> headk(3)
 >> to_pandas)

Unnamed: 0,Id,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0


## Giving `num` a default value

In `pandas`/`dfply`, `head` has a default of `num = 5`.  Let's switch `num` to a default value to better mirror this expected functionality.

In [5]:
from functoolz import pipeable

@pipeable
def headk(df, num = 5): # Keywords must follow positional arguments
    return df.take(num)

#### Testing `head` with a default values

In [6]:
(df_spark
 >> headk
 >> to_pandas)

Unnamed: 0,Id,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


#### Testing `head` with a default values

Notice that currying forces us specify `num` as a keyword.

In [7]:
(df_spark
 >> headk(num = 2)
 >> to_pandas)

Unnamed: 0,Id,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0


## Example 1 - Implementing `selectk` for `pyspark`

As our first example, we will implement `selectk` for `pyspark`.    **Note:** To avoid name conflicts, I will append a `k` to all `pyspark` versions of these functions. This function will be a wrapper around the `select` method on a `pyspark` data frame.

#### Step 1 - Get help in `df_spark.select`

In [8]:
?df_spark.select

#### Notes on `*cols`

* `df_spark.select` unpacks the cols using `*cols`
* `pipeable` uses currying, which
    * puts the `df` in the last position
    * Makes it hard to use `*cols`
* **Consequences:** 
    * We will use an unpacked `cols` argument.
    * `df` is necessarily in the last position

#### Creating `selectk`

<img src="./img/select.png" width=500>

In [9]:
@pipeable
def selectk(cols, df):
    return df.select(*cols)

#### Testing `selectk`

Note that we test all representations of a `pyspark.sql` column: `'col_name'`, `col('col_name')`, and `df_spark.col_name`

In [13]:
from pyspark.sql.functions import col

height_in = (df_spark.Height*0.3937).alias('height_in')

(df_spark
 >> selectk(['name', # Test col_name strings
            col('Eye Color'), # Test col expressions
            df_spark.Gender, # Test df attributes
            height_in]) # Test calculated col saved as an expression
 >> headk(num=3)
 >> to_pandas)

Unnamed: 0,name,Eye Color,Gender,height_in
0,A-Bomb,yellow,Male,79.9211
1,Abe Sapien,blue,Male,75.1967
2,Abin Sur,blue,Male,72.8345


## <font color="red" > Exercise 1 - Construct `filter_byk` </font>

* This should be a wrapper around `df_spark.where`
    * Explore using help.  How many additional arguments?  What are they?
* Must be `pipeable` with `df` in the last position.
* Include test(s) of your function.

In [39]:
# Your function definition here

In [40]:
# Your test expression here

## Up Next

Stuff