# Activity 5.3 - Returning to the Nest

## Processing Dr. Bergen's Eagle Data in `pyspark`

In a previous homework, you performed a data management task for Dr. Bergen, Director of the WSU Statistical Consulting Center.  The associated data can be found in the `data` folder of this repository.  Below, you will find the instructions for the original task.

>    Dr. Bergen had the following to say about the data.
>
>     - One row = one GPS measurement.  
>     - Subsample of 10K GPS points from a couple bald eagles in Iowa. 
>     - **Context.** need to use the flight characteristics to perform $k$-means clustering of the flight points.  
>
>    Variables to be used for clustering include
>
>    - `KPH` (km per hour; an instantaneous measure of speed; measured by the GPS device);
>    - `Sn` (an average speed; given 2 time points and at locations and something like );
>    - `AGL0` (meters above ground level);
>    - `VerticalRate` (change in AGL between two time points; large negative if descending quickly; large positiveif ascending quickly);
>    - `absVR` (absolute value of VerticalRate); and
>    - `abs_angle`c(absolute value of turn angle, in radians; larger values equal more “tortuous”, i.e. twisty flight)
>
>    All variables except for `VerticalRate` are skewed and all variables need to be mean-centered and standardized prior to clustering.
>
>    <img src="./img/summary_of_features.png"/>
>
>    Note that data is 
>
>    - *mean-centered* by subtracting the mean of the column from each entry.
>    - *standardized* by dividing each entry by the standard deviation of the column.

### Tasks

In this activity, you will redo the following tasks in `pyspark` using the STACK-TRANSFORM-UNSTACK trick.

- Read the data into `pyspark` and assure that the columns have the correct type.  Define a schema as needed.
- Apply `sqrt` transform to `KPH`, `Sn`, `AGL0`, `absVR` and `abs_angle`.  
- Mean-center and standardize transformed variables from above as well as `VerticalRate`
- Visualize the transformed features.  
    - Because `pyspark` lacks visualization tools, you should convert the results back to a `pandas.Dataframe` then use a [seaborn multi-plot grid](https://seaborn.pydata.org/tutorial/axis_grids.html) to plot all the variables on the same panel.  **HINT.** To make this work, you will need to stack all of the transformed features.

**Deliverables.** You should keep any code cells you used to test/figure-out the solution, but the end result should be two cells,

1. A cell loading spark and reading in the data frame.
2. A second cell containing all the code and data management in one dot chain; along with any other objects used in the pipe.
3. A third cell containing all the code needed to convert the data frame back to pandas and create your visualization.

Note that these three cells should work independent of the rest of your code: If I restart the kernel and run only these cells, everything should work.



In [42]:
# Hint 1.  pyspark includes sqrt, mean, and sd functions.
from pyspark.sql import functions as fn
from pyspark.sql import *
from more_pyspark import get_spark_types, to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

In [3]:
# Hint 2.  The Apache Arrow library allows fast conversion of data frames back to pandas.  
!pip install pyarrow

# The `toPandas` method effectively replaces `collect`. 
# Example:
# pandas_df = spark_df.toPandas() # <== requires pyarrow


Collecting pyarrow
  Downloading pyarrow-9.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.3 MB)
[K     |████████████████████████████████| 35.3 MB 6.0 MB/s eta 0:00:01
Installing collected packages: pyarrow
Successfully installed pyarrow-9.0.0


In [14]:
from composable import pipeable
from pyspark.sql.functions import *

@pipeable
def gather(var_lbl, val_lbl, cols_to_stack, df):
    make_array = lambda var_name, val_name, cols: (array(*(struct(lit(c).alias(var_name), 
                                                                  col(c).alias(val_name))
                                                           for c in cols)))
    return (df
            .withColumn('var_val_array', 
                        make_array(var_lbl, 
                                   val_lbl, 
                                   cols_to_stack))
            .withColumn("vars_and_vals", explode(col('var_val_array')))
            .withColumn(var_lbl, col("vars_and_vals").getItem(var_lbl))
            .withColumn(val_lbl, col("vars_and_vals").getItem(val_lbl))
            .drop(*(cols_to_stack + ['var_val_array', "vars_and_vals"])))

@pipeable
def spread(val_col, var_col, group_by_col, df):
    return  (df
             .groupBy(group_by_col)
             .pivot(val_col)
             .sum(var_col))

In [11]:
eagles =  spark.read.csv('data/bald_eagle_subsample.csv', inferSchema=True, header=True)
eagles.printSchema()

root
 |-- Animal_ID: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age2: string (nullable = true)
 |-- LocalTime: string (nullable = true)
 |-- KPH: double (nullable = true)
 |-- Sn: double (nullable = true)
 |-- AGL0: double (nullable = true)
 |-- VerticalRate: double (nullable = true)
 |-- abs_angle: double (nullable = true)
 |-- absVR: double (nullable = true)



In [12]:
eagles.toPandas()

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.00,-0.120000,0.570000,0.120000
2,106,F,Fledgling,7/6/19 7:02,35.42,8.58,13.15,0.490000,2.010000,0.490000
3,106,F,Fledgling,7/6/19 7:02,32.87,9.13,10.88,-0.450000,1.100000,0.450000
4,106,F,Fledgling,7/6/19 7:02,35.37,10.01,7.28,-0.720000,0.370000,0.720000
...,...,...,...,...,...,...,...,...,...,...
9995,106,F,Juvenile,12/27/19 11:33,28.39,7.98,159.85,0.140000,0.120000,0.140000
9996,106,F,Juvenile,12/27/19 11:33,34.15,8.49,154.67,-0.860000,0.470000,0.860000
9997,106,F,Juvenile,12/27/19 11:33,30.15,8.43,152.39,-0.370000,0.960000,0.370000
9998,106,F,Juvenile,12/27/19 11:33,55.43,11.30,142.03,-1.720000,0.050000,1.720000


In [34]:
eagles.columns[-6:]

['KPH', 'Sn', 'AGL0', 'VerticalRate', 'abs_angle', 'absVR']

In [31]:
all_vars_cols = eagles.columns[-6:]
vars_except_vr = all_vars_cols.copy()
vars_except_vr.remove('VerticalRate')
vars_except_vr

['KPH', 'Sn', 'AGL0', 'abs_angle', 'absVR']

In [32]:
eagles.columns[:4]

['Animal_ID', 'Sex', 'Age2', 'LocalTime']

In [56]:
( 
    ( ( (eagles
       >> gather("variable","value",vars_except_vr)
      )
      .withColumn("value", fn.sqrt(col("value")))
      >> spread("variable","value", eagles.columns[:4] + ['VerticalRate'])
    )
      .withColumn('id', fn.monotonically_increasing_id())
      >> gather("variable","value",all_vars_cols)
     )
    .withColumn("groupmean", fn.mean(col("value")).over(Window.partitionBy('variable')))
    .withColumn("groupstddev", fn.stddev(col("value")).over(Window.partitionBy('variable')))
    .withColumn('zscore', (col('value') - col('groupmean'))/col("groupstddev"))
    >> spread("variable","zscore", "id")
 ).toPandas()

Unnamed: 0,id,AGL0,KPH,Sn,VerticalRate,absVR,abs_angle
0,26,-0.520699,-0.115110,-0.391198,0.593960,-0.154204,0.771576
1,29,1.297387,0.079510,-0.640294,0.461491,-0.397826,0.674643
2,474,-0.684463,-0.258202,-0.770569,0.426630,-0.467964,1.054636
3,964,-0.892091,-1.225493,-0.303698,2.099923,1.627251,1.432634
4,1677,0.938968,-0.320269,-0.877572,1.493354,1.033920,0.602981
...,...,...,...,...,...,...,...
9809,8577,-0.990025,-1.676586,-0.842300,0.022252,-1.885098,0.610245
9810,8826,-0.619960,-0.998742,-0.713811,0.391770,-0.541253,-0.785608
9811,8954,-0.804471,0.654474,0.219969,2.943541,2.322067,-1.505808
9812,9663,-1.359070,-0.084581,-0.167300,-0.382127,-0.439554,-0.654005
