# Stacking and unstacking data

In [6]:
import pandas as pd
from dfply import *

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Ops').getOrCreate()

## Reshaping data

Two ways

* We can **stack** data into a *tall* format.
* We can **unstack** data into a *long* format.

## (totally real and not at all made-up) Example - Quarterly Auto Sales

**Note** the last four columns are

* same measurements
* same units

#### `pandas`

In [8]:
sales = pd.read_csv("./data/auto_sales.csv")
sales

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9


## Stacking measurements of the same type/units

<img src="./img/stack_in_action.gif" width=600>

We can fix issues with informative column labels by stacking the data with `gather`

## A Stack by any other name ...

The act of stacking similar columns goes by various names.

* JMP and Minitab call this *stack*
* `pandas` calls this *melt*
* Wickham/`tidyr`/`dfply` call this *gather*

I prefer **stack**, primarily because it makes it clear we are *melting*/*gathering* data vertically.

## Stacking data in `pandas` with `gather`

Syntax: `gather(lbl_col_name, val_col_name, cols_to_stack)`

In [10]:
sales_cols = ['Compact', 'Sedan', 'SUV', 'Truck']
sales_stacked = (sales 
                 >> gather("CarType","QrtSales", sales_cols))
sales_stacked >> head

Unnamed: 0,Salesperson,CarType,QrtSales
0,Ann,Compact,22
1,Bob,Compact,19
2,Yolanda,Compact,19
3,Xerxes,Compact,12
4,Ann,Sedan,18


In [11]:
df.withColumn("_vars_and_vals", explode(_vars_and_vals))

NameError: name 'df' is not defined

## Unstacking Data with `spread`

Syntax: `spread(split_by_col, to_split_col)`

In [12]:
(sales_stacked
 >> spread(X.CarType, X.QrtSales))

Unnamed: 0,Salesperson,Compact,SUV,Sedan,Truck
0,Ann,22,15,18,12
1,Bob,19,17,12,20
2,Xerxes,12,18,23,9
3,Yolanda,19,32,8,15


## Safely working with `gather` and `spread`


We were lucky the last example worked.  Note that 

* `spread` needs a unique column to work properly.  
* `gather` will add a column by setting `add_id=True`

In [13]:
sales_stacked = sales >> gather("CarType","QrtSales", sales_cols, add_id=True)
sales_stacked >> head(2)

Unnamed: 0,Salesperson,_ID,CarType,QrtSales
0,Ann,0,Compact,22
1,Bob,1,Compact,19


In [14]:
sales_stacked >> spread(X.CarType, X.QrtSales) >> head(2)

Unnamed: 0,Salesperson,_ID,Compact,SUV,Sedan,Truck
0,Ann,0,22,15,18,12
1,Bob,1,19,17,12,20


## Why Stack?

* Perform transformations on many columns.
* Fix problems with the Golden Rule

## Example - Switching Units on All Sales

Suppose your manager wants these numbers in *monthly* sales.  You could

1. Adjust each column with a separate formula
2. Stack --> Transform once --> Unstack

#### Method 1 - Column Transformations

In [16]:
(sales
 >> mutate(Compact = X.Compact/3,
           SUV =   X.SUV/3,
           Sedan = X.Sedan/3,
           Truck = X.Truck/3)
 >> head(2))

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,7.333333,6.0,5.0,4.0
1,Bob,6.333333,4.0,5.666667,6.666667


#### Method 2 - Stack-Transform-Unstack

In [17]:
(sales 
 >> gather("CarType","QrtSales", sales_cols)
 >> mutate(MonSales = X.QrtSales/3)
 >> drop(X.QrtSales)
 >> spread(X.CarType, X.MonSales)
 >> head(2))

Unnamed: 0,Salesperson,Compact,SUV,Sedan,Truck
0,Ann,7.333333,5.0,6.0,4.0
1,Bob,6.333333,5.666667,4.0,6.666667


## Comparing the two methods

**Method 1:**
* More straight forward
* Lots of repeated code
* Doesn't scale ... imagine 100+ columns

**Method 2:**
* More complicated
* Scales well


## <font color="red"> Exercise 1 </font>
    
**Task:** Load the `health_survey.csv` data and use the Stack-Transform-Unstack trick to transform the responses to a Lickert scale where *Strongly Agree* mapped to 5 and *Strongly Disagree* mapped to 1


In [18]:
survey = pd.read_csv("./data/health_survey.csv")
survey.head(2)

Unnamed: 0.1,Unnamed: 0,F1,F5,F2,F1.1,F2.1,F6,F4,F3,F5.1,...,F2.9,F3.4,F4.3,F2.10,F1.7,F6.4,F4.4,F5.7,F3.5,F2.11
0,1,Somewhat Agree,Somewhat Disagree,Somewhat Agree,Somewhat Agree,Somewhat Agree,Somewhat Disagree,Somewhat Agree,Somewhat Agree,Somewhat Agree,...,Somewhat Agree,Somewhat Disagree,Neither Agree nor Disagree,Somewhat Agree,Somewhat Agree,Somewhat Agree,Somewhat Agree,Somewhat Agree,Somewhat Agree,Somewhat Agree
1,2,Somewhat Agree,Somewhat Disagree,Somewhat Agree,Somewhat Agree,Somewhat Agree,Somewhat Disagree,Somewhat Agree,Neither Agree nor Disagree,Neither Agree nor Disagree,...,Somewhat Agree,Somewhat Agree,Neither Agree nor Disagree,Somewhat Agree,Somewhat Agree,Somewhat Disagree,Neither Agree nor Disagree,Somewhat Agree,Neither Agree nor Disagree,Somewhat Agree


In [170]:
# Your code here

## Up Next

Stuff