# Stacking and Unstacking Data

In [1]:
import polars as pl

## Reshaping data

Two ways

* We can **stack** data into a *tall* format.
* We can **unstack** data into a *long* format.

## (totally real and not at all made-up) Example - Quarterly Auto Sales

**Note** the last four columns are

* same measurements
* same units

In [2]:
sales = pl.read_csv("./data/auto_sales.csv")
sales

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15.0,12
"""Bob""",19,12,17.0,20
"""Doug""",20,13,,20
"""Yolanda""",19,8,32.0,15
"""Xerxes""",12,23,18.0,9


## Stacking measurements of the same type/units

<img src="./img/stack_in_action.gif" width=600>

We can fix issues with informative column labels by stacking the data with `gather`

## A Stack by any other name ...

The act of stacking similar columns goes by various names.

* `polars` calls this `melt`
* JMP and Minitab call this *stack*
* Wickham/`tidyr`/`dfply` call this *gather*

I prefer **stack**, primarily because it makes it clear we are *melting*/*gathering* data vertically.

## Stacking data in `polars` with `melt`

Syntax: `df.melt(id_cols, val_vars, variable_name, value_name)`

In [22]:
?sales.melt

In [25]:
len(sales)

5

In [12]:
sales_cols = ['Compact', 'Sedan', 'SUV', 'Truck']
sales_stacked = (sales 
                 .melt('Salesperson', sales_cols, "CarType","QrtSales")
                )
sales_stacked

Salesperson,CarType,QrtSales
str,str,i64
"""Ann""","""Compact""",22.0
"""Bob""","""Compact""",19.0
"""Doug""","""Compact""",20.0
"""Yolanda""","""Compact""",19.0
"""Xerxes""","""Compact""",12.0
"""Ann""","""Sedan""",18.0
"""Bob""","""Sedan""",12.0
"""Doug""","""Sedan""",13.0
"""Yolanda""","""Sedan""",8.0
"""Xerxes""","""Sedan""",23.0


## Unstacking Data with `unstack`

Syntax: `pivot(values, index, columns, aggregate_fn = 'first')`

In [18]:
?sales.pivot

In [27]:
(sales_stacked
 .pivot('QrtSales', 'Salesperson', 'CarType')
)

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15.0,12
"""Bob""",19,12,17.0,20
"""Doug""",20,13,,20
"""Yolanda""",19,8,32.0,15
"""Xerxes""",12,23,18.0,9


## Safely STACK then UNSTACK


If we want to ensure we can unstack after stacking,

* Add an `ID`/`index` column of unique values
* Use this column as one of the index columns.
* Use `'first'` as the `aggregation_fn`.


In [39]:
(sales 
 .with_column(pl.arange(0, len(sales)).alias('ID'))
 .melt(['ID', 'Salesperson'], sales_cols, "CarType","QrtSales")
 .pivot('QrtSales', ['ID','Salesperson'], 'CarType')
)

ID,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15.0,12
1,"""Bob""",19,12,17.0,20
2,"""Doug""",20,13,,20
3,"""Yolanda""",19,8,32.0,15
4,"""Xerxes""",12,23,18.0,9


## Why Stack?

* Perform transformations on many columns.
* Fix problems with the Golden Rule

## Example - Switching Units on All Sales

Suppose your manager wants these numbers in *monthly* sales.  You could

1. Adjust each column with a separate formula
2. Stack --> Transform once --> Unstack

#### Method 1 - Brute-force Column Transformations

In [45]:
pl.Config.with_columns_kwargs = True

(sales
 .with_columns([(pl.col('Compact')/3).alias('Compact'),
                (pl.col('SUV')/3).alias('SUV'),
                (pl.col('Sedan')/3).alias('Sedan'),
                (pl.col('Truck')/3).alias('Truck')
               ])
)

Salesperson,Compact,Sedan,SUV,Truck
str,f64,f64,f64,f64
"""Ann""",7.333333,6.0,5.0,4.0
"""Bob""",6.333333,4.0,5.666667,6.666667
"""Doug""",6.666667,4.333333,,6.666667
"""Yolanda""",6.333333,2.666667,10.666667,5.0
"""Xerxes""",4.0,7.666667,6.0,3.0


#### Method 2 - Refactored with a list comprehension

In [47]:
(sales
 .with_columns([(pl.col(c)/3).alias(c)
                for c in sales_cols])
)

Salesperson,Compact,Sedan,SUV,Truck
str,f64,f64,f64,f64
"""Ann""",7.333333,6.0,5.0,4.0
"""Bob""",6.333333,4.0,5.666667,6.666667
"""Doug""",6.666667,4.333333,,6.666667
"""Yolanda""",6.333333,2.666667,10.666667,5.0
"""Xerxes""",4.0,7.666667,6.0,3.0


#### Method 3 - Stack-Transform-Unstack

In [48]:
(sales 
 .melt('Salesperson', sales_cols, "CarType","QrtSales")
 .with_columns(MonSales = pl.col('QrtSales')/3)
 .drop('QrtSales')
 .pivot('MonSales', 'Salesperson', 'CarType')
)

Salesperson,Compact,Sedan,SUV,Truck
str,f64,f64,f64,f64
"""Ann""",7.333333,6.0,5.0,4.0
"""Bob""",6.333333,4.0,5.666667,6.666667
"""Doug""",6.666667,4.333333,,6.666667
"""Yolanda""",6.333333,2.666667,10.666667,5.0
"""Xerxes""",4.0,7.666667,6.0,3.0


## Comparing the two methods

**Method 1:**
* More straight forward
* Lots of repeated code
* Doesn't scale ... imagine 100+ columns

**Method 2:**
* More complicated
* Requires very similar column expressions

**Method 3:**
* More complicated
* Scales well
* Easier with more complicated transformations

## <font color="red"> Exercise 2.6.1 </font>
    
**Task:** Load the `Artwork.csv` data and use the Stack-Transform-Unstack trick to convert all measurements in cm to mm.

**Hints.**
1. You will need to fix the `dtypes` for some of the measurement columns.
2. You will need to add an `ID` column
3. `pivot` can't group by float columns, so you need to stack all measurements.
4. To process only the `cm` columns, use a `pl.when(cond).then(expr).otherwise(expr)` expression.
5. You should also replace the `cm` with `mm` using the same trick in the last hint.

In [57]:
artwork = pl.read_csv("./data/Artworks.csv")
artwork.head(2)

Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,str,str,f64,str,str
"""Ferdinandsbrüc...","""Otto Wagner""","""6210""","""(Austrian, 184...","""(Austrian)""","""(1841)""","""(1918)""","""(Male)""","""1896""","""Ink and cut-an...","""19 1/8 x 66 1/...","""Fractional and...","""885.1996""","""Architecture""","""Architecture &...","""1996-04-09""","""Y""",2,"""http://www.mom...","""http://www.mom...",,,,48.6,,,168.9,,
"""City of Music,...","""Christian de P...","""7470""","""(French, born ...","""(French)""","""(1944)""","""(0)""","""(Male)""","""1987""","""Paint and colo...","""16 x 11 3/4"" (...","""Gift of the ar...","""1.1995""","""Architecture""","""Architecture &...","""1995-01-17""","""Y""",3,"""http://www.mom...","""http://www.mom...",,,,40.6401,,,29.8451,,


In [58]:
artwork.dtypes

[polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Int64,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Float64,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Float64,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8]

In [11]:
# Your code here