# Standardizing variables with window function in `polars`

## Window functions in `Python`/`polars`

To make a computation with a *window function* with `dplyr`, 

1. Use `select` or `with_columns` to create a new column,
2. Dot-chain `pl.col('col name)` into a summary functions like `mean`, then finally
2. use `over` with a column summary function to compute the window summary for each row.

In [1]:
import polars as pl

In [2]:
data = pl.DataFrame({'Group':['a', 'a', 'b', 'b', 'b', 'c', 'c', 'c',],
                     'Data' :[5, 3, 4, 1, 2, 3, 5, 3]})
data

Group,Data
str,i64
"""a""",5
"""a""",3
"""b""",4
"""b""",1
"""b""",2
"""c""",3
"""c""",5
"""c""",3


## Window functions over the default partition

To compute statistics over the default partition, e.g., the grand mean or grand total, simply apply the desired summary method to a the column expression.

### Example 1 - Computing the grand mean, total, & SD

#### Simple aggregation

In [3]:
( data
  .select(grand_mean = pl.col('Data').mean(),
          grand_total = pl.col('Data').sum(),
          grand_SD = pl.col('Data').std(),
               )
)

grand_mean,grand_total,grand_SD
f64,i64,f64
3.25,26,1.38873


#### Column statistics

In [4]:
( data
  .with_columns(grand_mean = pl.col('Data').mean(),
                grand_total = pl.col('Data').sum(),
                grand_SD = pl.col('Data').std(),
               )
)

Group,Data,grand_mean,grand_total,grand_SD
str,i64,f64,i64,f64
"""a""",5,3.25,26,1.38873
"""a""",3,3.25,26,1.38873
"""b""",4,3.25,26,1.38873
"""b""",1,3.25,26,1.38873
"""b""",2,3.25,26,1.38873
"""c""",3,3.25,26,1.38873
"""c""",5,3.25,26,1.38873
"""c""",3,3.25,26,1.38873


### Standardizing variables using window function.

Simply make an expression that includes the window function column expression.

#### Example 2 - Various standardized fields

In [14]:
std_data = (data
            .with_columns(mean_centered = pl.col('Data') - pl.col('Data').mean(),
                          z_score = (pl.col('Data') - pl.col('Data').mean())/pl.col('Data').std(),
                          percent_of_total = 100*pl.col('Data')/pl.col('Data').sum(),
                          
                         min_val = pl.col('Data').min(),
                            max_val = pl.col('Data').max(),
                          range_scale = (pl.col('Data')-pl.col('Data').min())/( pl.col('Data').max()-pl.col('Data').min())
                         )
           )
std_data

Group,Data,mean_centered,z_score,percent_of_total,range_scale
str,i64,f64,f64,f64,f64
"""a""",5,1.75,1.260144,19.230769,1.0
"""a""",3,-0.25,-0.180021,11.538462,0.5
"""b""",4,0.75,0.540062,15.384615,0.75
"""b""",1,-2.25,-1.620185,3.846154,0.0
"""b""",2,-1.25,-0.900103,7.692308,0.25
"""c""",3,-0.25,-0.180021,11.538462,0.5
"""c""",5,1.75,1.260144,19.230769,1.0
"""c""",3,-0.25,-0.180021,11.538462,0.5


#### Double-checking the standardization

In [25]:
(std_data
 .select(([pl.mean(c).alias(f'mean of {c}') for c in ('mean_centered', 'z_score')]
          + [pl.std(c).alias(f'SD of {c}') for c in ('z_score',)] 
          + [pl.sum(c).alias(f'total of {c}') for c in ('percent_of_total',)]
          + [pl.min(c).alias(f'min of {c}') for c in ('range_scale',)]
         )
 )
)

mean of mean_centered,mean of z_score,SD of z_score,total of percent_of_total,min of range_scale
f64,f64,f64,f64,f64
0.0,-4.8572e-17,1.0,100.0,0.0


## <font color="red"> Exercise 5.7.1 - Range-scaling the variable.</font>

Use window functions to range-scale the `Data` column using the default partition.

$$y_{range\;scaled} = \frac{y - \min{y}}{\max{y} - \min{y}}$$

Double check that the new minimum and maximum are zero and one, respectively.

In [27]:
(std_data
 .select((
           [pl.min(c).alias(f'min of {c}') for c in ('range_scale',)]
         + [pl.max(c).alias(f'max of {c}') for c in ('range_scale',)]
         )
 )
)

min of range_scale,max of range_scale
f64,f64
0.0,1.0


## Computing summaries `over` a partition

To add a partition other than the default, we use the `.over` method on the column expression.

### Example 3 - Computing the group mean over `Group`

#### Grouped aggregation

In [50]:
(data
 .group_by('Group')
 .agg(group_mean = pl.col('Data').mean())
)

Group,group_mean
str,f64
"""c""",3.666667
"""b""",2.333333
"""a""",4.0


In [52]:
(data
 .with_columns(group_mean = (pl.col('Data')
                               .mean()          # Summary first
                               .over('Group')), # over MUST follow the summary method
               BAD = (pl.col('Data')
                        .over('Group')
                        .mean()), # over gets ignored here :(
              )
)

Group,Data,group_mean,BAD
str,i64,f64,f64
"""a""",5,4.0,3.25
"""a""",3,4.0,3.25
"""b""",4,2.333333,3.25
"""b""",1,2.333333,3.25
"""b""",2,2.333333,3.25
"""c""",3,3.666667,3.25
"""c""",5,3.666667,3.25
"""c""",3,3.666667,3.25


#### Example 4 - Standardizing by group

In [28]:
std_by_grp = ( data
              .with_columns(mean_centered = pl.col('Data') - pl.col('Data').mean().over('Group'),
                             z_score = (pl.col('Data') - pl.col('Data').mean().over('Group'))/pl.col('Data').std().over('Group'),
                             percent_of_total = 100*pl.col('Data')/pl.col('Data').sum().over('Group'),
                            range_scale = (pl.col('Data')-pl.col('Data').min())/( pl.col('Data').max()-pl.col('Data').min()).over('Group'),
                            )
             )
std_by_grp

Group,Data,mean_centered,z_score,percent_of_total,range_scale
str,i64,f64,f64,f64,f64
"""a""",5,1.0,0.707107,62.5,2.0
"""a""",3,-1.0,-0.707107,37.5,1.0
"""b""",4,1.666667,1.091089,57.142857,1.0
"""b""",1,-1.333333,-0.872872,14.285714,0.0
"""b""",2,-0.333333,-0.218218,28.571429,0.333333
"""c""",3,-0.666667,-0.57735,27.272727,1.0
"""c""",5,1.333333,1.154701,45.454545,2.0
"""c""",3,-0.666667,-0.57735,27.272727,1.0


In [29]:
(std_by_grp
 .group_by('Group')
 .agg(([pl.mean(c).alias(f'mean of {c}') for c in ('mean_centered', 'z_score')]
          + [pl.std(c).alias(f'SD of {c}') for c in ('z_score',)] 
          + [pl.sum(c).alias(f'total of {c}') for c in ('percent_of_total',)]
          +[pl.min(c).alias(f'min of {c}') for c in ('range_scale',)]
         + [pl.max(c).alias(f'max of {c}') for c in ('range_scale',)]
         )
     )
)

Group,mean of mean_centered,mean of z_score,SD of z_score,total of percent_of_total,min of range_scale,max of range_scale
str,f64,f64,f64,f64,f64,f64
"""b""",-1.4803e-16,-7.4015e-17,1.0,100.0,0.0,1.0
"""a""",0.0,0.0,1.0,100.0,1.0,2.0
"""c""",1.4803e-16,1.4803e-16,1.0,100.0,1.0,2.0


## <font color="red"> Exercise 5.7.2 - Range-scaling `Data` by `Group`</font>

Now range-scale the `Data` column using `Group` as the partition.

$$y_{range\;scaled} = \frac{y - \min{y}}{\max{y} - \min{y}}$$

Double check that the new minimum and maximum are zero and one within each group, respectively.

In [30]:
# Your code here
(std_by_grp
 .group_by('Group')
 .agg(([pl.mean(c).alias(f'mean of {c}') for c in ('mean_centered', 'z_score')]
          + [pl.std(c).alias(f'SD of {c}') for c in ('z_score',)] 
          + [pl.sum(c).alias(f'total of {c}') for c in ('percent_of_total',)]
          +[pl.min(c).alias(f'min of {c}') for c in ('range_scale',)]
         + [pl.max(c).alias(f'max of {c}') for c in ('range_scale',)]
         )
     )
)

Group,mean of mean_centered,mean of z_score,SD of z_score,total of percent_of_total,min of range_scale,max of range_scale
str,f64,f64,f64,f64,f64,f64
"""b""",-1.4803e-16,-7.4015e-17,1.0,100.0,0.0,1.0
"""a""",0.0,0.0,1.0,100.0,1.0,2.0
"""c""",1.4803e-16,1.4803e-16,1.0,100.0,1.0,2.0
