## Lab 11: Window Functions - grouping, but better!

Let's start by reviewing something we already know how to do - grouping based on some value set and then calculating things like the average or mean. 

We learnt how to do this using group_by() in lab 10 (before our data viz work!) - let's revise quickly. 

In [None]:
import pandas as pd

In [None]:
# let's play with our old toy dataset for now
data = pd.DataFrame(
    data=[
        ['312', 'A1', 0.12, 'LEFT'],
        ['312', 'A2', 0.37, 'LEFT'],
        ['312', 'C2', 0.68, 'LEFT'],
        ['313', 'A1', 0.07, 'RIGHT'],
        ['313', 'B1', 0.08, 'RIGHT'],
        ['314', 'A2', 0.29, 'LEFT'],
        ['314', 'B1', 0.14, 'RIGHT'],
        ['314', 'C2', 0.73, 'RIGHT'],
        ['711', 'A1', 4.01, 'RIGHT'],
        ['712', 'A2', 3.29, 'LEFT'],
        ['713', 'B1', 5.74, 'LEFT'],
        ['714', 'B2', 3.32, 'RIGHT'],
    ],
    columns=['subject_id', 'condition_id', 'response_time', 'response'],
)
data

In [None]:
# mean response time by condition?

In [None]:
max response time by subject?

What happens to the output - how does the size and shape change? What if we don't want our output to be a different shape or size? 

Or what if we want to group by something that isn't a categorical variable? What if we want to aggregate across a particualr time window? Or number of rows in a sliding fashion? 

E.g. what if we wanted to calculate the sum of every 5 rows at a time? 

Of course, we could do this with a for loop, but what about a more efficient approach (for our big health datasets!)? Time for "windows"!

## Rolling Windows

Imagine you are a public health researcher with a dataset which spans many months and years and tracks some metrics of health across different groups or regions (sound familiar? most of your project datasets will likely look a bit like this). 

Now imagine you want to calculate some values of these metrics as they change over time. How can we do that? 

For example, let's say if your data is about the number of COVID-19 infections recorded in a  clinic. Any day there were positive cases the clinic logged them, if there were no cases then nothing was logged. You want to aggregate over different period of time, maybe over different windows of time to understand the average infection rates in a more dynamic way to grasp the disease dynmaics themselves.  

Let's build a toy version of that dataset and see can we figure out how to do that using "window" functions. 

In [None]:
times = ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-29'] # let's set up some dates

s = pd.Series(range(5), index=pd.DatetimeIndex(times)) # and add an integer to each date that will represent our "recorded positive cases". 
s

Okay, let's say we want to start summing over some number of these rows to build a more "big picture" view of these numbers. Remember the rolling weekly infection rates from 2020?

Jump here for a quick look: https://www.nyc.gov/site/doh/covid/covid-19-data.page#daily 

Let's try and get some of those kinds of numbers... 


The rolling() function will let us do just that by specifying a window that will sum the number of rows specified as our "window". 
Let's say we want to start simple, summing up a rolling addition every two days...

In [None]:
# Window with 2 observations



What do we notice? 

What happened with the shape of our dataframe? How is this different to something like group_by()?
What about the first row, what happened there? 

But is this actually summing over a certain number of days? Look more carefully. 
Our days are evenly separated, some have gaps and some skip many days so we aren't actually getting a "two days rolling sum" like we hoped. 

Let's get more specific with our window function - it is really good at handling date items and understands units of time...

In [None]:
# Window with 2 days worth of observations


Note that the unit of time here is days, it cannot be months as months have different numbers of days, if you tried something like "1M" you would get a  window error (try it and see what the error message is!)

In [None]:
# "1M": this will throw an error - examine it as part of improving your debugging!

If you want to center the calculation around the current row then set center = True, spot the difference in these outputs to understand the difference. 

In [None]:
s #these are the start value

In [None]:
# sum on 3 obs window

In [None]:
# add our center argument

## Expanding windows

Okay - what if we want to watch positive cases accumulating over time - with each new row being added to the count? Again we could write a for loop - but that's rarely the best option!

For this, we have the expanding() function which is used for cumulative or expanding window calculations. Unlike rolling(), which applies operations over a fixed-size window, expanding() grows the window size as it moves through the data. It starts at the first element and includes all prior elements up to the current one.

In [None]:
s # remember what we have already... 

In [None]:
#expanding() with min periods set to 1 and sum()

In [None]:
#expanding() with min periods set to 1 and mean()

Note - these expanding calculations are a special case of rolling statistics. We could achieve an equivalent output with the following rolling() call...

In [None]:
# window plus min periods defined here

Why does this happen what is rolling() capturing with each step to achieve this?

## Shifting rows

The .shift() function is used to shift the values of a column or index by a specified number of periods, either forward or backward. It's useful for creating lagged or lead features, comparing data points across different time steps (also good for time series analysis...)

Basic use case - you can use .shift() to calculate differences between current and past values.

In [None]:
data = {'values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df

In [None]:
# Shifting values down by 1 period (default)


In [None]:
# Shifting values up by 1 period


In [None]:
# Create a new column for the difference between the current and previous row


## Exercise/Challenge: Back to our neural data 

#For each patcher, compute the average number of days they waited between experiments

Here is how to proceed
1. Use a window function to compute the number of days that elapse between experiment (i.e., the distance between `date`), for each `patcher`. Add that as a new column, `'days from prev'`
2. Compute the average `'days from prev'` per patcher

With your new awesome vectorization skills, it should only take two lines! (though you may have to lay around first to get here!

In [None]:
# Set some Pandas options: maximum number of rows/columns it's going to display
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)

In [None]:
df = pd.read_csv('experiment_data.csv', parse_dates=['date']) # load in the data

In [None]:
df.head()

In [None]:
df['patcher'].unique() # we have two none NAN values here .... 

In [None]:
# can we view the data more intuitively?

In [None]:
 # then  calculate those means