# Sliding Window Functions

### Introduction

Now so far we have worked with window frames that are cumalitive.  That is, when we created our window, we did this based on some category or attribute of each row.  For example, in the window function below, we have a window for each store number.

```sql
SELECT date, store_nbr, transactions,
SUM(transactions) OVER (partition by store_nbr ORDER BY date) as running_total
FROM store_transactions;
```

This allows us to calculate aggregates across each store.  But sometimes we'll want a sliding window.  With a sliding window we can, for each day, calculate a three day average.  So with a sliding window, we have a changing window for each row.  We'll see how to create a sliding window, and various usecases for it.

### Loading our Data

Let's again use the data from the [favorita kaggle competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data).

We begin by reading this data from a csv file.

In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/data-eng-10-21/window-functions/main/favorita_transactions.csv"
df = pd.read_csv(url)
df[:2]

Unnamed: 0,id,date,store_nbr,transactions
0,0,2013-01-01,25,770
1,1,2013-01-02,1,2111


And then we can load this data into our database.

In [2]:
import sqlite3
conn = sqlite3.connect('grocery.db')

In [4]:
df.to_sql('store_transactions', conn, index = False, if_exists = 'replace')

### Calculating a Running Average

Ok, now let's calculate running average of sales across the current row and each of the two following rows. 

In [3]:
import pandas as pd


query = """SELECT date, store_nbr, transactions,
AVG(transactions) over (partition by store_nbr order by date rows between 2 preceding and current row) as avg_transactions
FROM store_transactions WHERE store_nbr = 1 LIMIT 5;
"""

pd.read_sql(query, conn)

Unnamed: 0,date,store_nbr,transactions,avg_transactions
0,2013-01-02,1,2111,2111.0
1,2013-01-03,1,1833,1972.0
2,2013-01-04,1,1863,1935.666667
3,2013-01-05,1,1509,1735.0
4,2013-01-06,1,520,1297.333333


So if we look perform the calculation for the row at index 2, we can indeed see that it matches a calculation of the average across the two preceeding and the current row.

In [6]:
(2111 + 1833 + 1863)/3

1935.6666666666667

As does the following calculation. 

In [7]:
(1833 + 1863 + 1509)/3

1735.0

Now let's use a window of one day prior and one day following.  We can do so with the following.

In [4]:
import pandas as pd

query = """SELECT date, store_nbr, transactions,
AVG(transactions) over (partition by store_nbr order by date rows between 1 preceding and 1 following) as avg_transactions
FROM store_transactions WHERE store_nbr = 1 LIMIT 5;
"""

pd.read_sql(query, conn)

Unnamed: 0,date,store_nbr,transactions,avg_transactions
0,2013-01-02,1,2111,1972.0
1,2013-01-03,1,1833,1935.666667
2,2013-01-04,1,1863,1735.0
3,2013-01-05,1,1509,1297.333333
4,2013-01-06,1,520,1278.666667


So one way to think about window functions is that they allow us access data from another row in a specific window.

### Calculating daily increase

So we saw how we can use window functions to reference previous rows.  One other way of referencing prior rows is by using the lag function.  Below we use the lag function to find the transaction value of the previous row.

In [13]:
query = """SELECT store_nbr, date, transactions, LAG(transactions, 1) OVER ( 
PARTITION BY store_nbr
ORDER BY store_nbr, date ) previous_transactions
FROM store_transactions
LIMIT 3
"""

pd.read_sql(query, conn)

Unnamed: 0,store_nbr,date,transactions,previous_transactions
0,1,2013-01-02,2111,
1,1,2013-01-03,1833,2111.0
2,1,2013-01-04,1863,1833.0


So this is useful for comparing each day to the prior day.  Notice that for the first row, we get a value of NAN, because for the first row there is no prior row.  We can specify a default value of 0, when no value exists with the following. 

In [14]:
query = """SELECT store_nbr, date, transactions, LAG(transactions, 1, 0) OVER ( 
PARTITION BY store_nbr
ORDER BY store_nbr, date ) previous_transactions
FROM store_transactions
LIMIT 3
"""

pd.read_sql(query, conn)

Unnamed: 0,store_nbr,date,transactions,previous_transactions
0,1,2013-01-02,2111,0
1,1,2013-01-03,1833,2111
2,1,2013-01-04,1863,1833


As we know, the lag function can allow us to compare the current row with values from the previous row.  So to find the *change* from the previous row, we can simply subtract the current number of transactions from the previous number of transactions like so.

In [18]:
import pandas as pd

query = """SELECT date, transactions,
transactions - LAG(transactions, 1) OVER ( 
PARTITION BY store_nbr
ORDER BY transactions ) diff_transactions
FROM store_transactions
LIMIT 3
"""

pd.read_sql(query, conn)

Unnamed: 0,date,transactions,diff_transactions
0,2016-01-04,10,
1,2014-03-02,346,336.0
2,2015-11-01,369,23.0


### Summary

In this lesson, we saw how we use window functions to reference prior rows.  One way of doing this is with a sliding window.  With a sliding window, the window is defined relative to the current row with something like the following:  

```sql
AVG(transactions) over (order by date rows between 2 preceding and current row) as avg_transactions
```

So the above the average number transactions of the two previous rows and the prior row. 

And the next query calculates the average of the previous row, current row, and succeeding row.

```sql
AVG(transactions) over (order by date rows between 1 preceding and 1 following) as avg_transactions
```

Then, we saw how we can calculate the difference between values in different rows with the lag function.   

```sql
transactions - LAG(transactions, 1) OVER (PARTITION BY store_nbr
ORDER BY transactions ) diff_transactions
```

### Resources

[Snowflake Window Functions](https://docs.snowflake.com/en/user-guide/functions-window-using.html#rank-related-window-functions)

[Sqlite Window Functions](https://www.sqlite.org/windowfunctions.html)