# Reviewing Window Functions

### Introduction 

In this lesson, we'll review window functions in SQL.  Let's get started.

### Loading our Data

We can begin by loading our data from the [favorita kaggle competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data).

In [1]:
import pandas as pd
url = "./favorita_transactions.csv"
df = pd.read_csv(url)
df[:2]

Unnamed: 0,id,date,store_nbr,transactions
0,0,2013-01-01,25,770
1,1,2013-01-02,1,2111


And then load the data into our database.

In [2]:
import sqlite3
conn = sqlite3.connect('grocery.db')

In [6]:
df.to_sql('store_transactions', conn, index = False, if_exists = 'replace')

83488

### Introducing Window Functions

Before getting to window functions, let's quickly review our group by functions.  We start with a table that has the number of transactions per store per day.

In [9]:
pd.read_sql('SELECT * FROM store_transactions LIMIT 3', conn)

Unnamed: 0,id,date,store_nbr,transactions
0,0,2013-01-01,25,770
1,1,2013-01-02,1,2111
2,2,2013-01-02,2,2358


And if we want to calculate the average number of transactions per day across all scores we do something like the following.

In [8]:
query = '''SELECT AVG(transactions)
            as avg_transactions 
            FROM store_transactions'''
pd.read_sql(query, conn)

Unnamed: 0,avg_transactions
0,1694.602158


So above, we just calculated the average amount of transactions across our entire dataset.  Notice that we reduced the number of rows to just one.

Now let's see a similar query, but this time we'll use  a window function.

In [11]:
query = '''SELECT date, store_nbr, transactions, 
AVG(transactions) OVER ()
            as avg_transactions 
            FROM store_transactions
            LIMIT 3'''
pd.read_sql(query, conn)

Unnamed: 0,date,store_nbr,transactions,avg_transactions
0,2013-01-01,25,770,1694.602158
1,2013-01-02,1,2111,1694.602158
2,2013-01-02,2,2358,1694.602158


So we with our window function, we add the `OVER` keyword.  And we can see that as a result, SQL returns the same average transactions, but it does not reduce the number of rows.

### Reviewing the Syntax

What makes it a window function is the *calculation* (in this case `AVG`) and then the `OVER` keyword. 

```sql
SELECT date, store_nbr, transactions, 
AVG(transactions) OVER ()
            as avg_transactions 
            FROM store_transactions
```

In the parentheses after `OVER`, we define the "window".  The window just means the group of rows to consider.

In the above query, we do not specify a subset of rows, so SQL calculates the average across *all* of the rows.

However, if we change the window to be the `store_nbr` -- the store number -- then this time we will calculate the average number of transactions per store.  Let's see this. 

In [12]:
query = '''SELECT date, store_nbr, 
transactions, AVG(transactions) OVER (partition by store_nbr)
            as avg_by_store
            FROM store_transactions
            WHERE store_nbr = 1 
            LIMIT 2'''
pd.read_sql(query, conn)

Unnamed: 0,date,store_nbr,transactions,avg_by_store
0,2013-01-02,1,2111,1523.844272
1,2013-01-03,1,1833,1523.844272


In [13]:
query = '''SELECT date, store_nbr, 
transactions, AVG(transactions) OVER (partition by store_nbr)
            as avg_by_store
            FROM store_transactions
            WHERE store_nbr = 2 
            LIMIT 2'''
pd.read_sql(query, conn)

Unnamed: 0,date,store_nbr,transactions,avg_by_store
0,2013-01-02,2,2358,1920.036374
1,2013-01-03,2,2033,1920.036374


So this time, we can see that we get different averages based on the store.  And to achieve this, we added the phrase `partition by` to our window function. 

```sql
AVG(transactions) OVER (partition by store_nbr)
```

So as we can see in the last set of parentheses, we specify how we want to partition our rows, here by store number.

> This time try it on your own, defining the window as the date instead of the score number.  Write it without copying the above.

In [27]:
query = """
SELECT * FROM store_transactions             
where date = '2013-01-02'
order by date
LIMIT 3;
"""

pd.read_sql(query, conn)

# date	store_nbr	avg_txns
# 0	2013-01-02	1	2026.413043
# 1	2013-01-02	2	2026.413043
# 2	2013-01-02	3	2026.413043

Unnamed: 0,date,store_nbr,avg_txns
0,2013-01-02,1,2026.413043
1,2013-01-02,2,2026.413043
2,2013-01-02,3,2026.413043


### Comparing with Group By

Notice how the above differs from using a group by function.  In the above query, the number of rows returned is the same as in the original dataset.  This is why we see the value for `avg_transactions` multiple times.

If we use group by, by contrast, we only see one row per grouping.  So moving the above to a group by, we will see only row per date.

In [31]:
query = """
SELECT date, AVG(transactions) as avg_transactions
FROM store_transactions             
group by date
order by date
LIMIT 3;
"""

pd.read_sql(query, conn)

Unnamed: 0,date,avg_transactions
0,2013-01-01,770.0
1,2013-01-02,2026.413043
2,2013-01-03,1706.608696


### Summary

In this lesson, we saw how window functions allow us to perform calculations within a specified window.  Unlike our aggregate functions, window functions do not reduce the number of rows that are returned.  This can make them useful for comparing a specific row against the calculated value in that window -- like deviation from the average.

We saw that we create a window function with the `OVER` keyword, and that in the parentheses after the OVER, the window is specified.  If we leave it blank, the window is all of the queried rows.

```sql
SELECT date, store_nbr, transactions, 
AVG(transactions) OVER ()
            as avg_transactions 
            FROM store_transactions
```

Or we can partition our dataset into different windows by a specified criteria.

```sql
SELECT date, store_nbr, 
transactions, AVG(transactions) OVER (partition by store_nbr)
            as avg_by_store
FROM store_transaction
```

### Answers

In [28]:
query = """
SELECT date, store_nbr, AVG(transactions) OVER (partition by date) as avg_txns 
FROM store_transactions             
where date = '2013-01-02'
order by date
LIMIT 3;
"""

### Resources

[Snowflake Window Functions](https://docs.snowflake.com/en/user-guide/functions-window-using.html)

[Data School Window Functions](https://dataschool.com/how-to-teach-people-sql/how-window-functions-work/)

[Kaggle Analytic Window Functions](https://www.kaggle.com/alexisbcook/analytic-functions)

[StrataScratch Window Functions](https://www.stratascratch.com/blog/types-of-window-functions-in-sql-and-questions-asked-by-airbnb-netflix-twitter-and-uber/)

[Instacard Data](https://www.kaggle.com/c/instacart-market-basket-analysis/data)

[Chartio Window Functions](https://chartio.com/resources/tutorials/using-window-functions/)