![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

## Learner Stories

```txt
As a DATA PROFESSIONAL,  
I want to be able to use Window Functions like ROW_NUMBER(), RANK(), and LAG(),  
so that I can analyse data across specific windows or partitions
```

# What are Window Functions?

Window functions are a powerful feature of SQL that allow you to perform calculations across a set of table rows that are related to the current row. Window functions are similar to aggregate functions, but they do not group rows into a single output row. Instead, they return a single value for each row based on a window of rows.

---

## Why Use Window Functions?

Window Functions are useful for a number of reasons:

1. **Comparing Rows**: You can compare the current row with other rows in the table.
2. **Ranking Rows**: You can assign a rank to each row based on a specific column.
3. **Calculating Running Totals**: You can calculate running totals for a column.
4. **Calculating Percentiles**: You can calculate percentiles for a column.
5. **Calculating Moving Averages**: You can calculate moving averages for a column.
6. **Calculating Differences**: You can calculate differences between rows.

---

## Use Cases for Window Functions

Here are some common use cases for window functions:

1. **Ranking**: You can rank rows based on a specific column.
2. **Partitioning**: You can partition rows into groups based on a specific column.
3. **Aggregating**: You can aggregate data within a window.
4. **Filtering**: You can filter rows based on a window.

Here is an example of a window function that calculates the running total of a column:

```sql
SELECT
    order_id,
    order_date,
    order_amount,
    SUM(order_amount) OVER (ORDER BY order_date) AS running_total
FROM
    orders;
```

This query does the following:

1. It selects the `order_id`, `order_date`, and `order_amount` columns from the `orders` table.
2. It calculates the running total of the `order_amount` column using the `SUM()` function and the `OVER` clause.
3. It orders the rows by the `order_date` column.
4. It assigns the running total to the `running_total` column.
5. It returns the `order_id`, `order_date`, `order_amount`, and `running_total` columns.
6. It returns the results in ascending order of `order_date` in a single output row for each row in the `orders` table that is the running total for each row in the `orders` table.

---

## Examples of Window Functions

Here are some examples of window functions:

1. **`ROW_NUMBER()`**: Assigns a unique number to each row in a result set.
2. **`RANK()`**: Assigns a rank to each row in a result set.
3. **`LAG()`**: Accesses data from a previous row in a result set.
4. **`DENSE_RANK()`**: Assigns a dense rank to each row in a result set.
5. **`NTILE()`**: Divides a result set into a specified number of groups.
6. **`LEAD()`**: Accesses data from a subsequent row in a result set.
7. **`FIRST_VALUE()`**: Returns the first value in a result set.

---

## Demo Database Set Up

To see some examples in action, we'll create an in-memory SQLite database and populate it with some sample data.

In [29]:
import sqlite3

In [30]:
# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create the employees table
cursor.execute('''
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    name TEXT,
    department_id INTEGER,
    salary INTEGER
);
''')

# Populate the employees table with sample data
employees = [
    (1, 'Alice', 1, 70000),
    (2, 'Bob', 1, 80000),
    (3, 'Charlie', 1, 70000),
    (4, 'David', 2, 90000),
    (5, 'Eve', 2, 85000),
    (6, 'Frank', 2, 90000)
]

cursor.executemany('INSERT INTO employees (employee_id, name, department_id, salary) VALUES (?, ?, ?, ?)', employees)

<sqlite3.Cursor at 0x29ddd4a72c0>

---

## DEMOS

### `ROW_NUMBER()` SQL Demo

Assign a unique sequential integer to rows within a partition of a result set, starting at 1 for the first row in each partition.

In [31]:
row_number_demo_query = '''
SELECT 
    employee_id, 
    name, 
    department_id, 
    salary,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS row_num
FROM 
    employees;
'''

#### Explanation of the Query

1. ***`SELECT` Clause***:
   - `employee_id`, `name`, `department_id`, `salary`: These columns are selected from the employees table.
   - `ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS row_num`: This is the key part of the query where the `ROW_NUMBER()` function is used.
2. ***`ROW_NUMBER()` Function***:
   - `ROW_NUMBER()` is a ***window function*** that assigns a unique sequential integer to rows within a partition of a result set.
   - `OVER (PARTITION BY department_id ORDER BY salary DESC)`: This clause defines the ***window*** for the `ROW_NUMBER()` function
     - `PARTITION BY department_id`: This means that the rows are ***partitioned*** (grouped) by the `department_id` column. Each department will have its own sequence of row numbers.
     - `ORDER BY salary DESC`: Within each ***partition*** (department), the rows are ordered by the `salary` column in descending order. The highest `salary` in each department will get the row number `1`, the second highest will get row number `2`, and so on.
3. ***Result***:
   - The query will return a result set with the original columns (`employee_id`, `name`, `department_id`, `salary`) and an additional column `row_num` that contains the *row number* for each employee within their department, ordered by `salary` in *descending* order.

#### Run the Query and put the result into a Pandas DataFrame

In [32]:
# Import the pandas library
import pandas as pd

In [33]:
# Run the query using the read_sql function
df = pd.read_sql(row_number_demo_query, conn)

# Print the DataFrame
df

Unnamed: 0,employee_id,name,department_id,salary,row_num
0,2,Bob,1,80000,1
1,1,Alice,1,70000,2
2,3,Charlie,1,70000,3
3,4,David,2,90000,1
4,6,Frank,2,90000,2
5,5,Eve,2,85000,3


#### Explanation of the Result

- For `department_id = 1`:
  - ***Bob*** has the highest salary (80000), so he gets `row_num = 1`.
  - ***Alice*** and ***Charlie*** both have the *same* salary (70000). Since ***Alice*** appears first in the original data, they get `row_num = 2`, and ***Charlie*** gets `row_num = 3`.

- For `department_id = 2`:
  - ***David*** and ***Frank*** both have the highest salary (90000). Since ***David*** appears first in the original data, they get `row_num = 1`, and ***Frank*** gets `row_num = 2`.
  - ***Eve*** has the next highest salary (85000), so they get `row_num = 3`.

This query helps in assigning a unique sequential number to each row within a partition, which can be useful for ranking, pagination, and other analytical purposes.

---

### `RANK()` SQL Demo

Assign a rank to each row within a partition of a result set, with gaps in the ranking.

In [34]:
rank_demo_query = '''
SELECT 
    employee_id, 
    name, 
    department_id, 
    salary,
    RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM 
    employees;
'''

#### Explanation of the Query

1. ***`SELECT` Clause***:
   - `employee_id`, `name`, `department_id`, `salary`: These columns are selected from the employees table.
   - ``RANK()` OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank`: This is the key part of the query where the ``RANK()`` function is used.
2. ***`RANK()` Function***:
   - `RANK()` is a ***window function*** that assigns a rank to each row within a partition of a result set.
   - `OVER (PARTITION BY department_id ORDER BY salary DESC)`: This clause defines the *window* for the `RANK()` function.
     - `PARTITION BY department_id`: This means that the rows are *partitioned* (grouped) by the `department_id` column. Each department will have its own sequence of ranks.
     - `ORDER BY salary DESC`: Within each *partition* (department), the rows are ordered by the `salary` column in *descending* order. The highest salary in each department will get the rank `1`, the second highest will get rank `2, and so on.
   - If there are *ties* (i.e., multiple rows with the same value in the `ORDER BY` clause), they will receive the *same rank*, and the *next rank* will be skipped.
3. ***Result:***
   - The query will return a result set with the original columns (`employee_id`, `name`, `department_id`, `salary`) and an additional column `rank` that contains the *rank* for each employee within their department, ordered by `salary` in *descending* order.

In [35]:
# Run the query using the read_sql function
df = pd.read_sql(rank_demo_query, conn)

# Print the DataFrame
df

Unnamed: 0,employee_id,name,department_id,salary,rank
0,2,Bob,1,80000,1
1,1,Alice,1,70000,2
2,3,Charlie,1,70000,2
3,4,David,2,90000,1
4,6,Frank,2,90000,1
5,5,Eve,2,85000,3


#### Explanation of the Result

- For `department_id = 1`:
  - ***Bob*** has the highest salary (80000), so he gets `rank = 1`.
  - ***Alice*** and ***Charlie*** both have the *same* salary (70000). They ***both*** receive the *same* rank (`rank = 2`), and the next rank (`3`) is skipped.

- For `department_id = 2`:
  - ***David*** and ***Frank*** both have the highest salary (90000). They both receive the same rank (`rank = 1`), and the next rank (`2`) is skipped.
  - ***Eve*** has the next highest salary (85000), so she gets `rank = 3`.

This query helps in assigning ranks to rows within a partition, which can be useful for ranking, competition results, and other analytical purposes where ties need to be handled appropriately.

---

### `LAG()` SQL Demo

Provide access to a row at a given physical offset that comes before the current row.

In [36]:
lag_demo_query = '''
SELECT 
    employee_id, 
    name, 
    department_id, 
    salary,
    LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY salary DESC) AS prev_salary
FROM 
    employees;
'''

#### Explanation of the Query

1. ***`SELECT` Clause***:
   - `employee_id`, `name`, `department_id`, `salary`: These columns are selected from the employees table.
   - `LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY salary DESC) AS prev_salary`: This is the key part of the query where the `LAG()` function is used.
- ***`LAG()` Function***:
   - `LAG()` is a ***window function*** that provides access to a row at a given physical offset that comes before the current row.
   - `salary`: This is the column from which the previous value is retrieved.
   - `1`: This is the offset that specifies how many rows before the current row to look for the value.
   - `OVER (PARTITION BY department_id ORDER BY salary DESC)`: This clause defines the *window* for the `LAG()` function.
     - `PARTITION BY department_id`: This means that the rows are *partitioned* (grouped) by the `department_id` column. Each department will have its own sequence of previous salaries.
     - `ORDER BY salary DESC`: Within each *partition* (department), the rows are ordered by the `salary` column in *descending* order.
- ***Result:***
   - The query will return a result set with the original columns (`employee_id`, `name`, `department_id`, `salary`) and an additional column `prev_salary` that contains the *salary* of the previous employee within their department, ordered by `salary` in *descending* order.

In [37]:
# Run the query using the read_sql function
df = pd.read_sql(lag_demo_query, conn)

# Print the DataFrame
df

Unnamed: 0,employee_id,name,department_id,salary,prev_salary
0,2,Bob,1,80000,
1,1,Alice,1,70000,80000.0
2,3,Charlie,1,70000,70000.0
3,4,David,2,90000,
4,6,Frank,2,90000,90000.0
5,5,Eve,2,85000,90000.0


#### Explanation of the Result

- For `department_id = 1`:
  - ***Bob*** has the highest salary (80000), so there is no previous salary for him and `prev_salary` is `NaN`.
  - ***Alice*** and ***Charlie*** both have the *same* salary (70000). Since ***Alice*** appears first in the original data, the previous salary for ***Charlie*** is ***Alice's*** salary (70000).

- For `department_id = 2`:
  - ***David*** and ***Frank*** both have the highest salary (90000). Since ***David*** appears first in the original data, there is no previous salary for him and `prev_salary` is `NaN`
  - The previous salary for ***Frank*** is ***David's*** salary (90000).
  - ***Eve*** has the next highest salary (85000), so the previous salary for ***Eve*** is ***Frank's*** salary (90000).

This query helps in accessing the value of a column from a previous row within a partition, which can be useful for calculating differences, trends, and other analytical purposes where you need to compare the current row with a previous row.

---

---

## Activities

### Database Set Up



In [38]:
# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create the departments table
cursor.execute('''CREATE TABLE departments (
    department_id INTEGER PRIMARY KEY,
    name TEXT
);
''')

# Create the employees table
cursor.execute('''
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    name TEXT,
    department_id INTEGER,
    salary INTEGER
);
''')

# Create the salaries table
cursor.execute('''
CREATE TABLE salaries (
    employee_id INTEGER PRIMARY KEY,
    salary INTEGER,
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id)
);
''')

# Define department data
departments = [
    (1, 'HR'),
    (2, 'Engineering'),
    (3, 'Marketing')
]

# Insert data into the departments table
cursor.executemany('INSERT INTO departments (department_id, name) VALUES (?, ?)', departments)

# Define employee data
employees = [
    (1, 'Alice', 1, None),
    (2, 'Bob', 1, 1),
    (3, 'Charlie', 1, 1),
    (4, 'David', 2, None),
    (5, 'Eve', 2, 4),
    (6, 'Frank', 2, 4),
    (7, 'Grace', 3, None),
    (8, 'Hank', 3, 7),
    (9, 'Ivy', 3, 7)
]

# Insert employee data into the employees table
cursor.executemany('INSERT INTO employees (employee_id, name, department_id, salary) VALUES (?, ?, ?, ?)', employees)

# Define salary data
salaries = [
    (1, 70000),
    (2, 80000),
    (3, 70000),
    (4, 90000),
    (5, 85000),
    (6, 90000),
    (7, 95000),
    (8, 75000),
    (9, 72000)
]

# Insert salary data into the salaries table
cursor.executemany('INSERT INTO salaries (employee_id, salary) VALUES (?, ?)', salaries)

<sqlite3.Cursor at 0x29ddd4d6fc0>

---

### Activity 1 - `ROW_NUMBER()`

#### User Story

> **As a** Data Professional,  
> **I want to** assign a unique sequential integer to each employee within their department based on their salary,  
> **so that** I can identify the order of employees by salary within each department.

#### Definition of Done

- [ ] - Write a SQL query that assigns a unique sequential integer to each employee within their department based on their salary.
- [ ] - Run the query and display the results in a Pandas DataFrame.
- [ ] - Explain the query and the results in a markdown cell.

In [39]:
# Define your query, run it, place the results in a DataFrame and output the DataFrame
query1 = '''SELECT
    e.employee_id, 
    e.name, 
    e.department_id, 
    s.salary,
    ROW_NUMBER() OVER (PARTITION BY e.department_id ORDER BY s.salary DESC) AS row_num
FROM 
    employees e
JOIN 
    salaries s ON e.employee_id = s.employee_id;
'''

df1 = pd.read_sql(query1, conn)
df1

Unnamed: 0,employee_id,name,department_id,salary,row_num
0,2,Bob,1,80000,1
1,1,Alice,1,70000,2
2,3,Charlie,1,70000,3
3,4,David,2,90000,1
4,6,Frank,2,90000,2
5,5,Eve,2,85000,3
6,7,Grace,3,95000,1
7,8,Hank,3,75000,2
8,9,Ivy,3,72000,3


#### Explanation of Query and Results



Groups by department id, and then assigns a number to each employee in the department based on their salary compared to the other employees. Given a tie between two employees, the employee that appeared first originally is assigned the smaller number.

---

### Activity 2 - `RANK()`

#### User Story

> **As a** Data Professional,
> **I want to** rank employees within their department based on their salary,
> **so that** I can identify employees with the same salary and see their relative standing.

#### Definition of Done

- [ ] - Write a SQL query that ranks employees within their department based on their salary.
- [ ] - Run the query and display the results in a Pandas DataFrame.
- [ ] - Explain the query and the results in a markdown cell.

In [40]:
# Define your query, run it, place the results in a DataFrame and output the DataFrame
query2 = '''SELECT
    e.employee_id, 
    e.name, 
    e.department_id, 
    s.salary,
    RANK() OVER (PARTITION BY e.department_id ORDER BY s.salary DESC) AS rank
FROM 
    employees e
JOIN 
    salaries s ON e.employee_id = s.employee_id;
'''
df2 = pd.read_sql(query2, conn)
df2

Unnamed: 0,employee_id,name,department_id,salary,rank
0,2,Bob,1,80000,1
1,1,Alice,1,70000,2
2,3,Charlie,1,70000,2
3,4,David,2,90000,1
4,6,Frank,2,90000,1
5,5,Eve,2,85000,3
6,7,Grace,3,95000,1
7,8,Hank,3,75000,2
8,9,Ivy,3,72000,3


#### Explanation of Query and Results



Similar to ROW_NUM but in this case assigns the same rank to employees who are tied for salary.

---

### Activity 3 - `LAG()`

#### User Story

> **As a** Data Professional,  
> **I want to** access the previous employee's salary within each department based on their salary,  
> **so that** I can analyse salary trends and differences.

#### Definition of Done

- [ ] - Write a SQL query that accesses the previous employee's salary within each department based on their salary.
- [ ] - Run the query and display the results in a Pandas DataFrame.
- [ ] - Explain the query and the results in a markdown cell.

In [41]:
query3 = '''
SELECT 
    e.employee_id, 
    e.name, 
    e.department_id, 
    s.salary,
    LAG(s.salary, 1) OVER (PARTITION BY e.department_id ORDER BY s.salary DESC) AS prev_salary
FROM 
    employees e
JOIN 
    salaries s ON e.employee_id = s.employee_id;
'''

df3 = pd.read_sql(query3, conn)
df3

Unnamed: 0,employee_id,name,department_id,salary,prev_salary
0,2,Bob,1,80000,
1,1,Alice,1,70000,80000.0
2,3,Charlie,1,70000,70000.0
3,4,David,2,90000,
4,6,Frank,2,90000,90000.0
5,5,Eve,2,85000,90000.0
6,7,Grace,3,95000,
7,8,Hank,3,75000,95000.0
8,9,Ivy,3,72000,75000.0


#### Explanation of Query and Results



Just returns the salary from one row behind

---

## Query Solutions

#### Activity 1

In [None]:
SELECT
    e.employee_id, 
    e.name, 
    e.department_id, 
    s.salary,
    ROW_NUMBER() OVER (PARTITION BY e.department_id ORDER BY s.salary DESC) AS row_num
FROM 
    employees e
JOIN 
    salaries s ON e.employee_id = s.employee_id;

#### Activity 2

In [None]:
SELECT 
    e.employee_id, 
    e.name, 
    e.department_id, 
    s.salary,
    RANK() OVER (PARTITION BY e.department_id ORDER BY s.salary DESC) AS rank
FROM 
    employees e
JOIN 
    salaries s ON e.employee_id = s.employee_id;

#### Activity 3

In [None]:
SELECT 
    e.employee_id, 
    e.name, 
    e.department_id, 
    s.salary,
    LAG(salary, 1) OVER (PARTITION BY e.department_id ORDER BY s.salary DESC) AS prev_salary
FROM 
    employees e
JOIN 
    salaries s ON e.employee_id = s.employee_id;

---
---

## Could we do this all with Pandas?

Absolutely! Pandas has a number of functions that can be used to achieve similar results to SQL window functions.

### Set Up a Database

In [None]:
# Create another in-memory SQLite database
conn2 = sqlite3.connect(':memory:')
cursor = conn2.cursor()

# Create the employees table
cursor.execute('''
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    name TEXT,
    department_id INTEGER,
    salary INTEGER
);
''')

# Populate the employees table with sample data
employees = [
    (1, 'Alice', 1, 70000),
    (2, 'Bob', 1, 80000),
    (3, 'Charlie', 1, 70000),
    (4, 'David', 2, 90000),
    (5, 'Eve', 2, 85000),
    (6, 'Frank', 2, 90000)
]

cursor.executemany('INSERT INTO employees (employee_id, name, department_id, salary) VALUES (?, ?, ?, ?)', employees)



### `ROW_NUMBER` Equivalent in Pandas


In [None]:
# Load the employees table into a DataFrame
df = pd.read_sql('SELECT * FROM employees', conn2)
df

In [None]:
# ROW_NUMBER() equivalent in Pandas of
# SELECT employee_id, name, department_id, salary, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS row_num FROM employees;
# Rank salaries within each department in descending order
df['row_num'] = df.groupby('department_id')['salary'].rank(method='first', ascending=False).astype(int)

# Sort by department_id and row_num to match the expected order
df = df.sort_values(by=['department_id', 'row_num']).reset_index(drop=True)

df

### `RANK()` Equivalent in Pandas

In [None]:
# Load the employees table into a DataFrame
df2 = pd.read_sql('SELECT * FROM employees', conn2)
df2


In [None]:
# RANK() equivalent in Pandas
# SELECT employee_id, name, department_id, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank FROM employees;

# Rank salaries within each department in descending order using 'min' method
df2['rank'] = df2.groupby('department_id')['salary'].rank(method='min', ascending=False)

# Sort by department_id and rank to match the expected order
df2 = df2.sort_values(by=['department_id', 'rank']).reset_index(drop=True)

df2

---

### `LAG()` Equivalent in Pandas

In [None]:
# Load the employees table into a DataFrame
df3 = pd.read_sql('SELECT * FROM employees', conn2)
df3


In [None]:
# LAG() equivalent in Pandas
# # SELECT employee_id, name, department_id, salary, LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY salary DESC) AS prev_salary FROM employees;

# Calculate the previous salary within each department
df3['prev_salary'] = df3.groupby('department_id')['salary'].shift(1)

df3