<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/3_Ranking.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Ranking Functions

## Overview

### 🥅 Analysis Goals

- **Running Order & Avg Revenue:** Get the running order and average revenue for each order. 
- **Rank Customers by Order Quantity:** Calculate and rank customers based on their total order quantity. Provides meaningful sequence of orders within each day.

### 📘 Concepts Covered

- `ORDER BY`
- Ranking
    - `ROW_NUMBER`
    - `RANK`
    - `DENSE_RANK`

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## ORDER BY

### 📝 Notes

`ORDER BY`

- **ORDER BY**: Orders rows within each partition for the function.
- `ORDER BY` can be ordered in either `DESC` or `ASC` order.
- Syntax
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
            ORDER BY column_name --DESC or ASC
        ) AS window_column_alias
    FROM table_name;
    ```

#### Importance of `ORDER BY` in a Window Function  

1. Controls Row Processing Order 🏛️
    - Defines how rows are sequentially evaluated within their partition.
    - Essential for functions like cumulative sums, moving averages, and rankings.  
2. Required for Certain Window Functions 🪟
    - Functions like `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`, and `LAG()/LEAD()` require an `ORDER BY` inside `OVER()` to determine sequence.  
3. Affects Aggregation Window 📊
    - For cumulative functions (`SUM()`, `AVG()`, etc.), it determines how values are accumulated row by row.

### 📈 Analysis

- Get the running order count for each order. 
- Calculate the running average revenue for each order. 

#### Running Total of Orders for Customers

**`ORDER BY`**

1. Get the running order count that's ordered by `orderdate` using `ORDER BY` in the windows function. 
    - Get net revenue per order `(quantity * netprice * exchangerate)`
    - Track running count of orders using `COUNT(*) OVER (PARTITION BY customerkey ORDER BY orderdate) AS running_order_count`
    - Order the running count by `orderdate` for each customer

In [2]:
%%sql

SELECT 
    customerkey,
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue,
    COUNT(*) OVER (
        PARTITION BY customerkey 
        ORDER BY orderdate
    ) AS running_order_count
FROM sales
LIMIT 15

Unnamed: 0,customerkey,orderdate,net_revenue,running_order_count
0,15,2021-03-08,2217.41,1
1,180,2018-07-28,525.31,1
2,180,2023-08-28,1913.55,3
3,180,2023-08-28,71.36,3
4,185,2019-06-01,1395.52,1
5,243,2016-05-19,287.67,1
6,387,2018-12-21,45.62,4
7,387,2018-12-21,97.05,4
8,387,2018-12-21,1608.1,4
9,387,2018-12-21,619.77,4


#### Running Average Net Revenue for Customer

**`ORDER BY`**

1. Get the running order count that's ordered by `orderdate` using `ORDER BY` in the windows function. 
    - Get net revenue per order `(quantity * netprice * exchangerate)`
    - Calculate the running average revenue using `AVG(*) OVER (PARTITION BY customerkey ORDER BY orderdate) AS running_order_count`
    - Order the running count by `orderdate` for each customer

In [3]:
%%sql

SELECT 
    customerkey,
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue,
    AVG(quantity * netprice * exchangerate) OVER (
        PARTITION BY customerkey 
        ORDER BY orderdate
    ) AS running_avg_revenue
FROM sales
LIMIT 15

Unnamed: 0,customerkey,orderdate,net_revenue,running_avg_revenue
0,15,2021-03-08,2217.41,2217.41
1,180,2018-07-28,525.31,525.31
2,180,2023-08-28,1913.55,836.74
3,180,2023-08-28,71.36,836.74
4,185,2019-06-01,1395.52,1395.52
5,243,2016-05-19,287.67,287.67
6,387,2018-12-21,1608.1,592.64
7,387,2018-12-21,619.77,592.64
8,387,2018-12-21,45.62,592.64
9,387,2018-12-21,97.05,592.64


---
## ROW_NUMBER

### 📝 Notes

`ROW_NUMBER`

- **ROW NUMBER**: Assigns a unique number to each row within a partition.
- Syntax:
    ```sql
    ROW_NUMBER() OVER(
         PARTITION BY partition_expression
         ORDER BY column_name
    ) AS window_column_alias
    ```

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Customer Order Rank: Unique position assigned to each customer based on their order metrics
  - Sequential Numbering: Process of assigning unique, consecutive numbers to orders or customers
  - Order Frequency: How often a customer makes purchases over a given time period
- **💡 Why It Matters**: Assigns unique identifiers to orders, enabling customer behavior analysis
- **🎯 Common Use Cases**: Customer order tracking, monthly cohort rankings
- **📈 Related KPIs**: Order frequency, customer engagement metrics


### 📈 Analysis

- Give a daily order number to each order. 

#### Assign Row Numbers Without ORDER BY (DEMO ONLY)

1. Assign a row number to each row in the sales table.
   - Use ROW_NUMBER() without any ORDER BY clause.
      - ⚠️ This will assign arbitrary numbers that may change between query runs.
   - Not recommended as it provides no meaningful sequence.

In [2]:
%%sql

SELECT 
    ROW_NUMBER() OVER() AS row_num,
    *
FROM sales
LIMIT 10

Unnamed: 0,row_num,orderkey,linenumber,orderdate,deliverydate,customerkey,storekey,productkey,quantity,unitprice,netprice,unitcost,currencycode,exchangerate
0,1,1000,0,2015-01-01,2015-01-01,947009,400,48,1,112.46,98.97,57.34,GBP,0.64
1,2,1000,1,2015-01-01,2015-01-01,947009,400,460,1,749.75,659.78,382.25,GBP,0.64
2,3,1001,0,2015-01-01,2015-01-01,1772036,430,1730,2,54.38,54.38,25.0,USD,1.0
3,4,1002,0,2015-01-01,2015-01-01,1518349,660,955,4,315.04,286.69,144.88,USD,1.0
4,5,1002,1,2015-01-01,2015-01-01,1518349,660,62,7,135.75,135.75,62.43,USD,1.0
5,6,1002,2,2015-01-01,2015-01-01,1518349,660,1050,3,499.2,434.3,229.57,USD,1.0
6,7,1002,3,2015-01-01,2015-01-01,1518349,660,1608,1,65.99,58.73,33.65,USD,1.0
7,8,1003,0,2015-01-01,2015-01-01,1317097,510,85,3,74.99,74.99,34.48,USD,1.0
8,9,1004,0,2015-01-01,2015-01-01,254117,80,128,2,114.72,113.57,58.49,CAD,1.16
9,10,1004,1,2015-01-01,2015-01-01,254117,80,2079,1,499.45,499.45,165.48,CAD,1.16


#### Assign Row Numbers With ORDER BY

1. Assign a row number to each row in the sales table.
   - Use `ROW_NUMBER()` with `ORDER BY orderdate, orderkey, linenumber`.
      - ✅ This will assign consistent numbers based on chronological order.
   - Provides meaningful sequence based on order date and line items.

In [6]:
%%sql

SELECT
    ROW_NUMBER() OVER(
        ORDER BY 
            orderdate,
            orderkey,
            linenumber
    ) AS row_num,
    *
FROM sales
LIMIT 10

Unnamed: 0,row_num,orderkey,linenumber,orderdate,deliverydate,customerkey,storekey,productkey,quantity,unitprice,netprice,unitcost,currencycode,exchangerate
0,1,1000,0,2015-01-01,2015-01-01,947009,400,48,1,112.46,98.97,57.34,GBP,0.64
1,2,1000,1,2015-01-01,2015-01-01,947009,400,460,1,749.75,659.78,382.25,GBP,0.64
2,3,1001,0,2015-01-01,2015-01-01,1772036,430,1730,2,54.38,54.38,25.0,USD,1.0
3,4,1002,0,2015-01-01,2015-01-01,1518349,660,955,4,315.04,286.69,144.88,USD,1.0
4,5,1002,1,2015-01-01,2015-01-01,1518349,660,62,7,135.75,135.75,62.43,USD,1.0
5,6,1002,2,2015-01-01,2015-01-01,1518349,660,1050,3,499.2,434.3,229.57,USD,1.0
6,7,1002,3,2015-01-01,2015-01-01,1518349,660,1608,1,65.99,58.73,33.65,USD,1.0
7,8,1003,0,2015-01-01,2015-01-01,1317097,510,85,3,74.99,74.99,34.48,USD,1.0
8,9,1004,0,2015-01-01,2015-01-01,254117,80,128,2,114.72,113.57,58.49,CAD,1.16
9,10,1004,1,2015-01-01,2015-01-01,254117,80,2079,1,499.45,499.45,165.48,CAD,1.16


> **NOTE:** We can use `DESC` to specify the order of rankings.

In [7]:
%%sql

SELECT
    ROW_NUMBER() OVER(
        ORDER BY 
            orderdate DESC,
            orderkey DESC,
            linenumber DESC
    ) AS row_num,
    *
FROM sales
LIMIT 10

Unnamed: 0,row_num,orderkey,linenumber,orderdate,deliverydate,customerkey,storekey,productkey,quantity,unitprice,netprice,unitcost,currencycode,exchangerate
0,1,3398035,2,2024-04-20,2024-04-22,267690,999999,1693,6,6.88,6.88,3.16,CAD,1.38
1,2,3398035,1,2024-04-20,2024-04-22,267690,999999,415,5,326.0,293.4,166.2,CAD,1.38
2,3,3398035,0,2024-04-20,2024-04-22,267690,999999,1575,2,60.99,53.67,28.05,CAD,1.38
3,4,3398034,2,2024-04-20,2024-04-21,664396,999999,1646,1,159.99,159.99,73.57,EUR,0.94
4,5,3398034,1,2024-04-20,2024-04-21,664396,999999,1651,7,159.99,139.19,73.57,EUR,0.94
5,6,3398034,0,2024-04-20,2024-04-21,664396,999999,1511,1,229.0,199.23,105.31,EUR,0.94
6,7,3398033,2,2024-04-20,2024-04-20,635184,160,1206,3,1560.0,1388.4,516.86,EUR,0.94
7,8,3398033,1,2024-04-20,2024-04-20,635184,160,991,3,268.0,235.84,88.79,EUR,0.94
8,9,3398033,0,2024-04-20,2024-04-20,635184,160,1681,7,6.89,5.93,3.17,EUR,0.94
9,10,3398032,1,2024-04-20,2024-04-25,852158,999999,1651,2,159.99,139.19,73.57,EUR,0.94


#### Assign Daily Order Numbers With PARTITION BY

1. Assign a row number to each row in the sales table, partitioned by date.
   - Use `ROW_NUMBER()` with `PARTITION BY orderdate` and `ORDER BY orderdate, orderkey, linenumber`.
      - ✅ This will assign numbers that restart each day, ordered by order ID and line number.
   - Provides meaningful sequence of orders within each day.
   - WHERE clause included to show orders from the next day, demonstrating how numbering restarts.

In [8]:
%%sql

SELECT
    ROW_NUMBER() OVER(
        PARTITION BY
            orderdate
        ORDER BY 
            orderdate,
            orderkey,
            linenumber
    ) AS daily_order_num,
    *
FROM sales
WHERE orderdate > '2015-01-01'  -- WHERE included only to demonstrate numbering restarts every day
LIMIT 10

Unnamed: 0,daily_order_num,orderkey,linenumber,orderdate,deliverydate,customerkey,storekey,productkey,quantity,unitprice,netprice,unitcost,currencycode,exchangerate
0,1,2000,0,2015-01-02,2015-01-02,1639738,530,1613,5,65.99,59.39,33.65,USD,1.0
1,2,2001,0,2015-01-02,2015-01-15,2085372,999999,2182,2,1237.5,1237.5,410.01,USD,1.0
2,3,2002,0,2015-01-02,2015-01-02,1732602,510,1822,2,22.4,22.4,11.42,USD,1.0
3,4,2002,1,2015-01-02,2015-01-02,1732602,510,49,5,149.96,149.96,68.96,USD,1.0
4,5,2003,0,2015-01-02,2015-01-02,728917,300,1674,2,4.89,4.89,2.49,EUR,0.83
5,6,2003,1,2015-01-02,2015-01-02,728917,300,369,1,1747.5,1555.28,803.6,EUR,0.83
6,7,2004,0,2015-01-02,2015-01-02,1724183,570,1654,2,155.99,155.99,51.68,USD,1.0
7,8,2005,0,2015-01-02,2015-01-02,2054699,480,460,1,749.75,712.26,382.25,USD,1.0
8,1,3000,0,2015-01-03,2015-01-03,1793739,500,108,3,99.74,97.75,45.87,USD,1.0
9,2,3000,1,2015-01-03,2015-01-03,1793739,500,1684,3,11.82,11.0,3.92,USD,1.0


> 

#### Rank Customers Order Quantity

**`ROW_NUMBER`**

1. By customer, assign a rank to the total orders each customer made.  
   - Use `COUNT(orderkey)` to calculate the total number of orders for each customer.  
   - Group by `customerkey` to ensure the order count is calculated for each individual customer.  
   - Use `ROW_NUMBER() OVER (ORDER BY COUNT(orderkey) DESC)` to assign a unique rank to each customer based on their total orders, in descending order.  
   - Select `customerkey`, `total_orders`, and the rank (`row_number_rank`) in the output.  


In [9]:
%%sql
SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    ROW_NUMBER() OVER (ORDER BY COUNT(orderkey) DESC) AS total_orders_row_num
FROM sales
GROUP BY customerkey
LIMIT 10


Unnamed: 0,customerkey,total_orders,total_orders_row_num
0,1834524,31,1
1,1375597,30,2
2,249557,27,3
3,1495941,26,4
4,459519,26,5
5,1801215,26,6
6,1219056,25,7
7,1876222,24,8
8,1427444,24,9
9,759419,24,10


---
## RANK

### 📝 Notes

`RANK`

- **RANK**: Assigns the same rank to rows with identical values but skips ranks after ties (e.g., 1, 2, 2, 4).
- Syntax:
    ```sql
    RANK() OVER(
         PARTITION BY partition_expression
         ORDER BY column_name
    ) AS window_column_alias
    ```

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Customer Order Ranking: Position of a customer based on their order volume or value
  - Tied Rankings: When multiple customers share the same rank due to identical metrics
  - Order Volume: Total number of orders placed by a customer
- **💡 Why It Matters**: Identifies high-volume customers while preserving tied rankings
- **🎯 Common Use Cases**: Customer segmentation, identifying top customers
- **📈 Related KPIs**: Customer order volume, tier distribution

### 📈 Analysis

- Rank customers by their total amount of orders.
- Track customer ordering behavior over time, grouping by the year of first purchase (cohort year) and aggregating orders and unique users by month.


#### Rank Customers Order Quantity

**`RANK`**

1. By customer, assign a rank to the total orders each customer made (from the previous example use `RANK` instead).  
   - Use `COUNT(orderkey)` to calculate the total number of orders for each customer.  
   - Group by `customerkey` to ensure the order count is calculated for each individual customer.  
   - 🔔 Use `RANK() OVER (ORDER BY COUNT(orderkey) DESC)` to assign a unique rank to each customer based on their total orders, in descending order.  
   - Select `customerkey`, `total_orders`, and the rank (`rank_rank`) in the output.  

In [69]:
%%sql
SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    ROW_NUMBER() OVER (ORDER BY COUNT(orderkey) DESC) AS total_orders_row_num,
    RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS total_orders_rank
FROM sales
GROUP BY customerkey
LIMIT 10

Unnamed: 0,customerkey,total_orders,total_orders_row_num,total_orders_rank
0,1834524,31,1,1
1,1375597,30,2,2
2,249557,27,3,3
3,1495941,26,4,4
4,459519,26,5,4
5,1801215,26,6,4
6,1219056,25,7,7
7,1876222,24,8,8
8,1427444,24,9,8
9,759419,24,10,8


 > **NOTE:** With `RANK()`, customers with the same total_orders get the same rank (e.g., three customers with 26 orders all get rank 4), while `ROW_NUMBER()` assigns unique sequential numbers even for ties (4,5,6). `RANK()` shows true ties, while `ROW_NUMBER()` forces unique ordering.

---
## DENSE RANK

### 📝 Notes

`DENSE_RANK`

- **DENSE_RANK**: Similar to RANK(), it assigns the same rank to rows with identical values but does not skip ranks after ties (e.g., 1, 2, 2, 3).
- Syntax:
    ```sql
    DENSE_RANK() OVER(
         PARTITION BY partition_expression
         ORDER BY column_name
    ) AS window_column_alias
    ```

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Continuous Customer Ranking: Ranking system without gaps, even when ties exist
  - Order Volume Tiers: Groupings of customers based on their order quantities
  - Customer Segmentation: Process of dividing customers into groups based on similar characteristics
- **💡 Why It Matters**: Creates consecutive rankings for customer segmentation without gaps
- **🎯 Common Use Cases**: Customer tiering, continuous rank analysis
- **📈 Related KPIs**: Customer tier metrics, order volume distribution

### 📈 Analysis

- Rank customers by their total amount of orders.
- Track customer ordering behavior over time, grouping by the year of first purchase (cohort year) and aggregating orders and unique users by month.



#### Rank Customers Order Quantity

**`DENSE_RANK`**

1. By customer, assign a rank to the total orders each customer made (from the previous example use `DENSE_RANK` instead).  
   - Use `COUNT(orderkey)` to calculate the total number of orders for each customer.  
   - Group by `customerkey` to ensure the order count is calculated for each individual customer.  
   - 🔔 Use `DENSE_RANK() OVER (ORDER BY COUNT(orderkey) DESC)` to assign a unique rank to each customer based on their total orders, in descending order.  
   - Select `customerkey`, `total_orders`, and the rank (`dense_rank`) in the output.  

In [73]:
%%sql

SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    ROW_NUMBER() OVER (ORDER BY COUNT(orderkey) DESC) AS total_orders_row_num,
    RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS total_orders_rank,
    DENSE_RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS total_orders_dense_rank
FROM sales
GROUP BY customerkey
LIMIT 10


Unnamed: 0,customerkey,total_orders,total_orders_row_num,total_orders_rank,total_orders_dense_rank
0,1834524,31,1,1,1
1,1375597,30,2,2,2
2,249557,27,3,3,3
3,1495941,26,4,4,4
4,459519,26,5,4,4
5,1801215,26,6,4,4
6,1219056,25,7,7,5
7,1876222,24,8,8,6
8,1427444,24,9,8,6
9,759419,24,10,8,6


> **NOTE:** DENSE_RANK differs from RANK in that it assigns consecutive ranks without gaps when there are ties, while RANK leaves gaps after ties.

### 💡 What's the difference between `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`

1. `ROW_NUMBER()` 
    - Even if two rows have the same value, they will get different, consecutive ranks.
    - Example: If three products have the same sales amount, they’ll be ranked 1, 2, and 3 in sequence.

| Sales | ROW_NUMBER() |
|-------|--------------|
| 500   | 1            |
| 500   | 2            |
| 400   | 3            |
| 300   | 4            |  
  

2. `RANK()`
    - Rows with identical values receive the same rank, and the next rank jumps to the next number in sequence.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 4.

| Sales | ROW_NUMBER() |
|-------|--------------|
| 500   | 1            |
| 500   | 1            |
| 400   | 3            |
| 300   | 4            |


3. `DENSE_RANK()`
    - Rows with identical values receive the same rank, and the next rank continues sequentially without gaps.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 2.

| Sales | ROW_NUMBER() |
|-------|--------------|
| 500   | 1            |
| 500   | 1            |
| 400   | 2            |
| 300   | 3            |

**Alternative note format**

- Same info as above but in a different format. 

| Function     | Description                                                                                    | Tie Handling                           | Example Sales Values (500, 500, 400, 300) |
|--------------|------------------------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------------|
| ROW_NUMBER() | Assigns a unique, sequential rank   to each row without regard for ties.                       | No ties; each row gets a unique   rank | 1, 2, 3, 4                                            |
| RANK()       | Assigns the same rank to   identical values but skips ranks after ties.                        | Same rank for ties; skips next   ranks | 1, 1, 3, 4                                            |
| DENSE_RANK() | Assigns the same rank to   identical values but continues sequentially without skipping ranks. | Same rank for ties; no skipped   ranks | 1, 1, 2, 3                                            |