# 02 — Window Functions

**Learning Objectives**

| # | Goal |
|---|------|
| 1 | Understand the difference between `GROUP BY` (collapses rows) and `OVER()` (keeps rows) |
| 2 | Use ranking functions: `ROW_NUMBER`, `RANK`, `DENSE_RANK` |
| 3 | Compare rows with `LAG` and `LEAD` |
| 4 | Compute running totals and moving averages with frame clauses |
| 5 | Apply `NTILE` for percentile bucketing |

---

### Mental Model: What Is a Window Function?

```
GROUP BY  ──▶  Many rows  →  ONE row per group   (rows disappear)
OVER()    ──▶  Many rows  →  SAME rows + new col  (rows stay)
```

A window function **looks at a set of related rows** (the "window") and computes a value for each row without collapsing them.

**Anatomy:**
```sql
FUNCTION(col) OVER (
    PARTITION BY group_col   -- defines the window (like GROUP BY)
    ORDER BY sort_col        -- ordering inside the window
    ROWS BETWEEN ...         -- optional frame clause
)
```

In [None]:
%load_ext sql
%sql postgresql://admin:password@postgres:5432/mastery_db

---
## 1. Ranking Functions

| Function | Ties | Gaps |
|----------|------|------|
| `ROW_NUMBER()` | Breaks ties arbitrarily | No gaps |
| `RANK()` | Same rank for ties | Gaps after ties |
| `DENSE_RANK()` | Same rank for ties | No gaps |

**Business Question:** Rank bookings by ADR within each hotel — who paid the most?

In [None]:
%%sql
SELECT
    hotel,
    country,
    adr,
    ROW_NUMBER() OVER (PARTITION BY hotel ORDER BY adr DESC) AS row_num,
    RANK()       OVER (PARTITION BY hotel ORDER BY adr DESC) AS rank_val,
    DENSE_RANK() OVER (PARTITION BY hotel ORDER BY adr DESC) AS dense_rank_val
FROM hotel_bookings
WHERE adr > 0
LIMIT 15;

> **Observe:** When two rows share the same `adr`, `RANK` gives them the same number but *skips* the next value (e.g. 1, 1, 3). `DENSE_RANK` doesn't skip (1, 1, 2). `ROW_NUMBER` is always unique.

### Practical Use: Top-N per Group

A very common pattern — get the **top 3 highest-paying bookings per hotel** using a subquery.

In [None]:
%%sql
SELECT *
FROM (
    SELECT
        hotel,
        country,
        customer_type,
        adr,
        ROW_NUMBER() OVER (PARTITION BY hotel ORDER BY adr DESC) AS rn
    FROM hotel_bookings
    WHERE adr > 0
) ranked
WHERE rn <= 3
ORDER BY hotel, rn;

> **Key Pattern:** Wrap the window function in a subquery or CTE, then filter on the rank in the outer query. You **cannot** put `WHERE rn <= 3` in the same SELECT — window functions run after WHERE.

---
## 2. LAG & LEAD — Compare Adjacent Rows

| Function | Returns |
|----------|---------|
| `LAG(col, n)` | Value from **n rows before** (default n=1) |
| `LEAD(col, n)` | Value from **n rows after** |

**Business Question:** How does the monthly average ADR change month-over-month for each hotel?

In [None]:
%%sql
WITH monthly_adr AS (
    SELECT
        hotel,
        arrival_date_year  AS yr,
        arrival_date_month AS mo,
        ROUND(AVG(adr)::numeric, 2) AS avg_adr
    FROM hotel_bookings
    WHERE adr > 0
    GROUP BY hotel, arrival_date_year, arrival_date_month
)
SELECT
    hotel,
    yr,
    mo,
    avg_adr,
    LAG(avg_adr) OVER (PARTITION BY hotel ORDER BY yr, mo)  AS prev_month_adr,
    ROUND(
        avg_adr - LAG(avg_adr) OVER (PARTITION BY hotel ORDER BY yr, mo), 2
    ) AS mom_change
FROM monthly_adr
ORDER BY hotel, yr, mo
LIMIT 15;

> **Note:** The first row in each partition returns `NULL` for `LAG` because there is no previous row. This is expected — use `COALESCE(LAG(...), 0)` if you need a default.

---
## 3. Running Totals

A running total (cumulative sum) adds up values from the first row to the current row.

**Business Question:** What is the cumulative booking count over time?

In [None]:
%%sql
SELECT
    arrival_date_year  AS yr,
    arrival_date_month AS mo,
    COUNT(*) AS monthly_bookings,
    SUM(COUNT(*)) OVER (
        ORDER BY arrival_date_year, arrival_date_month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM hotel_bookings
GROUP BY arrival_date_year, arrival_date_month
ORDER BY yr, mo;

> **Frame Clause Explained:**  
> `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` means "from the very first row up to this one". This is actually the default when you specify `ORDER BY`, but being explicit is good practice.

---
## 4. Moving Averages

A moving average smooths out noise by averaging over a sliding window.

**Business Question:** What is the 3-month moving average ADR per hotel?

In [None]:
%%sql
WITH monthly AS (
    SELECT
        hotel,
        arrival_date_year  AS yr,
        arrival_date_month AS mo,
        ROUND(AVG(adr)::numeric, 2) AS avg_adr
    FROM hotel_bookings
    WHERE adr > 0
    GROUP BY hotel, arrival_date_year, arrival_date_month
)
SELECT
    hotel,
    yr,
    mo,
    avg_adr,
    ROUND(
        AVG(avg_adr) OVER (
            PARTITION BY hotel
            ORDER BY yr, mo
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        )::numeric, 2
    ) AS moving_avg_3m
FROM monthly
ORDER BY hotel, yr, mo;

> **`ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`** = current row + 2 rows before = 3-row window.  
> For the first 2 rows the window is smaller (1 or 2 rows), so the average may look different.

---
## 5. NTILE — Percentile Buckets

`NTILE(n)` divides the ordered rows into **n equal-ish buckets** (1 through n).

**Business Question:** Split bookings into ADR quartiles (4 buckets) per hotel.

In [None]:
%%sql
WITH quartiles AS (
    SELECT
        hotel,
        adr,
        NTILE(4) OVER (PARTITION BY hotel ORDER BY adr) AS adr_quartile
    FROM hotel_bookings
    WHERE adr > 0
)
SELECT
    hotel,
    adr_quartile,
    COUNT(*)                          AS bookings,
    ROUND(MIN(adr)::numeric, 2)       AS min_adr,
    ROUND(AVG(adr)::numeric, 2)       AS avg_adr,
    ROUND(MAX(adr)::numeric, 2)       AS max_adr
FROM quartiles
GROUP BY hotel, adr_quartile
ORDER BY hotel, adr_quartile;

> **Use Case:** NTILE is great for segmenting customers (top 25% spenders vs bottom 25%) or creating "A/B/C/D" tiers.

---
## Exercises

**Exercise 1:** For each hotel, find the booking with the **longest lead time**. Return hotel, country, lead_time, and rank.

<details><summary>Hint</summary>

```sql
SELECT * FROM (
    SELECT hotel, country, lead_time,
           ROW_NUMBER() OVER (PARTITION BY hotel ORDER BY lead_time DESC) AS rn
    FROM hotel_bookings
) t WHERE rn = 1;
```
</details>

In [None]:
%%sql
-- Exercise 1: Your query here


**Exercise 2:** Calculate the **percentage of total bookings** each country represents, using a window function (no subquery needed).

<details><summary>Hint</summary>

```sql
SELECT country, COUNT(*) AS bookings,
       ROUND(COUNT(*)::numeric / SUM(COUNT(*)) OVER () * 100, 2) AS pct_of_total
FROM hotel_bookings
GROUP BY country
ORDER BY bookings DESC
LIMIT 10;
```
</details>

In [None]:
%%sql
-- Exercise 2: Your query here


**Exercise 3:** For each market segment, compute a running total of cancellations ordered by year and month.

<details><summary>Hint</summary>

```sql
SELECT market_segment, arrival_date_year, arrival_date_month,
       SUM(is_canceled) AS monthly_cancels,
       SUM(SUM(is_canceled)) OVER (
           PARTITION BY market_segment
           ORDER BY arrival_date_year, arrival_date_month
       ) AS running_cancels
FROM hotel_bookings
GROUP BY market_segment, arrival_date_year, arrival_date_month
ORDER BY market_segment, arrival_date_year, arrival_date_month;
```
</details>

In [None]:
%%sql
-- Exercise 3: Your query here


---
## Key Takeaways

| Function | Purpose | Frame Default |
|----------|---------|---------------|
| `ROW_NUMBER()` | Unique sequential number | N/A |
| `RANK()` / `DENSE_RANK()` | Ranking with/without gaps | N/A |
| `LAG()` / `LEAD()` | Access previous / next row | N/A |
| `SUM() OVER(ORDER BY ...)` | Running total | `UNBOUNDED PRECEDING` to `CURRENT ROW` |
| `AVG() OVER(... ROWS BETWEEN n PRECEDING AND CURRENT ROW)` | Moving average | Explicit frame |
| `NTILE(n)` | Split into n equal buckets | N/A |

**Golden Rule:** Window functions execute *after* `WHERE`, `GROUP BY`, and `HAVING`. You cannot filter on a window function's result in the same query — wrap it in a subquery or CTE.

**Next:** [03_performance_tuning.ipynb](./03_performance_tuning.ipynb) — indexes, EXPLAIN ANALYZE, and making queries fast.