# 04 — Complex Aggregations

**Learning Objectives**

| # | Goal |
|---|------|
| 1 | Generate multiple grouping levels in one query with `GROUPING SETS` |
| 2 | Create hierarchical subtotals with `ROLLUP` |
| 3 | Generate all possible grouping combinations with `CUBE` |
| 4 | Pivot data using the `FILTER` clause |
| 5 | Combine CTEs with aggregations for multi-step analysis |

---

### Why Beyond GROUP BY?

Standard `GROUP BY` gives you **one level** of aggregation. But reports often need:
- Subtotals by category **and** a grand total in the same query
- Hierarchical drill-downs (Year → Month → Day)
- Pivot-table-style columns

SQL provides three powerful extensions for this:

```
GROUPING SETS  →  You choose exactly which groupings
ROLLUP         →  Hierarchical subtotals (most → least granular)
CUBE           →  Every possible combination of groupings
```

In [None]:
%load_ext sql
%sql postgresql://admin:password@postgres:5432/mastery_db

---
## 1. GROUPING SETS — Custom Grouping Combinations

Instead of running 3 separate `GROUP BY` queries and `UNION ALL`-ing them, use `GROUPING SETS` to define exactly which groupings you want.

**Business Question:** Show total bookings by **hotel**, by **market segment**, and the **grand total** — all in one result.

In [None]:
%%sql
SELECT
    hotel,
    market_segment,
    COUNT(*) AS bookings
FROM hotel_bookings
GROUP BY GROUPING SETS (
    (hotel),              -- Subtotal per hotel
    (market_segment),     -- Subtotal per segment
    ()                    -- Grand total
)
ORDER BY
    GROUPING(hotel, market_segment),  -- Grand total last
    hotel NULLS LAST,
    market_segment NULLS LAST;

### Understanding NULLs in the Output

| `hotel` | `market_segment` | Meaning |
|---------|------------------|---------|
| `City Hotel` | `NULL` | Subtotal for City Hotel (all segments) |
| `NULL` | `Online TA` | Subtotal for Online TA (all hotels) |
| `NULL` | `NULL` | Grand total |

Use `GROUPING(col)` to distinguish real NULLs from subtotal NULLs (returns 1 when the NULL is from the grouping, 0 otherwise).

### Using GROUPING() for Clean Labels

In [None]:
%%sql
SELECT
    CASE WHEN GROUPING(hotel) = 1 THEN '** ALL HOTELS **' ELSE hotel END AS hotel,
    CASE WHEN GROUPING(market_segment) = 1 THEN '** ALL SEGMENTS **' ELSE market_segment END AS market_segment,
    COUNT(*) AS bookings
FROM hotel_bookings
GROUP BY GROUPING SETS (
    (hotel),
    (market_segment),
    ()
)
ORDER BY GROUPING(hotel, market_segment), hotel NULLS LAST, market_segment NULLS LAST;

---
## 2. ROLLUP — Hierarchical Subtotals

`ROLLUP(A, B, C)` automatically generates:
- `(A, B, C)` — most detailed
- `(A, B)` — subtotal removing C
- `(A)` — subtotal removing B and C
- `()` — grand total

**Perfect for time hierarchies:** Year → Month → Grand Total.

**Business Question:** Total revenue by year and month, with year subtotals and a grand total.

In [None]:
%%sql
SELECT
    arrival_date_year  AS yr,
    arrival_date_month AS mo,
    COUNT(*)           AS bookings,
    ROUND(SUM(adr)::numeric, 2)  AS total_revenue,
    ROUND(AVG(adr)::numeric, 2)  AS avg_daily_rate
FROM hotel_bookings
WHERE adr > 0
GROUP BY ROLLUP (arrival_date_year, arrival_date_month)
ORDER BY yr NULLS LAST, mo NULLS LAST;

> **Reading the result:**
> - Rows where `mo = NULL` but `yr` has a value → **Year subtotal**
> - Row where both are `NULL` → **Grand total**

---
## 3. CUBE — All Possible Combinations

`CUBE(A, B)` generates **all** `2^n` combinations:
- `(A, B)`, `(A)`, `(B)`, `()`

Compared to ROLLUP which only generates `n+1` levels.

**Business Question:** Bookings by hotel and customer type — with subtotals in every direction.

In [None]:
%%sql
SELECT
    COALESCE(hotel, '** ALL **')         AS hotel,
    COALESCE(customer_type, '** ALL **') AS customer_type,
    COUNT(*)                             AS bookings,
    ROUND(AVG(adr)::numeric, 2)          AS avg_adr
FROM hotel_bookings
WHERE adr > 0
GROUP BY CUBE (hotel, customer_type)
ORDER BY hotel NULLS LAST, customer_type NULLS LAST;

### ROLLUP vs CUBE Comparison

| | ROLLUP(A, B) | CUBE(A, B) |
|-|-------------|------------|
| `(A, B)` | ✅ | ✅ |
| `(A)` | ✅ | ✅ |
| `(B)` | ❌ | ✅ |
| `()` | ✅ | ✅ |
| **Groups generated** | `n + 1` | `2^n` |

> Use **ROLLUP** for hierarchies (Year > Month).  
> Use **CUBE** when you need *every* cross-combination.

---
## 4. The FILTER Clause — SQL Pivoting

Instead of messy `CASE WHEN ... THEN 1 ELSE 0 END` inside `SUM()`, use the cleaner `FILTER (WHERE ...)` syntax.

**Business Question:** For each hotel, show total bookings, canceled bookings, successful bookings, and the cancellation rate — as separate columns.

In [None]:
%%sql
SELECT
    hotel,
    COUNT(*) AS total_bookings,
    COUNT(*) FILTER (WHERE is_canceled = 1) AS canceled,
    COUNT(*) FILTER (WHERE is_canceled = 0) AS successful,
    ROUND(
        COUNT(*) FILTER (WHERE is_canceled = 1)::numeric / COUNT(*) * 100, 2
    ) AS cancel_rate_pct
FROM hotel_bookings
GROUP BY hotel;

### Advanced Pivot — Bookings by Meal Type

Turn row values into columns using FILTER.

In [None]:
%%sql
SELECT
    hotel,
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE meal = 'BB') AS bed_breakfast,
    COUNT(*) FILTER (WHERE meal = 'HB') AS half_board,
    COUNT(*) FILTER (WHERE meal = 'FB') AS full_board,
    COUNT(*) FILTER (WHERE meal = 'SC' OR meal = 'Undefined') AS self_catering
FROM hotel_bookings
GROUP BY hotel;

> **Compared to CASE WHEN:** `FILTER` is a Postgres extension (also in SQL:2003 standard). It's more readable and sometimes faster because it avoids evaluating the aggregate for non-matching rows.

---
## 5. Combining CTEs with Aggregations

For complex multi-step analysis, use CTEs (Common Table Expressions) to build up the query in layers.

**Business Question:** Find the cancellation rate per market segment per year, and flag segments where the rate exceeds 40%.

In [None]:
%%sql
WITH segment_stats AS (
    SELECT
        market_segment,
        arrival_date_year AS yr,
        COUNT(*) AS total_bookings,
        SUM(is_canceled) AS cancellations,
        ROUND(SUM(is_canceled)::numeric / COUNT(*) * 100, 2) AS cancel_rate
    FROM hotel_bookings
    GROUP BY market_segment, arrival_date_year
)
SELECT
    market_segment,
    yr,
    total_bookings,
    cancellations,
    cancel_rate,
    CASE
        WHEN cancel_rate > 40 THEN 'HIGH RISK'
        WHEN cancel_rate > 25 THEN 'MODERATE'
        ELSE 'LOW'
    END AS risk_level
FROM segment_stats
ORDER BY cancel_rate DESC;

> **Why CTEs?** They make complex queries readable and testable. You can run just the CTE portion to validate intermediate results before building the final SELECT.

---
## Exercises

**Exercise 1:** Use `ROLLUP` to calculate total revenue (`SUM(adr)`) by `hotel → market_segment`, with subtotals per hotel and a grand total.

<details><summary>Hint</summary>

```sql
SELECT hotel, market_segment, ROUND(SUM(adr)::numeric, 2) AS revenue
FROM hotel_bookings
WHERE adr > 0
GROUP BY ROLLUP (hotel, market_segment)
ORDER BY hotel NULLS LAST, market_segment NULLS LAST;
```
</details>

In [None]:
%%sql
-- Exercise 1: Your query here


**Exercise 2:** Create a pivot report using `FILTER` that shows, for each `arrival_date_year`, the number of bookings from each `deposit_type` as separate columns.

<details><summary>Hint</summary>

```sql
SELECT
    arrival_date_year,
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE deposit_type = 'No Deposit') AS no_deposit,
    COUNT(*) FILTER (WHERE deposit_type = 'Non Refund') AS non_refund,
    COUNT(*) FILTER (WHERE deposit_type = 'Refundable') AS refundable
FROM hotel_bookings
GROUP BY arrival_date_year
ORDER BY arrival_date_year;
```
</details>

In [None]:
%%sql
-- Exercise 2: Your query here


**Exercise 3:** Write a CTE-based query that:
1. Calculates the average lead time per country
2. Ranks countries by average lead time
3. Returns only the top 10 countries with the longest average lead time

<details><summary>Hint</summary>

```sql
WITH country_lead AS (
    SELECT country, ROUND(AVG(lead_time)::numeric, 1) AS avg_lead, COUNT(*) AS bookings
    FROM hotel_bookings
    WHERE country IS NOT NULL
    GROUP BY country
    HAVING COUNT(*) >= 100  -- filter out rare countries
),
ranked AS (
    SELECT *, RANK() OVER (ORDER BY avg_lead DESC) AS rk
    FROM country_lead
)
SELECT * FROM ranked WHERE rk <= 10;
```
</details>

In [None]:
%%sql
-- Exercise 3: Your query here


---
## Key Takeaways

| Concept | What It Does | When to Use |
|---------|-------------|-------------|
| `GROUPING SETS` | Custom list of groupings | When you need specific, non-hierarchical groupings |
| `ROLLUP(A, B)` | `(A,B)`, `(A)`, `()` | Time hierarchies, org charts |
| `CUBE(A, B)` | All `2^n` combinations | Cross-tabulation reports |
| `GROUPING(col)` | Distinguishes real NULL from subtotal NULL | Always use with GROUPING SETS/ROLLUP/CUBE |
| `FILTER (WHERE ...)` | Conditional aggregation | Pivoting, cleaner than CASE WHEN |
| CTEs | Named temporary result sets | Multi-step analysis, readability |

---

### Congratulations!

You've completed the SQL Mastery series. You now have hands-on experience with:

1. **Setup & Exploration** — connecting, loading data, quality checks
2. **Window Functions** — ranking, LAG/LEAD, running totals, NTILE
3. **Performance Tuning** — EXPLAIN, indexes, monitoring
4. **Complex Aggregations** — GROUPING SETS, ROLLUP, CUBE, FILTER

**Next steps:**
- Practice with your own business questions on this dataset
- Try combining window functions with GROUPING SETS
- Explore recursive CTEs for hierarchical data