# 09 — Data Analysis Applications

This is where everything comes together. These are the queries hotel analysts, revenue managers, and data engineers write every day — pivoting data for dashboards, computing rolling KPIs, deduplicating messy feeds, and imputing missing data.

**What You'll Practice:**
- Pivoting with CASE WHEN and FILTER
- Cumulative sums (running totals)
- Moving averages with frame clauses
- Deduplication with ROW_NUMBER
- Year-over-Year comparison
- Cohort analysis

In [None]:
%load_ext sql
%sql postgresql://admin:password@postgres:5432/mastery_db

---
## Section A — Pivoting

Pivoting turns **row values** into **columns**. Two approaches in Postgres:
- `CASE WHEN` inside aggregates (works everywhere)
- `FILTER (WHERE ...)` — cleaner Postgres syntax

---

### Quiz 1 — Monthly Revenue by Hotel (Pivot Table)

> **From: CFO**  
> *I need a table where each ROW is a month and each COLUMN is a hotel type, showing total revenue. Like a spreadsheet pivot table but in SQL. Include a total column.*

**Skills:** CASE WHEN pivot

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    arrival_date_month AS month,
    ROUND(SUM(CASE WHEN hotel = 'Resort Hotel' THEN adr ELSE 0 END)::numeric, 0) AS resort_revenue,
    ROUND(SUM(CASE WHEN hotel = 'City Hotel'   THEN adr ELSE 0 END)::numeric, 0) AS city_revenue,
    ROUND(SUM(adr)::numeric, 0) AS total_revenue
FROM hotel_bookings
WHERE adr > 0 AND is_canceled = 0
GROUP BY arrival_date_month
ORDER BY total_revenue DESC;
```
</details>

---

### Quiz 2 — Booking Status Breakdown by Market Segment (FILTER)

> **From: Commercial Director**  
> *For each market segment, pivot the `reservation_status` into columns: how many Check-Outs, Canceled, and No-Shows? Use the FILTER clause. Add a no-show rate percentage.*

**Skills:** FILTER clause pivot, percentage calculation

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    market_segment,
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE reservation_status = 'Check-Out') AS checked_out,
    COUNT(*) FILTER (WHERE reservation_status = 'Canceled')  AS canceled,
    COUNT(*) FILTER (WHERE reservation_status = 'No-Show')   AS no_show,
    ROUND(
        COUNT(*) FILTER (WHERE reservation_status = 'No-Show')::numeric / COUNT(*) * 100, 2
    ) AS no_show_rate_pct
FROM hotel_bookings
GROUP BY market_segment
ORDER BY no_show_rate_pct DESC;
```
</details>

---

### Quiz 3 — Room Type Demand Heatmap by Month

> **From: Revenue Manager**  
> *Create a pivot where rows are room types (A–G) and columns are months (use arrival_date_month). Values = booking count. This is our demand heatmap. Use the OTA table (`hotel_reservations`) and FILTER.*

**Skills:** FILTER pivot on hotel_reservations, multiple columns

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    room_type_reserved,
    COUNT(*) FILTER (WHERE arrival_month = 1)  AS jan,
    COUNT(*) FILTER (WHERE arrival_month = 2)  AS feb,
    COUNT(*) FILTER (WHERE arrival_month = 3)  AS mar,
    COUNT(*) FILTER (WHERE arrival_month = 4)  AS apr,
    COUNT(*) FILTER (WHERE arrival_month = 5)  AS may,
    COUNT(*) FILTER (WHERE arrival_month = 6)  AS jun,
    COUNT(*) FILTER (WHERE arrival_month = 7)  AS jul,
    COUNT(*) FILTER (WHERE arrival_month = 8)  AS aug,
    COUNT(*) FILTER (WHERE arrival_month = 9)  AS sep,
    COUNT(*) FILTER (WHERE arrival_month = 10) AS oct,
    COUNT(*) FILTER (WHERE arrival_month = 11) AS nov,
    COUNT(*) FILTER (WHERE arrival_month = 12) AS dec,
    COUNT(*) AS total
FROM hotel_reservations
GROUP BY room_type_reserved
ORDER BY room_type_reserved;
```
</details>

---
## Section B — Rolling Calculations

---

### Quiz 4 — Cumulative Revenue Over Time

> **From: Finance**  
> *Show me monthly revenue and a running (cumulative) total over time. Use `reservation_status_date` truncated to month. The cumulative sum should reset logic-wise — I just want the overall running total.*

**Skills:** SUM() OVER(ORDER BY ...) cumulative sum

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH monthly AS (
    SELECT
        DATE_TRUNC('month', reservation_status_date::date) AS month,
        ROUND(SUM(adr)::numeric, 0) AS monthly_revenue
    FROM hotel_bookings
    WHERE adr > 0 AND is_canceled = 0
    GROUP BY DATE_TRUNC('month', reservation_status_date::date)
)
SELECT
    month,
    monthly_revenue,
    SUM(monthly_revenue) OVER (ORDER BY month) AS cumulative_revenue
FROM monthly
ORDER BY month;
```
</details>

---

### Quiz 5 — 3-Month Moving Average ADR

> **From: Revenue Manager**  
> *Monthly ADR is noisy. Give me a 3-month moving average per hotel so I can see the trend. Use ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.*

**Skills:** AVG() OVER with frame clause, PARTITION BY

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH monthly AS (
    SELECT
        hotel,
        arrival_date_year AS yr,
        arrival_date_month AS mo,
        ROUND(AVG(adr)::numeric, 2) AS avg_adr
    FROM hotel_bookings
    WHERE adr > 0
    GROUP BY hotel, arrival_date_year, arrival_date_month
)
SELECT
    hotel, yr, mo, avg_adr,
    ROUND(
        AVG(avg_adr) OVER (
            PARTITION BY hotel
            ORDER BY yr, mo
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        )::numeric, 2
    ) AS moving_avg_3m
FROM monthly
ORDER BY hotel, yr, mo;
```
</details>

---

### Quiz 6 — Cumulative Cancellation Count per Hotel (Partition Reset)

> **From: Operations VP**  
> *Show the cumulative cancellation count per hotel over time (by year + month). The running total should reset for each hotel. I want to see which hotel's cancellations are growing faster.*

**Skills:** SUM() OVER(PARTITION BY ... ORDER BY ...) — cumulative with partition

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH monthly_cancels AS (
    SELECT
        hotel,
        arrival_date_year AS yr,
        arrival_date_month AS mo,
        SUM(is_canceled) AS monthly_cancellations
    FROM hotel_bookings
    GROUP BY hotel, arrival_date_year, arrival_date_month
)
SELECT
    hotel, yr, mo,
    monthly_cancellations,
    SUM(monthly_cancellations) OVER (
        PARTITION BY hotel
        ORDER BY yr, mo
    ) AS running_cancellations
FROM monthly_cancels
ORDER BY hotel, yr, mo;
```
</details>

---
## Section C — Deduplication

---

### Quiz 7 — Remove Duplicate Guest Records

> **From: CRM Manager**  
> *I suspect our `hotel_bookings` has duplicate guest entries — same name AND same email. Using ROW_NUMBER, identify duplicates and keep only the most recent booking (by `reservation_status_date`). Show the count of duplicates found and the cleaned result.*

**Skills:** ROW_NUMBER for dedup, PARTITION BY multiple columns

In [None]:
%%sql
-- Step 1: Find how many duplicates exist


In [None]:
%%sql
-- Step 2: Show deduplicated records (keep latest per name+email)


<details><summary>Hint</summary>

```sql
-- Step 1: Count duplicates
WITH numbered AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY name, email
            ORDER BY reservation_status_date DESC
        ) AS rn
    FROM hotel_bookings
    WHERE name IS NOT NULL AND email IS NOT NULL
)
SELECT
    COUNT(*) FILTER (WHERE rn = 1) AS unique_guests,
    COUNT(*) FILTER (WHERE rn > 1) AS duplicate_rows;

-- Step 2: Deduplicated result
WITH numbered AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY name, email
            ORDER BY reservation_status_date DESC
        ) AS rn
    FROM hotel_bookings
    WHERE name IS NOT NULL AND email IS NOT NULL
)
SELECT hotel, name, email, country, adr, reservation_status_date
FROM numbered
WHERE rn = 1
ORDER BY reservation_status_date DESC
LIMIT 15;
```
</details>

---
## Section D — Year-over-Year & Cohort Analysis

---

### Quiz 8 — Year-over-Year ADR Comparison with LAG

> **From: Owner**  
> *For each month, show the average ADR in 2015 and 2016 side by side, and calculate the YoY change (%). Use LAG or a self-join approach — your choice.*

**Skills:** LAG for YoY, percentage change

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH yearly AS (
    SELECT
        arrival_date_year AS yr,
        arrival_date_month AS mo,
        ROUND(AVG(adr)::numeric, 2) AS avg_adr
    FROM hotel_bookings
    WHERE adr > 0 AND arrival_date_year IN (2015, 2016)
    GROUP BY arrival_date_year, arrival_date_month
)
SELECT
    yr, mo, avg_adr,
    LAG(avg_adr) OVER (PARTITION BY mo ORDER BY yr) AS prev_year_adr,
    ROUND(
        (avg_adr - LAG(avg_adr) OVER (PARTITION BY mo ORDER BY yr)) /
        NULLIF(LAG(avg_adr) OVER (PARTITION BY mo ORDER BY yr), 0) * 100, 2
    ) AS yoy_change_pct
FROM yearly
ORDER BY mo, yr;
```
</details>

---

### Quiz 9 — Lead Time Cohort Performance

> **From: Revenue Strategy**  
> *Segment bookings into lead time cohorts (0–7 days = 'Last Minute', 8–30 = 'Short', 31–90 = 'Medium', 91+ = 'Long Advance'). For each cohort, show: booking count, cancellation rate, average ADR, average total nights, and estimated revenue. Rank cohorts by revenue.*

**Skills:** CASE WHEN cohorts, multiple aggregates, RANK()

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH cohorts AS (
    SELECT *,
        CASE
            WHEN lead_time <= 7  THEN 'Last Minute (0-7d)'
            WHEN lead_time <= 30 THEN 'Short (8-30d)'
            WHEN lead_time <= 90 THEN 'Medium (31-90d)'
            ELSE 'Long Advance (91+d)'
        END AS lead_cohort
    FROM hotel_bookings
    WHERE adr > 0
)
SELECT
    lead_cohort,
    COUNT(*) AS bookings,
    ROUND(AVG(is_canceled)::numeric * 100, 1) AS cancel_rate_pct,
    ROUND(AVG(adr)::numeric, 2) AS avg_adr,
    ROUND(AVG(stays_in_weekend_nights + stays_in_week_nights)::numeric, 1) AS avg_nights,
    ROUND(SUM(adr * (stays_in_weekend_nights + stays_in_week_nights))::numeric, 0) AS est_revenue,
    RANK() OVER (
        ORDER BY SUM(adr * (stays_in_weekend_nights + stays_in_week_nights)) DESC
    ) AS revenue_rank
FROM cohorts
WHERE is_canceled = 0
GROUP BY lead_cohort
ORDER BY revenue_rank;
```
</details>

---

### Quiz 10 — Deposit Type Impact Analysis

> **From: Finance Director**  
> *Build a comprehensive analysis of deposit types: for each type, pivot reservation_status into columns, calculate revenue, and show the percentage of total revenue each deposit type contributes. Use a CTE chain.*

**Skills:** CTE chain + FILTER pivot + window function for % of total

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH deposit_analysis AS (
    SELECT
        deposit_type,
        COUNT(*) AS total_bookings,
        COUNT(*) FILTER (WHERE reservation_status = 'Check-Out') AS checked_out,
        COUNT(*) FILTER (WHERE reservation_status = 'Canceled')  AS canceled,
        COUNT(*) FILTER (WHERE reservation_status = 'No-Show')   AS no_show,
        ROUND(SUM(CASE WHEN is_canceled = 0 THEN adr * (stays_in_weekend_nights + stays_in_week_nights) ELSE 0 END)::numeric, 0) AS realized_revenue
    FROM hotel_bookings
    GROUP BY deposit_type
)
SELECT
    *,
    ROUND(realized_revenue::numeric / SUM(realized_revenue) OVER () * 100, 2) AS pct_of_total_revenue
FROM deposit_analysis
ORDER BY realized_revenue DESC;
```
</details>

---
## Bonus — Free Play

In [None]:
%%sql


In [None]:
%%sql


---
**Next:** [10_final_project.ipynb](./10_final_project.ipynb) — the capstone. No hints. Just you, two hotel datasets, and real business questions.