# 06 — JOINs & Multi-Table Analysis

In real hotel tech, data lives across **many tables** — bookings, guests, rooms, agents, pricing. JOINs are how you bring it all together.

**What You'll Practice:**
- INNER, LEFT, RIGHT, FULL OUTER JOIN
- Self JOINs (comparing rows within the same table)
- CROSS JOINs (generating combinations)
- UNION / UNION ALL
- Multi-column JOINs

---

### JOIN Cheat Sheet

```
INNER JOIN  →  Only rows that match in BOTH tables
LEFT JOIN   →  ALL rows from LEFT + matches from RIGHT (NULLs if no match)
RIGHT JOIN  →  ALL rows from RIGHT + matches from LEFT
FULL OUTER  →  ALL rows from BOTH (NULLs on either side if no match)
CROSS JOIN  →  Every row × every row (cartesian product)
SELF JOIN   →  Table joined with itself
```

---
## Setup — Load Data & Create Helper Tables

In [1]:
%load_ext sql
%sql postgresql://admin:password@postgres:5432/mastery_db

In [2]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("postgresql://admin:password@postgres:5432/mastery_db")

# Load hotel_bookings (119k rows — the PMS/legacy system)
df_booking = pd.read_csv('/app/data/hotel_booking.csv')
df_booking.to_sql('hotel_bookings', engine, if_exists='replace', index=False)
print(f"hotel_bookings: {len(df_booking):,} rows")

# Load hotel_reservations (36k rows — the OTA/channel manager system)
df_res = pd.read_csv('/app/data/hotel_reservation.csv')
df_res.columns = df_res.columns.str.lower()
df_res.to_sql('hotel_reservations', engine, if_exists='replace', index=False)
print(f"hotel_reservations: {len(df_res):,} rows")

hotel_bookings: 119,390 rows
hotel_reservations: 36,275 rows


In [None]:
%%sql
-- Room rate card (used for CROSS JOIN and pricing exercises)
DROP TABLE IF EXISTS room_rates CASCADE;
CREATE TABLE room_rates (
    room_type   VARCHAR(5),
    season      VARCHAR(10),
    base_rate   NUMERIC(8,2)
);
INSERT INTO room_rates VALUES
    ('A', 'low',     80.00),  ('A', 'mid',    110.00), ('A', 'high',   150.00),
    ('B', 'low',     95.00),  ('B', 'mid',    130.00), ('B', 'high',   180.00),
    ('C', 'low',    120.00),  ('C', 'mid',    160.00), ('C', 'high',   220.00),
    ('D', 'low',    150.00),  ('D', 'mid',    200.00), ('D', 'high',   280.00),
    ('E', 'low',    200.00),  ('E', 'mid',    260.00), ('E', 'high',   350.00),
    ('F', 'low',    250.00),  ('F', 'mid',    320.00), ('F', 'high',   420.00),
    ('G', 'low',    300.00),  ('G', 'mid',    400.00), ('G', 'high',   520.00);

-- Agent directory
DROP TABLE IF EXISTS agents CASCADE;
CREATE TABLE agents (
    agent_id     NUMERIC,
    agent_name   VARCHAR(50),
    region       VARCHAR(30),
    commission_pct NUMERIC(4,2)
);
INSERT INTO agents VALUES
    (9,   'TravelBee',        'Europe',   0.12),
    (14,  'BookDirect EU',    'Europe',   0.10),
    (7,   'SunHolidays',      'Europe',   0.15),
    (240, 'AsiaGetaway',      'Asia',     0.13),
    (304, 'GlobalTrips',      'Americas', 0.11),
    (250, 'QuickBook',        'Americas', 0.09),
    (241, 'TravelMax',        'Asia',     0.14),
    (1,   'PremiumStays',     'Europe',   0.08);

-- Country lookup
DROP TABLE IF EXISTS country_lookup CASCADE;
CREATE TABLE country_lookup (
    country_code VARCHAR(5),
    country_name VARCHAR(60),
    continent    VARCHAR(20)
);
INSERT INTO country_lookup VALUES
    ('PRT', 'Portugal',       'Europe'),
    ('GBR', 'United Kingdom', 'Europe'),
    ('FRA', 'France',         'Europe'),
    ('ESP', 'Spain',          'Europe'),
    ('DEU', 'Germany',        'Europe'),
    ('ITA', 'Italy',          'Europe'),
    ('IRL', 'Ireland',        'Europe'),
    ('BEL', 'Belgium',        'Europe'),
    ('BRA', 'Brazil',         'South America'),
    ('USA', 'United States',  'North America'),
    ('CHN', 'China',          'Asia'),
    ('AGO', 'Angola',         'Africa'),
    ('ISR', 'Israel',         'Asia'),
    ('NLD', 'Netherlands',    'Europe'),
    ('AUT', 'Austria',        'Europe');

SELECT 'Helper tables created' AS status;

---
## Section A — INNER & LEFT JOINs

---

### Quiz 1 — Agent Performance Report

> **From: Revenue Manager**  
> *Hey, I need a report showing each agent's name, region, total bookings, and the commission they'd earn (commission_pct × total ADR). Can you join the bookings with our agent directory? Only include agents that appear in both tables.*

**Skills:** INNER JOIN, GROUP BY, aggregate math

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT a.agent_name, a.region,
       COUNT(*) AS total_bookings,
       ROUND(SUM(b.adr * a.commission_pct)::numeric, 2) AS total_commission
FROM hotel_bookings b
INNER JOIN agents a ON b.agent = a.agent_id
GROUP BY a.agent_name, a.region
ORDER BY total_commission DESC;
```
</details>

---

### Quiz 2 — Country Name Enrichment

> **From: Marketing**  
> *Our bookings table only has 3-letter country codes. Can you give me total bookings and average ADR per country with the FULL country name and continent? I also want to see countries that have bookings but aren't in our lookup table yet — show them as 'Unknown'.*

**Skills:** LEFT JOIN, COALESCE

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    b.country AS code,
    COALESCE(c.country_name, 'Unknown') AS country_name,
    COALESCE(c.continent, 'Unknown')    AS continent,
    COUNT(*)                            AS bookings,
    ROUND(AVG(b.adr)::numeric, 2)       AS avg_adr
FROM hotel_bookings b
LEFT JOIN country_lookup c ON b.country = c.country_code
WHERE b.country IS NOT NULL
GROUP BY b.country, c.country_name, c.continent
ORDER BY bookings DESC
LIMIT 15;
```
</details>

---

### Quiz 3 — Agents with Zero Bookings

> **From: Partnerships Team**  
> *We signed up 8 agents but I suspect some haven't produced any bookings at all. Can you find which agents in our directory have ZERO bookings?*

**Skills:** LEFT JOIN + WHERE ... IS NULL (anti-join pattern)

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT a.agent_id, a.agent_name, a.region
FROM agents a
LEFT JOIN hotel_bookings b ON a.agent_id = b.agent
WHERE b.agent IS NULL;
```
</details>

---
## Section B — FULL OUTER JOIN & System Reconciliation

---

### Quiz 4 — Room Type Mismatch Audit

> **From: Revenue Manager**  
> *We have a rate card (`room_rates`) with prices for room types A–G, and bookings that reference `reserved_room_type`. I need to know: (1) which room types have rates but no bookings, and (2) which room types appear in bookings but have no rate card entry. Use a FULL OUTER JOIN.*

**Skills:** FULL OUTER JOIN, DISTINCT, NULL filtering

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH booked_types AS (
    SELECT DISTINCT reserved_room_type AS room_type
    FROM hotel_bookings
),
rate_types AS (
    SELECT DISTINCT room_type
    FROM room_rates
)
SELECT
    COALESCE(r.room_type, b.room_type) AS room_type,
    CASE
        WHEN r.room_type IS NULL THEN 'No rate card'
        WHEN b.room_type IS NULL THEN 'No bookings'
        ELSE 'Matched'
    END AS status
FROM rate_types r
FULL OUTER JOIN booked_types b ON r.room_type = b.room_type
ORDER BY room_type;
```
</details>

---
## Section C — Self JOINs

---

### Quiz 5 — Guests with Similar Pricing

> **From: Pricing Analyst**  
> *I'm investigating rate parity. Can you find pairs of bookings from DIFFERENT countries but the same hotel where the ADR is within $2 of each other? Limit to 20 pairs. Make sure each pair only shows once (no duplicates reversed).*

**Skills:** Self JOIN, ABS, inequality condition to avoid duplicates

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    b1.hotel, b1.country AS country_1, b1.adr AS adr_1,
    b2.country AS country_2, b2.adr AS adr_2,
    ROUND(ABS(b1.adr - b2.adr)::numeric, 2) AS price_diff
FROM hotel_bookings b1
INNER JOIN hotel_bookings b2
    ON b1.hotel = b2.hotel
    AND b1.country < b2.country
    AND ABS(b1.adr - b2.adr) < 2
WHERE b1.adr > 0 AND b2.adr > 0
LIMIT 20;
```
</details>

---

### Quiz 6 — Upsell Opportunity: Room Upgrade Pairs

> **From: Front Desk Manager**  
> *When the reserved room type differs from the assigned room type, that's an upgrade or downgrade. Find all bookings where the guest was assigned a DIFFERENT room type than reserved. Show the reserved type, assigned type, and the rate difference (from `room_rates` at 'mid' season).*

**Skills:** Self JOIN on room_rates, filtering mismatches

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    b.hotel,
    b.reserved_room_type,
    b.assigned_room_type,
    rr_res.base_rate AS reserved_rate,
    rr_asgn.base_rate AS assigned_rate,
    rr_asgn.base_rate - rr_res.base_rate AS rate_diff,
    COUNT(*) AS occurrences
FROM hotel_bookings b
JOIN room_rates rr_res  ON b.reserved_room_type = rr_res.room_type  AND rr_res.season = 'mid'
JOIN room_rates rr_asgn ON b.assigned_room_type = rr_asgn.room_type AND rr_asgn.season = 'mid'
WHERE b.reserved_room_type <> b.assigned_room_type
GROUP BY b.hotel, b.reserved_room_type, b.assigned_room_type,
         rr_res.base_rate, rr_asgn.base_rate
ORDER BY occurrences DESC
LIMIT 10;
```
</details>

---
## Section D — CROSS JOIN

---

### Quiz 7 — Generate Full Rate Card

> **From: Revenue Manager**  
> *I need a complete rate card that shows every combination of room type × season × hotel type (Resort Hotel, City Hotel). That means 7 rooms × 3 seasons × 2 hotels = 42 rows. Use a CROSS JOIN to generate it, then add the base_rate from `room_rates`.*

**Skills:** CROSS JOIN, generating combinations

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH hotels AS (
    SELECT DISTINCT hotel FROM hotel_bookings
)
SELECT
    h.hotel,
    rr.room_type,
    rr.season,
    rr.base_rate
FROM hotels h
CROSS JOIN room_rates rr
ORDER BY h.hotel, rr.room_type, rr.season;
```
</details>

---
## Section E — UNION / UNION ALL

---

### Quiz 8 — Unified Booking Pipeline

> **From: CTO**  
> *We have two booking systems — `hotel_bookings` (PMS) and `hotel_reservations` (OTA channel). I need a unified view with these columns from BOTH tables: `source` (label which system), `lead_time`, `market_segment`, `is_canceled` (1/0), `price` (adr or avg_price_per_room). Stack them with UNION ALL.*

**Skills:** UNION ALL, column aliasing to match schemas

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
SELECT
    'PMS' AS source,
    lead_time,
    market_segment,
    is_canceled,
    adr AS price
FROM hotel_bookings

UNION ALL

SELECT
    'OTA' AS source,
    lead_time,
    market_segment_type AS market_segment,
    CASE WHEN booking_status = 'Canceled' THEN 1 ELSE 0 END AS is_canceled,
    avg_price_per_room AS price
FROM hotel_reservations
LIMIT 20;
```
</details>

---

### Quiz 9 — Compare Systems Side by Side

> **From: Data Engineering Lead**  
> *Using the unified view from Quiz 8 (make it a CTE), calculate the total bookings, cancellation rate, and average price PER source system. Which system has higher cancellation?*

**Skills:** CTE wrapping a UNION ALL, GROUP BY on the combined result

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH unified AS (
    SELECT 'PMS' AS source, lead_time, market_segment, is_canceled, adr AS price
    FROM hotel_bookings
    UNION ALL
    SELECT 'OTA', lead_time, market_segment_type,
           CASE WHEN booking_status = 'Canceled' THEN 1 ELSE 0 END,
           avg_price_per_room
    FROM hotel_reservations
)
SELECT
    source,
    COUNT(*) AS total_bookings,
    ROUND(AVG(is_canceled)::numeric * 100, 2) AS cancel_rate_pct,
    ROUND(AVG(price)::numeric, 2) AS avg_price
FROM unified
GROUP BY source;
```
</details>

---
## Section F — Multi-Column JOINs

---

### Quiz 10 — Seasonal Rate vs Actual ADR

> **From: Revenue Manager**  
> *Join bookings to `room_rates` on room type AND season. Map months June–August to 'high', March–May and September–November to 'mid', and December–February to 'low'. Then compare the actual ADR against the base_rate. Are we over- or under-charging?*

**Skills:** Multi-column JOIN with a computed season column, CASE WHEN

In [None]:
%%sql


<details><summary>Hint</summary>

```sql
WITH bookings_with_season AS (
    SELECT *,
        CASE
            WHEN arrival_date_month IN ('June','July','August') THEN 'high'
            WHEN arrival_date_month IN ('March','April','May','September','October','November') THEN 'mid'
            ELSE 'low'
        END AS season
    FROM hotel_bookings
    WHERE adr > 0
)
SELECT
    bs.reserved_room_type,
    bs.season,
    rr.base_rate,
    ROUND(AVG(bs.adr)::numeric, 2) AS actual_avg_adr,
    ROUND(AVG(bs.adr)::numeric - rr.base_rate, 2) AS diff_vs_rate_card,
    COUNT(*) AS bookings
FROM bookings_with_season bs
JOIN room_rates rr
    ON bs.reserved_room_type = rr.room_type
    AND bs.season = rr.season
GROUP BY bs.reserved_room_type, bs.season, rr.base_rate
ORDER BY bs.reserved_room_type, bs.season;
```
</details>

---
## Bonus — Free Play

Write any JOIN query you want against the hotel tables.

In [None]:
%%sql


In [None]:
%%sql


---
**Next:** [07_subqueries_and_ctes.ipynb](./07_subqueries_and_ctes.ipynb) — nested queries, EXISTS, ANY/ALL, and chaining CTEs.