# 03 — Performance Tuning (Indexes & EXPLAIN)

**Learning Objectives**

| # | Goal |
|---|------|
| 1 | Read and interpret `EXPLAIN ANALYZE` output |
| 2 | Understand Seq Scan vs Index Scan vs Bitmap Scan |
| 3 | Create single-column and composite indexes |
| 4 | Know when indexes help and when they hurt |
| 5 | Use `pg_stat_user_indexes` to check index usage |

---

### Why Performance Matters

Writing correct SQL is step one. Writing **fast** SQL is what separates a junior from a senior engineer.  
On a 100-row table, everything is fast. On 100 million rows, a missing index can turn a 2ms query into a 2-minute one.

### How Postgres Executes a Query

```
SQL query
   ↓
Parser → Planner → Executor → Results
              ↑
         The planner chooses the cheapest execution plan
         based on table statistics and available indexes.
```

`EXPLAIN` shows the plan. `EXPLAIN ANALYZE` runs the query and shows actual timing.

In [None]:
%load_ext sql
%sql postgresql://admin:password@postgres:5432/mastery_db

---
## 1. Baseline — The Sequential Scan

Without indexes, Postgres must read **every row** in the table to find matches — a `Seq Scan`.

**Question:** Find all bookings made through agent 9.

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT hotel, country, adr
FROM hotel_bookings
WHERE agent = 9;

### How to Read the Output

| Field | Meaning |
|-------|---------|
| `Seq Scan on hotel_bookings` | Scanning every row |
| `cost=0.00..N` | Estimated startup & total cost (arbitrary units) |
| `rows=N` | Estimated rows returned |
| `actual time=X..Y` | Real wall-clock time (ms) |
| `Execution Time` | Total query time including overhead |

> **Note the Execution Time.** We will compare it after adding an index.

---
## 2. Creating a B-Tree Index

A **B-Tree index** is a sorted data structure that lets Postgres jump directly to matching rows in `O(log n)` time instead of scanning all `n` rows.

```
Without index:  Scan all 119,000 rows  →  O(n)
With index:     Binary search in tree   →  O(log n)
```

In [None]:
%%sql
CREATE INDEX idx_agent ON hotel_bookings(agent);

### Now Run the Same Query Again

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT hotel, country, adr
FROM hotel_bookings
WHERE agent = 9;

### What Changed?

| Before | After |
|--------|-------|
| `Seq Scan` | `Bitmap Heap Scan` or `Index Scan` |
| Scanned all rows | Only fetched matching rows |
| Execution time: ~10–50 ms | Execution time: ~0.5–2 ms |

> Postgres may choose `Bitmap Heap Scan` (for many matches) or `Index Scan` (for few matches). Both use the index.

---
## 3. Composite (Multi-Column) Indexes

If you frequently filter on **two columns together**, a composite index is more efficient than two separate indexes.

**Important rule:** The composite index `(agent, is_canceled)` can serve queries that filter on:
- `agent` alone ✅ (leftmost prefix)
- `agent AND is_canceled` ✅
- `is_canceled` alone ❌ (not a prefix)

In [None]:
%%sql
-- First, let's see the plan WITHOUT a composite index
EXPLAIN ANALYZE
SELECT hotel, country, adr
FROM hotel_bookings
WHERE agent = 9 AND is_canceled = 1;

In [None]:
%%sql
-- Create the composite index
CREATE INDEX idx_agent_canceled ON hotel_bookings(agent, is_canceled);

In [None]:
%%sql
-- Same query, now with composite index available
EXPLAIN ANALYZE
SELECT hotel, country, adr
FROM hotel_bookings
WHERE agent = 9 AND is_canceled = 1;

> **Observation:** With the composite index, Postgres can satisfy both conditions directly from the index without any extra filtering step.

---
## 4. When Indexes DON'T Help

Indexes aren't free. Understand the trade-offs:

| Scenario | Index Useful? | Why |
|----------|--------------|-----|
| `WHERE agent = 9` (few matches) | ✅ Yes | High selectivity — only fetches a few rows |
| `WHERE is_canceled = 1` (50% of rows) | ❌ Often not | Low selectivity — Seq Scan may be faster |
| `SELECT * FROM table` (no WHERE) | ❌ No | Full table scan anyway |
| Frequent INSERTs/UPDATEs | ⚠️ Trade-off | Every write must update the index too |
| Small tables (< 1000 rows) | ❌ No | Seq Scan is fast enough |

Let's verify — query a low-selectivity column:

In [None]:
%%sql
-- Even with an index on is_canceled, Postgres may ignore it
CREATE INDEX idx_canceled ON hotel_bookings(is_canceled);

EXPLAIN ANALYZE
SELECT hotel, adr
FROM hotel_bookings
WHERE is_canceled = 1;

> Postgres is smart — even though the index exists, the planner may still choose `Seq Scan` because reading ~50% of the table through an index is slower than a straight sequential read.

---
## 5. Monitoring Index Usage

In production, unused indexes waste disk and slow down writes. Postgres tracks usage statistics.

In [None]:
%%sql
SELECT
    indexrelname AS index_name,
    idx_scan     AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

> **Rule of Thumb:** If `times_used = 0` after a week of production traffic, consider dropping the index.

---
## 6. EXPLAIN for JOINs and Subqueries

EXPLAIN is especially valuable for complex queries. Let's see the plan for a self-join.

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT
    b.hotel,
    b.country,
    b.adr,
    avg_tbl.avg_adr
FROM hotel_bookings b
JOIN (
    SELECT hotel, ROUND(AVG(adr)::numeric, 2) AS avg_adr
    FROM hotel_bookings
    WHERE adr > 0
    GROUP BY hotel
) avg_tbl ON b.hotel = avg_tbl.hotel
WHERE b.adr > avg_tbl.avg_adr * 2
LIMIT 10;

### Reading Nested Plans

The output is a tree. Key join types:

| Join Type | When Used |
|-----------|-----------|
| `Nested Loop` | Small inner table, indexed lookups |
| `Hash Join` | Equality joins, medium-sized tables |
| `Merge Join` | Both sides sorted, large tables |

Look at the **most indented** node first — that's the deepest operation.

---
## 7. Cleanup

Always clean up indexes created for experimentation.

In [None]:
%%sql
DROP INDEX IF EXISTS idx_agent;
DROP INDEX IF EXISTS idx_agent_canceled;
DROP INDEX IF EXISTS idx_canceled;

---
## Exercises

**Exercise 1:** Run `EXPLAIN ANALYZE` on `SELECT * FROM hotel_bookings WHERE country = 'PRT'`. Create an index on `country`, run it again, and compare.

<details><summary>Hint</summary>

```sql
-- Before
EXPLAIN ANALYZE SELECT * FROM hotel_bookings WHERE country = 'PRT';

-- Create index
CREATE INDEX idx_country ON hotel_bookings(country);

-- After
EXPLAIN ANALYZE SELECT * FROM hotel_bookings WHERE country = 'PRT';

-- Cleanup
DROP INDEX idx_country;
```
</details>

In [None]:
%%sql
-- Exercise 1: Your queries here


**Exercise 2:** Will an index on `hotel` (only 2 distinct values) speed up `WHERE hotel = 'City Hotel'`? Form a hypothesis, then test it.

<details><summary>Hint</summary>

Low cardinality (2 values) means each value matches ~50% of rows. Postgres will likely ignore the index and stick with Seq Scan.
</details>

In [None]:
%%sql
-- Exercise 2: Your queries here


**Exercise 3:** Create an index that would speed up this query:
```sql
SELECT * FROM hotel_bookings
WHERE market_segment = 'Online TA' AND arrival_date_year = 2016;
```
Decide: single-column or composite? Test with EXPLAIN ANALYZE.

<details><summary>Hint</summary>

A composite index `(market_segment, arrival_date_year)` is best for this two-column filter.
</details>

In [None]:
%%sql
-- Exercise 3: Your queries here


---
## Key Takeaways

| Concept | Summary |
|---------|---------|}
| `EXPLAIN` | Shows the query plan without running the query |
| `EXPLAIN ANALYZE` | Runs the query and shows actual timing |
| `Seq Scan` | Reads every row — fine for small tables or low selectivity |
| `Index Scan` | Uses B-Tree to jump to matching rows — great for high selectivity |
| Composite Index | `(A, B)` serves `WHERE A`, `WHERE A AND B`, but NOT `WHERE B` alone |
| Trade-off | Indexes speed up reads but slow down writes and consume disk |
| `pg_stat_user_indexes` | Monitor which indexes are actually being used |

**Next:** [04_complex_aggregations.ipynb](./04_complex_aggregations.ipynb) — GROUPING SETS, ROLLUP, CUBE, and FILTER.