# 01 — Setup & Data Exploration

**Learning Objectives**

| # | Goal |
|---|------|
| 1 | Connect Jupyter to PostgreSQL using `jupysql` |
| 2 | Load a real-world CSV dataset into a Postgres table |
| 3 | Explore the schema — column names, types, row count |
| 4 | Run basic data-quality checks (NULLs, duplicates, distributions) |
| 5 | Write your first analytical queries |

---

### Dataset: Hotel Booking Demand

We use a cleaned version of the [Hotel Booking Demand](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) dataset.  
It contains **~119 000 bookings** across two hotel types (Resort Hotel & City Hotel) from 2015–2017.

<details>
<summary><strong>Column Reference (click to expand)</strong></summary>

| Column | Description |
|--------|-------------|
| `hotel` | Resort Hotel or City Hotel |
| `is_canceled` | 1 = canceled, 0 = not canceled |
| `lead_time` | Days between booking date and arrival |
| `arrival_date_year` | Year of arrival (2015–2017) |
| `arrival_date_month` | Month of arrival |
| `arrival_date_week_number` | ISO week number |
| `arrival_date_day_of_month` | Day of month |
| `stays_in_weekend_nights` | Weekend nights in the stay |
| `stays_in_week_nights` | Weekday nights in the stay |
| `adults` | Number of adults |
| `children` | Number of children |
| `babies` | Number of babies |
| `meal` | Meal package (BB, HB, FB, SC) |
| `country` | Country of origin (ISO 3166) |
| `market_segment` | Market segment (Online TA, Offline TA/TO, Direct, etc.) |
| `distribution_channel` | Booking channel |
| `is_repeated_guest` | 1 = returning guest |
| `previous_cancellations` | Prior cancellations by this guest |
| `previous_bookings_not_canceled` | Prior successful bookings |
| `reserved_room_type` | Room type originally reserved |
| `assigned_room_type` | Room type actually assigned |
| `booking_changes` | Number of booking modifications |
| `deposit_type` | No Deposit / Non Refund / Refundable |
| `agent` | Travel agent ID |
| `company` | Company ID |
| `days_in_waiting_list` | Days on waiting list before confirmation |
| `customer_type` | Transient, Contract, Group, Transient-Party |
| `adr` | Average Daily Rate (revenue per room-night) |
| `required_car_parking_spaces` | Parking spaces requested |
| `total_of_special_requests` | Number of special requests |
| `reservation_status` | Check-Out, Canceled, No-Show |
| `reservation_status_date` | Date of last status change |

</details>

---
## Part 1 — Connect to PostgreSQL

In [1]:
# Load the jupysql extension so we can write %%sql cells
%load_ext sql

In [2]:
# Connect to the Postgres instance defined in docker-compose.yml
# Format: postgresql://USER:PASSWORD@HOST:PORT/DATABASE
%sql postgresql://admin:password@postgres:5432/mastery_db

If you see a success message, the connection is ready. All `%%sql` cells below will run against this database.

---
## Part 2 — Load the CSV into Postgres

We use **pandas** to read the CSV and **SQLAlchemy** to push it into a Postgres table called `hotel_bookings`.  
This only needs to run once — subsequent notebooks can query the table directly.

In [3]:
import pandas as pd
from sqlalchemy import create_engine

# 1. Read the CSV
csv_path = '/app/data/hotel_booking.csv'
df = pd.read_csv(csv_path)
print(f"Loaded {len(df):,} rows  x  {len(df.columns)} columns")

# 2. Push to Postgres (replace if table exists)
engine = create_engine("postgresql://admin:password@postgres:5432/mastery_db")
df.to_sql('hotel_bookings', engine, if_exists='replace', index=False)
print("Table 'hotel_bookings' created successfully.")

Loaded 119,390 rows  x  36 columns
Table 'hotel_bookings' created successfully.


---
## Part 3 — Verify & Explore the Table

From here on, **everything is pure SQL**. No more Python needed.

### 3.1 Row Count

In [5]:
%%sql
SELECT COUNT(*) AS total_rows FROM hotel_bookings;

total_rows
119390


### 3.2 Preview a Few Rows

Always peek at your data before writing queries.  
Tip: use `LIMIT` to avoid flooding the output.

In [6]:
%%sql
SELECT
    hotel,
    is_canceled,
    lead_time,
    arrival_date_year,
    arrival_date_month,
    country,
    market_segment,
    adr
FROM hotel_bookings
LIMIT 5;

hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,country,market_segment,adr
Resort Hotel,0,342,2015,July,PRT,Direct,0.0
Resort Hotel,0,737,2015,July,PRT,Direct,0.0
Resort Hotel,0,7,2015,July,GBR,Direct,75.0
Resort Hotel,0,13,2015,July,GBR,Corporate,75.0
Resort Hotel,0,14,2015,July,GBR,Online TA,98.0


### 3.3 Table Schema (Column Names & Types)

Postgres stores metadata in `information_schema`. This is how you inspect a table programmatically.

In [11]:
%%sql
SELECT
    column_name,
    data_type,
    is_nullable
FROM information_schema.columns
WHERE table_name = 'hotel_bookings'
ORDER BY ordinal_position;

column_name,data_type,is_nullable
hotel,text,YES
is_canceled,bigint,YES
lead_time,bigint,YES
arrival_date_year,bigint,YES
arrival_date_month,text,YES
arrival_date_week_number,bigint,YES
arrival_date_day_of_month,bigint,YES
stays_in_weekend_nights,bigint,YES
stays_in_week_nights,bigint,YES
adults,bigint,YES


---
## Part 4 — Data Quality Checks

Before any analysis, **always** check for NULLs, unexpected values, and duplicates.  
Bad data leads to wrong conclusions.

### 4.1 NULL Counts per Column

This pattern counts NULLs for every column in one shot using `COUNT(*) - COUNT(col)`.

In [12]:
%%sql
SELECT
    COUNT(*) AS total_rows,
    COUNT(*) - COUNT(children)   AS children_nulls,
    COUNT(*) - COUNT(country)    AS country_nulls,
    COUNT(*) - COUNT(agent)      AS agent_nulls,
    COUNT(*) - COUNT(company)    AS company_nulls
FROM hotel_bookings;

total_rows,children_nulls,country_nulls,agent_nulls,company_nulls
119390,4,488,16340,112593


> **Insight:** `agent` and `company` likely have many NULLs — those bookings were made without an intermediary. Keep this in mind when joining or filtering on these columns.

### 4.2 Distinct Value Counts (Cardinality)

Understanding how many unique values each categorical column has helps you decide how to GROUP BY and what indexes to create later.

In [13]:
%%sql
SELECT
    COUNT(DISTINCT hotel)                AS hotels,
    COUNT(DISTINCT arrival_date_year)    AS years,
    COUNT(DISTINCT arrival_date_month)   AS months,
    COUNT(DISTINCT country)              AS countries,
    COUNT(DISTINCT market_segment)       AS market_segments,
    COUNT(DISTINCT customer_type)        AS customer_types,
    COUNT(DISTINCT deposit_type)         AS deposit_types,
    COUNT(DISTINCT reserved_room_type)   AS room_types
FROM hotel_bookings;

hotels,years,months,countries,market_segments,customer_types,deposit_types,room_types
2,3,12,177,8,4,3,10


### 4.3 Numeric Summary Statistics

Use `MIN`, `MAX`, `AVG`, `PERCENTILE_CONT` to understand the distribution of key numeric columns.

In [14]:
%%sql
SELECT
    ROUND(MIN(adr)::numeric, 2)  AS min_adr,
    ROUND(AVG(adr)::numeric, 2)  AS avg_adr,
    ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY adr)::numeric, 2) AS median_adr,
    ROUND(MAX(adr)::numeric, 2)  AS max_adr,
    MIN(lead_time)               AS min_lead_time,
    ROUND(AVG(lead_time)::numeric, 0) AS avg_lead_time,
    MAX(lead_time)               AS max_lead_time
FROM hotel_bookings;

min_adr,avg_adr,median_adr,max_adr,min_lead_time,avg_lead_time,max_lead_time
-6.38,101.83,94.58,5400.0,0,104,737


> **Tip:** If `max_adr` is abnormally high (e.g. 5000+), those are likely data-entry errors. In production you'd filter or cap them.

---
## Part 5 — Basic Analytical Queries

Let's answer a few real business questions to warm up.

### 5.1 Bookings by Hotel Type and Year

In [15]:
%%sql
SELECT
    hotel,
    arrival_date_year,
    COUNT(*)                                    AS total_bookings,
    SUM(is_canceled)                            AS cancellations,
    ROUND(SUM(is_canceled)::numeric / COUNT(*) * 100, 1) AS cancel_rate_pct
FROM hotel_bookings
GROUP BY hotel, arrival_date_year
ORDER BY hotel, arrival_date_year;

hotel,arrival_date_year,total_bookings,cancellations,cancel_rate_pct
City Hotel,2015,13682,6004,43.9
City Hotel,2016,38140,15407,40.4
City Hotel,2017,27508,11691,42.5
Resort Hotel,2015,8314,2138,25.7
Resort Hotel,2016,18567,4930,26.6
Resort Hotel,2017,13179,4054,30.8


### 5.2 Top 10 Countries by Booking Volume

In [4]:
%%sql
SELECT
    country,
    COUNT(*)                  AS total_bookings,
    ROUND(AVG(adr)::numeric, 2) AS avg_daily_rate,
    ROUND(AVG(lead_time)::numeric, 0) AS avg_lead_time_days
FROM hotel_bookings
WHERE country IS NOT NULL
GROUP BY country
ORDER BY total_bookings DESC
LIMIT 10;

country,total_bookings,avg_daily_rate,avg_lead_time_days
PRT,48590,92.04,116
GBR,12129,96.02,127
FRA,10415,109.62,82
ESP,8568,117.0,55
DEU,7287,104.4,137
ITA,3766,113.95,91
IRL,3375,98.19,120
BEL,2342,113.85,100
BRA,2224,111.01,83
NLD,2104,108.09,81


### 5.3 Average Stay Duration by Customer Type

In [6]:
%%sql
SELECT
    customer_type,
    COUNT(*) AS bookings,
    ROUND(AVG(stays_in_weekend_nights + stays_in_week_nights)::numeric, 1) AS avg_total_nights,
    ROUND(AVG(adr)::numeric, 2) AS avg_daily_rate
FROM hotel_bookings
GROUP BY customer_type
ORDER BY bookings DESC;

customer_type,bookings,avg_total_nights,avg_daily_rate
Transient,89613,3.4,107.01
Transient-Party,25124,3.1,86.08
Contract,4076,5.3,87.55
Group,577,2.9,83.49


---
## Exercises

Try these on your own. Solutions are hidden — attempt them first!

**Exercise 1:** What is the cancellation rate for each `market_segment`? Which segment cancels the most?

<details><summary>Hint</summary>

```sql
SELECT market_segment, ... , ROUND(SUM(is_canceled)::numeric / COUNT(*) * 100, 1) AS cancel_rate
FROM hotel_bookings
GROUP BY market_segment
ORDER BY cancel_rate DESC;
```
</details>

In [4]:
%%sql
-- Exercise 1: Your query here
SELECT 
    market_segment,
    COUNT(*) AS total_bookings,
    SUM(CASE WHEN is_canceled = 1 THEN 1 ELSE 0 END) AS total_canceled,
    ROUND(
        SUM(CASE WHEN is_canceled = 1 THEN 1 ELSE 0 END)::numeric 
        / NULLIF(COUNT(*), 0) * 100, 
        1
    ) AS cancel_rate
FROM hotel_bookings
GROUP BY market_segment
ORDER BY cancel_rate DESC;



market_segment,total_bookings,total_canceled,cancel_rate
Undefined,2,2,100.0
Groups,19811,12097,61.1
Online TA,56477,20739,36.7
Offline TA/TO,24219,8311,34.3
Aviation,237,52,21.9
Corporate,5295,992,18.7
Direct,12606,1934,15.3
Complementary,743,97,13.1


**Exercise 2:** What is the average lead time by `deposit_type`? Do non-refundable deposits tend to be booked further in advance?

<details><summary>Hint</summary>

```sql
SELECT deposit_type, ROUND(AVG(lead_time)::numeric, 0) AS avg_lead_time, COUNT(*) AS bookings
FROM hotel_bookings
GROUP BY deposit_type
ORDER BY avg_lead_time DESC;
```
</details>

In [5]:
%%sql
-- Exercise 2: Your query here
SELECT deposit_type, ROUND(AVG(lead_time)::numeric, 0) AS avg_lead_time, COUNT(*) AS bookings
FROM hotel_bookings
GROUP BY deposit_type
ORDER BY avg_lead_time DESC


deposit_type,avg_lead_time,bookings
Non Refund,213,14587
Refundable,152,162
No Deposit,89,104641


**Exercise 3:** Find months with the highest average ADR. Is there a seasonal pattern?

<details><summary>Hint</summary>

```sql
SELECT arrival_date_month, ROUND(AVG(adr)::numeric, 2) AS avg_adr, COUNT(*) AS bookings
FROM hotel_bookings
WHERE adr > 0
GROUP BY arrival_date_month
ORDER BY avg_adr DESC;
```
</details>

In [6]:
%%sql
-- Exercise 3: Your query here
SELECT arrival_date_month, ROUND(AVG(adr)::numeric, 2) AS avg_adr, COUNT(*) AS bookings
FROM hotel_bookings
WHERE adr > 0
GROUP BY arrival_date_month
ORDER BY avg_adr DESC

arrival_date_month,avg_adr,bookings
August,141.81,13711
July,128.51,12491
June,117.97,10819
May,110.38,11611
September,106.64,10351
April,101.63,10953
October,89.77,10929
December,83.78,6561
March,81.96,9641
November,75.5,6641


---
## Key Takeaways

| Concept | What You Learned |
|---------|------------------|
| `%load_ext sql` | Enables `%%sql` magic cells in Jupyter |
| `%sql connection_string` | Connects to the database |
| `information_schema` | Query Postgres metadata (columns, types) |
| `COUNT(*) - COUNT(col)` | Quick NULL counting pattern |
| `COUNT(DISTINCT col)` | Cardinality check |
| `PERCENTILE_CONT` | Compute median / percentiles |
| `GROUP BY` + aggregates | Foundation of all SQL analytics |

**Next:** [02_window_functions.ipynb](./02_window_functions.ipynb) — ranking, lead/lag, moving averages.