# 04 — Numpy, Pandas & Parquet Deep Dive

Master the core data processing stack every data engineer relies on.

### What you'll learn
| # | Topic | Key Takeaway |
|---|-------|--------------|
| 1 | **Numpy** | Arrays, vectorization, `np.where`, `np.select`, `np.unique`, `np.isnan` |
| 2 | **Pandas IO** | `read_csv`, `read_parquet`, `to_parquet` — loading & saving data |
| 3 | **Pandas Transformation** | `apply`, `map`, `astype` — reshaping values |
| 4 | **Pandas Cleaning** | `fillna`, `dropna`, `drop_duplicates` — data quality |
| 5 | **Pandas Joins** | `merge` (SQL-style), `concat` (stacking) |
| 6 | **Pandas Aggregations** | `groupby`, `agg`, `pivot_table` — summarizing data |
| 7 | **Parquet** | Columnar storage, compression, and efficient IO |

---
## 1. Numpy — The Engine Behind Pandas

Numpy provides **fixed-type arrays** and **vectorized operations** (no Python loops).  
Pandas DataFrames are built on top of Numpy arrays, so understanding Numpy = understanding Pandas performance.

In [None]:
import numpy as np

# ============================================================
# 1a. Array Basics
# ============================================================

# np.array() creates an array from a Python list.
# Unlike lists, ALL elements must be the same type → enables fast math.
prices = np.array([120.0, 180.0, 90.0, 200.0, 160.0])

print(f"Array  : {prices}")
print(f"Type   : {type(prices)}")         # numpy.ndarray
print(f"dtype  : {prices.dtype}")          # float64 — the element data type
print(f"shape  : {prices.shape}")          # (5,) — dimensions (rows,) for 1D
print(f"ndim   : {prices.ndim}")           # 1 — number of dimensions
print(f"size   : {prices.size}")           # 5 — total element count

In [None]:
# ============================================================
# 1b. Vectorization — why Numpy is fast
# ============================================================
# "Vectorized" means the operation is applied to ALL elements at once
# in compiled C code — no Python for-loop needed.

nights = np.array([2, 3, 1, 4, 2])

# Element-wise math — each element is processed in parallel
revenue_per_night = prices / nights        # divide each price by its night count
discounted = prices * 0.9                  # 10% discount on all prices
total_with_tax = prices * 1.11             # 11% tax added

print(f"Prices          : {prices}")
print(f"Nights          : {nights}")
print(f"Rev/night       : {revenue_per_night}")
print(f"10% discount    : {discounted}")
print(f"With 11% tax    : {total_with_tax}")

In [None]:
# ============================================================
# 1c. Aggregation Functions
# ============================================================
# These reduce an array to a single value (or along an axis).

print(f"sum()   : {prices.sum()}")     # sum of all elements
print(f"mean()  : {prices.mean()}")    # arithmetic average
print(f"std()   : {prices.std():.2f}") # standard deviation
print(f"min()   : {prices.min()}")     # minimum value
print(f"max()   : {prices.max()}")     # maximum value
print(f"argmin(): {prices.argmin()}")  # INDEX of the minimum value
print(f"argmax(): {prices.argmax()}")  # INDEX of the maximum value

# cumsum() returns a running total (cumulative sum)
print(f"cumsum(): {prices.cumsum()}")  # [120, 300, 390, 590, 750]

In [None]:
# ============================================================
# 1d. Boolean Indexing (Filtering)
# ============================================================
# Comparison operators on arrays return boolean arrays.
# Use boolean arrays as a MASK to select elements.

mask = prices > 150          # element-wise comparison → [False, True, False, True, True]
print(f"Mask (>150)     : {mask}")
print(f"Premium prices  : {prices[mask]}")  # only elements where mask is True

# Combine conditions with & (and), | (or), ~ (not)
# IMPORTANT: wrap each condition in parentheses!
mid_range = prices[(prices >= 100) & (prices <= 180)]
print(f"Mid-range (100-180): {mid_range}")

# Count matches
n_premium = (prices > 150).sum()  # True=1, False=0, so sum() counts Trues
print(f"Count > 150     : {n_premium}")

In [None]:
# ============================================================
# 1e. np.where — vectorized if-else
# ============================================================
# np.where(condition, value_if_true, value_if_false)
# Like: [true_val if cond else false_val for each element]
# But runs in C — orders of magnitude faster.

# Classify each price as "Premium" or "Standard"
categories = np.where(prices > 150, "Premium", "Standard")
print(f"Prices     : {prices}")
print(f"Categories : {categories}")

# Numeric example: cap prices at 180 (any price above 180 → 180)
capped = np.where(prices > 180, 180, prices)
print(f"Capped     : {capped}")

In [None]:
# ============================================================
# 1f. np.select — multiple conditions (like CASE WHEN in SQL)
# ============================================================
# np.select(condlist, choicelist, default)
#   condlist  : list of boolean arrays (conditions)
#   choicelist: list of values (one per condition)
#   default   : value when no condition is True
#
# Equivalent SQL:
#   CASE WHEN price > 200 THEN 'Luxury'
#        WHEN price > 120 THEN 'Premium'
#        ELSE 'Budget' END

conditions = [
    prices > 200,     # condition 1
    prices > 120,     # condition 2
]
choices = [
    "Luxury",         # if condition 1 is True
    "Premium",        # if condition 2 is True (and condition 1 is False)
]

tiers = np.select(conditions, choices, default="Budget")
print(f"Prices : {prices}")
print(f"Tiers  : {tiers}")

In [None]:
# ============================================================
# 1g. np.unique — find distinct values
# ============================================================
# np.unique(array) returns sorted unique values.
# Like SQL: SELECT DISTINCT ... ORDER BY ...

countries = np.array(["PRT", "GBR", "FRA", "PRT", "GBR", "PRT", "DEU", "FRA"])

# Basic unique values
unique_countries = np.unique(countries)
print(f"Unique : {unique_countries}")

# return_counts=True also gives the count of each unique value
values, counts = np.unique(countries, return_counts=True)
print("\nValue counts:")
for v, c in zip(values, counts):  # zip() pairs elements from two iterables
    print(f"  {v}: {c}")

In [None]:
# ============================================================
# 1h. np.isnan — detect missing/NaN values
# ============================================================
# NaN (Not a Number) = Numpy/Pandas representation of missing data.
# You CANNOT check NaN with == (NaN != NaN is True by IEEE standard).
# Use np.isnan() instead.

data = np.array([120.0, np.nan, 90.0, np.nan, 160.0])

print(f"Data         : {data}")
print(f"isnan mask   : {np.isnan(data)}")       # [False, True, False, True, False]
print(f"Count NaN    : {np.isnan(data).sum()}") # 2

# np.nanmean / np.nansum — aggregate IGNORING NaN values
print(f"mean (w/ NaN): {np.mean(data)}")     # nan (any NaN poisons the result)
print(f"nanmean      : {np.nanmean(data)}")  # 123.33 (ignores NaN)
print(f"nansum       : {np.nansum(data)}")   # 370.0

---
## 2. Pandas IO — Loading & Saving Data

Pandas is the workhorse of data processing in Python.  
It adds **labeled columns**, **mixed types**, and **rich IO** on top of Numpy.

In [None]:
import pandas as pd
from pathlib import Path

# ============================================================
# 2a. read_csv — load CSV into a DataFrame
# ============================================================
# pd.read_csv() is the most common way to load data.
# It returns a DataFrame — a 2D labeled table with typed columns.

hotel_csv = Path("data/hotel_booking.csv")

df = pd.read_csv(
    hotel_csv,
    # Key parameters:
    # sep=','           → column delimiter (default: comma)
    # header=0          → which row is the header (default: first row)
    # dtype={'col': str} → force specific column types
    # parse_dates=['col'] → auto-parse date columns
    # usecols=[...]     → only load specific columns (saves memory!)
    # nrows=1000        → only load first N rows (great for testing)
    # na_values=['', 'NULL'] → extra strings to treat as NaN
)

# --- Quick inspection methods ---
print(f"Shape   : {df.shape}")        # (rows, columns) tuple
print(f"Columns : {df.columns.tolist()[:10]}...")  # column names as a list
print(f"Dtypes  :\n{df.dtypes.head(10)}")  # data type of each column

In [None]:
# --- Exploring the DataFrame ---

# head(n) — show first n rows (default 5)
# tail(n) — show last n rows
df.head(3)

In [None]:
# info() — summary of columns, dtypes, non-null counts, and memory usage
# Great for spotting missing data and wrong types at a glance.
df.info()

In [None]:
# describe() — statistical summary for numeric columns
# Shows count, mean, std, min, 25%, 50% (median), 75%, max
df.describe()

In [None]:
# ============================================================
# 2b. read_csv with optimization — load only what you need
# ============================================================
# On large files, loading ALL columns wastes memory.
# usecols selects only the columns you need.

cols_needed = [
    "hotel", "is_canceled", "lead_time",
    "arrival_date_year", "arrival_date_month", "arrival_date_day_of_month",
    "stays_in_weekend_nights", "stays_in_week_nights",
    "adults", "children", "country",
    "adr", "reservation_status",
]

df = pd.read_csv(
    hotel_csv,
    usecols=cols_needed,           # only load these columns
    dtype={"country": "category"},  # 'category' type saves memory for low-cardinality strings
)

print(f"Shape          : {df.shape}")
print(f"Memory (full)  : {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# memory_usage(deep=True) measures actual memory including string content

---
## 3. Pandas Transformation — Reshaping Values

Transform columns: cast types, map values, compute new columns.

In [None]:
# ============================================================
# 3a. astype — cast column data types
# ============================================================
# astype(new_type) converts a column to a different data type.
# Essential for: fixing auto-detected types, optimizing memory.

print("--- Before ---")
print(df[["is_canceled", "lead_time", "adr"]].dtypes)

# Cast is_canceled from int64 → bool (it's really a 0/1 flag)
df["is_canceled"] = df["is_canceled"].astype(bool)

# Cast lead_time from int64 → int32 (values are small, saves 50% memory)
df["lead_time"] = df["lead_time"].astype("int32")

# Cast adr from float64 → float32 (sufficient precision for prices)
df["adr"] = df["adr"].astype("float32")

print("\n--- After ---")
print(df[["is_canceled", "lead_time", "adr"]].dtypes)

In [None]:
# ============================================================
# 3b. map — transform values using a dictionary or function
# ============================================================
# Series.map(dict_or_func) replaces each value.
# Like a lookup table / VLOOKUP / CASE WHEN.

# Map month names to quarter numbers
month_to_quarter = {
    "January": "Q1", "February": "Q1", "March": "Q1",
    "April": "Q2", "May": "Q2", "June": "Q2",
    "July": "Q3", "August": "Q3", "September": "Q3",
    "October": "Q4", "November": "Q4", "December": "Q4",
}

# map() replaces each value using the dictionary
# Values not in the dict become NaN
df["quarter"] = df["arrival_date_month"].map(month_to_quarter)

print(df[["arrival_date_month", "quarter"]].drop_duplicates().sort_values("quarter"))

In [None]:
# map() with a function — apply a function to each value
# str.upper is a built-in string method
df["hotel_upper"] = df["hotel"].map(str.upper)
print(df["hotel_upper"].unique())

# Clean up the temp column
df.drop(columns=["hotel_upper"], inplace=True)
# drop(columns=[...]) removes columns
# inplace=True modifies the DataFrame directly instead of returning a copy

In [None]:
# ============================================================
# 3c. apply — row-wise or column-wise custom logic
# ============================================================
# DataFrame.apply(func, axis=1) applies a function to EACH ROW.
#   axis=0 → apply to each column (default)
#   axis=1 → apply to each row
#
# NOTE: apply(axis=1) is SLOW because it loops in Python.
# Prefer vectorized operations (np.where, np.select) when possible.
# Use apply only for complex logic that can't be vectorized.

def compute_total_nights(row):
    """
    Compute total stay length from weekend + weekday nights.
    `row` is a Series with column names as index.
    """
    return row["stays_in_weekend_nights"] + row["stays_in_week_nights"]


# Slow way: apply (loops row by row in Python)
# df["total_nights"] = df.apply(compute_total_nights, axis=1)

# Fast way: vectorized addition (Numpy under the hood)
df["total_nights"] = df["stays_in_weekend_nights"] + df["stays_in_week_nights"]

# Compute total revenue (vectorized)
df["total_revenue"] = df["adr"] * df["total_nights"]

print(df[["adr", "total_nights", "total_revenue"]].head())

In [None]:
# ============================================================
# 3d. Vectorized classification with np.select (preferred over apply)
# ============================================================

import numpy as np

# Classify bookings by ADR tier — much faster than apply + if/elif
conditions = [
    df["adr"] > 200,
    df["adr"] > 100,
    df["adr"] > 0,
]
choices = ["Luxury", "Premium", "Budget"]

df["adr_tier"] = np.select(conditions, choices, default="Free/Unknown")

# value_counts() returns the count of each unique value, sorted descending
print(df["adr_tier"].value_counts())

---
## 4. Pandas Cleaning — Data Quality

Real data is messy. These methods handle **missing values** and **duplicates**.

In [None]:
# ============================================================
# 4a. Inspect missing data
# ============================================================

# isnull() returns a boolean DataFrame (True = missing)
# .sum() counts True per column
missing = df.isnull().sum()

# Filter to only columns with missing values
missing_only = missing[missing > 0].sort_values(ascending=False)
print("Columns with missing values:")
print(missing_only)

# Percentage missing
print(f"\n% missing children: {df['children'].isnull().mean() * 100:.2f}%")
# .mean() of booleans = proportion of True values

In [None]:
# ============================================================
# 4b. fillna — fill missing values
# ============================================================
# fillna(value) replaces NaN with the specified value.

# Fill missing children count with 0 (no children)
df["children"] = df["children"].fillna(0)

# fillna can use different strategies:
#   fillna(0)              → fill with a constant
#   fillna(method='ffill') → forward fill (use previous row's value)
#   fillna(method='bfill') → backward fill (use next row's value)
#   fillna(df['col'].mean()) → fill with column mean
#   fillna(df['col'].median()) → fill with column median

print(f"Children NaN after fill: {df['children'].isnull().sum()}")

# Fill missing country with "UNKNOWN"
df["country"] = df["country"].cat.add_categories("UNKNOWN")  # add new category first
# cat.add_categories() adds new valid categories to a categorical column
# (must do this before fillna because categorical columns can only hold defined categories)
df["country"] = df["country"].fillna("UNKNOWN")
print(f"Country NaN after fill : {df['country'].isnull().sum()}")

In [None]:
# ============================================================
# 4c. dropna — remove rows/columns with missing data
# ============================================================
# dropna() removes rows (or columns) that contain NaN.
#   axis=0     → drop ROWS (default)
#   axis=1     → drop COLUMNS
#   how='any'  → drop if ANY value is NaN (default)
#   how='all'  → drop only if ALL values are NaN
#   subset=[cols] → only check specific columns for NaN
#   thresh=N   → keep rows with at least N non-NaN values

print(f"Rows before dropna: {len(df):,}")

# Drop rows where 'country' is missing (already filled, so just for demo)
df_no_missing = df.dropna(subset=["adr", "country"])
print(f"Rows after dropna : {len(df_no_missing):,}")
print(f"Rows dropped      : {len(df) - len(df_no_missing):,}")

In [None]:
# ============================================================
# 4d. drop_duplicates — remove duplicate rows
# ============================================================
# drop_duplicates() removes rows where ALL values are identical.
#   subset=[cols] → only check these columns for duplicates
#   keep='first'  → keep the first occurrence (default)
#   keep='last'   → keep the last occurrence
#   keep=False    → drop ALL duplicates (no rows kept)

print(f"Rows before dedup: {len(df):,}")

# Check for full-row duplicates
n_dupes = df.duplicated().sum()  # duplicated() marks True for duplicate rows
print(f"Full-row duplicates: {n_dupes:,}")

# Remove duplicates
df = df.drop_duplicates()
print(f"Rows after dedup : {len(df):,}")

# Dedup by specific columns (e.g., one booking per country+date)
df_unique_country_year = df.drop_duplicates(
    subset=["country", "arrival_date_year"],
    keep="first"
)
print(f"\nUnique country-year combos: {len(df_unique_country_year):,}")

---
## 5. Pandas Joins — Combining DataFrames

- `merge()` — SQL-style joins on keys  
- `concat()` — stack DataFrames vertically or horizontally

In [None]:
# ============================================================
# 5a. merge — SQL-style joins
# ============================================================
# pd.merge(left, right, on=..., how=...)
#   on       : column(s) to join on (must exist in both DataFrames)
#   left_on / right_on : use when column names differ
#   how      : join type
#     'inner' → only matching rows (default) — like SQL INNER JOIN
#     'left'  → all left rows + matching right — like SQL LEFT JOIN
#     'right' → all right rows + matching left
#     'outer' → all rows from both sides — like SQL FULL OUTER JOIN

# Create a dimension table: hotel metadata
hotel_dim = pd.DataFrame({
    "hotel": ["Resort Hotel", "City Hotel"],
    "city": ["Lagos", "Lisbon"],
    "star_rating": [5, 4],
})

print("--- Hotel Dimension ---")
print(hotel_dim)

# LEFT JOIN: keep all bookings, enrich with hotel metadata
df_enriched = df.merge(
    hotel_dim,
    on="hotel",      # join column
    how="left",      # keep all rows from left (df)
)

print(f"\nAfter merge: {df_enriched.shape}")
print(df_enriched[["hotel", "city", "star_rating", "adr"]].head())

In [None]:
# ============================================================
# 5b. merge with different column names
# ============================================================

# Country codes dimension table (column name differs: 'code' vs 'country')
country_dim = pd.DataFrame({
    "code": ["PRT", "GBR", "FRA", "ESP", "DEU", "ITA", "USA", "BRA"],
    "country_name": ["Portugal", "United Kingdom", "France", "Spain",
                     "Germany", "Italy", "United States", "Brazil"],
    "region": ["Europe", "Europe", "Europe", "Europe",
               "Europe", "Europe", "Americas", "Americas"],
})

# Use left_on / right_on when column names don't match
df_with_country = df.merge(
    country_dim,
    left_on="country",    # column in left DataFrame
    right_on="code",      # column in right DataFrame
    how="left",
)

# Show some results (many countries won't match — that's expected)
print(df_with_country[["country", "country_name", "region"]].drop_duplicates().head(10))

In [None]:
# ============================================================
# 5c. concat — stack DataFrames
# ============================================================
# pd.concat([df1, df2, ...], axis=0/1)
#   axis=0 → stack vertically (add rows)    — like SQL UNION ALL
#   axis=1 → stack horizontally (add columns)
#   ignore_index=True → reset the row index after stacking

# Split data by hotel, then recombine (simulates merging daily batches)
resort = df[df["hotel"] == "Resort Hotel"].head(3)
city = df[df["hotel"] == "City Hotel"].head(3)

print(f"Resort rows: {len(resort)}, City rows: {len(city)}")

# Stack vertically (UNION ALL)
combined = pd.concat([resort, city], axis=0, ignore_index=True)
print(f"Combined   : {len(combined)} rows")
print(combined[["hotel", "country", "adr"]])

---
## 6. Pandas Aggregations — Summarizing Data

`groupby` + `agg` is the Pandas equivalent of SQL's `GROUP BY` with aggregate functions.

In [None]:
# ============================================================
# 6a. groupby + agg — custom aggregations
# ============================================================
# df.groupby(column).agg(new_col=(source_col, func))
#
# Named aggregation syntax:
#   new_column_name = ("source_column", "agg_function")
#
# Common agg functions: 'sum', 'mean', 'median', 'min', 'max',
#                        'count', 'nunique', 'std', 'first', 'last'

hotel_summary = df.groupby("hotel").agg(
    total_bookings=("hotel", "count"),         # COUNT(*)
    avg_adr=("adr", "mean"),                    # AVG(adr)
    max_adr=("adr", "max"),                     # MAX(adr)
    avg_lead_time=("lead_time", "mean"),        # AVG(lead_time)
    unique_countries=("country", "nunique"),    # COUNT(DISTINCT country)
    cancel_rate=("is_canceled", "mean"),        # AVG(is_canceled) = cancel ratio
).reset_index()  # reset_index() moves the groupby key back to a regular column

print(hotel_summary.to_string(index=False))  # to_string() for clean printing

In [None]:
# ============================================================
# 6b. groupby with multiple keys
# ============================================================
# Like SQL: GROUP BY hotel, arrival_date_year

yearly_summary = df.groupby(["hotel", "arrival_date_year"]).agg(
    bookings=("hotel", "count"),
    avg_adr=("adr", "mean"),
    total_revenue=("total_revenue", "sum"),
).reset_index()

# round() rounds all numeric columns to N decimal places
print(yearly_summary.round(2).to_string(index=False))

In [None]:
# ============================================================
# 6c. Custom aggregation functions
# ============================================================
# You can pass your OWN function to agg().
# The function receives a Series (all values in the group for that column).

def revenue_range(series):
    """Compute the difference between max and min values."""
    return series.max() - series.min()


custom_agg = df.groupby("hotel").agg(
    adr_range=("adr", revenue_range),            # custom function
    adr_p90=("adr", lambda s: s.quantile(0.9)),  # 90th percentile via lambda
    # quantile(0.9) returns the value below which 90% of data falls
).reset_index()

print(custom_agg.round(2).to_string(index=False))

In [None]:
# ============================================================
# 6d. pivot_table — spreadsheet-style aggregation
# ============================================================
# pivot_table() creates a cross-tabulation (rows × columns × values).
#   index   : row labels
#   columns : column labels
#   values  : what to aggregate
#   aggfunc : aggregation function (default: 'mean')
#   fill_value : value for missing combinations
#   margins : add row/column totals

pivot = df.pivot_table(
    index="hotel",               # rows
    columns="quarter",           # columns
    values="adr",                # what to aggregate
    aggfunc="mean",              # how to aggregate
    fill_value=0,                # fill missing combos with 0
    margins=True,                # add "All" row and column (totals)
)

print("--- Average ADR by Hotel × Quarter ---")
print(pivot.round(2))

In [None]:
# ============================================================
# 6e. Booking count pivot with margins
# ============================================================

booking_pivot = df.pivot_table(
    index="hotel",
    columns="arrival_date_year",
    values="lead_time",          # any column works for count
    aggfunc="count",             # count rows
    fill_value=0,
    margins=True,
)

print("--- Booking Count by Hotel × Year ---")
print(booking_pivot)

---
## 7. Parquet — Columnar Storage for Data Engineering

**Parquet** is the standard file format for analytics and data lakes.  

| Feature | CSV | Parquet |
|---------|-----|--------|
| Storage | Row-based (text) | Columnar (binary) |
| Types | Everything is text | Preserves int, float, date, etc. |
| Size | Large (no compression) | Small (built-in compression) |
| Read speed | Must scan entire file | Can read only needed columns |
| Append | Easy | Not easy (immutable files) |
| Human readable | Yes | No (binary) |

In [None]:
import pandas as pd
from pathlib import Path

# ============================================================
# 7a. Writing Parquet with to_parquet
# ============================================================
# DataFrame.to_parquet(path, engine, compression, index)
#   engine      : 'pyarrow' (fast, full-featured) or 'fastparquet'
#   compression : 'snappy' (fast, default), 'gzip' (smaller), 'zstd' (balanced), None
#   index       : whether to include the DataFrame index in the file

output_dir = Path("data/output")
output_dir.mkdir(parents=True, exist_ok=True)

# Write with default compression (snappy)
parquet_snappy = output_dir / "hotel_bookings_snappy.parquet"
df.to_parquet(
    parquet_snappy,
    engine="pyarrow",       # pyarrow is the standard Parquet engine
    compression="snappy",   # snappy: fast compress/decompress, decent ratio
    index=False,            # don't save the row index
)

# Write with gzip compression (smaller but slower)
parquet_gzip = output_dir / "hotel_bookings_gzip.parquet"
df.to_parquet(parquet_gzip, engine="pyarrow", compression="gzip", index=False)

# Compare file sizes
csv_size = Path("data/hotel_booking.csv").stat().st_size / 1024**2
snappy_size = parquet_snappy.stat().st_size / 1024**2
gzip_size = parquet_gzip.stat().st_size / 1024**2

print(f"{'Format':<20} {'Size (MB)':>10} {'Ratio':>8}")
print(f"{'-'*20} {'-'*10} {'-'*8}")
print(f"{'CSV (original)':<20} {csv_size:>10.2f} {'1.00x':>8}")
print(f"{'Parquet (snappy)':<20} {snappy_size:>10.2f} {csv_size/snappy_size:>7.1f}x")
print(f"{'Parquet (gzip)':<20} {gzip_size:>10.2f} {csv_size/gzip_size:>7.1f}x")

In [None]:
# ============================================================
# 7b. Reading Parquet with read_parquet
# ============================================================
# pd.read_parquet(path, engine, columns)
#   columns : list of column names to read (HUGE performance win!)
#             Because Parquet is columnar, it only reads the columns you need.

# Read ALL columns
df_full = pd.read_parquet(parquet_snappy, engine="pyarrow")
print(f"Full read  : {df_full.shape} — {df_full.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Read only 3 columns (much faster on wide tables)
df_partial = pd.read_parquet(
    parquet_snappy,
    engine="pyarrow",
    columns=["hotel", "country", "adr"],  # only these columns are read from disk
)
print(f"Partial read: {df_partial.shape} — {df_partial.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
# ============================================================
# 7c. Parquet preserves data types (CSV does not)
# ============================================================
# When you write a CSV, everything becomes text.
# When you read it back, Pandas must GUESS the types.
# Parquet stores the EXACT types — no guessing, no data loss.

print("--- Dtypes from Parquet (preserved exactly) ---")
print(df_full.dtypes)

print("\n--- Dtypes from CSV (auto-inferred, often wrong) ---")
df_csv = pd.read_csv("data/hotel_booking.csv", usecols=["hotel", "is_canceled", "adr", "children"])
print(df_csv.dtypes)

In [None]:
# ============================================================
# 7d. Inspecting Parquet metadata with PyArrow
# ============================================================
# PyArrow lets you read Parquet metadata WITHOUT loading the data.
# Useful for: checking schema, row count, compression before loading.

import pyarrow.parquet as pq

# pq.read_metadata() reads ONLY the file footer (fast, no data loaded)
meta = pq.read_metadata(parquet_snappy)

print(f"Num rows       : {meta.num_rows:,}")
print(f"Num columns    : {meta.num_columns}")
print(f"Num row groups : {meta.num_row_groups}")  # Parquet splits data into row groups
print(f"Created by     : {meta.created_by}")
print(f"Format version : {meta.format_version}")

# pq.read_schema() reads the column schema (names + types)
schema = pq.read_schema(parquet_snappy)
print(f"\n--- Schema ---")
for i in range(len(schema)):
    field = schema.field(i)  # field(i) returns the i-th column definition
    print(f"  {field.name:<30} {str(field.type):<15}")

In [None]:
# ============================================================
# 7e. Practical — CSV to Parquet conversion pipeline
# ============================================================
# A common DE task: convert raw CSV → optimized Parquet for downstream use.

import time
from pathlib import Path

csv_source = Path("data/hotel_booking.csv")
parquet_dest = Path("data/output/hotel_bookings_clean.parquet")

# --- Step 1: Read CSV with optimized types ---
start = time.perf_counter()

df_raw = pd.read_csv(
    csv_source,
    dtype={
        "hotel": "category",               # low-cardinality → category
        "country": "category",
        "market_segment": "category",
        "reserved_room_type": "category",
        "is_canceled": "int8",              # 0/1 flag → int8 (1 byte vs 8)
        "lead_time": "int32",               # small ints → int32 (4 bytes vs 8)
        "adr": "float32",                   # prices → float32 (4 bytes vs 8)
    },
)

csv_time = time.perf_counter() - start
print(f"CSV read     : {csv_time:.3f}s, {df_raw.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# --- Step 2: Write Parquet ---
start = time.perf_counter()
df_raw.to_parquet(parquet_dest, engine="pyarrow", compression="snappy", index=False)
write_time = time.perf_counter() - start
print(f"Parquet write: {write_time:.3f}s, {parquet_dest.stat().st_size / 1024**2:.1f} MB on disk")

# --- Step 3: Read Parquet back ---
start = time.perf_counter()
df_pq = pd.read_parquet(parquet_dest, engine="pyarrow")
pq_time = time.perf_counter() - start
print(f"Parquet read : {pq_time:.3f}s, {df_pq.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

print(f"\nRead speedup : {csv_time / pq_time:.1f}x faster than CSV")

---
## Key Takeaways

| Concept | Key Functions | When to Use |
|---------|--------------|-------------|
| **Numpy** | `np.where`, `np.select`, `np.unique`, `np.isnan`, `nanmean` | Vectorized math, conditional logic, missing data |
| **read_csv** | `usecols`, `dtype`, `nrows`, `parse_dates` | Load CSVs with memory optimization |
| **astype / map** | `astype('category')`, `map(dict)` | Cast types, map values |
| **apply** | `apply(func, axis=1)` | Complex row logic (prefer vectorized!) |
| **fillna / dropna** | `fillna(0)`, `dropna(subset=[...])` | Handle missing data |
| **drop_duplicates** | `drop_duplicates(subset=[...])` | Remove duplicate rows |
| **merge** | `merge(on=..., how='left')` | SQL-style joins between DataFrames |
| **concat** | `pd.concat([df1, df2])` | Stack DataFrames (UNION ALL) |
| **groupby + agg** | `.groupby().agg(new=(col, func))` | GROUP BY with named aggregations |
| **pivot_table** | `pivot_table(index, columns, values)` | Cross-tabulation summaries |
| **Parquet** | `to_parquet`, `read_parquet`, `pq.read_schema` | Fast, typed, compressed columnar storage |

---

## Next Steps

Continue with **`10_sql_analytics.ipynb`** — DuckDB: run SQL directly on CSVs, Parquet files, and DataFrames.