# 03b — Python Standard Libraries for Data Engineering

Python ships with powerful **"batteries-included"** modules. Mastering them means fewer  
external dependencies and more reliable, portable code.

### What you'll learn
| # | Module | Why it matters for DE |
|---|--------|-----------------------|
| 1 | **datetime** | Parse, format, and compute dates/times/timestamps/timezones |
| 2 | **os & sys** | Read environment variables, interact with the OS and runtime |
| 3 | **logging** | Production-grade logging with levels, formatting, and file output |
| 4 | **collections** | Specialized containers: `Counter`, `defaultdict`, `namedtuple` |
| 5 | **glob** | Pattern-based file discovery across directories |

---
## 1. datetime — Dates, Times, Timestamps & Timezones

The `datetime` module is the foundation for all time-related work.  
In data engineering, you'll constantly parse date strings from CSVs/APIs,  
compute durations, and convert between timezones.

In [1]:
from datetime import date, datetime, time, timedelta, timezone

# ============================================================
# 1a. Core Types
# ============================================================

# date(year, month, day) — a calendar date with NO time component
today = date.today()          # today() returns the current local date
specific = date(2025, 7, 15)  # construct a specific date
print(f"Today       : {today}")
print(f"Specific    : {specific}")
print(f"Year        : {specific.year}")     # .year, .month, .day are integer attributes
print(f"Day of week : {specific.weekday()}")  # weekday() → 0=Monday ... 6=Sunday
print(f"ISO weekday : {specific.isoweekday()}")  # isoweekday() → 1=Monday ... 7=Sunday

Today       : 2026-02-13
Specific    : 2025-07-15
Year        : 2025
Day of week : 1
ISO weekday : 2


In [None]:
# datetime(year, month, day, hour, minute, second, microsecond)
# Combines date + time into a single object.
now = datetime.now()            # now() returns current local date+time
utc_now = datetime.now(timezone.utc)  # now(tz) returns current time in a specific timezone

print(f"Local now : {now}")
print(f"UTC now   : {utc_now}")

# Access individual components
print(f"Hour      : {now.hour}")
print(f"Minute    : {now.minute}")
print(f"Date part : {now.date()}")   # .date() extracts just the date
print(f"Time part : {now.time()}")   # .time() extracts just the time

In [None]:
# time(hour, minute, second, microsecond) — time of day with NO date
checkout_time = time(11, 0, 0)  # 11:00:00 AM
print(f"Checkout time: {checkout_time}")
print(f"Hour         : {checkout_time.hour}")

In [None]:
# ============================================================
# 1b. Parsing Strings → datetime (strptime)
# ============================================================
# strptime(string, format) parses a date STRING into a datetime object.
# "strptime" = "string parse time"
#
# Common format codes:
#   %Y = 4-digit year (2025)      %y = 2-digit year (25)
#   %m = month (01-12)            %B = full month name (July)
#   %d = day (01-31)              %b = abbreviated month (Jul)
#   %H = hour 24h (00-23)         %I = hour 12h (01-12)
#   %M = minute (00-59)           %S = second (00-59)
#   %p = AM/PM

# ISO format (most common in databases/APIs)
dt1 = datetime.strptime("2025-07-15 14:30:00", "%Y-%m-%d %H:%M:%S")
print(f"Parsed ISO     : {dt1}")

# US-style format
dt2 = datetime.strptime("07/15/2025", "%m/%d/%Y")
print(f"Parsed US date : {dt2}")

# From the hotel dataset: arrival date looks like "July" + "2015" + "1"
# We can reconstruct and parse it:
date_str = "July 1, 2015"
dt3 = datetime.strptime(date_str, "%B %d, %Y")
print(f"Parsed hotel   : {dt3}")

In [None]:
# ============================================================
# 1c. Formatting datetime → String (strftime)
# ============================================================
# strftime(format) formats a datetime INTO a string.
# "strftime" = "string format time"

now = datetime.now()

print(f"ISO format     : {now.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Date only      : {now.strftime('%Y-%m-%d')}")
print(f"Human readable : {now.strftime('%B %d, %Y at %I:%M %p')}")
print(f"Compact (logs) : {now.strftime('%Y%m%d_%H%M%S')}")

# isoformat() is a shorthand for the standard ISO 8601 format
print(f"isoformat()    : {now.isoformat()}")

In [None]:
# ============================================================
# 1d. timedelta — Date/Time Arithmetic
# ============================================================
# timedelta represents a DURATION (difference between two dates/times).
# You can add/subtract timedeltas to/from dates and datetimes.

from datetime import timedelta

today = date.today()

# Adding durations
tomorrow = today + timedelta(days=1)        # timedelta(days=N) creates a N-day duration
next_week = today + timedelta(weeks=1)      # weeks is a shorthand for days=7
two_hours_later = datetime.now() + timedelta(hours=2, minutes=30)

print(f"Today          : {today}")
print(f"Tomorrow       : {tomorrow}")
print(f"Next week      : {next_week}")
print(f"+2h30m         : {two_hours_later.strftime('%H:%M:%S')}")

# Subtracting dates gives a timedelta
checkin = date(2025, 7, 10)
checkout = date(2025, 7, 15)
stay = checkout - checkin
print(f"\nCheck-in       : {checkin}")
print(f"Check-out      : {checkout}")
print(f"Stay duration  : {stay.days} days")   # .days extracts the day count as int
print(f"Total seconds  : {stay.total_seconds()}")  # total_seconds() converts to float seconds

In [None]:
# ============================================================
# 1e. Timezones
# ============================================================
# In production pipelines, ALWAYS store timestamps in UTC.
# Convert to local time only for display.
#
# timezone.utc          → the UTC timezone (built-in, no extra library)
# timezone(timedelta())  → create a fixed-offset timezone

from datetime import timezone, timedelta

# Create timezone-AWARE datetimes ("aware" = knows its timezone)
utc_now = datetime.now(timezone.utc)
print(f"UTC now       : {utc_now}")
print(f"UTC isoformat : {utc_now.isoformat()}")

# Create a custom timezone: UTC+7 (Jakarta / WIB)
wib = timezone(timedelta(hours=7))   # fixed offset from UTC
jakarta_now = utc_now.astimezone(wib) # astimezone() converts to another timezone
print(f"Jakarta (WIB) : {jakarta_now}")

# UTC+9 (Tokyo / JST)
jst = timezone(timedelta(hours=9))
tokyo_now = utc_now.astimezone(jst)
print(f"Tokyo (JST)   : {tokyo_now}")

# Naive vs Aware:
# "Naive" datetime has NO timezone info → dangerous in production!
naive = datetime.now()              # naive — no tz info
aware = datetime.now(timezone.utc)  # aware — knows it's UTC
print(f"\nNaive tzinfo  : {naive.tzinfo}")   # None
print(f"Aware tzinfo  : {aware.tzinfo}")     # UTC

In [2]:
# ============================================================
# 1f. Timestamps (Unix Epoch)
# ============================================================
# A Unix timestamp is the number of seconds since Jan 1, 1970 00:00 UTC.
# It's the universal exchange format for time data between systems.

from datetime import datetime, timezone

now_utc = datetime.now(timezone.utc)

# datetime → timestamp (float seconds since epoch)
ts = now_utc.timestamp()  # timestamp() returns a float
print(f"Unix timestamp : {ts}")
print(f"As integer     : {int(ts)}")

# timestamp → datetime
# fromtimestamp(ts, tz) converts back; ALWAYS pass tz=timezone.utc
restored = datetime.fromtimestamp(ts, tz=timezone.utc)
print(f"Restored       : {restored}")
print(f"Match          : {restored == now_utc}")

Unix timestamp : 1771021966.457106
As integer     : 1771021966
Restored       : 2026-02-13 22:32:46.457106+00:00
Match          : True


In [6]:
# ============================================================
# 1g. Practical — Parse hotel booking dates
# ============================================================
# The hotel dataset has: arrival_date_year, arrival_date_month,
# arrival_date_day_of_month as SEPARATE columns.
# Let's combine them into proper date objects.

import csv
from pathlib import Path
from datetime import datetime


def parse_hotel_date(row: dict) -> date:
    """
    Combine year + month_name + day columns into a date object.
    Example: 2015, 'July', 1 → date(2015, 7, 1)
    """
    date_str = f"{row['arrival_date_month']} {row['arrival_date_day_of_month']}, {row['arrival_date_year']}"
    return datetime.strptime(date_str, "%B %d, %Y").date()  # .date() drops the time part


# Parse first 5 bookings
hotel_csv = Path("data/hotel_booking.csv")

with hotel_csv.open("r", newline="") as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        if i >= 10:
            break
        arrival = parse_hotel_date(row)
        lead_days = int(row["lead_time"])  # days between booking and arrival
        booking_date = arrival - timedelta(days=lead_days)
        print(
            f"  Booking #{i+1}: booked {booking_date} → arrives {arrival}"
            f"  (lead time: {lead_days} days)"
        )

  Booking #1: booked 2014-07-24 → arrives 2015-07-01  (lead time: 342 days)
  Booking #2: booked 2013-06-24 → arrives 2015-07-01  (lead time: 737 days)
  Booking #3: booked 2015-06-24 → arrives 2015-07-01  (lead time: 7 days)
  Booking #4: booked 2015-06-18 → arrives 2015-07-01  (lead time: 13 days)
  Booking #5: booked 2015-06-17 → arrives 2015-07-01  (lead time: 14 days)
  Booking #6: booked 2015-06-17 → arrives 2015-07-01  (lead time: 14 days)
  Booking #7: booked 2015-07-01 → arrives 2015-07-01  (lead time: 0 days)
  Booking #8: booked 2015-06-22 → arrives 2015-07-01  (lead time: 9 days)
  Booking #9: booked 2015-04-07 → arrives 2015-07-01  (lead time: 85 days)
  Booking #10: booked 2015-04-17 → arrives 2015-07-01  (lead time: 75 days)


---
## 2. os & sys — Operating System & Runtime Interaction

- `os` — environment variables, process info, filesystem operations  
- `sys` — Python runtime info, interpreter settings, command-line args  

Critical for making pipelines **configurable** (dev vs prod) via env vars.

In [7]:
import os

# ============================================================
# 2a. Environment Variables
# ============================================================
# In production, pipelines read config from environment variables
# (database URLs, API keys, feature flags) — NEVER hardcode secrets.

# os.environ is a dict-like object of ALL environment variables
print(f"Total env vars : {len(os.environ)}")

# os.getenv(key, default) safely reads an env var.
# Returns `default` if the key doesn't exist (instead of raising KeyError).
home = os.getenv("HOME", "/unknown")        # user's home directory
user = os.getenv("USER", "unknown")          # current username
db_url = os.getenv("DATABASE_URL", "sqlite:///local.db")  # fallback to local DB

print(f"HOME         : {home}")
print(f"USER         : {user}")
print(f"DATABASE_URL : {db_url} (fallback — not set)")

Total env vars : 18
HOME         : /root
USER         : unknown
DATABASE_URL : sqlite:///local.db (fallback — not set)


In [None]:
# Setting environment variables at runtime
# os.environ[key] = value sets a variable for THIS process and its children.
os.environ["PIPELINE_ENV"] = "development"
os.environ["BATCH_SIZE"] = "5000"  # values MUST be strings

# Read them back
env = os.getenv("PIPELINE_ENV")
batch = int(os.getenv("BATCH_SIZE", "1000"))  # cast to int after reading

print(f"PIPELINE_ENV : {env}")
print(f"BATCH_SIZE   : {batch} (type: {type(batch).__name__})")

# Clean up — remove a variable
# os.environ.pop(key, default) removes the key, returns its value
os.environ.pop("PIPELINE_ENV", None)
print(f"After pop    : {os.getenv('PIPELINE_ENV', 'NOT SET')}")

In [None]:
# ============================================================
# 2b. Process & Filesystem Info (os)
# ============================================================

# os.getcwd() returns the Current Working Directory as a string
print(f"CWD          : {os.getcwd()}")

# os.getpid() returns the Process ID of the current Python process
print(f"Process ID   : {os.getpid()}")

# os.cpu_count() returns the number of CPU cores (useful for parallelism)
print(f"CPU cores    : {os.cpu_count()}")

# os.listdir(path) returns a list of filenames in a directory
# (Pathlib's iterdir() is generally preferred, but os.listdir is still common)
files = os.listdir("data")
print(f"\nFiles in data/ ({len(files)}):")
for name in sorted(files):
    print(f"  {name}")

In [None]:
import sys

# ============================================================
# 2c. Python Runtime Info (sys)
# ============================================================

# sys.version — full Python version string
print(f"Python version  : {sys.version}")

# sys.version_info — structured version (major, minor, micro, ...)
v = sys.version_info
print(f"Version tuple   : {v.major}.{v.minor}.{v.micro}")

# sys.platform — OS identifier ('linux', 'darwin' for macOS, 'win32')
print(f"Platform        : {sys.platform}")

# sys.executable — path to the Python interpreter binary
print(f"Executable      : {sys.executable}")

# sys.path — list of directories Python searches for imports
# (useful for debugging "module not found" errors)
print(f"\nFirst 3 sys.path entries:")
for p in sys.path[:3]:
    print(f"  {p}")

In [None]:
# ============================================================
# 2d. sys.getsizeof — check object memory usage
# ============================================================
# sys.getsizeof(obj) returns the memory size of a Python object in BYTES.
# Helpful for understanding memory consumption in pipelines.

small_list = list(range(100))
big_list = list(range(1_000_000))
sample_dict = {"booking_id": "B-1001", "guest": "Alya", "revenue": 120.0}
sample_str = "A" * 10_000

print(f"list(100)        : {sys.getsizeof(small_list):>10,} bytes")
print(f"list(1_000_000)  : {sys.getsizeof(big_list):>10,} bytes")
print(f"dict (3 keys)    : {sys.getsizeof(sample_dict):>10,} bytes")
print(f"str (10K chars)  : {sys.getsizeof(sample_str):>10,} bytes")

# Note: getsizeof only measures the CONTAINER, not nested objects.
# For deep measurement, you'd need a recursive function or `pympler`.

---
## 3. logging — Production-Grade Logging

**`print()` is for debugging. `logging` is for production.**

The `logging` module provides:
- **Severity levels** — DEBUG, INFO, WARNING, ERROR, CRITICAL  
- **Customizable format** — timestamps, module names, line numbers  
- **Multiple outputs** — console, file, remote servers  
- **Filtering** — show only WARNING+ in prod, DEBUG+ in dev

In [8]:
import logging

# ============================================================
# 3a. Basic Setup with basicConfig
# ============================================================
# logging.basicConfig() configures the ROOT logger (global default).
#   level  : minimum severity to display (everything below is ignored)
#   format : template for log messages
#   force  : True resets any existing config (needed in notebooks)
#
# Severity levels (from lowest to highest):
#   DEBUG    (10) — detailed diagnostic info (verbose)
#   INFO     (20) — confirmation that things work as expected
#   WARNING  (30) — something unexpected but not an error (default)
#   ERROR    (40) — a serious problem; some functionality failed
#   CRITICAL (50) — the program may not be able to continue

logging.basicConfig(
    level=logging.DEBUG,  # show everything from DEBUG and above
    format="%(asctime)s | %(levelname)-8s | %(message)s",
    # Format codes:
    #   %(asctime)s   — timestamp string
    #   %(levelname)s — severity level name (e.g. 'INFO')
    #   %(message)s   — the actual log message
    #   %(name)s      — logger name (default: 'root')
    #   %(filename)s  — source file name
    #   %(lineno)d    — line number in source file
    #   %-8s          — left-align and pad to 8 chars
    datefmt="%Y-%m-%d %H:%M:%S",  # custom timestamp format
    force=True,  # reset any prior config (important in Jupyter)
)

# Each level has its own function
logging.debug("Loading configuration...")       # verbose detail
logging.info("Pipeline started successfully")    # normal operation
logging.warning("CSV file has 15 empty rows")    # unexpected but recoverable
logging.error("Failed to connect to database")   # something broke
logging.critical("Out of disk space — aborting")  # fatal

2026-02-13 22:36:06 | DEBUG    | Loading configuration...
2026-02-13 22:36:06 | INFO     | Pipeline started successfully
2026-02-13 22:36:06 | ERROR    | Failed to connect to database
2026-02-13 22:36:06 | CRITICAL | Out of disk space — aborting


In [9]:
# ============================================================
# 3b. Named Loggers — scoped logging for modules
# ============================================================
# In real projects, each module gets its OWN named logger.
# This lets you control log levels per module.
#
# logging.getLogger(name) creates or retrieves a named logger.
# Convention: use __name__ so the logger matches the module name.

logger = logging.getLogger("hotel_pipeline")

# You can set a DIFFERENT level for this specific logger
logger.setLevel(logging.INFO)  # only INFO and above for this logger

logger.debug("This won't appear — level is INFO")  # below INFO → suppressed
logger.info("Processing hotel_booking.csv")
logger.warning("Found 42 bookings with negative ADR")

2026-02-13 22:36:11 | INFO     | Processing hotel_booking.csv


In [None]:
# ============================================================
# 3c. Logging with Variables — f-string vs % formatting
# ============================================================
# Two ways to include variables in log messages:

total_rows = 119_390
skipped = 42

# Method 1: %-style (traditional, lazy evaluation — only formats if level is shown)
logger.info("Processed %d rows, skipped %d", total_rows, skipped)

# Method 2: f-string (modern, always evaluates even if log is suppressed)
logger.info(f"Processed {total_rows:,} rows, skipped {skipped}")

# Best practice: Use %-style for hot paths (millions of calls),
# f-strings are fine for most DE pipelines.

In [None]:
# ============================================================
# 3d. Logging to a File
# ============================================================
# FileHandler writes log messages to a file on disk.
# You can have MULTIPLE handlers: one for console, one for file.

from pathlib import Path

log_dir = Path("data/output")
log_dir.mkdir(parents=True, exist_ok=True)

# Create a fresh logger for this demo
file_logger = logging.getLogger("file_demo")
file_logger.setLevel(logging.DEBUG)

# Remove any existing handlers (cleanup for re-running in notebook)
file_logger.handlers.clear()

# FileHandler(filename, mode) — mode 'a' appends, 'w' overwrites
fh = logging.FileHandler(log_dir / "pipeline.log", mode="w")
fh.setLevel(logging.DEBUG)

# Formatter controls the output format for this handler
formatter = logging.Formatter(
    "%(asctime)s | %(name)s | %(levelname)-8s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
fh.setFormatter(formatter)     # attach the format to the handler
file_logger.addHandler(fh)     # attach the handler to the logger

# Also add a StreamHandler for console output
sh = logging.StreamHandler()   # StreamHandler() writes to stderr by default
sh.setLevel(logging.INFO)      # console only shows INFO+
sh.setFormatter(formatter)
file_logger.addHandler(sh)

# Now log some messages
file_logger.debug("Debug: loading config")   # → file only (below console level)
file_logger.info("Info: pipeline started")    # → file + console
file_logger.warning("Warning: 15 null rows")  # → file + console

# Show the log file contents
print("\n--- pipeline.log ---")
print((log_dir / "pipeline.log").read_text())

In [None]:
# ============================================================
# 3e. Logging Exceptions with traceback
# ============================================================
# logger.exception() logs at ERROR level and INCLUDES the full traceback.
# Always use it inside an `except` block.

try:
    result = int("not_a_number")
except ValueError:
    # exception() automatically captures the traceback from the current exception
    file_logger.exception("Failed to parse value")
    # This logs:
    #   ERROR | Failed to parse value
    #   Traceback (most recent call last): ...
    #   ValueError: invalid literal for int() ...

---
## 4. collections — Specialized Data Structures

The `collections` module provides high-performance alternatives to plain dicts/lists.  
These are the ones you'll use **constantly** in DE pipelines.

In [10]:
from collections import Counter

# ============================================================
# 4a. Counter — count occurrences of anything
# ============================================================
# Counter is a dict subclass: {element: count}.
# Like SQL: SELECT value, COUNT(*) FROM ... GROUP BY value

# Count from a list
room_types = ["Suite", "Deluxe", "Standard", "Suite", "Deluxe", "Suite", "Standard"]
room_counts = Counter(room_types)
print(f"Room counts: {room_counts}")
# Counter({'Suite': 3, 'Deluxe': 2, 'Standard': 2})

# most_common(n) returns the n most frequent elements as (element, count) tuples
print(f"Most common: {room_counts.most_common(2)}")

# Access count for a specific element (returns 0 if not found — never KeyError!)
print(f"Suite count : {room_counts['Suite']}")
print(f"Penthouse   : {room_counts['Penthouse']}")  # 0, not KeyError

Room counts: Counter({'Suite': 3, 'Deluxe': 2, 'Standard': 2})
Most common: [('Suite', 3), ('Deluxe', 2)]
Suite count : 3
Penthouse   : 0


In [11]:
# ============================================================
# 4b. Counter — practical: count countries in hotel CSV
# ============================================================
import csv
from pathlib import Path
from collections import Counter

hotel_csv = Path("data/hotel_booking.csv")

# Stream the CSV and count countries using a generator expression
with hotel_csv.open("r", newline="") as f:
    reader = csv.DictReader(f)
    # Counter() accepts any iterable — here, a generator of country values
    country_counts = Counter(row["country"] for row in reader)

print(f"Unique countries: {len(country_counts)}")
print(f"\nTop 10 countries:")
for country, count in country_counts.most_common(10):
    print(f"  {country:>5} : {count:>6,} bookings")

Unique countries: 178

Top 10 countries:
    PRT : 48,590 bookings
    GBR : 12,129 bookings
    FRA : 10,415 bookings
    ESP :  8,568 bookings
    DEU :  7,287 bookings
    ITA :  3,766 bookings
    IRL :  3,375 bookings
    BEL :  2,342 bookings
    BRA :  2,224 bookings
    NLD :  2,104 bookings


In [12]:
# Counter arithmetic — combine or subtract counts
july_rooms = Counter({"Suite": 50, "Deluxe": 30, "Standard": 80})
aug_rooms = Counter({"Suite": 45, "Deluxe": 55, "Standard": 70})

# Addition: combine counts
total = july_rooms + aug_rooms
print(f"Total (Jul+Aug) : {total}")

# Subtraction: difference (negative counts are dropped)
diff = aug_rooms - july_rooms
print(f"Aug gained over Jul: {diff}")

# total() returns the sum of all counts (Python 3.10+)
try:
    print(f"Grand total     : {total.total()}")
except AttributeError:
    print(f"Grand total     : {sum(total.values())}")

Total (Jul+Aug) : Counter({'Standard': 150, 'Suite': 95, 'Deluxe': 85})
Aug gained over Jul: Counter({'Deluxe': 25})
Grand total     : 330


In [None]:
from collections import defaultdict

# ============================================================
# 4c. defaultdict — auto-initialize missing keys
# ============================================================
# A regular dict raises KeyError on missing keys.
# defaultdict(factory) automatically creates a default value
# using the factory function when you access a missing key.
#
# Common factories:
#   defaultdict(int)   → missing key = 0    (great for counting)
#   defaultdict(list)  → missing key = []   (great for grouping)
#   defaultdict(set)   → missing key = set()
#   defaultdict(float) → missing key = 0.0

# --- Example: Group bookings by hotel type ---
# Without defaultdict: you'd need `if key not in d: d[key] = []` every time.

bookings_by_hotel = defaultdict(list)  # missing key → empty list

with hotel_csv.open("r", newline="") as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        if i >= 20:  # just first 20 rows for demo
            break
        hotel = row["hotel"]
        # No need to check if key exists — defaultdict creates [] automatically
        bookings_by_hotel[hotel].append(row["country"])

for hotel, countries in bookings_by_hotel.items():
    print(f"{hotel}: {len(countries)} bookings from {countries}")

In [None]:
# --- defaultdict(int) for counting ---
# Same as Counter, but useful when you need more control over the counting logic.

revenue_by_hotel = defaultdict(float)  # missing key → 0.0
count_by_hotel = defaultdict(int)      # missing key → 0

with hotel_csv.open("r", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        hotel = row["hotel"]
        try:
            adr = float(row["adr"])
        except (ValueError, KeyError):
            adr = 0.0
        revenue_by_hotel[hotel] += adr
        count_by_hotel[hotel] += 1

print("Hotel Revenue Summary:")
for hotel in revenue_by_hotel:
    avg = revenue_by_hotel[hotel] / count_by_hotel[hotel]
    print(f"  {hotel:>15} : {count_by_hotel[hotel]:>6,} bookings, avg ADR ${avg:.2f}")

In [13]:
from collections import namedtuple

# ============================================================
# 4d. namedtuple — lightweight, immutable data records
# ============================================================
# namedtuple creates a class with named fields but no methods.
# Like a dict but:
#   - Immutable (can't change values after creation)
#   - Uses less memory than a dict
#   - Access by NAME (record.field) and by INDEX (record[0])
#
# Great for representing structured records in pipelines.

# Define a namedtuple type (like defining a class)
# namedtuple(typename, field_names)
Booking = namedtuple("Booking", ["booking_id", "guest", "hotel", "adr", "country"])

# Create instances
b1 = Booking("B-1001", "Alya", "Resort Hotel", 120.0, "IDN")
b2 = Booking("B-1002", "Rafi", "City Hotel", 95.0, "PRT")

# Access by name (preferred — more readable)
print(f"Guest: {b1.guest}, Hotel: {b1.hotel}, ADR: ${b1.adr}")

# Access by index (like a regular tuple)
print(f"First field: {b1[0]}")

# Unpack like a tuple
bid, guest, hotel, adr, country = b2
print(f"Unpacked: {guest} at {hotel}")

# Convert to dict
print(f"As dict: {b1._asdict()}")

# _replace() creates a NEW namedtuple with some fields changed (immutable!)
b1_updated = b1._replace(adr=150.0)
print(f"Updated ADR: {b1_updated.adr} (original: {b1.adr})")

Guest: Alya, Hotel: Resort Hotel, ADR: $120.0
First field: B-1001
Unpacked: Rafi at City Hotel
As dict: {'booking_id': 'B-1001', 'guest': 'Alya', 'hotel': 'Resort Hotel', 'adr': 120.0, 'country': 'IDN'}
Updated ADR: 150.0 (original: 120.0)


In [None]:
from collections import OrderedDict, deque

# ============================================================
# 4e. Other Useful collections
# ============================================================

# --- OrderedDict ---
# Dict that remembers insertion order.
# Note: regular dicts preserve order since Python 3.7+,
# but OrderedDict has extra features like move_to_end().
od = OrderedDict()
od["step_1"] = "extract"
od["step_2"] = "transform"
od["step_3"] = "load"

# move_to_end(key, last=True/False) reorders a key
od.move_to_end("step_1", last=True)  # move step_1 to the end
print(f"OrderedDict: {list(od.items())}")

# --- deque (double-ended queue) ---
# Efficient append/pop from BOTH ends — O(1) vs O(n) for lists.
# Great for: sliding windows, recent-N caches, BFS algorithms.

# deque(maxlen=N) automatically drops the oldest item when full
recent_errors = deque(maxlen=3)  # keep only last 3 errors

recent_errors.append("Error at row 100")   # append() adds to the right
recent_errors.append("Error at row 250")
recent_errors.append("Error at row 500")
recent_errors.append("Error at row 800")   # oldest (row 100) is dropped!

print(f"\nRecent errors (max 3): {list(recent_errors)}")

# appendleft() adds to the left (front)
recent_errors.appendleft("URGENT: row 1")
print(f"After appendleft    : {list(recent_errors)}")

---
## 5. glob — Pattern-Based File Discovery

The `glob` module finds files matching shell-style wildcard patterns.  
While `pathlib.glob()` is preferred for new code (see notebook 03),  
the standalone `glob` module is still widely used in legacy code and scripts.

In [14]:
import glob as glob_module  # import with alias to avoid conflict with pathlib.glob

# ============================================================
# 5a. Basic Patterns
# ============================================================
# glob.glob(pattern) returns a list of matching file paths as strings.
#
# Wildcard patterns:
#   *     → matches everything (any characters, any length)
#   ?     → matches any single character
#   [abc] → matches one character from the set
#   **    → matches any depth of directories (requires recursive=True)

# Find all CSV files in data/
csvs = sorted(glob_module.glob("data/*.csv"))
print("CSV files in data/:")
for f in csvs:
    print(f"  {f}")

# Find all JSON and JSONL files
json_files = sorted(glob_module.glob("data/**/*.json*", recursive=True))
print(f"\nJSON/JSONL files (recursive):")
for f in json_files:
    print(f"  {f}")

CSV files in data/:
  data/bookings.csv
  data/hotel_booking.csv

JSON/JSONL files (recursive):
  data/booking_summary.json
  data/bookings.json


In [15]:
# ============================================================
# 5b. iglob — Lazy Iterator (memory efficient)
# ============================================================
# glob.glob() returns a full list (loads all matches into memory).
# glob.iglob() returns an ITERATOR — yields one match at a time.
# Preferred when you might have thousands of matches.

print("All files in data/ (iglob iterator):")
for filepath in glob_module.iglob("data/**/*", recursive=True):
    print(f"  {filepath}")

All files in data/ (iglob iterator):
  data/booking_summary.json
  data/bookings.json
  data/hotel_booking.csv
  data/bookings.csv


In [None]:
# ============================================================
# 5c. Practical — discover and summarize data files
# ============================================================
import os
from pathlib import Path
from collections import Counter


def summarize_data_directory(directory: str) -> None:
    """
    Scan a directory recursively and summarize files by extension.
    Combines glob (discovery), pathlib (info), and Counter (aggregation).
    """
    # Use pathlib's rglob for recursive discovery
    data_dir = Path(directory)
    all_files = list(data_dir.rglob("*"))  # rglob('*') matches all files recursively

    # Filter to files only (not directories)
    files_only = [f for f in all_files if f.is_file()]

    # Count by extension
    ext_counts = Counter(f.suffix.lower() for f in files_only)

    # Sum sizes by extension
    ext_sizes: dict[str, int] = {}
    for f in files_only:
        ext = f.suffix.lower()
        ext_sizes[ext] = ext_sizes.get(ext, 0) + f.stat().st_size

    print(f"Directory: {data_dir.resolve()}")
    print(f"Total files: {len(files_only)}\n")
    print(f"{'Extension':>10} | {'Count':>5} | {'Size':>10}")
    print(f"{'-'*10:>10} | {'-'*5:>5} | {'-'*10:>10}")

    for ext, count in ext_counts.most_common():
        size_kb = ext_sizes[ext] / 1024
        unit = "KB" if size_kb < 1024 else "MB"
        size_val = size_kb if size_kb < 1024 else size_kb / 1024
        print(f"{ext:>10} | {count:>5} | {size_val:>7.1f} {unit}")


summarize_data_directory("data")

In [None]:
# ============================================================
# 5d. Comparing glob approaches
# ============================================================
# Summary of the 3 ways to find files by pattern:
#
# | Approach            | Returns   | Recursive       | Import           |
# |---------------------|-----------|-----------------|------------------|
# | glob.glob()         | list[str] | recursive=True  | import glob      |
# | glob.iglob()        | iterator  | recursive=True  | import glob      |
# | Path.glob() / rglob | generator | rglob() = **/*  | from pathlib ... |
#
# Recommendation: Use pathlib.rglob() for new code — it returns Path objects
# with all the useful methods (.stat(), .suffix, .name, etc.).

from pathlib import Path

# pathlib equivalent of glob.glob('data/**/*.csv', recursive=True)
csv_files = list(Path("data").rglob("*.csv"))
print("CSV files (pathlib):")
for f in csv_files:
    size_kb = f.stat().st_size / 1024
    print(f"  {f.name:>25} — {size_kb:.1f} KB")

---
## Key Takeaways

| Module | Key Functions | DE Use Case |
|--------|--------------|-------------|
| **datetime** | `strptime`, `strftime`, `timedelta`, `timezone` | Parse dates from CSVs/APIs, compute durations, store in UTC |
| **os** | `getenv`, `environ`, `getcwd`, `cpu_count` | Config via env vars, detect runtime environment |
| **sys** | `version`, `platform`, `getsizeof`, `path` | Debug imports, check memory, log runtime info |
| **logging** | `basicConfig`, `getLogger`, `FileHandler` | Replace `print()` with structured, leveled log output |
| **collections** | `Counter`, `defaultdict`, `namedtuple`, `deque` | Count, group, and structure data efficiently |
| **glob** | `glob()`, `iglob()`, pathlib `rglob()` | Discover data files by pattern across directories |

---

## Next Steps

Continue with **`04_pandas_deep_dive.ipynb`** — Numpy, Pandas transformations, and Parquet format.