# Lesson 1 - Exercise 2 (Python): Connect to Cassandra and Preview Events

## Goal

Connect to a Cassandra cluster that stores `raw_events` and do a quick 
schema/data check so you know how to normalize to `stg.events_raw` later.

## What to build

A Jupyter notebook that:

1.  Connects using `cassandra-driver`.
2.  Reads a bounded number of rows from a keyspace/table.
3.  Computes simple metrics: counts by `event_type`, `mode`, and
    presence of `from_station_id`/`to_station_id`.
4.  Writes `/tmp/events_preview.jsonl` and prints a compact summary.

------------------------------------------------------------------------

### Step 1: Imports and Configuration

Load required libraries and set up connection parameters from environment variables.

In [None]:
import os
import json
import pandas as pd
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement

# --- Configuration (from environment variables or defaults) ---
CASSANDRA_HOSTS = os.environ.get("CASSANDRA_HOSTS", "127.0.0.1").split(",")
CASSANDRA_PORT = int(os.environ.get("CASSANDRA_PORT", "9042"))
CASSANDRA_KEYSPACE = os.environ.get("CASSANDRA_KEYSPACE", "transit")
CASSANDRA_TABLE = os.environ.get("CASSANDRA_TABLE", "raw_events")

LIMIT = 200
OUT_FILE = "/tmp/events_preview.jsonl"

print(f"Cassandra Hosts: {CASSANDRA_HOSTS}")
print(f"Port: {CASSANDRA_PORT}")
print(f"Keyspace: {CASSANDRA_KEYSPACE}")
print(f"Table: {CASSANDRA_TABLE}")
print(f"Row limit: {LIMIT}")
print(f"Output file: {OUT_FILE}")

### Step 2: Connect to Cassandra

Establish a connection to the Cassandra cluster and set the keyspace.

In [None]:
# Clean up host list (remove empty strings, strip whitespace)
hosts = [h.strip() for h in CASSANDRA_HOSTS if h.strip()]

cluster = Cluster(hosts, port=CASSANDRA_PORT)
session = cluster.connect(CASSANDRA_KEYSPACE)

print(f"Successfully connected to Cassandra!")
print(f"Cluster name: {cluster.metadata.cluster_name}")

### Step 3: Fetch Sample Rows

Query the events table to get a bounded sample of rows.

**TODO**: Write a CQL query to select all columns from the table with a LIMIT. Execute it and convert the results to a list of dictionaries. Hint: Use `row._asdict()` to convert each row.

In [None]:
# TODO: Build the CQL query string using CASSANDRA_TABLE and LIMIT
query = ""
stmt = SimpleStatement(query)

# TODO: Execute the query using session.execute()
rows = None

# TODO: Convert rows to a list of dictionaries
# Hint: Use dict(row._asdict()) for each row
records = []

print(f"Fetched {len(records)} rows from {CASSANDRA_KEYSPACE}.{CASSANDRA_TABLE}")

### Step 4: Convert to DataFrame and Preview

Load the records into a pandas DataFrame for easier analysis.

In [None]:
if not records:
    print("WARNING: No rows fetched. Check keyspace/table name or data loader.")
    df = pd.DataFrame()
else:
    df = pd.DataFrame(records)
    print(f"DataFrame shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print("\nSample rows:")
    display(df.head(10))

### Step 5: Profile Event Types

Count the distribution of `event_type` values to understand what kinds of events are captured.

**TODO**: Use pandas `value_counts()` to get the top 5 most common event types. Display the results as a DataFrame with a "count" column.

In [None]:
if "event_type" in df.columns:
    print("Top 5 Event Types:")
    print("-" * 40)
    # TODO: Get value counts for top 5 event types and display as a DataFrame
    event_counts = None
    display(event_counts)
else:
    print("Column 'event_type' not found in data")

### Step 6: Profile Transport Modes

Count the distribution of `mode` values (bus, train, ferry, etc.).

**TODO**: Use the same approach as Step 5 to get the top 5 transport modes.

In [None]:
if "mode" in df.columns:
    print("Top 5 Transport Modes:")
    print("-" * 40)
    # TODO: Get value counts for top 5 modes and display as a DataFrame
    mode_counts = None
    display(mode_counts)
else:
    print("Column 'mode' not found in data")

### Step 7: Check Station ID Presence

Calculate what percentage of rows have station IDs populated (important for trip mapping).

**TODO**: For each station column, calculate the percentage of non-null values. Hint: Use `.notna().mean() * 100` to get the percentage.

In [None]:
print("Station ID Presence:")
print("-" * 40)

station_cols = ["from_station_id", "to_station_id"]
for col in station_cols:
    if col in df.columns:
        # TODO: Calculate the percentage of non-null values (rounded to 2 decimal places)
        pct_present = None
        print(f"{col}: {pct_present}% populated")
    else:
        print(f"{col}: column not found")

### Step 8: Check Timestamp Quality

Verify the `event_ts` column has valid timestamps with minimal nulls.

**TODO**: Calculate the null percentage for `event_ts` and find the min/max timestamp range.

In [None]:
if "event_ts" in df.columns:
    # TODO: Calculate null percentage (rounded to 2 decimal places)
    null_pct = None
    print(f"Timestamp (event_ts) null %: {null_pct}%")
    
    # TODO: Get valid (non-null) timestamps and print min/max range
    valid_ts = None
    if valid_ts is not None and len(valid_ts) > 0:
        print(f"Timestamp range: {valid_ts.min()} to {valid_ts.max()}")
else:
    print("Column 'event_ts' not found in data")

### Step 9: Review All Column Data Types

Check data types for all columns to plan the staging table DDL.

**TODO**: Loop through all columns in the DataFrame and print each column's name, data type, and null count.

In [None]:
if not df.empty:
    print("All Column Data Types:")
    print("-" * 40)
    # TODO: Loop through df.dtypes and print column info with null counts


### Step 10: Write Sample to JSONL

Save the sample events to a JSON Lines file for reference.

**TODO**: Write each record to a JSONL file (one JSON object per line). Use `json.dumps(record, default=str)` to handle non-serializable types like timestamps. Then verify by counting lines in the file.

In [None]:
# TODO: Write records to JSONL file
with open(OUT_FILE, "w", encoding="utf-8") as f:
    pass  # Replace with your code

print(f"Wrote {len(records)} rows to {OUT_FILE}")

# TODO: Verify by counting lines in the file
line_count = None
print(f"Verified: {line_count} lines in output file")

### Step 11: Clean Up

Close the Cassandra session and cluster connections.

In [None]:
session.shutdown()
cluster.shutdown()
print("Cassandra connection closed.")

------------------------------------------------------------------------

## Summary

In this exercise, you:

1. Connected to Cassandra using the official Python driver
2. Fetched a bounded sample of event rows
3. Profiled event_type distribution to understand event categories
4. Profiled transport mode distribution
5. Checked station ID presence (important for trip mapping)
6. Verified timestamp quality
7. Reviewed all column data types for staging planning
8. Exported a preview JSONL file for reference

These patterns (env-driven config, bounded reads, quick stats, JSONL outputs) are exactly what you'll reuse when building **ETL pipelines**.