# Lesson 1 — Exercise 2 (Python): Connect to Cassandra and Preview Events

## Goal

Connect to a Cassandra cluster that stores `raw_events` and do a quick 
schema/data check so you know how to normalize to `stg.events_raw` later.

## What to build

A Jupyter notebook that:

1.  Connects using `cassandra-driver`.
2.  Reads a bounded number of rows from a keyspace/table.
3.  Computes simple metrics: counts by `event_type`, `mode`, and
    presence of `from_station_id`/`to_station_id`.
4.  Writes `/tmp/events_preview.jsonl` and prints a compact summary.

### Acceptance criteria

-   Uses `CASSANDRA_HOSTS` (comma-separated), `CASSANDRA_KEYSPACE`,
    `CASSANDRA_TABLE`.
-   Pulls at least 200 rows (or as available).
-   Prints top 5 `event_type` and `mode` values, and % rows with station
    IDs.

------------------------------------------------------------------------

## Lesson 1 Exercise 2: Connect to Cassandra and Preview Events Solution

Populate the Cassandra database by running available script

In [1]:
!python populate-cassandra.py

Cassandra is up!
Ensured keyspace.table: transit.raw_events
Inserted 500 …
Inserted 1000 …
Inserted 1500 …
Inserted 2000 …
Inserted 2500 …
Inserted 2500 rows into transit.raw_events


### Step 1: Imports and Configuration

Load required libraries and set up connection parameters from environment variables.

In [2]:
import os
import json
import pandas as pd
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement

# --- Configuration (from environment variables or defaults) ---
CASSANDRA_HOSTS = os.environ.get("CASSANDRA_HOSTS", "127.0.0.1").split(",")
CASSANDRA_PORT = int(os.environ.get("CASSANDRA_PORT", "9042"))
CASSANDRA_KEYSPACE = os.environ.get("CASSANDRA_KEYSPACE", "transit")
CASSANDRA_TABLE = os.environ.get("CASSANDRA_TABLE", "raw_events")

LIMIT = 200
OUT_FILE = "/tmp/events_preview.jsonl"

print(f"Cassandra Hosts: {CASSANDRA_HOSTS}")
print(f"Port: {CASSANDRA_PORT}")
print(f"Keyspace: {CASSANDRA_KEYSPACE}")
print(f"Table: {CASSANDRA_TABLE}")
print(f"Row limit: {LIMIT}")
print(f"Output file: {OUT_FILE}")

Cassandra Hosts: ['127.0.0.1']
Port: 9042
Keyspace: transit
Table: raw_events
Row limit: 200
Output file: /tmp/events_preview.jsonl


### Step 2: Connect to Cassandra

Establish a connection to the Cassandra cluster and set the keyspace.

In [3]:
# Clean up host list (remove empty strings, strip whitespace)
hosts = [h.strip() for h in CASSANDRA_HOSTS if h.strip()]

cluster = Cluster(hosts, port=CASSANDRA_PORT)
session = cluster.connect(CASSANDRA_KEYSPACE)

print(f"Successfully connected to Cassandra!")
print(f"Cluster name: {cluster.metadata.cluster_name}")

Successfully connected to Cassandra!
Cluster name: Test Cluster


### Step 3: Fetch Sample Rows

Query the events table to get a bounded sample of rows.

In [4]:
query = f"SELECT * FROM {CASSANDRA_TABLE} LIMIT {LIMIT}"
stmt = SimpleStatement(query)

rows = session.execute(stmt)

# Convert to list of dictionaries
records = [dict(row._asdict()) for row in rows]

print(f"Fetched {len(records)} rows from {CASSANDRA_KEYSPACE}.{CASSANDRA_TABLE}")

Fetched 200 rows from transit.raw_events


### Step 4: Convert to DataFrame and Preview

Load the records into a pandas DataFrame for easier analysis.

In [5]:
if not records:
    print("WARNING: No rows fetched. Check keyspace/table name or data loader.")
    df = pd.DataFrame()
else:
    df = pd.DataFrame(records)
    print(f"DataFrame shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print("\nSample rows:")
    display(df.head(10))

DataFrame shape: (200, 25)
Columns: ['event_id', 'city', 'country', 'device_type', 'door_open', 'dwell_seconds', 'event_ts', 'event_type', 'from_station_id', 'incident_code', 'inspection_outcome', 'inspector_id', 'latitude', 'load_factor', 'longitude', 'mode', 'os', 'passenger_delta', 'province', 'route_id', 'session_id', 'speed_kmh', 'to_station_id', 'validator_id', 'vehicle_id']

Sample rows:


Unnamed: 0,event_id,city,country,device_type,door_open,dwell_seconds,event_ts,event_type,from_station_id,incident_code,...,mode,os,passenger_delta,province,route_id,session_id,speed_kmh,to_station_id,validator_id,vehicle_id
0,E200962,Vancouver,CA,onboard_sensor,False,40,2025-04-19 12:56:10,gps_ping,S007,NONE,...,bus,Android,0,BC,R058,S7420484713,20.3114,S009,,V0424
1,E200302,Vancouver,CA,vehicle_avl_unit,False,40,2025-03-23 22:14:01,gps_ping,S014,NONE,...,bus,Embedded,0,BC,R009,S7542933036,21.820635,S027,,V0151
2,E200393,Vancouver,CA,validator_gate,False,24,2025-01-21 23:11:20,door_close,S008,NONE,...,bus,iOS,0,BC,R013,S1626288629,34.295534,S012,,V0067
3,E200379,Vancouver,CA,cctv,False,7,2024-06-11 05:25:18,tap_on,S026,MEDICAL,...,bus,Android,5,BC,R025,S3748584027,20.259986,S002,VAL01586,V0781
4,E201160,Vancouver,CA,handheld_validator,False,28,2024-05-09 14:05:40,gps_ping,S025,NONE,...,skytrain,RTOS,0,BC,R076,S6573640626,31.971919,S029,,V0199
5,E202487,Vancouver,CA,validator_gate,True,16,2024-05-30 22:14:20,door_open,S021,NONE,...,bus,Android,0,BC,R075,S1878303982,31.394312,S019,,V0446
6,E201514,Vancouver,CA,vehicle_avl_unit,False,24,2024-11-04 23:32:09,gps_ping,S008,NONE,...,bus,RTOS,0,BC,R084,S8916375602,39.298645,S015,,V0674
7,E201911,Vancouver,CA,validator_gate,False,27,2025-06-20 13:48:45,fare_inspection,S016,NONE,...,seabus,Android,0,BC,R066,S1914021258,29.375672,S021,VAL01977,V0693
8,E200420,Vancouver,CA,cctv,False,30,2025-04-16 21:39:10,tap_on,S016,NONE,...,bus,Embedded,4,BC,R016,S6050738248,16.113458,S020,VAL01529,V0433
9,E200790,Vancouver,CA,validator_gate,False,28,2024-11-07 05:00:49,incident_report,S009,NONE,...,bus,Linux,0,BC,R022,S6801281590,17.178918,S007,,V0485


### Step 5: Profile Event Types

Count the distribution of `event_type` values to understand what kinds of events are captured.

In [6]:
if "event_type" in df.columns:
    print("Top 5 Event Types:")
    print("-" * 40)
    event_counts = df["event_type"].value_counts().head(5)
    display(event_counts.to_frame("count"))
else:
    print("Column 'event_type' not found in data")

Top 5 Event Types:
----------------------------------------


Unnamed: 0_level_0,count
event_type,Unnamed: 1_level_1
gps_ping,63
tap_on,47
tap_off,19
departure,15
arrival,12


### Step 6: Profile Transport Modes

Count the distribution of `mode` values (bus, train, ferry, etc.).

In [7]:
if "mode" in df.columns:
    print("Top 5 Transport Modes:")
    print("-" * 40)
    mode_counts = df["mode"].value_counts().head(5)
    display(mode_counts.to_frame("count"))
else:
    print("Column 'mode' not found in data")

Top 5 Transport Modes:
----------------------------------------


Unnamed: 0_level_0,count
mode,Unnamed: 1_level_1
bus,102
skytrain,71
seabus,14
wce,13


### Step 7: Check Station ID Presence

Calculate what percentage of rows have station IDs populated (important for trip mapping).

In [8]:
print("Station ID Presence:")
print("-" * 40)

station_cols = ["from_station_id", "to_station_id"]
for col in station_cols:
    if col in df.columns:
        pct_present = round(df[col].notna().mean() * 100, 2)
        print(f"{col}: {pct_present}% populated")
    else:
        print(f"{col}: column not found")

Station ID Presence:
----------------------------------------
from_station_id: 100.0% populated
to_station_id: 100.0% populated


### Step 8: Check Timestamp Quality

Verify the `event_ts` column has valid timestamps with minimal nulls.

In [9]:
if "event_ts" in df.columns:
    null_pct = round(df["event_ts"].isna().mean() * 100, 2)
    print(f"Timestamp (event_ts) null %: {null_pct}%")
    
    # Show timestamp range if available
    valid_ts = df["event_ts"].dropna()
    if len(valid_ts) > 0:
        print(f"Timestamp range: {valid_ts.min()} to {valid_ts.max()}")
else:
    print("Column 'event_ts' not found in data")

Timestamp (event_ts) null %: 0.0%
Timestamp range: 2024-01-01 06:50:30 to 2025-06-27 03:38:49


### Step 9: Review All Column Data Types

Check data types for all columns to plan the staging table DDL.

In [10]:
if not df.empty:
    print("All Column Data Types:")
    print("-" * 40)
    for col, dtype in df.dtypes.items():
        null_count = df[col].isna().sum()
        print(f"{col:25} {str(dtype):15} (nulls: {null_count})")

All Column Data Types:
----------------------------------------
event_id                  object          (nulls: 0)
city                      object          (nulls: 0)
country                   object          (nulls: 0)
device_type               object          (nulls: 0)
door_open                 bool            (nulls: 0)
dwell_seconds             int64           (nulls: 0)
event_ts                  datetime64[ns]  (nulls: 0)
event_type                object          (nulls: 0)
from_station_id           object          (nulls: 0)
incident_code             object          (nulls: 0)
inspection_outcome        object          (nulls: 192)
inspector_id              object          (nulls: 192)
latitude                  float64         (nulls: 0)
load_factor               float64         (nulls: 0)
longitude                 float64         (nulls: 0)
mode                      object          (nulls: 0)
os                        object          (nulls: 0)
passenger_delta           int64

### Step 10: Write Sample to JSONL

Save the sample events to a JSON Lines file for reference.

In [11]:
with open(OUT_FILE, "w", encoding="utf-8") as f:
    for record in records:
        # Convert non-serializable types to strings
        f.write(json.dumps(record, default=str) + "\n")

print(f"Wrote {len(records)} rows to {OUT_FILE}")

# Verify the file
with open(OUT_FILE, "r") as f:
    line_count = sum(1 for _ in f)
print(f"Verified: {line_count} lines in output file")

Wrote 200 rows to /tmp/events_preview.jsonl
Verified: 200 lines in output file


### Step 11: Clean Up

Close the Cassandra session and cluster connections.

In [12]:
session.shutdown()
cluster.shutdown()
print("Cassandra connection closed.")

Cassandra connection closed.


------------------------------------------------------------------------

## Summary

In this exercise, you:

1. Connected to Cassandra using the official Python driver
2. Fetched a bounded sample of event rows
3. Profiled event_type distribution to understand event categories
4. Profiled transport mode distribution
5. Checked station ID presence (important for trip mapping)
6. Verified timestamp quality
7. Reviewed all column data types for staging planning
8. Exported a preview JSONL file for reference

These patterns (env-driven config, bounded reads, quick stats, JSONL outputs) continue to be exactly what you'll reuse when building **ETL pipelines**.