# Build a Streaming Data Pipeline in Python: Dynamic Tables

## Objective
Build a real-time analytics pipeline using Snowflake Dynamic Tables to transform raw streaming ski resort data into actionable insights.

## What You'll Learn
- Create and chain Dynamic Tables for incremental data processing.
- Build multi-level aggregation hierarchies (hourly → daily → weekly).
- Implement real-time operational dashboards.
- Apply best practices for streaming data transformations.

## Architecture Overview
The data processing follows a layered approach:
1.  **Raw Streaming Data**: The initial source of information.
2.  **Hourly Aggregations**: First level of transformation, providing granular summaries.
3.  **Daily Summaries**: Aggregation of hourly data for daily insights.
4.  **Weekly Reports**: Highest level of aggregation for trend analysis.

Operational views and KPIs are derived from these layers for real-time analytics.

## Lab Setup

### Start Streaming Data
1. Navigate to your GitHub Codespace.
2. Run `docker-compose up` to begin streaming.
3. Follow instructions in `README.md`.
4. Keep the streamer running throughout the lab (stop with Ctrl+C).

### Initialize Environment

In [None]:
USE ROLE STREAMING_INGEST;
USE SCHEMA STREAMING_INGEST.STREAMING_INGEST;
USE WAREHOUSE STREAMING_INGEST;

## Explore Raw Data

Examine the streaming data structure from three core tables:

In [None]:
-- Lift usage events (core activity data)
SELECT 
    RESORT,
    LIFT,
    RIDE_TIME,
    RFID
FROM LIFT_RIDE 
ORDER BY RIDE_TIME DESC
LIMIT 10;

In [None]:
-- Day ticket purchases
SELECT 
    RESORT,
    PURCHASE_TIME,
    PRICE_USD,
    DAYS,
    NAME
FROM RESORT_TICKET 
ORDER BY PURCHASE_TIME DESC
LIMIT 10;

In [None]:
-- Season pass purchases
SELECT 
    PURCHASE_TIME,
    PRICE_USD,
    NAME,
    EXPIRATION_TIME
FROM SEASON_PASS 
ORDER BY PURCHASE_TIME DESC
LIMIT 10;

## Part 1: Basic Transformations

These foundational Dynamic Tables provide quick insights, serve as building blocks for complex analytics, and feed real-time dashboards. A 10-minute target lag is used for near real-time updates.

In [None]:
-- Foundation table: Daily lift usage patterns
CREATE OR REPLACE DYNAMIC TABLE LIFT_RIDES_BY_DAY 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE_TRUNC(day, RIDE_TIME) as RIDE_DATE,
    RESORT, 
    LIFT, 
    COUNT(*) as RIDE_COUNT
FROM LIFT_RIDE 
GROUP BY all;

In [None]:
-- Real-time operational view: Today's busiest lifts
CREATE OR REPLACE DYNAMIC TABLE BUSIEST_LIFTS_TODAY 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    RESORT, 
    LIFT, 
    RIDE_COUNT 
FROM LIFT_RIDES_BY_DAY 
WHERE RIDE_DATE = CURRENT_DATE() 
ORDER BY RIDE_COUNT DESC 
LIMIT 10;

In [None]:
-- Capacity management: Unique visitors per resort/day
CREATE OR REPLACE DYNAMIC TABLE RESORT_VISITORS_BY_DAY 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE_TRUNC(day, RIDE_TIME) as RIDE_DATE, 
    RESORT, 
    COUNT(DISTINCT RFID) as VISITOR_COUNT
FROM LIFT_RIDE 
GROUP BY all;

In [None]:
-- Financial tracking: Daily ticket revenue
CREATE OR REPLACE DYNAMIC TABLE RESORT_REVENUE_BY_DAY 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE_TRUNC(day, PURCHASE_TIME) as PURCHASE_DATE, 
    RESORT, 
    SUM(PRICE_USD) as REVENUE
FROM RESORT_TICKET 
GROUP BY all;

In [None]:
-- Long-term revenue: Season pass sales
CREATE OR REPLACE DYNAMIC TABLE SEASON_PASS_REVENUE_BY_MONTH 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE_TRUNC(month, PURCHASE_TIME) as PURCHASE_MONTH, 
    SUM(PRICE_USD) as REVENUE
FROM SEASON_PASS 
GROUP BY all;

### Verify Your Basic Transformations
Check that data is flowing and updating:

In [None]:
-- Check current busiest lifts
SELECT * FROM BUSIEST_LIFTS_TODAY;

In [None]:
-- Check visitor counts
SELECT * FROM RESORT_VISITORS_BY_DAY 
WHERE RIDE_DATE = CURRENT_DATE()
ORDER BY VISITOR_COUNT DESC;

## Part 2: Reference Data & Capacity Planning

### Resort Capacity Configuration
Create reference data for capacity calculations and operational limits:

In [None]:
-- Reference table: Resort operational capacity
CREATE OR REPLACE TABLE RESORT_CAPACITY (
    RESORT VARCHAR(100) PRIMARY KEY,
    MAX_CAPACITY INTEGER,
    HOURLY_CAPACITY INTEGER,
    BASE_LIFT_COUNT INTEGER
);

INSERT INTO RESORT_CAPACITY VALUES
('Vail', 7000, 1100, 34),
('Beaver Creek', 5500, 900, 25),
('Breckenridge', 6500, 1000, 35),
('Keystone', 4500, 700, 21),
('Heavenly', 5000, 800, 27);

## Part 3: Advanced Analytics Pipeline

### Hierarchical Aggregation Strategy
This pipeline processes data in stages:
- **Level 0 (Raw Data)**: `LIFT_RIDE`, `RESORT_TICKET`, `SEASON_PASS`
- **Level 1 (Hourly)**: Aggregations by hour (e.g., `LIFT_RIDES_BY_HOUR`). Foundation for time-based analysis.
- **Level 2 (Comprehensive Hourly)**: Combined hourly metrics (e.g., `RESORT_HOURLY_SUMMARY`).
- **Level 3 (Daily)**: Roll-up of hourly data to daily summaries (e.g., `RESORT_DAILY_SUMMARY`).
- **Level 4 (Weekly)**: Aggregation of daily data for weekly trends (e.g., `RESORT_WEEKLY_SUMMARY`).

Use `TARGET_LAG='downstream'` to create dependent chains, ensuring tables refresh in sequence.

In [None]:
-- Level 1: Hourly lift usage
CREATE OR REPLACE DYNAMIC TABLE LIFT_RIDES_BY_HOUR 
TARGET_LAG='downstream' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE(RIDE_TIME) as RIDE_DATE,
    HOUR(RIDE_TIME) as RIDE_HOUR,
    RESORT, 
    LIFT,
    COUNT(*) as RIDE_COUNT,
    COUNT(DISTINCT RFID) as UNIQUE_VISITORS
FROM LIFT_RIDE 
GROUP BY all;

In [None]:
-- Level 1: Hourly visitor patterns
CREATE OR REPLACE DYNAMIC TABLE RESORT_VISITORS_BY_HOUR 
TARGET_LAG='downstream' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE(RIDE_TIME) as RIDE_DATE,
    HOUR(RIDE_TIME) as RIDE_HOUR,
    RESORT,
    COUNT(DISTINCT RFID) as VISITOR_COUNT,
    COUNT(*) as TOTAL_RIDES
FROM LIFT_RIDE 
GROUP BY all;

In [None]:
-- Level 1: Hourly ticket sales
CREATE OR REPLACE DYNAMIC TABLE HOURLY_TICKET_REVENUE 
TARGET_LAG='downstream' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE(PURCHASE_TIME) as PURCHASE_DATE,
    HOUR(PURCHASE_TIME) as PURCHASE_HOUR,
    RESORT,
    SUM(PRICE_USD) as TICKET_REVENUE,
    COUNT(*) as TICKETS_SOLD
FROM RESORT_TICKET 
GROUP BY all;

In [None]:
-- Level 1: Season pass usage
CREATE OR REPLACE DYNAMIC TABLE HOURLY_PASS_USAGE 
TARGET_LAG='downstream' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE(lr.RIDE_TIME) as RIDE_DATE,
    HOUR(lr.RIDE_TIME) as RIDE_HOUR,
    lr.RESORT,
    COUNT(DISTINCT sp.RFID) as ACTIVE_PASSES,
    COUNT(*) as PASS_RIDES
FROM LIFT_RIDE lr
JOIN SEASON_PASS sp ON lr.RFID = sp.RFID
GROUP BY all;

### Comprehensive Hourly Summary (Level 2)
Combine all hourly metrics into a single operational view:

In [None]:
-- Level 2: Comprehensive hourly resort summary
CREATE OR REPLACE DYNAMIC TABLE RESORT_HOURLY_SUMMARY 
TARGET_LAG='downstream' 
WAREHOUSE = STREAMING_INGEST AS
WITH pass_revenue_allocation AS (
    SELECT 5.50 as HOURLY_VALUE_USD  -- Revenue per pass use
)
SELECT 
    v.RIDE_DATE,
    v.RIDE_HOUR,
    v.RESORT,
    v.VISITOR_COUNT,
    v.TOTAL_RIDES,
    COALESCE(t.TICKET_REVENUE, 0) as TICKET_REVENUE,
    COALESCE(t.TICKETS_SOLD, 0) as TICKETS_SOLD,
    COALESCE(p.ACTIVE_PASSES, 0) as ACTIVE_PASSES,
    -- Calculate capacity percentage
    (v.VISITOR_COUNT / rc.MAX_CAPACITY * 100) as CAPACITY_PCT,
    -- Calculate total revenue (tickets + allocated pass value)
    COALESCE(t.TICKET_REVENUE, 0) + 
    (COALESCE(p.ACTIVE_PASSES, 0) * pra.HOURLY_VALUE_USD) as TOTAL_REVENUE
FROM RESORT_VISITORS_BY_HOUR v
LEFT JOIN HOURLY_TICKET_REVENUE t 
    ON v.RIDE_DATE = t.PURCHASE_DATE 
    AND v.RIDE_HOUR = t.PURCHASE_HOUR 
    AND v.RESORT = t.RESORT
LEFT JOIN HOURLY_PASS_USAGE p 
    ON v.RIDE_DATE = p.RIDE_DATE 
    AND v.RIDE_HOUR = p.RIDE_HOUR 
    AND v.RESORT = p.RESORT
JOIN RESORT_CAPACITY rc ON v.RESORT = rc.RESORT
CROSS JOIN pass_revenue_allocation pra;

### Daily Aggregations (Level 3)
Roll up hourly data into daily business insights:

In [None]:
-- Level 3: Daily resort summary (from hourly data)
CREATE OR REPLACE DYNAMIC TABLE RESORT_DAILY_SUMMARY 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    RIDE_DATE,
    RESORT,
    MAX(VISITOR_COUNT) as PEAK_VISITORS,
    SUM(TOTAL_RIDES) as TOTAL_RIDES,
    SUM(TOTAL_REVENUE) as TOTAL_REVENUE,
    SUM(TICKETS_SOLD) as TICKETS_SOLD,
    SUM(ACTIVE_PASSES) as TOTAL_PASS_USES,
    AVG(CAPACITY_PCT) as AVG_CAPACITY_PCT,
    MAX(CAPACITY_PCT) as PEAK_CAPACITY_PCT,
    COUNT(*) as OPERATION_HOURS
FROM RESORT_HOURLY_SUMMARY
GROUP BY all;

In [None]:
-- Level 3: Daily lift performance rankings
CREATE OR REPLACE DYNAMIC TABLE LIFT_DAILY_SUMMARY 
TARGET_LAG='1 minute' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    RIDE_DATE,
    RESORT,
    LIFT,
    SUM(RIDE_COUNT) as DAILY_RIDES,
    SUM(UNIQUE_VISITORS) as DAILY_VISITORS,
    COUNT(*) as OPERATION_HOURS,
    RANK() OVER (PARTITION BY RESORT, RIDE_DATE 
                 ORDER BY SUM(RIDE_COUNT) DESC) as USAGE_RANK
FROM LIFT_RIDES_BY_HOUR
GROUP BY all;

### Weekly Trend Analysis (Level 4)
Aggregate daily data for business intelligence:

In [None]:
-- Level 4: Weekly trends and patterns
CREATE OR REPLACE DYNAMIC TABLE RESORT_WEEKLY_SUMMARY 
TARGET_LAG='30 minutes' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    DATE_TRUNC(week, RIDE_DATE) as WEEK_START_DATE,
    RESORT,
    MAX(PEAK_VISITORS) as WEEK_PEAK_VISITORS,
    AVG(PEAK_VISITORS) as AVG_DAILY_VISITORS,
    SUM(TOTAL_RIDES) as WEEK_TOTAL_RIDES,
    SUM(TOTAL_REVENUE) as WEEK_TOTAL_REVENUE,
    AVG(TOTAL_REVENUE) as AVG_DAILY_REVENUE,
    SUM(TICKETS_SOLD) as WEEK_TICKETS_SOLD,
    SUM(TOTAL_PASS_USES) as WEEK_PASS_USES,
    AVG(AVG_CAPACITY_PCT) as AVG_WEEK_CAPACITY_PCT,
    MAX(PEAK_CAPACITY_PCT) as WEEK_PEAK_CAPACITY_PCT,
    COUNT(*) as OPERATION_DAYS
FROM RESORT_DAILY_SUMMARY
GROUP BY all;

## Part 4: Real-Time Operational Views

### Mission-Critical Dashboards
Create live operational intelligence for resort management:

In [None]:
-- Real-time resort status (refreshes every minute)
CREATE OR REPLACE DYNAMIC TABLE CURRENT_RESORT_STATUS 
TARGET_LAG='1 minute' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    RESORT,
    VISITOR_COUNT as CURRENT_VISITORS,
    CAPACITY_PCT as CURRENT_CAPACITY_PCT,
    TOTAL_RIDES as CURRENT_HOUR_RIDES,
    TOTAL_REVENUE as CURRENT_HOUR_REVENUE,
    CASE 
        WHEN CAPACITY_PCT > 90 THEN 'HIGH'
        WHEN CAPACITY_PCT > 70 THEN 'MODERATE'
        ELSE 'NORMAL'
    END as CAPACITY_STATUS
FROM RESORT_HOURLY_SUMMARY
WHERE RIDE_DATE = CURRENT_DATE()
AND RIDE_HOUR = HOUR(CURRENT_TIMESTAMP());

In [None]:
-- Today's top performing lifts by resort
CREATE OR REPLACE DYNAMIC TABLE TOP_LIFTS_TODAY 
TARGET_LAG='1 minute' 
WAREHOUSE = STREAMING_INGEST AS
SELECT 
    RESORT,
    LIFT,
    DAILY_RIDES,
    DAILY_VISITORS,
    USAGE_RANK
FROM LIFT_DAILY_SUMMARY
WHERE RIDE_DATE = CURRENT_DATE()
AND USAGE_RANK <= 5
ORDER BY RESORT, USAGE_RANK;

In [None]:
-- Daily revenue performance vs targets
CREATE OR REPLACE DYNAMIC TABLE REVENUE_PERFORMANCE_DAILY 
TARGET_LAG='10 minutes' 
WAREHOUSE = STREAMING_INGEST AS
WITH daily_targets AS (
    SELECT 
        rds.RESORT,
        rds.RIDE_DATE,
        (rc.MAX_CAPACITY * 0.7 * 100) as REVENUE_TARGET_USD
    FROM RESORT_DAILY_SUMMARY rds
    JOIN RESORT_CAPACITY rc ON rds.RESORT = rc.RESORT
)
SELECT 
    rds.RIDE_DATE,
    rds.RESORT,
    rds.TOTAL_REVENUE,
    dt.REVENUE_TARGET_USD,
    (rds.TOTAL_REVENUE / dt.REVENUE_TARGET_USD * 100) as REVENUE_TARGET_PCT,
    CASE 
        WHEN rds.TOTAL_REVENUE >= dt.REVENUE_TARGET_USD THEN 'ABOVE_TARGET'
        WHEN rds.TOTAL_REVENUE >= dt.REVENUE_TARGET_USD * 0.9 THEN 'NEAR_TARGET'
        ELSE 'BELOW_TARGET'
    END as PERFORMANCE_STATUS
FROM RESORT_DAILY_SUMMARY rds
JOIN daily_targets dt ON rds.RESORT = dt.RESORT AND rds.RIDE_DATE = dt.RIDE_DATE;

## Part 5: Testing & Validation

### Monitor Your Pipeline
Use these queries to verify your Dynamic Tables are working correctly:

In [None]:
-- Check current operational status
SELECT 
    RESORT,
    CURRENT_VISITORS,
    CURRENT_CAPACITY_PCT,
    CAPACITY_STATUS,
    CURRENT_HOUR_REVENUE
FROM CURRENT_RESORT_STATUS
ORDER BY CURRENT_CAPACITY_PCT DESC;

In [None]:
-- View today's top lifts
SELECT * FROM TOP_LIFTS_TODAY
ORDER BY RESORT, USAGE_RANK;

In [None]:
-- Check revenue performance
SELECT 
    RESORT,
    TOTAL_REVENUE,
    REVENUE_TARGET_USD,
    REVENUE_TARGET_PCT,
    PERFORMANCE_STATUS
FROM REVENUE_PERFORMANCE_DAILY
WHERE RIDE_DATE = CURRENT_DATE()
ORDER BY REVENUE_TARGET_PCT DESC;

### Monitor Dynamic Table Health
Track refresh history and performance:

In [None]:
-- Check refresh history for performance monitoring
SELECT 
    name, 
    refresh_start_time, 
    state, 
    refresh_end_time,
    DATEDIFF('second', refresh_start_time, refresh_end_time) as duration_seconds
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY())
WHERE name LIKE '%RESORT_%'
ORDER BY refresh_start_time DESC
LIMIT 20;

In [None]:
-- Check Dynamic Table status and refresh modes
SHOW DYNAMIC TABLES;
SELECT 
    "name", 
    "rows", 
    "target_lag", 
    "refresh_mode", 
    "refresh_mode_reason"
FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
WHERE "name" NOT LIKE '%LIFT_RIDES_BY_DAY%';  -- Filter out basic tables

### Explore Hourly Insights
Investigate patterns in your aggregated data:

In [None]:
-- Analyze hourly visitor patterns
SELECT 
    RIDE_HOUR,
    RESORT,
    AVG(VISITOR_COUNT) as AVG_VISITORS,
    AVG(CAPACITY_PCT) as AVG_CAPACITY
FROM RESORT_HOURLY_SUMMARY
WHERE RIDE_DATE = CURRENT_DATE()
GROUP BY RIDE_HOUR, RESORT
ORDER BY RIDE_HOUR, RESORT;

In [None]:
-- Identify peak hours by resort
SELECT 
    RESORT,
    RIDE_HOUR,
    VISITOR_COUNT,
    CAPACITY_PCT,
    TOTAL_REVENUE
FROM RESORT_HOURLY_SUMMARY
WHERE RIDE_DATE = CURRENT_DATE()
QUALIFY ROW_NUMBER() OVER (PARTITION BY RESORT ORDER BY VISITOR_COUNT DESC) = 1
ORDER BY VISITOR_COUNT DESC;

## Lab Management

### Reset Simulation (if needed)
Use this to restart your data stream:

In [None]:
-- WARNING: This deletes all streaming data.
-- Only run if you need to reset the simulation.
/*
DELETE FROM LIFT_RIDE;
DELETE FROM RESORT_TICKET;
DELETE FROM SEASON_PASS;
*/

## Lab Complete!

### What You've Built
- Multi-level aggregation pipeline (Raw → Hourly → Daily → Weekly).
- Real-time operational dashboards with 1-minute refresh.
- Capacity management with live status indicators.
- Revenue tracking with performance targets.
- Hierarchical data dependencies using `TARGET_LAG` strategies.

### Key Takeaways
- Dynamic Tables automatically handle incremental processing.
- `TARGET_LAG='downstream'` creates efficient refresh chains.
- Near real-time analytics can be implemented in a declarative style.

### Next Steps
- Extend the pipeline with additional analytics.
- Build Streamlit dashboards using these Dynamic Tables.

---
*Great job completing the notebook portion of this lab!*