# Streaming Data Pipeline with Snowpark Python and Dynamic Tables

## Objective
This notebook demonstrates an enhanced approach to building a real-time analytics pipeline using Snowflake Dynamic Tables, Snowpark Python procedures, and Triggered Tasks. It focuses on transforming raw streaming ski resort data into actionable insights, with improved daily visit tracking and a structured aggregation hierarchy.

## 1. Setup and Initialization

Python includes and initialize Snowpark environment

In [None]:
# Import python packages
import streamlit as st
import pandas as pd
from snowflake.core import Root

# Grab active Snowpark session
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Initialize Snowflake Python API for object management
root = Root(session)

## 2. Initial Data Exploration

Before building transformations, let's examine the structure of our raw streaming data. This helps in understanding the source tables we'll be working with.

In [None]:
-- Lift usage events (core activity data)
SELECT * FROM LIFT_RIDE LIMIT 20;

In [None]:
-- Day ticket purchases
SELECT * FROM RESORT_TICKET LIMIT 20;

In [None]:
-- Season pass purchases
SELECT * FROM SEASON_PASS LIMIT 20;

## 3. Initial Data Pipeline Setup

This section covers the additional setup required for this use case, including creating streams and reference tables.

### 3.1. Create Stream on Raw Lift Ride Data

A stream is created on the `LIFT_RIDE` table to capture new lift ride events. This stream will be the source for the Snowpark procedure that populates daily visit information.

In [None]:
CREATE OR REPLACE STREAM LIFT_RIDE_STREAM ON TABLE LIFT_RIDE APPEND_ONLY = TRUE SHOW_INITIAL_ROWS = TRUE;

### 3.2. Resort Capacity Reference Table

Create and populate a reference table for resort capacities, which will be used in downstream calculations.

In [None]:
-- Reference table for resort capacity
CREATE OR REPLACE TABLE RESORT_CAPACITY (
    RESORT VARCHAR(100) PRIMARY KEY,
    MAX_CAPACITY INTEGER,
    HOURLY_CAPACITY INTEGER,
    BASE_LIFT_COUNT INTEGER,
    IANA_TIMEZONE VARCHAR(50) 
);

INSERT INTO RESORT_CAPACITY (RESORT, MAX_CAPACITY, HOURLY_CAPACITY, BASE_LIFT_COUNT, IANA_TIMEZONE) VALUES
('Vail', 7000, 1100, 34, 'America/Denver'),
('Beaver Creek', 5500, 900, 25, 'America/Denver'),
('Breckenridge', 6500, 1000, 35, 'America/Denver'),
('Keystone', 4500, 700, 21, 'America/Denver'),
('Heavenly', 5000, 800, 27, 'America/Los_Angeles');

## 4. Automated Daily Visit Processing with Snowpark

This section details the setup for accurately tracking daily visits using a Snowpark Stored Procedure and a Task to automate its execution.

### 4.1. `DAILY_VISITS` Table

This table will store unique daily visits per RFID at each resort, along with their first ride details and season pass status. It is populated by a Snowpark procedure.

In [None]:
CREATE OR REPLACE TABLE DAILY_VISITS (
    VISIT_DATE DATE,
    RESORT STRING,
    RFID STRING,
    NAME STRING,
    FIRST_RIDE_TIME DATETIME,
    FIRST_LIFT STRING,
    HAS_SEASON_PASS BOOLEAN,
    PURCHASE_PRICE_USD DECIMAL(7,2),    
    ACTIVATION_USAGE_COUNT INTEGER,
    TICKET_ORIGINAL_DURATION INTEGER
);

### 4.2. Stage for Deployed Snowpark Code

Create a stage to store Snowpark Python code for stored procedures.

In [None]:
create stage if not exists snowpark_apps;

### 4.3. Snowpark Python Function: `populate_daily_visits`

This Python function will ultimately be deployed as a Python Stored Procedure. It processes new records from `LIFT_RIDE_STREAM`, identifies the first ride for each visitor per day at each resort, enriches the data with customer details and pass status, and inserts new, unique daily visits into the `DAILY_VISITS` table.

In [None]:
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, row_number, coalesce, when
from snowflake.snowpark.window import Window

def populate_daily_visits(session: Session) -> str:
    """
    Populate DAILY_VISITS table using Snowpark Python
    Handles data from any date in the stream, deduplicates by RFID per resort per day
    This process is designed to be run frequently from a triggered task
    """
    
    # Step 1: Get new rides from stream
    lift_ride_stream = session.table("LIFT_RIDE_STREAM")
    
    # Deduplicate by RFID per resort per day - get earliest ride time
    window_spec = Window.partition_by(
        col("RESORT"), 
        col("RFID"), 
        col("VISIT_DATE")
    ).order_by(col("RIDE_TIME").asc())
    
    first_rides_df = lift_ride_stream.select(
        col("RESORT"),
        col("RFID"),
        col("LIFT").alias("FIRST_LIFT"),
        col("RIDE_TIME").alias("FIRST_RIDE_TIME"),        
        col("RIDE_TIME").cast('DATE').alias("VISIT_DATE"),
        col("ACTIVATION_DAY_COUNT").alias("ACTIVATION_USAGE_COUNT"), # Ride data includes total number of days ticket or pass has been activated
        row_number().over(window_spec).alias("rn")
    )
    # Filter to only first ride of each day for each RFID at each resort
    first_rides_df = first_rides_df.filter(col("rn") == 1) #.drop(col("rn"))
    
    # Step 2: Join with customer data to get customer details and determine visit type
    season_pass_df = session.table("SEASON_PASS")
    resort_ticket_df = session.table("RESORT_TICKET")
    
    # Left join with season pass
    first_rides_df = first_rides_df.join(season_pass_df, col("RFID") == col("RFID_PASS"), "left", rsuffix="_PASS")
    
    # Left join with resort ticket
    first_rides_df = first_rides_df.join(resort_ticket_df, col("RFID") == col("RFID_TICKET"), "left", rsuffix="_TICKET")
     
    first_rides_df = first_rides_df.select(
        first_rides_df.col("RESORT"),
        first_rides_df.col("RFID"),
        first_rides_df.col("FIRST_LIFT"),
        first_rides_df.col("FIRST_RIDE_TIME"),
        first_rides_df.col("VISIT_DATE"), 
        coalesce(season_pass_df.col("NAME"), resort_ticket_df.col("NAME")).alias("NAME"), # Name on ticket or pass
        when(season_pass_df.col("RFID").is_not_null(), True).otherwise(False).alias("HAS_SEASON_PASS"),
        coalesce(season_pass_df.col("PRICE_USD"), resort_ticket_df.col("PRICE_USD")).alias("PURCHASE_PRICE_USD"), # Price of ticket or pass        
        first_rides_df.col("ACTIVATION_USAGE_COUNT"),
        resort_ticket_df.col("DAYS").alias("TICKET_ORIGINAL_DURATION") #Will be null for passes
    )
    
    # Step 3: Anti-join with existing DAILY_VISITS
    daily_visits_df = session.table("DAILY_VISITS").select(
            col("VISIT_DATE"),
            col("RESORT"),
            col("RFID")
    )
    # Create the anti-join condition - check for any existing record for this RFID/resort/date combination
    new_visits_df = first_rides_df.join(daily_visits_df, 
        ((col("VISIT_DATE") == col("VISIT_DATE_DV")) &
        (col("RESORT") == col("RESORT_DV")) &
        (col("RFID") == col("RFID_DV"))), "left", rsuffix="_DV").filter(col("RESORT_DV").is_null())  # Anti-join condition        
    new_visits_df = new_visits_df.select(
        first_rides_df.col("VISIT_DATE"),
        first_rides_df.col("RESORT"),
        first_rides_df.col("RFID"),
        first_rides_df.col("NAME"),
        first_rides_df.col("FIRST_RIDE_TIME"),
        first_rides_df.col("FIRST_LIFT"),
        first_rides_df.col("HAS_SEASON_PASS"),
        first_rides_df.col("PURCHASE_PRICE_USD"),
        first_rides_df.col("ACTIVATION_USAGE_COUNT"),
        first_rides_df.col("TICKET_ORIGINAL_DURATION") 
    )
    
    # Step 4: Append new visits into DAILY_VISITS table
    try:
        # Write the data to the table
        new_visits_df.write.mode("append").save_as_table("DAILY_VISITS", column_order="name")        
        return "OK"
    except Exception as e:
        return f"ERROR: {str(e)}"

### 4.4. Manually invoke Python function (for testing/setup)

Prior to deploying as a Snowflake task, let's run the Python function to make sure it's working properly. This step will backfill initial data if `SHOW_INITIAL_ROWS=TRUE` was used for the stream and it's the first run.

In [None]:
populate_daily_visits(session)

### 4.5. Create Triggered Task to Automate `populate_daily_visits`

Define and create a Snowflake Triggered Task to automatically run `populate_daily_visits` as a Python stored procedure when new data arrives in the `LIFT_RIDE_STREAM`.

In [None]:
from snowflake.core.task import StoredProcedureCall, Task

populate_dv_task = Task(
    "populate_daily_visits",
    StoredProcedureCall(populate_daily_visits, stage_location="@snowpark_apps"),
    warehouse="STREAMING_INGEST", 
    condition="SYSTEM$STREAM_HAS_DATA('lift_ride_stream')",
    allow_overlapping_execution=False
)
populate_dv_task_res = root.databases['streaming_ingest'].schemas['streaming_ingest'].tasks["populate_daily_visits"]
populate_dv_task_res.create_or_alter(populate_dv_task)

## 5. Task Management

Commands to manage the `populate_daily_visits`, such as suspending, checking parameters, altering, and resuming.

In [None]:
populate_dv_task_res = root.databases['streaming_ingest'].schemas['streaming_ingest'].tasks["populate_daily_visits"]

In [None]:
populate_dv_task_res.suspend()

In [None]:
SHOW PARAMETERS LIKE 'USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS' IN TASK populate_daily_visits;

In [None]:
-- Note: USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS controls the minimum execution interval for triggered tasks.
-- By setting to 10 seconds, the task will run with maximum frequency.
ALTER TASK populate_daily_visits SET USER_TASK_MINIMUM_TRIGGER_INTERVAL_IN_SECONDS = 10;

In [None]:
# Resume the task to start its execution based on the stream condition
# Ensure populate_dv_task_ref is defined from the PY_SUSPEND_POPULATE_DAILY_VISITS_TASK cell
populate_dv_task_res.resume()

In [None]:
describe task populate_daily_visits;

## 6. Dynamic Table Aggregation Pipeline

Define a series of Dynamic Tables to perform hierarchical aggregations (hourly, daily, weekly) on the ski resort data. These tables will automatically refresh as new data arrives.

### 6.1. Define Hourly Aggregations using SQL

These Dynamic Tables provide the first level of aggregation, summarizing data on an hourly basis. Let's start by using SQL to define the DTs.

In [None]:
-- Summarize hourly lift activity
-- Currently requires one join to determine how many riders have season passes and pass vs ticket rides
CREATE OR REPLACE DYNAMIC TABLE HOURLY_LIFT_ACTIVITY
TARGET_LAG='downstream' --This table will not be queried directly, so we can use downstream lag
WAREHOUSE = STREAMING_INGEST
REFRESH_MODE = incremental
AS
SELECT
    DATE(lr.RIDE_TIME) as RIDE_DATE,
    HOUR(lr.RIDE_TIME) as RIDE_HOUR,
    DATE_TRUNC('hour', lr.RIDE_TIME) as RIDE_HOUR_TIMESTAMP,
    lr.RESORT,
    COUNT(*) as TOTAL_RIDES,
    COUNT(DISTINCT lr.RFID) as VISITOR_COUNT,
    -- Use DAILY_VISITS to determine pass usage
    COUNT(DISTINCT CASE WHEN dv.HAS_SEASON_PASS = TRUE THEN lr.RFID END) as ACTIVE_PASSES,
    COUNT(CASE WHEN dv.HAS_SEASON_PASS = TRUE THEN 1 END) as PASS_RIDES
FROM LIFT_RIDE lr
LEFT JOIN DAILY_VISITS dv ON lr.RFID = dv.RFID
    AND DATE(lr.RIDE_TIME) = dv.VISIT_DATE
    AND lr.RESORT = dv.RESORT
GROUP BY RIDE_DATE, RIDE_HOUR, RIDE_HOUR_TIMESTAMP, lr.RESORT;

In [None]:
-- Let's see what the data looks like
select * from HOURLY_LIFT_ACTIVITY 
order by RIDE_HOUR_TIMESTAMP desc 
limit 100;

In [None]:
-- Determine hourly revenue based on the activation of tickets and season passes
-- Revenue from multi day tickets and season passes is prorated based on inline business rules
-- NOTE: This revenue does not include hourly sales of tickets or passes - we would need an additional DT for that
CREATE OR REPLACE DYNAMIC TABLE HOURLY_AMORTIZED_REVENUE
TARGET_LAG = 'downstream'
WAREHOUSE = STREAMING_INGEST
REFRESH_MODE = INCREMENTAL
AS
SELECT
    dv.VISIT_DATE AS RIDE_DATE,
    HOUR(dv.FIRST_RIDE_TIME) AS RIDE_HOUR,
    DATE_TRUNC('hour', dv.FIRST_RIDE_TIME) AS RIDE_HOUR_TIMESTAMP,
    dv.RESORT,
    SUM(CASE
        WHEN NOT dv.HAS_SEASON_PASS -- It's a ticket
        THEN (dv.PURCHASE_PRICE_USD / GREATEST(dv.TICKET_ORIGINAL_DURATION, 1)) -- Use ticket duration to prorate revenue
        ELSE 0
    END) AS RECOGNIZED_TICKET_REVENUE,

    COUNT(DISTINCT CASE WHEN NOT dv.HAS_SEASON_PASS THEN dv.RFID END) AS TICKET_ACTIVATIONS,

    SUM(CASE
        WHEN dv.HAS_SEASON_PASS AND dv.ACTIVATION_USAGE_COUNT <= 20 -- Prorate pass revenue across first 20 pass activations
        THEN (dv.PURCHASE_PRICE_USD / 20) -- Recognize pass fractional revenue 
        ELSE 0
    END) AS RECOGNIZED_PASS_REVENUE,

    COUNT(DISTINCT CASE WHEN dv.HAS_SEASON_PASS THEN dv.RFID END) AS PASS_ACTIVATIONS
FROM DAILY_VISITS dv
GROUP BY RIDE_HOUR_TIMESTAMP, dv.RESORT, dv.VISIT_DATE, RIDE_HOUR;

In [None]:
select * from HOURLY_AMORTIZED_REVENUE 
order by RIDE_HOUR_TIMESTAMP desc 
limit 100;

In [None]:
-- Now let's combine together the first two intermediate DTs into a new consolidated DT that is easy to query
CREATE OR REPLACE DYNAMIC TABLE HOURLY_RESORT_SUMMARY
TARGET_LAG = '1 minute'
WAREHOUSE = STREAMING_INGEST
REFRESH_MODE = INCREMENTAL
AS
SELECT
    activity.RIDE_DATE,
    activity.RIDE_HOUR,
    activity.RIDE_HOUR_TIMESTAMP,
    activity.RESORT,
    activity.VISITOR_COUNT,
    activity.TOTAL_RIDES,
    -- Ticktet and pass activation counts
    revenue.TICKET_ACTIVATIONS,
    revenue.PASS_ACTIVATIONS,
    -- General activity metrics 
    activity.ACTIVE_PASSES,
    activity.PASS_RIDES,
    (activity.VISITOR_COUNT - activity.ACTIVE_PASSES) AS ACTIVE_TICKETS,
    (activity.TOTAL_RIDES - activity.PASS_RIDES) AS TICKET_RIDES,
    -- Recognized Revenue from HOURLY_AMORTIZED_REVENUE
    COALESCE(revenue.RECOGNIZED_TICKET_REVENUE, 0) AS RECOGNIZED_TICKET_REVENUE,
    COALESCE(revenue.RECOGNIZED_PASS_REVENUE, 0) AS RECOGNIZED_PASS_REVENUE,
    -- New Total Recognized Revenue
    (COALESCE(revenue.RECOGNIZED_TICKET_REVENUE, 0) + COALESCE(revenue.RECOGNIZED_PASS_REVENUE, 0)) AS TOTAL_RECOGNIZED_REVENUE,
    -- Calculate capacity percentage
    ROUND((activity.VISITOR_COUNT / rc.MAX_CAPACITY * 100), 1) AS CAPACITY_PCT,
    -- Capacity status
    CASE
        WHEN (activity.VISITOR_COUNT / rc.MAX_CAPACITY * 100) > 90 THEN 'HIGH'
        WHEN (activity.VISITOR_COUNT / rc.MAX_CAPACITY * 100) > 70 THEN 'MODERATE'
        ELSE 'NORMAL'
        END AS CAPACITY_STATUS
FROM HOURLY_LIFT_ACTIVITY activity
         LEFT JOIN HOURLY_AMORTIZED_REVENUE revenue
                   ON activity.RIDE_DATE = revenue.RIDE_DATE
                       AND activity.RIDE_HOUR = revenue.RIDE_HOUR
                       AND activity.RESORT = revenue.RESORT
         JOIN RESORT_CAPACITY rc ON activity.RESORT = rc.RESORT;

In [None]:
select * from HOURLY_RESORT_SUMMARY 
where resort = 'Vail'
order by RIDE_DATE desc, RIDE_HOUR desc
limit 100;

### 6.2. Define Daily Aggregations using Snowpark Python
 
This next Dynamic Table rolls up hourly data. It will be very efficient to refresh because its calculated solely off of aggregated data.

***While defining dynamic tables in SQL is nice,  we can also use Python!***

In [None]:
from snowflake.snowpark.functions import col, max as max_, sum as sum_, avg, count, round as round_, lit

# Read from the HOURLY_RESORT_SUMMARY dynamic table
hourly_summary_df = session.table("HOURLY_RESORT_SUMMARY")

# Group by RIDE_DATE and RESORT, then aggregate
daily_summary_df = (hourly_summary_df.group_by(
    col("RIDE_DATE"), 
    col("RESORT")
).agg(
    max_(col("VISITOR_COUNT")).alias("PEAK_HOURLY_VISITORS"), # Peak visitors in any single hour        
    sum_(col("VISITOR_COUNT")).alias("TOTAL_VISITOR_HOURS"), # Sum of all visitor-hours    
    sum_(col("TOTAL_RIDES")).alias("TOTAL_RIDES"), # Total rides across all hours    
    # Revenue aggregations
    sum_(col("RECOGNIZED_TICKET_REVENUE")).alias("TOTAL_TICKET_REVENUE"),
    sum_(col("RECOGNIZED_PASS_REVENUE")).alias("TOTAL_PASS_REVENUE"),
    sum_(col("TOTAL_RECOGNIZED_REVENUE")).alias("TOTAL_REVENUE"),
    # Visitor activations
    sum_(col("TICKET_ACTIVATIONS")).alias("TOTAL_TICKET_ACTIVATIONS"),
    sum_(col("PASS_ACTIVATIONS")).alias("TOTAL_PASS_ACTIVATIONS"),
    # Ride type breakdowns
    sum_(col("PASS_RIDES")).alias("TOTAL_PASS_RIDES"),
    sum_(col("TICKET_RIDES")).alias("TOTAL_TICKET_RIDES"),
    # Capacity metrics
    round_(avg(col("CAPACITY_PCT")), 1).alias("AVG_CAPACITY_PCT"),
    max_(col("CAPACITY_PCT")).alias("PEAK_CAPACITY_PCT"),
    # Operation hours (count of hourly records)
    count(lit(1)).alias("OPERATION_HOURS")
)
    # Calculate total visitors as sum of ticket and pass activations
    .with_column("TOTAL_VISITORS", col("TOTAL_TICKET_ACTIVATIONS") + col("TOTAL_PASS_ACTIVATIONS")))
    
# Now we can use the Snowpark dataframe to deploy the next DT! 
daily_summary_df.create_or_replace_dynamic_table(
    name="DAILY_RESORT_SUMMARY",
    warehouse="STREAMING_INGEST",
    lag="1 minute",
    refresh_mode="INCREMENTAL"
)

In [None]:
select * from DAILY_RESORT_SUMMARY 
order by RIDE_DATE desc
limit 100;

### 6.3. Define Weekly Aggregation using Python

This last Dynamic Table aggregates daily summaries to provide weekly insights.  It will also be very efficient to calculate because it is calculated entirely based on daily data.

In [None]:
from snowflake.snowpark.functions import col, max as max_, sum as sum_, avg, round as round_, date_trunc, count_distinct

# Read from the DAILY_RESORT_SUMMARY dynamic table
daily_summary_df = session.table("DAILY_RESORT_SUMMARY")

# Group by WEEK_START_DATE (derived from RIDE_DATE) and RESORT, then aggregate
weekly_summary_df = (daily_summary_df.group_by(
    date_trunc('week', col("RIDE_DATE")).alias("WEEK_START_DATE"),
    col("RESORT")
).agg(
    max_(col("TOTAL_VISITORS")).alias("MAX_DAILY_UNIQUE_VISITORS"), # Peak unique visitors on any single day in the week
    round_(avg(col("TOTAL_VISITORS")), 0).alias("AVG_DAILY_UNIQUE_VISITORS"), # Average daily unique visitors
    sum_(col("TOTAL_VISITORS")).alias("WEEK_TOTAL_VISITORS"), # Sum of daily unique visitors (visitor-days)
    sum_(col("TOTAL_RIDES")).alias("WEEK_TOTAL_RIDES"),
    sum_(col("TOTAL_PASS_RIDES")).alias("WEEK_TOTAL_PASS_RIDES"),
    sum_(col("TOTAL_TICKET_RIDES")).alias("WEEK_TOTAL_TICKET_RIDES"),
    sum_(col("TOTAL_TICKET_REVENUE")).alias("WEEK_TOTAL_TICKET_REVENUE"),
    sum_(col("TOTAL_PASS_REVENUE")).alias("WEEK_TOTAL_PASS_REVENUE"),
    sum_(col("TOTAL_REVENUE")).alias("WEEK_TOTAL_REVENUE"),
    round_(avg(col("TOTAL_REVENUE")), 0).alias("AVG_DAILY_REVENUE"),
    sum_(col("TOTAL_TICKET_ACTIVATIONS")).alias("WEEK_TOTAL_TICKET_ACTIVATIONS"),
    sum_(col("TOTAL_PASS_ACTIVATIONS")).alias("WEEK_TOTAL_PASS_ACTIVATIONS"),
    round_(avg(col("AVG_CAPACITY_PCT")), 1).alias("AVG_WEEK_CAPACITY_PCT"), # Average of the daily average capacities
    max_(col("PEAK_CAPACITY_PCT")).alias("WEEK_PEAK_CAPACITY_PCT"), # Peak hourly capacity reached during the week
    count_distinct(col("RIDE_DATE")).alias("OPERATION_DAYS") # Count of distinct days with operations in the week
))

# Again, we can use the Snowpark dataframe to deploy our last DT
weekly_summary_df.create_or_replace_dynamic_table(
    name="WEEKLY_RESORT_SUMMARY2",
    warehouse="STREAMING_INGEST",
    lag="1 minute",
    refresh_mode="INCREMENTAL"
)

In [None]:
select * from WEEKLY_RESORT_SUMMARY 
order by WEEK_START_DATE desc
limit 100;

## 7. Analytical Views for Reporting

Create views on top of base tables and/or dynamic tables for easier querying and dashboarding. Views can easily be defined using SQL or Snowpark.

In [None]:
-- ========================================
-- VIEW: V_DAILY_REVENUE_PERFORMANCE
-- Daily revenue vs targets, derived from DAILY_RESORT_SUMMARY and RESORT_CAPACITY
-- ========================================
CREATE OR REPLACE VIEW V_DAILY_REVENUE_PERFORMANCE AS
WITH daily_targets AS (
    SELECT
        RESORT,
        (MAX_CAPACITY * 0.7 * 100) as REVENUE_TARGET_USD -- Example target: 70% of max capacity value, assuming $100 per visitor
    FROM RESORT_CAPACITY
)
SELECT
    d.RIDE_DATE,
    d.RESORT,
    d.TOTAL_REVENUE,
    t.REVENUE_TARGET_USD,
    CASE
        WHEN t.REVENUE_TARGET_USD > 0 THEN ROUND((d.TOTAL_REVENUE / t.REVENUE_TARGET_USD * 100), 1)
        ELSE NULL
        END as REVENUE_TARGET_PCT,
    CASE
        WHEN d.TOTAL_REVENUE >= t.REVENUE_TARGET_USD THEN 'ABOVE_TARGET'
        WHEN d.TOTAL_REVENUE >= t.REVENUE_TARGET_USD * 0.9 THEN 'NEAR_TARGET'
        ELSE 'BELOW_TARGET'
        END as PERFORMANCE_STATUS
FROM DAILY_RESORT_SUMMARY d
         JOIN daily_targets t ON d.RESORT = t.RESORT;

In [None]:
select * from V_DAILY_REVENUE_PERFORMANCE order by RIDE_DATE DESC LIMIT 100;

In [None]:
from snowflake.snowpark.functions import col, sum as sum_, avg, round as round_, count_distinct

daily_summary_df = session.table("DAILY_RESORT_SUMMARY")
# Group by RIDE_DATE and aggregate
v_daily_network_metrics_df = (daily_summary_df
    .group_by(col("RIDE_DATE"))
    .agg(
        sum_(col("TOTAL_VISITORS")).alias("TOTAL_NETWORK_VISITORS"),
        sum_(col("TOTAL_REVENUE")).alias("TOTAL_NETWORK_REVENUE"),
        round_(avg(col("AVG_CAPACITY_PCT")), 1).alias("AVG_NETWORK_CAPACITY_PCT"),
        sum_(col("TOTAL_RIDES")).alias("TOTAL_NETWORK_RIDES"),
        count_distinct(col("RESORT")).alias("ACTIVE_RESORTS")
    ))

# Create or replace the view using the Snowpark dataframe
v_daily_network_metrics_df.create_or_replace_view("V_DAILY_NETWORK_METRICS")

In [None]:
select * from V_DAILY_NETWORK_METRICS order by RIDE_DATE DESC LIMIT 100;

## 9. Analytical Stored Procedures for Reporting

In some cases, obtaining analytical results requires more flexibility than a `VIEW` provides.  Here is an example of a tabular Python stored procedure for calculating lift performance for a single resort over the last 30 minutes of lift operations.  This query will directly query the `LIFT_RIDE` base table, which is common in many streaming use cases. 

**NOTE:** This logic could also be written as a SQL UDTF, or embedded directly in a dashboard, but this example demonstrates how to encapsulate dynamic Python query logic for reusability.

In [None]:
# Required imports for Snowpark operations and types
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, lit, max, count, count_distinct, min, when, dateadd, datediff, round as snowpark_round, row_number, sproc
from snowflake.snowpark.window import Window
from snowflake.snowpark.types import StructType, StructField, StringType, LongType, TimestampType, DoubleType, IntegerType

# Define the schema for the output table of the stored procedure
# This must match the structure of the DataFrame being returned.
output_schema = StructType([
    StructField("RESORT", StringType(), nullable=False),
    StructField("LIFT", StringType(), nullable=False),
    StructField("RIDES", LongType(), nullable=False),
    StructField("UNIQUE_VISITORS", LongType(), nullable=False),
    StructField("FIRST_ACTIVITY_TIME", TimestampType(), nullable=True), # Can be null if no rides
    StructField("LAST_ACTIVITY_TIME", TimestampType(), nullable=True),  # Can be null if no rides
    StructField("USAGE_RANK_IN_RESORT", IntegerType(), nullable=False), # Ranks are integers
    StructField("OVERALL_USAGE_RANK", IntegerType(), nullable=False),   # Ranks are integers
    StructField("RIDES_PER_HOUR", DoubleType(), nullable=True)         # Can be null or decimal
])

# Use the @sproc decorator with a struct return type to register a tabular stored procedure
# This is similar to using a SQL UDTF, except this approach provides full access to a Snowpark session
# TODO: Also accept time range args, so this logic can be used to analyze other time periods
@sproc(
    name="get_resort_lift_performance",
    return_type=output_schema,
    input_types=[StringType()],
    packages=['snowflake-snowpark-python'],
    is_permanent=True, # Creates a permanent stored procedure
    replace=True,      # Allows replacing an existing SP with the same name
    stage_location = "@snowpark_apps" 
)
def get_resort_lift_stats_sp(snowpark_session: Session, resort_name_input: str):
    """
    Snowpark Stored Procedure to get lift ride statistics for a specific resort.

    Args:
        session: The Snowpark session object (implicitly provided).
        resort_name_input: The name of the resort to filter by.

    Returns:
        A Snowpark DataFrame with the lift ride statistics, matching output_schema.
    """

    # Reference the LIFT_RIDE table
    lift_ride_df = snowpark_session.table("LIFT_RIDE")

    # Get last ride for resort
    resort_last_ride_df = lift_ride_df.filter(col("RESORT") == resort_name_input) \
                                      .agg(max(col("RIDE_TIME")).alias("last_ride_time"))

    # Main query logic
    # First filter lift_ride for the specific resort
    lr_filtered_df = lift_ride_df.filter(col("RESORT") == resort_name_input)

    # Cross join with last ride data
    joined_df = lr_filtered_df.join(resort_last_ride_df, how="cross")

    # Apply the time filter
    # Ensure last_ride_time is not null before attempting dateadd
    thirty_minutes_before_last_ride = dateadd("minute", lit(-30), col("last_ride_time"))
    filtered_rides_df = joined_df.filter(
        (col("last_ride_time").is_not_null()) & # Ensure last_ride_time exists
        (col("RIDE_TIME") > thirty_minutes_before_last_ride)
    )
    # If there was no last ride time, filtered_rides_df will be empty.

    # Group by and aggregate
    pre_aggregated_df = filtered_rides_df.group_by(col("RESORT"), col("LIFT")) \
                                        .agg(
                                            count(lit(1)).alias("RIDES"),
                                            count_distinct(col("RFID")).alias("UNIQUE_VISITORS"),
                                            min(col("RIDE_TIME")).alias("FIRST_ACTIVITY_TIME"),
                                            max(col("RIDE_TIME")).alias("LAST_ACTIVITY_TIME")
                                        )

    # Define window specifications for ranking based on the aggregated "RIDES"
    window_resort = Window.partition_by(col("RESORT")).order_by(col("RIDES").desc())
    window_overall = Window.order_by(col("RIDES").desc())

    # Apply window functions and calculate RIDES_PER_HOUR
    # Ensure columns from pre_aggregated_df are used here
    final_df = pre_aggregated_df.select(
        col("RESORT"),
        col("LIFT"),
        col("RIDES"),
        col("UNIQUE_VISITORS"),
        col("FIRST_ACTIVITY_TIME"),
        col("LAST_ACTIVITY_TIME"),
        row_number().over(window_resort).alias("USAGE_RANK_IN_RESORT"),
        row_number().over(window_overall).alias("OVERALL_USAGE_RANK"),
        #If activity range is <1min of data, set rides_per_hour to null to avoid divide by zero
        #Otherwise calculate rides per hour across activity range 
        when(datediff("minute", col("FIRST_ACTIVITY_TIME"), col("LAST_ACTIVITY_TIME")) == 0, lit(None).cast(DoubleType()))
        .otherwise(
            snowpark_round(
                col("RIDES") / (datediff("minute", col("FIRST_ACTIVITY_TIME"), col("LAST_ACTIVITY_TIME")) / 60.0),
                1
            )
        ).alias("RIDES_PER_HOUR")
    )

    # Ensure the DataFrame schema matches the defined output_schema, especially nullable properties and types
    # Snowpark will try to map, but explicit casting or selection order helps.
    # The select statement above should produce columns in the correct order and type.
    # If any column might be missing due to no data, default values and schema alignment is needed.

    return final_df

In [None]:
# Get top 10 lifts for Vail in the last 30 minutes
session.table_function('get_resort_lift_performance', lit('Vail'))\
                  .filter(col("USAGE_RANK_IN_RESORT") <= 10)\
                  .order_by(col("USAGE_RANK_IN_RESORT"))

## 9. Schema Verification

Show tables and views to verify the created objects.

In [None]:
-- List base tables
SHOW TABLES;
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID())) WHERE "is_dynamic" = 'N';

In [None]:
-- List dynamic tables
SHOW DYNAMIC TABLES;

In [None]:
-- List views
SHOW VIEWS;

In [None]:
--List procedures
SHOW PROCEDURES;
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID())) WHERE "schema_name" = 'STREAMING_INGEST';

## 10. Dynamic Table Observability

Monitor the health, refresh history, and status of your Dynamic Tables.

In [None]:
-- Check refresh history for performance monitoring
SELECT *
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY(NAME_PREFIX => 'STREAMING_INGEST.STREAMING_INGEST.'))
ORDER BY refresh_start_time DESC;

## 11. Conclusion and Next Steps

This notebook has established an end-to-end streaming data pipeline incorporating Snowpark for complex transformations (`DAILY_VISITS`) and a hierarchy of Dynamic Tables for efficient, incremental aggregations.

**Key Features Implemented:**
- Automated daily unique visitor tracking using a Snowpark procedure and Task.
- Multi-level aggregation pipeline (Hourly → Daily → Weekly) using Dynamic Tables.
- Analytical views for simplified reporting and dashboarding.
- Observability queries for monitoring Dynamic Table performance and health.

**Potential Next Steps:**
- Build Streamlit applications or connect BI tools to these views and Dynamic Tables for visualization.
- Extend the pipeline with more advanced analytics, such as anomaly detection or predictive modeling.
- Implement alerting based on DT status or data quality checks.