# Overview

This notebook guides you through the setup and execution of the GOLD layer in your data pipeline using Snowflake and Iceberg tables. The GOLD layer focuses on curating aggregated and analytics-ready tables by leveraging the curated data from the SILVER and BRONZE layers.

You will:
- Import required Python packages and establish a Snowflake session.
- Set up user-specific variables, roles, databases, and schemas for the GOLD layer.
- Create dynamic Iceberg tables for key business entities and aggregations.
- Join and enrich data from the SILVER and BRONZE layers to produce final reporting tables.
- Create summary views for analytics and dashboarding.

Follow the instructions and code cells to complete the GOLD pipeline and prepare your data for business intelligence and advanced analytics.

In [None]:
# Import python packages
import streamlit as st
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


# Set User Number

Update the cell below with your unique user number for this lab. This ensures that all resources you create are isolated and do not conflict with those of other users.

**Usernum = '999'**

In [None]:
usernum = '<INSERT USER NUMBER>'

# Set User-Specific Variables

This section defines variables for your username, role, database, and schema. These variables will be used throughout the notebook to ensure all operations are performed in your dedicated environment.

In [None]:


SET USERNAME = 'HOL_USER_' || {{usernum}};
SELECT $USERNAME;


In [None]:
SET HOLROLE = $USERNAME || '_FULL_ROLE';
SET DB_NAME = $USERNAME || '_DB';
SET SCHEMANAME = 'GOLD';

In [None]:
USE ROLE IDENTIFIER($HOLROLE);
USE DATABASE IDENTIFIER($DB_NAME);
USE SCHEMA IDENTIFIER($SCHEMANAME);

# Creating the SCHEDULE Table

This step creates a dynamic Iceberg table named `SCHEDULE` in the GOLD schema. The table is populated by selecting the latest schedule record for each unique schedule key from the SILVER layer, ensuring only the most relevant schedule information is retained for analytics.

In [None]:
CREATE OR REPLACE DYNAMIC ICEBERG TABLE SCHEDULE
  TARGET_LAG = '1 minute'
  WAREHOUSE = hol_user_{{usernum}}_wh
  EXTERNAL_VOLUME= 'iceberg_hol_gold_vol'
  CATALOG = 'SNOWFLAKE'
  BASE_LOCATION = 'hol_user_{{usernum}}/schedule/'
  CATALOG_SYNC = iceberg_hol_oc_int
  AS
select  *  
from silver.schedule_00
qualify  row_number() over (partition by schedule_key order by ORIGIN_LAT_LON) = 1;

# Creating the TRAIN_ACTIVATIONS Table

This cell creates a dynamic Iceberg table named `TRAIN_ACTIVATIONS` by selecting the most recent activation event for each train from the SILVER layer. This ensures that only the latest activation record per train is available for downstream analytics.

In [None]:
CREATE OR REPLACE DYNAMIC ICEBERG TABLE TRAIN_ACTIVATIONS
  TARGET_LAG = '1 minute'
  WAREHOUSE = hol_user_{{usernum}}_wh
  EXTERNAL_VOLUME= 'iceberg_hol_gold_vol'
  CATALOG = 'SNOWFLAKE'
  BASE_LOCATION = 'hol_user_{{usernum}}/train_activations/'
  CATALOG_SYNC = iceberg_hol_oc_int
AS
select * from silver.train_activations_01
qualify ROW_NUMBER() OVER (PARTITION BY TRAIN_ID ORDER BY CREATION_TIMESTAMP DESC) = 1;

# Creating the TRAIN_MOVEMENTS Table

This step creates a dynamic Iceberg table named `TRAIN_MOVEMENTS` by joining movement events from the SILVER layer with activation, schedule, and location data. The resulting table provides a comprehensive, enriched view of each train movement, including schedule, route, and geospatial information.

In [None]:
CREATE OR REPLACE DYNAMIC ICEBERG TABLE TRAIN_MOVEMENTS
  TARGET_LAG = '1 minute'
  WAREHOUSE = hol_user_{{usernum}}_wh
  EXTERNAL_VOLUME= 'iceberg_hol_gold_vol'
  CATALOG = 'SNOWFLAKE'
  BASE_LOCATION = 'hol_user_{{usernum}}/train_movements/'
  CATALOG_SYNC = iceberg_hol_oc_int
AS
SELECT  TM.MSG_KEY AS MSG_KEY,
        TM.MSG_HEADER AS MSG_HEADER,
        TM.EVENT_TYPE AS EVENT_TYPE,
        TM.GBTT_TIMESTAMP AS GBTT_TIMESTAMP,
        TM.ORIGINAL_LOC_STANOX AS ORIGINAL_LOC_STANOX,
        TM.PLANNED_TIMESTAMP AS PLANNED_TIMESTAMP,
        TM.TIMETABLE_VARIATION AS TIMETABLE_VARIATION,
        TM.ORIGINAL_LOC_TIMESTAMP AS ORIGINAL_LOC_TIMESTAMP,
        TM.CURRENT_TRAIN_ID AS CURRENT_TRAIN_ID,
        TM.DELAY_MONITORING_POINT AS DELAY_MONITORING_POINT,
        TM.NEXT_REPORT_RUN_TIME AS NEXT_REPORT_RUN_TIME,
        TM.REPORTING_STANOX AS REPORTING_STANOX,
        TM.ACTUAL_TIMESTAMP AS ACTUAL_TIMESTAMP,
        TM.CORRECTION_IND AS CORRECTION_IND,
        TM.EVENT_SOURCE AS EVENT_SOURCE,
        TM.TRAIN_FILE_ADDRESS AS TRAIN_FILE_ADDRESS,
        TM.PLATFORM AS PLATFORM,
        TM.DIVISION_CODE AS DIVISION_CODE,
        TM.TRAIN_TERMINATED AS TRAIN_TERMINATED,
        TM.TRAIN_ID AS TRAIN_ID,
        TM.OFFROUTE_IND AS OFFROUTE_IND,
        TM.VARIATION AS VARIATION,
        TM.LATE_IND AS LATE_IND,
        TM.VARIATION_STATUS AS VARIATION_STATUS,
        TM.TRAIN_SERVICE_CODE AS TRAIN_SERVICE_CODE,
        TM.TOC_ID AS TOC_ID,
        TM.TOC AS TOC,
        TM.LOC_STANOX AS LOC_STANOX,
        L.DESCRIPTION AS MVT_DESCRIPTION,
        object_construct_keep_null('long',L.LONGITUDE, 'lat',L.LATITUDE)::OBJECT(long float, lat float) AS MVT_LAT_LON,
        TM.AUTO_EXPECTED AS AUTO_EXPECTED,
        TM.DIRECTION_IND AS DIRECTION_IND,
        TM.ROUTE AS ROUTE,
        TM.PLANNED_EVENT_TYPE AS PLANNED_EVENT_TYPE,
        TM.NEXT_REPORT_STANOX AS NEXT_REPORT_STANOX,
        TM.LINE_IND AS LINE_IND,
        TA.SCHEDULE_SOURCE AS SCHEDULE_SOURCE,
        TA.TP_ORIGIN_TIMESTAMP AS TP_ORIGIN_TIMESTAMP,
        TA.SCHEDULE_TYPE AS SCHEDULE_TYPE,
        TA.CREATION_TIMESTAMP AS CREATION_TIMESTAMP,
        TA.ORIGIN_DEP_TIMESTAMP AS ORIGIN_DEP_TIMESTAMP,
        TA.D1266_RECORD_NUMBER AS D1266_RECORD_NUMBER,
        TA.TRAIN_SERVICE_CODE AS TRAIN_SERVICE_CODE_02,
        TA.SCHED_ORIGIN_STANOX AS SCHED_ORIGIN_STANOX,
        TA.TRAIN_UID AS TRAIN_UID,
        TA.TRAIN_CALL_MODE AS TRAIN_CALL_MODE,
        TA.TP_ORIGIN_STANOX AS TP_ORIGIN_STANOX,
        TA.SCHEDULE_WTT_ID AS SCHEDULE_WTT_ID,
        TA.TRAIN_CALL_TYPE AS TRAIN_CALL_TYPE,
        TA.SCHEDULE_END_DATE AS SCHEDULE_END_DATE,
        COALESCE(TA.SCHEDULE_KEY,'no_schedule_activation_found') AS SCHEDULE_KEY,
        TA.SCHED_ORIGIN_DESC AS SCHED_ORIGIN_DESC,
        SCH.CIF_TRAIN_UID AS CIF_TRAIN_UID,
        SCH.NUM_STOPS AS SCHEDULE_NUM_STOPS,
        SCH.SCHEDULE_START_DATE AS SCHEDULE_START_DATE,
        SCH.CIF_STP_INDICATOR AS CIF_STP_INDICATOR,
        SCH.ATOC_CODE AS ATOC_CODE,
        SCH.TRAIN_STATUS AS TRAIN_STATUS,
        SCH.POWER_TYPE AS POWER_TYPE,
        SCH.SEATING_CLASSES AS SEATING_CLASSES,
        SCH.RESERVATIONS AS RESERVATIONS,
        SCH.SLEEPING_ACCOMODATION AS SLEEPING_ACCOMODATION,
        SCH.TRAIN_CATEGORY AS TRAIN_CATEGORY,
        SCH.ORIGIN_TIPLOC_CODE AS ORIGIN_TIPLOC_CODE,
          SCH.ORIGIN_DESCRIPTION                                    AS ORIGIN_DESCRIPTION,
          SCH.ORIGIN_LAT_LON                                    AS ORIGIN_LAT_LON,
        SCH.ORIGIN_PUBLIC_DEPARTURE_TIME AS ORIGIN_PUBLIC_DEPARTURE_TIME,
        SCH.ORIGIN_PLATFORM AS ORIGIN_PLATFORM,
        SCH.DESTINATION_TIPLOC_CODE AS DESTINATION_TIPLOC_CODE,
          SCH.DESTINATION_DESCRIPTION                               AS DESTINATION_DESCRIPTION,
          SCH.DESTINATION_LAT_LON                               AS DESTINATION_LAT_LON,
        SCH.DESTINATION_PUBLIC_ARRIVAL_TIME AS DESTINATION_PUBLIC_ARRIVAL_TIME,
        SCH.DESTINATION_PLATFORM AS DESTINATION_PLATFORM  
  FROM SILVER.TRAIN_MOVEMENTS_00 TM
      LEFT JOIN TRAIN_ACTIVATIONS TA
        ON TM.TRAIN_ID = TA.TRAIN_ID
      LEFT JOIN BRONZE.LOCATIONS_RAW L
         ON TM.loc_stanox = L.STANOX
      LEFT JOIN 
        SCHEDULE SCH 
        ON COALESCE(TA.SCHEDULE_KEY,'no_schedule_activation_found') = SCH.SCHEDULE_KEY;

# Creating the TRAIN_CANCELLATIONS Table

This cell creates a dynamic Iceberg table named `TRAIN_CANCELLATIONS` by joining cancellation events from the SILVER layer with activation, schedule, and location reference data. The resulting table provides enriched cancellation event details for analytics and reporting.

In [None]:
CREATE OR REPLACE DYNAMIC ICEBERG TABLE TRAIN_CANCELLATIONS
  TARGET_LAG = '1 minute'
  WAREHOUSE = hol_user_{{usernum}}_wh
  EXTERNAL_VOLUME= 'iceberg_hol_gold_vol'
  CATALOG = 'SNOWFLAKE'
  BASE_LOCATION = 'hol_user_{{usernum}}/train_cancellations_01/'
  CATALOG_SYNC = iceberg_hol_oc_int
AS
SELECT    TC.MSG_HEADER AS MSG_HEADER, 
          TC.TRAIN_FILE_ADDRESS AS TRAIN_FILE_ADDRESS, 
          TC.TRAIN_SERVICE_CODE AS TRAIN_SERVICE_CODE, 
          TC.ORIG_LOC_STANOX AS ORIG_LOC_STANOX, 
          TC.TOC_ID AS TOC_ID, 
          TC.TOC AS TOC, 
          TC.DEP_TIMESTAMP AS DEP_TIMESTAMP, 
          TC.DIVISION_CODE AS DIVISION_CODE, 
          TC.LOC_STANOX AS LOC_STANOX, 
          TC.CANX_TIMESTAMP AS CANX_TIMESTAMP, 
          TC.CANX_REASON_CODE AS CANX_REASON_CODE, 
          C.CODE                                                 AS CANX_REASON, 
          L.DESCRIPTION                                             AS CANCELLATION_LOCATION, 
          object_construct_keep_null('long',L.LONGITUDE,'lat',L.LATITUDE)::OBJECT(long FLOAT, lat FLOAT)                    AS CANCELLATION_LAT_LON, 
          TC.TRAIN_ID AS TRAIN_ID, 
          TC.ORIG_LOC_TIMESTAMP AS ORIG_LOC_TIMESTAMP, 
          TC.CANX_TYPE AS CANX_TYPE,
          TA.SCHEDULE_SOURCE                                            AS SCHEDULE_SOURCE, 
          TA.TP_ORIGIN_TIMESTAMP                                        AS TP_ORIGIN_TIMESTAMP, 
          TA.SCHEDULE_TYPE                                              AS SCHEDULE_TYPE, 
          TA.CREATION_TIMESTAMP                                         AS CREATION_TIMESTAMP, 
          TA.ORIGIN_DEP_TIMESTAMP                                       AS ORIGIN_DEP_TIMESTAMP, 
          TA.D1266_RECORD_NUMBER                                        AS D1266_RECORD_NUMBER, 
          TA.TRAIN_SERVICE_CODE                                         AS TRAIN_SERVICE_CODE_02, 
          TA.SCHED_ORIGIN_STANOX                                        AS SCHED_ORIGIN_STANOX, 
          TA.TRAIN_UID                                                  AS TRAIN_UID, 
          TA.TRAIN_CALL_MODE                                            AS TRAIN_CALL_MODE, 
          TA.TP_ORIGIN_STANOX                                           AS TP_ORIGIN_STANOX, 
          TA.SCHEDULE_WTT_ID                                            AS SCHEDULE_WTT_ID, 
          TA.TRAIN_CALL_TYPE                                            AS TRAIN_CALL_TYPE, 
          TA.SCHEDULE_END_DATE                                          AS SCHEDULE_END_DATE, 
          COALESCE(TA.SCHEDULE_KEY,'no_schedule_found')                 AS SCHEDULE_KEY, 
          TA.SCHED_ORIGIN_DESC                                          AS SCHED_ORIGIN_DESC,
          SCH.CIF_TRAIN_UID                                             AS CIF_TRAIN_UID, 
          SCH.SCHEDULE_START_DATE                                       AS SCHEDULE_START_DATE, 
          SCH.CIF_STP_INDICATOR                                         AS CIF_STP_INDICATOR, 
          SCH.ATOC_CODE                                                 AS ATOC_CODE, 
          SCH.TRAIN_STATUS                                              AS TRAIN_STATUS, 
          SCH.POWER_TYPE                                                AS POWER_TYPE, 
          SCH.SEATING_CLASSES                                           AS SEATING_CLASSES, 
          SCH.RESERVATIONS                                              AS RESERVATIONS, 
          SCH.SLEEPING_ACCOMODATION                                     AS SLEEPING_ACCOMODATION, 
          SCH.TRAIN_CATEGORY                                            AS TRAIN_CATEGORY, 
          SCH.ORIGIN_TIPLOC_CODE                                        AS ORIGIN_TIPLOC_CODE, 
          SCH.ORIGIN_DESCRIPTION                                    AS ORIGIN_DESCRIPTION,
          SCH.ORIGIN_LAT_LON                                    AS ORIGIN_LAT_LON,
          SCH.ORIGIN_PUBLIC_DEPARTURE_TIME                              AS ORIGIN_PUBLIC_DEPARTURE_TIME,
          SCH.ORIGIN_PLATFORM                                           AS ORIGIN_PLATFORM, 
          SCH.DESTINATION_TIPLOC_CODE                                   AS DESTINATION_TIPLOC_CODE,
          SCH.DESTINATION_DESCRIPTION                               AS DESTINATION_DESCRIPTION,
          SCH.DESTINATION_LAT_LON                               AS DESTINATION_LAT_LON,
          SCH.DESTINATION_PUBLIC_ARRIVAL_TIME                           AS DESTINATION_PUBLIC_ARRIVAL_TIME,
          SCH.DESTINATION_PLATFORM                                      AS DESTINATION_PLATFORM
    FROM SILVER.TRAIN_CANCELLATIONS_00 TC
          LEFT JOIN BRONZE.LOCATIONS_RAW L 
                    ON        TC.LOC_STANOX = L.STANOX 
          LEFT JOIN SILVER.CANCEL_CODE_REFERENCES_00 C 
                    ON        TC.CANX_REASON_CODE = C.CODE
          LEFT JOIN TRAIN_ACTIVATIONS TA 
                    ON        TC.TRAIN_ID = TA.TRAIN_ID
          LEFT JOIN SCHEDULE SCH 
                    ON        COALESCE(TA.schedule_key,'X') = SCH.SCHEDULE_KEY ;

# Creating the TOC_MVT_STATS View

This step creates a summary view named `TOC_MVT_STATS` that aggregates movement statistics by Train Operating Company (TOC). The view provides counts of movements, unique trains, and time ranges, supporting dashboarding and business intelligence use cases.

In [None]:
CREATE VIEW IF NOT EXISTS TOC_MVT_STATS AS
SELECT TOC,
  sum( case when SCHEDULE_KEY = 'no_schedule_activation_found' then 1 else 0 end ) as no_activation_found,
  sum( case when SCHEDULE_KEY = 'no_schedule_activation_found' then 0 else 1 end ) as activation_found,
  COUNT(*) as MOVEMENT_CT,
  COUNT(DISTINCT(train_id)) as unique_trains,
  TO_CHAR(
  CONVERT_TIMEZONE('UTC', 'Europe/London', TO_TIMESTAMP(min(ACTUAL_TIMESTAMP)/1000)),
  'yyyy-MM-dd HH:mm:ss'
 ) first_ts,
 TO_CHAR(
  CONVERT_TIMEZONE('UTC', 'Europe/London', TO_TIMESTAMP(max(ACTUAL_TIMESTAMP)/1000)),
  'yyyy-MM-dd HH:mm:ss'
 ) last_ts
FROM TRAIN_MOVEMENTS
GROUP BY TOC;