# Data Engineering Pipelines with Snowpark Python - Part 2!


Welcomg to Part 2 of the Hands-on Lab! If you have not completed Part 1, please do so as it is required to set up the database objects for this part of the lab. 

As a reminder, run the next cell to see a visual overview of what we're building:

In [None]:
import streamlit as st
st.image("https://raw.githubusercontent.com/Snowflake-Labs/sfguide-data-engineering-with-snowpark-python/main/images/demo_overview.png",width=800)

# Orchestrate Jobs

During this step we will be orchestrating our new Snowpark pipelines with Snowflake's native orchestration feature named Tasks. We will create two tasks, one for each stored procedure, and chain them together. We will then run the tasks. To put this in context, we are on step **#8** in our data flow overview.


### Creating and Running the Tasks
In this step we did not create a schedule for our task DAG, so it will not run on its own at this point. So in this script you will notice that we manually execute the DAG, using the `EXECUTE TASK` command.

In [None]:
-- ----------------------------------------------------------------------------
-- Step #1: Create the tasks to call our Python stored procedures
-- ----------------------------------------------------------------------------

USE NB_HOL_DB.HARMONIZED;
USE WAREHOUSE HOL_WH;

CREATE OR REPLACE TASK ORDERS_UPDATE_TASK
WAREHOUSE = HOL_WH
WHEN
  SYSTEM$STREAM_HAS_DATA('POS_FLATTENED_V_STREAM')
AS
CALL HARMONIZED.ORDERS_UPDATE_SP();

CREATE OR REPLACE TASK DAILY_CITY_METRICS_UPDATE_TASK
WAREHOUSE = HOL_WH
AFTER ORDERS_UPDATE_TASK
WHEN
  SYSTEM$STREAM_HAS_DATA('ORDERS_STREAM')
AS
CALL HARMONIZED.DAILY_CITY_METRICS_UPDATE_SP();

-- ----------------------------------------------------------------------------
-- Step #2: Execute the tasks
-- ----------------------------------------------------------------------------

ALTER TASK DAILY_CITY_METRICS_UPDATE_TASK RESUME;

EXECUTE TASK ORDERS_UPDATE_TASK;


To see what happened when you ran this task just now, run the following query:


In [None]:
-- Task execution history in the past day
SELECT SCHEDULED_TIME,NAME,STATE,DATABASE_NAME,SCHEMA_NAME,QUERY_TEXT,CONDITION_TEXT,QUERY_START_TIME,COMPLETED_TIME,QUERY_ID
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
    SCHEDULED_TIME_RANGE_START=>DATEADD('DAY',-1,CURRENT_TIMESTAMP()),
    SCHEDULED_TIME_RANGE_END=>DATEADD('MINUTE',1,CURRENT_TIMESTAMP()),
    RESULT_LIMIT => 100))
ORDER BY SCHEDULED_TIME DESC
;


You will notice in the task history output that it skipped our task `ORDERS_UPDATE_TASK`. This is correct, because our `HARMONIZED.POS_FLATTENED_V_STREAM` stream doesn't have any data. We'll add some new data and run them again in the next step.

### More on Tasks
Tasks are Snowflake's native scheduling/orchestration feature. With a task you can execute any one of the following types of SQL code:

* Single SQL statement
* Call to a stored procedure
* Procedural logic using Snowflake Scripting Developer Guide

For this Quickstart we'll `CALL` our Snowpark stored procedures. 

A few things to point out; First you specify which Snowflake virtual warehouse to use when running the task with the `WAREHOUSE` clause. The `AFTER` clause lets you define the relationship between tasks, and the structure of this relationship is a Directed Acyclic Graph (or DAG) like most orchestration tools provide. The `AS` clause let's you define what the task should do when it runs, in this case to call our stored procedure.

The `WHEN` clause is really cool. We've already seen how streams work in Snowflake by allowing you to incrementally process data. We've even seen how you can create a stream on a view (which joins many tables together) and create a stream on that view to process its data incrementally! Here in the `WHEN` clause we're calling a system function `SYSTEM$STREAM_HAS_DATA()` which returns true if the specified stream has new data. With the `WHEN` clause in place the virtual warehouse will only be started up when the stream has new data. So if there's no new data when the task runs then your warehouse won't be started up and you won't be charged. You will only be charged when there's new data to process. Pretty cool, huh?

As mentioned above we did not define a `SCHEDULE` for the root task, so this DAG will not run on its own. That's fine for this Quickstart, but in a real situation you would define a schedule. See [CREATE TASK](https://docs.snowflake.com/en/sql-reference/sql/create-task.html) for the details.

And for more details on Tasks see [Introduction to Tasks](https://docs.snowflake.com/en/user-guide/tasks-intro.html).

### Task Metadata
Snowflake keeps metadata for almost everything you do, and makes that metadata available for you to query (and to create any type of process around). Tasks are no different, Snowflake maintains rich metadata to help you monitor your task runs. Here are a few sample SQL queries you can use to monitor your tasks runs:

In [None]:
-- Get a list of tasks
SHOW TASKS;

### Monitoring Tasks
So while you're free to create any operational or monitoring process you wish, Snowflake provides some rich task observability features in our Snowsight UI. Try it out for yourself by following these steps (in a new browser tab):

1. In the Snowsight navigation menu, click **Data** » **Databases**.
1. In the right pane, using the object explorer, navigate to a database and schema.
1. For the selected schema, select and expand **Tasks**.
1. Select a task. Task information is displayed, including **Task Details**, **Graph**, and **Run History** sub-tabs.
1. Select the **Graph** tab. The task graph appears, displaying a hierarchy of child tasks.
1. Select a task to view its details.


# Process Incrementally

During this step we will be adding new data to our POS order tables and then running our entire end-to-end pipeline to process the new data. 

This entire pipeline will be processing data incrementally thanks to Snowflake's advanced stream/CDC capabilities. To put this in context, we are on step **#9** in our data flow overview.

First, we'll dynamically scale up our warehouse to `XLARGE` and **LS** the directory to preview the files that will be loaded

In [None]:
-- ----------------------------------------------------------------------------
-- Add new/remaining order data
-- ----------------------------------------------------------------------------

USE SCHEMA NB_HOL_DB.RAW_POS;

ALTER WAREHOUSE HOL_WH SET WAREHOUSE_SIZE = XLARGE WAIT_FOR_COMPLETION = TRUE;

LS @external.frostbyte_raw_stage/pos/order_header/year=2022

Everything look good, so let's go aehad and run the `COPY INTO` and load the ORDER_DETAIL records from 2022.

In [None]:
COPY INTO ORDER_HEADER FROM @external.frostbyte_raw_stage/pos/order_header/year=2022
FILE_FORMAT = (FORMAT_NAME = EXTERNAL.PARQUET_FORMAT)
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

In [None]:
COPY INTO ORDER_DETAIL FROM @external.frostbyte_raw_stage/pos/order_detail/year=2022
FILE_FORMAT = (FORMAT_NAME = EXTERNAL.PARQUET_FORMAT)
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

Before we kick off our pipeline, take a look at the stream we created on top of the POS data. It should now contain all the new records we just copied into the ORDER_DETAIL table.

In [None]:
SELECT * FROM NB_HOL_DB.HARMONIZED.POS_FLATTENED_V_STREAM limit 100;

In [None]:
SELECT COUNT(1) FROM NB_HOL_DB.HARMONIZED.POS_FLATTENED_V_STREAM AS NEW_ROWS_TO_PROCESS;

☝️ Remember: These queries don't affect the stream buffer (i.e. the high water mark) because they are simply SELECT statements. 

Only DML statements which insert the records from the stream into a downstream table will automatically clear the stream.

Alright, now let's kick off our pipeline:

In [None]:
-- First, let's grab a count of the # records in our orders table and city metrics table before the udpate
SELECT * FROM
    (SELECT COUNT(1) AS ORDER_COUNT_BEFORE FROM NB_HOL_DB.HARMONIZED.ORDERS)a,
    (SELECT COUNT(1) AS DAILY_METRICS_COUNT_BEFORE FROM NB_HOL_DB.ANALYTICS.DAILY_CITY_METRICS)b
;

In [None]:
-- Kick off the pipeline
EXECUTE TASK NB_HOL_DB.HARMONIZED.ORDERS_UPDATE_TASK;

ALTER WAREHOUSE HOL_WH SET WAREHOUSE_SIZE = XSMALL;

### Viewing the Task History
Like the in the previous step, to see what happened when you ran this task DAG, run this query:

In [None]:
SELECT SCHEDULED_TIME,NAME,STATE,DATABASE_NAME,SCHEMA_NAME,QUERY_TEXT,CONDITION_TEXT,QUERY_START_TIME,COMPLETED_TIME,QUERY_ID
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
    SCHEDULED_TIME_RANGE_START=>DATEADD('DAY',-1,CURRENT_TIMESTAMP()),
    SCHEDULED_TIME_RANGE_END=>DATEADD('MINUTE',1,CURRENT_TIMESTAMP()),
    RESULT_LIMIT => 100))
ORDER BY SCHEDULED_TIME DESC
;

This time you will notice that the `ORDERS_UPDATE_TASK` task will not be skipped, since the `HARMONIZED.POS_FLATTENED_V_STREAM` stream has new data. In a few minutes you should see that both the `ORDERS_UPDATE_TASK` task and the `DAILY_CITY_METRICS_UPDATE_TASK` task completed successfully.

### Query History for Tasks
One important thing to understand about tasks, is that the queries which get executed by the task won't show up with the default Query History UI settings. In order to see the queries that just ran you need to do the following:
* Remove filters at the top of this table, including your username, as later scheduled tasks will run as "System"
* Click "Filter", and add filter option 'Queries executed by user tasks' and click "Apply Filters"

You should now see all the queries run by your tasks! Take a look at each of the MERGE commands in the Query History to see how many records were processed by each task. And don't forget to notice that we processed the whole pipeline just now, and did so incrementally!

## Teardown
Once you're finished with the Quickstart and want to clean things up, you can simply run the following commands.


In [None]:
DROP DATABASE NB_HOL_DB;
DROP WAREHOUSE HOL_WH;

### What we've covered
We've covered a ton in this Quickstart, and here are the highlights:

* Snowflake's Table Format
* Data ingestion with COPY
* Schema inference
* Data sharing/marketplace (instead of ETL)
* Streams for incremental processing (CDC)
* Streams on views
* Python UDFs (with third-party packages)
* Python Stored Procedures
* Snowpark DataFrame API
* Snowpark Python programmability
* Warehouse elasticity (dynamic scaling)
* Visual Studio Code Snowflake native extension (PuPr, Git integration)
* SnowCLI (PuPr)
* Tasks (with Stream triggers)
* Task Observability
* GitHub Actions (CI/CD) integration

### Related Resources
And finally, here's a quick recap of related resources:

* [Full Demo on Snowflake Demo Hub](https://developers.snowflake.com/demos/data-engineering-pipelines/)
* [Source Code on GitHub](https://github.com/Snowflake-Labs/sfguide-data-engineering-with-snowpark-python)
* [Snowpark Developer Guide for Python](https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html)
    * [Writing Python UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python.html)
    * [Writing Stored Procedures in Snowpark (Python)](https://docs.snowflake.com/en/sql-reference/stored-procedures-python.html)
    * [Working with DataFrames in Snowpark Python](https://docs.snowflake.com/en/developer-guide/snowpark/python/working-with-dataframes.html)
* Related Tools
    * [Snowflake Visual Studio Code Extension](https://marketplace.visualstudio.com/items?itemName=snowflake.snowflake-vsc)
    * [SnowCLI Tool](https://github.com/Snowflake-Labs/snowcli)