# 1 - REQUIRED - Course Setup and Creating a Pipeline

In this demo, we'll set up the course environment, explore its components, build a traditional ETL pipeline using JSON files as the data source, and then learn how to create a sample Lakeflow Spark Declarative Pipeline (SDP).

### Learning Objectives

By the end of this lesson, you will be able to:
- Efficiently navigate the Workspace to locate course catalogs, schemas, and source files.
- Create a Lakeflow Spark Declarative Pipeline using the Workspace and the Pipeline UI.


### IMPORTANT - PLEASE READ!
- **REQUIRED** - Run the **00_Setup_Environment.ipynb** notebook first to create the necessary catalog, schema, volume, and sample data files.
- All tables will be created in the **ldp_demo.ldp_schema** catalog and schema.
- All source data files are located in **/Volumes/ldp_demo/ldp_schema/raw/**.


## A. Environment Configuration

Set up the environment variables for this course.


In [0]:
# Define catalog, schema, and volume paths
CATALOG_NAME = 'cetpa_external_demo'
SCHEMA_NAME = 'ldp_schema'
VOLUME_NAME = 'raw'
WORKING_DIR = f'/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}'

print(f'Catalog: {CATALOG_NAME}')
print(f'Schema: {SCHEMA_NAME}')
print(f'Volume Path: {WORKING_DIR}')


## B. Explore the Lab Environment

Explore the raw data source files, catalogs, and schema in the course lab environment.


1. Complete these steps to explore your catalog and schema you will be using in this course:

   - a. Select the **Catalog** icon in the left navigation bar.

   - b. You should see the **ldp_demo** catalog. You will use this catalog throughout the course.

   - c. Expand your **ldp_demo** catalog. It should contain the **ldp_schema** schema where all tables will be created.


2. Complete the following steps to view where our streaming raw source files are coming from:

   a. Select the **Catalog** icon in the left navigation bar.

   b. Expand the **ldp_demo** catalog.

   c. Expand the **ldp_schema** schema and then **Volumes**.

   d. Expand the **raw** volume. You should notice that your volume contains three folders:
   - **customers**
   - **orders**
   - **status**

   e. Expand each folder and notice that each cloud storage location contains a single JSON file to start with.


3. The volume path is `/Volumes/ldp_demo/ldp_schema/raw`. You can reference this path throughout the course using the `WORKING_DIR` variable defined above.


In [0]:
print(f'Working directory: {WORKING_DIR}')


## C. Build a Traditional ETL Pipeline


1. Query the raw JSON file(s) in your `/Volumes/ldp_demo/ldp_schema/raw/orders` volume to preview the data.

      Notice that the JSON file is displayed ingested into tabular form using the `read_files` function. Take note of the following:

    a. The **orders** JSON file contains order data for a company.

    b. The one JSON file in your **/orders** volume (**00.json**) contains 174 rows. Remember that number for later.


In [0]:
spark.sql(f'''
          SELECT *
          FROM json.`/Volumes/cetpa_external_catalog/ldp_schema/raw/orders/`
          ''').display()


2. Traditionally, you would build an ETL pipeline by reading all of the files within the cloud storage location each time the pipeline runs. As data scales, this method becomes inefficient, more expensive, and time-consuming.

   For example, you would write code like below.

   **NOTES:**
   - The tables and views will be written to your **ldp_demo.ldp_schema** schema.
   - Knowledge of the Databricks `read_files` function is prerequisite for this course.


In [0]:
%sql
-- JSON -> Bronze
-- Read ALL files from your working directory each time the query is executed
CREATE OR REPLACE TABLE cetpa_external_catalog.ldp_schema.orders_bronze
AS
SELECT
  *,
  current_timestamp() AS processing_time,
  _metadata.file_name AS source_file
FROM read_files(
    '/Volumes/cetpa_external_catalog/ldp_schema/raw/orders',
    format =>"json");


-- Bronze -> Silver
-- Read the entire bronze table each time the query is executed
CREATE OR REPLACE TABLE cetpa_external_catalog.ldp_schema.orders_silver
AS
SELECT
  order_id,
  timestamp(order_timestamp) AS order_timestamp,
  customer_id,
  notifications
FROM cetpa_external_catalog.ldp_schema.orders_bronze;


-- Silver -> Gold
-- Aggregate the silver each time the query is executed.
CREATE OR REPLACE VIEW cetpa_external_catalog.ldp_schema.orders_by_date_vw
AS
SELECT
  date(order_timestamp) AS order_date,
  count(*) AS total_daily_orders
FROM cetpa_external_catalog.ldp_schema.orders_silver
GROUP BY date(order_timestamp);


3. Run the code in the cells to view the **orders_bronze** and **orders_silver** tables, and the **orders_by_date_vw** view. Explore the results.


In [0]:
%sql
SELECT *
FROM cetpa_external_catalog.ldp_schema.orders_bronze
LIMIT 5;


In [0]:
%sql
SELECT *
FROM cetpa_external_catalog.ldp_schema.orders_silver
LIMIT 5;


In [0]:
%sql
SELECT *
FROM cetpa_external_catalog.ldp_schema.orders_by_date_vw
LIMIT 5;


### Considerations

- As JSON files are added to the volume in cloud storage, your **bronze table** code will read **all** of the files each time it executes, rather than reading only new rows of raw data. As the data grows, this can become inefficient and costly.

- The **silver table** code will always read all the rows from the bronze table to prepare the silver table. As the data grows, this can also become inefficient and costly.

- The traditional view, **orders_by_date_vw**, executes each time it is called. As the data grows, this can become inefficient.

- To check data quality as new rows are added, additional code is needed to identify any values that do not meet the required conditions.

- Monitoring the pipeline for each run is a challenge.

- There is no simple user interface to explore, monitor, or fix issues everytime the code runs.

### We can automatically process data incrementally, manage infrastructure, monitor, observe, optimize, and view this ETL pipeline by converting this to use **Spark Declarative Pipelines**!


## D. Get Started Creating a Lakeflow Spark Declarative Pipeline Using the New Lakeflow Pipelines Editor

In this section, we'll show you how to start creating a Spark Declarative Pipeline using the new Lakeflow Pipelines Editor. We won't run or modify the pipeline just yet!

There are a few different ways to create your pipeline. Let's explore these methods.


1. First, complete the following steps to enable the new **Lakeflow Pipelines Editor**:

   **NOTE:** This is being updated and how to enable it might change slightly moving forward.

   a. In the top-right corner, select your user icon.

   b. Right-click on **Settings** and select **Open in New Tab**.

   c. Select **Developer**.

   d. Scroll to the bottom and enable **Lakeflow Pipelines Editor** if it's not enabled and Click **Enable tabs for notebooks and files**.

   e. Refresh your browser page to enable the option you turned on.


### D1. Create a Spark Declarative Pipeline Using the File Explorer
1. Complete the following steps to create a Spark Declarative Pipeline using the left navigation pane:

   a. In the left navigation bar, select the **Folder** icon to open the Workspace navigation.

   b. Navigate to the **13 lakeflow declarative pipelines** folder.

   c. (**PLEASE READ**) To complete this demonstration, it'll be easier to open this same notebook in another tab to follow along with these instructions. Right click on the notebook **01_Course_Setup_and_Creating_a_Pipeline** and select **Open in a New Tab**.

   d. In the other tab select the three ellipsis icon in the folder navigation bar.

   e. Select **Create** -> **ETL Pipeline**:
      - If you have not enabled the new **Lakeflow Pipelines Editor** a pop-up might appear asking you to enable the new editor. Select **Enable** here or complete the previous step.

      - Then use the following information:

         - **Name**: `yourfirstname-my-pipeline-project`

         - **Default catalog**: Select **ldp_demo** catalog

         - **Default schema**: Select **ldp_schema** schema

         - Select **Start with sample code in SQL**

         The project will open up in the pipeline editor.

   f. This will open your Spark Declarative Pipeline within the **Lakeflow Pipelines Editor**. By default, the project creates multiple folders and sample files for you as a starter. You can use this sample folder structure or create your own.

   g. Close the link with the sample pipeline.


### D2. Create a Spark Declarative Pipeline Using the Pipeline UI
1. You can also create a Spark Declarative Pipeline using the far-left main navigation bar by completing the following steps:

   a. On the far-left navigation bar, right-click **Jobs and Pipelines** and select **Open Link in New Tab**.

   b. Find the blue **Create** button and select it.

   c. Select **ETL pipeline**.

   d. The same **Create pipeline** pop-up appears as before.

   e. Here select **Add existing assets**.

   f. The **Add existing assets** button enables you to select a folder with pipeline assets. This option will enable you to associate this new pipeline with code files already available in your Workspace, including Git folders.

   g. You can close out of the pop up window and close the pipeline tab. You do not need to select a folder yet.
