# 2 - Developing a Simple Pipeline

In this demonstration, we will create a simple Lakeflow Spark Declarative Pipeline project using the new **Lakeflow Pipeline Editor** with declarative SQL.


### Learning Objectives

By the end of this lesson, you will be able to:
- Describe the SQL syntax used to create a Lakeflow Spark Declarative Pipeline.
- Navigate the Lakeflow Pipeline Editor to modify pipeline settings and ingest the raw data source file(s).
- Create, execute and monitor a Spark Declarative Pipeline.


## A. Environment Configuration

Set up the environment variables for this course.


In [0]:
# Define catalog, schema, and volume paths
CATALOG_NAME = 'cetpa_external_catalog'
SCHEMA_NAME = 'ldp_schema'
VOLUME_NAME = 'raw'
WORKING_DIR = f'/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}'

print(f'Catalog: {CATALOG_NAME}')
print(f'Schema: {SCHEMA_NAME}')
print(f'Volume Path: {WORKING_DIR}')


## B. Developing and Running a Spark Declarative Pipeline with the Lakeflow Pipeline Editor

This course includes a simple, pre-configured Spark Declarative Pipeline to explore and modify.

In this section, we will:

- Explore the Lakeflow Pipeline Editor and the declarative SQL syntax  
- Modify pipeline settings  
- Run the Spark Declarative Pipeline and explore the streaming tables and materialized view.


1. The volume path is `/Volumes/ldp_demo/ldp_schema/raw`. You will need this path when modifying your pipeline settings.

   This volume path contains the **orders**, **status** and **customer** directories, which contain the raw JSON files.


In [0]:
print(f'Working directory: {WORKING_DIR}')


2. In this course we have starter files for you to use in your pipeline. This demonstration uses the folder **2 - Developing a Simple Pipeline Project**. To create a pipeline and add existing assets to associate it with code files already available in your Workspace (including Git folders) complete the following:

   a. For ease of use, open **Jobs & Pipelines** in a separate tab:

    - On the main navigation bar, right-click on **Jobs & Pipelines** and select **Open in a New Tab**.

   b. In **Jobs & Pipelines** select **Create** â†’ **ETL Pipeline**.

   c. Complete the pipeline creation page with the following:

    - **Name**: `Name-your-pipeline-using-this-notebook-name-add-your-first-name`
    - **Default catalog**: Select **ldp_demo** catalog  
    - **Default schema**: Select **ldp_schema** schema
    - Notice there are a variety of options to start your pipeline.

   d. In the options, select **Add existing assets**. In the popup, complete the following:

    - **Pipeline root folder**: Select the **2 - Developing a Simple Pipeline Project** folder

    - **Source code paths**: Within the same root folder as above, select the **orders** folder

    **NOTE:** You can select folders containing SQL and Python files to be executed as part of the pipeline, or you can provide individual file paths. The specified files will be processed when the pipeline runs.

   e. Click **Add**, This will create a pipeline and associate the correct files for this demonstration.


3. In the new window, select the **orders_pipeline.sql** file and follow the instructions in the SQL file within the **Lakeflow Pipelines Editor**.

    Leave this notebook open as you will use it later.


## C. Add a New File to Cloud Storage

1. After exploring and executing the pipeline by following the instructions in the **`orders_pipeline.sql`** file, run the cell below to add a new JSON file (**01.json**) to your volume at: `/Volumes/ldp_demo/ldp_schema/raw/orders`.


In [0]:
def copy_files(copy_from: str, copy_to: str, n: int, sleep=2):
    import os
    import time

    print(f"\n----------------Loading files to volume: '{copy_to}'----------------")

    if os.path.exists(copy_from):
        list_of_files_to_copy = sorted(os.listdir(copy_from))
        total_files = len(list_of_files_to_copy)
    else:
        print(f'Source directory {copy_from} does not exist.')
        return

    if os.path.exists(copy_to):
        list_of_files_in_dest = os.listdir(copy_to)
    else:
        list_of_files_in_dest = []

    assert total_files >= n, f"Source location contains only {total_files} files, but you specified {n} files to copy."

    counter = 1
    for file in list_of_files_to_copy:
        if file in list_of_files_in_dest:
            print(f'File number {counter} - {file} already exists. Skipping.')
        else:
            file_to_copy = f'{copy_from}/{file}'
            copy_file_to = f'{copy_to}/{file}'
            print(f'File number {counter} - Copying {file_to_copy} --> {copy_file_to}')
            dbutils.fs.cp(file_to_copy, copy_file_to, recurse=True)
            time.sleep(sleep)

        if counter == n:
            break
        counter += 1

# Copy additional orders file
try:
    copy_files(
        copy_from='/Volumes/retail/v01/retail-pipeline/orders/stream_json',
        copy_to=f'{WORKING_DIR}/orders',
        n=2
    )
except Exception as e:
    print(f'Note: Could not copy from retail. Error: {e}')
    print(f'Please manually add JSON files to: {WORKING_DIR}/orders/')


2. Complete the following steps to view the new file in your volume:

   a. Select the **Catalog** icon from the left navigation pane.  

   b. Expand your **ldp_demo.ldp_schema.raw** volume.  

   c. Expand the **orders** directory. You should see two files in your volume: **00.json** and **01.json**.


3. Run the cell below to view the data in the new **/orders/01.json** file. Notice the following:

   - The **01.json** file contains new orders.  
   - The **01.json** file has 25 rows.


In [0]:
spark.sql(f'''
  SELECT *
  FROM json.`/Volumes/cetpa_external_catalog/ldp_schema/raw/orders/00.json`
''').display()


4. Go back to the **orders_pipeline.sql** file and select **Run Pipeline** to execute your ETL pipeline again with the new file.  

   Watch the pipeline run and notice only 25 rows are added to the bronze and silver tables.

   This happens because the pipeline has already processed the first **00.json** file (174 rows), and it is now only reading the new **01.json** file (25 rows), appending the rows to the streaming tables, and recomputing the materialized view with the latest data.


## D. Exploring Your Streaming Tables


1. View the new streaming tables and materialized view in your catalog. Complete the following:

   a. Select the catalog icon in the left navigation pane.

   b. Expand your **ldp_demo** catalog.

   c. Expand the **ldp_schema** schema. Notice that the streaming tables and materialized view are correctly placed in your schema.

      - **ldp_demo.ldp_schema.orders_bronze_demo2**

      - **ldp_demo.ldp_schema.orders_silver_demo2**

      - **ldp_demo.ldp_schema.orders_by_date_gold_demo2**


2. Run the cell below to view the data in the **ldp_demo.ldp_schema.orders_bronze_demo2** table. Before you run the cell, how many rows should this streaming table have?

   Notice the following:
      - The table contains 199 rows (**00.json** had 174 rows, and **01.json** had 25 rows).
      - In the **source_file** column you can see the exact file the rows were ingested from.
      - In the **processing_time** column you can see the exact time the rows were ingested.


In [0]:
%sql
SELECT *
FROM cetpa_external_catalog.ldp_schema.orders_bronze_demo2;


## E. Viewing Spark Declarative Pipelines with the Pipelines UI

After exploring and creating your pipeline using the **orders_pipeline.sql** file in the steps above, you can view the pipeline(s) you created in your workspace via the **Jobs and Pipelines** UI.


1. Complete the following steps to view the pipeline you created:

   a. In the main applications navigation pane on the far left, right-click on **Jobs & Pipelines** and select **Open Link in a New Tab**.

   b. This should take you to the pipelines you have created. You should see your pipeline.

   c. Select your pipeline. Here, you can use the UI to modify the pipeline.

   d. Select the **Settings** button at the top. This will take you to the settings within the UI.

   e. Select **Schedule** to schedule the pipeline. Select **Cancel**, we will learn how to schedule the pipeline later.

   f. Under your pipeline name, select the drop-down with the time date stamp. Here you can view the **Pipeline graph** and other metrics for each run of the pipeline.

   g. Close the pipeline UI tab you opened.


## Additional Resources

- [Lakeflow Spark Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/) documentation.
