# 4 Lab - Create a Pipeline  

In this lab, you'll migrate a traditional ETL workflow to a pipeline for incremental data processing. You'll practice building streaming tables and materialized views using Lakeflow Spark Declarative Pipelines syntax.

#### Your Tasks:
- Create a new Pipeline  
- Convert traditional SQL ETL to declarative syntax for incremental processing
- Configure pipeline settings  
- Define data quality expectations  
- Validate and run the pipeline

### Learning Objectives

By the end of this lesson, you will be able to:
- Create a pipeline and execute it successfully using the new Lakeflow Pipeline Editor.
- Modify and configure pipeline settings to align with specific data processing requirements.
- Integrate data quality expectations into a pipeline and evaluate their effectiveness.


## A. Environment Configuration

Set up the environment variables for this course.


In [0]:
# Define catalog, schema, and volume paths
CATALOG_NAME = 'ldp_demo'
SCHEMA_NAME = 'ldp_schema'
VOLUME_NAME = 'raw'
WORKING_DIR = f'/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}'

print(f'Catalog: {CATALOG_NAME}')
print(f'Schema: {SCHEMA_NAME}')
print(f'Volume Path: {WORKING_DIR}')


## B. SCENARIO

Your data engineering team has identified an opportunity to modernize an existing ETL pipeline that was originally developed in a Databricks notebook. While the current pipeline gets the job done, it lacks the scalability, observability, efficiency and automated data quality features required as your data volume and complexity grow.

To address this, you've been asked to migrate the existing pipeline to a Lakeflow Spark Declarative Pipeline. Spark Declarative Pipelines will enable your team to define data transformations more declaratively, apply data quality rules, and benefit from built-in optimization, lineage tracking and monitoring.

Your goal is to refactor the original notebook based logic (shown in the cells below) into a Spark Declarative Pipeline.

### REQUIREMENTS:
  - Migrate the ETL code below to a Spark Declarative Pipeline.
  - Add the required data quality expectations to the bronze table and silver table.
  - Create materialized views for the most up to date aggregated information.

Follow the steps below to complete your task.


### B1. Explore the Raw Data

1. Complete the following steps to view where our lab's streaming raw source files are coming from:

   a. Select the **Catalog** icon in the left navigation bar.  

   b. Expand your **ldp_demo** catalog.  

   c. Expand the **ldp_schema** schema.  

   d. Expand **Volumes**.  

   e. Expand the **raw** volume.  

   f. You should see folders: **customers**, **orders**, and **status**.  

   g. The files in the **raw** volume will be the data source files you will be ingesting.


### B2. Current ETL Code

Run each cell below to view the results of the current ETL pipeline. This will give you an idea of the expected output. Don't worry too much about the data transformations within the SQL queries.

The focus of this lab is on using **declarative SQL** and creating a **Spark Declarative Pipeline**. You will not need to modify the transformation logic, only the `CREATE` statements and `FROM` clauses to ensure data is read and processed incrementally in your pipeline.


#### B2.1 - JSON to Bronze

Explore the code and run the cell. Observe the results. Notice that:

- The JSON file in the volume is read in as a table named **orders_bronze_lab4** in the **ldp_demo.ldp_schema** schema.  

Think about what you will need to change when migrating this to a Spark Declarative Pipeline. Hints are added as comments in the code below.

**NOTE:** In your Spark Declarative Pipeline we will want to add data quality expectations to document any bad data coming into the pipeline.


In [0]:
%sql
-- You will have to modify this to create a streaming table in the pipeline
CREATE OR REPLACE TABLE ldp_demo.ldp_schema.orders_bronze_lab4
AS
SELECT
  *,
  current_timestamp() AS ingestion_time,
  _metadata.file_name as raw_file_name
FROM read_files(  -- You will have to modify FROM clause to incrementally read in data
  '/Volumes/ldp_demo/ldp_schema/raw/orders',  -- You will have to modify this path in the pipeline to your specific raw data source
  format => 'json'
);

-- Display table
SELECT *
FROM ldp_demo.ldp_schema.orders_bronze_lab4;


## C. TO DO: Create the Lakeflow Spark Declarative Pipeline (Steps)

After you have explored the traditional ETL code to create the tables and views, it's time to modify that syntax to declarative SQL for your new pipeline.

You will have to complete the following:

1. Create a Spark Declarative Pipeline and name it **Lab4 - firstname pipeline project**.

    - Select your **ldp_demo** catalog  
    - Select the **ldp_schema** schema  
    - Select the **Start with sample code in SQL** language  

2. Migrate the ETL code (shown above) into one or more files and folders to organize your pipeline.

3. Modify the code to create streaming tables and materialized views with data quality expectations.

4. Configure the pipeline settings:
   - Set the **source** configuration parameter to `/Volumes/ldp_demo/ldp_schema/raw`

5. Run the pipeline and validate the results.
