
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 1 - REQUIRED - Course Setup and Creating a Pipeline

In this demo, we'll set up the course environment, explore its components, build a traditional ETL pipeline using JSON files as the data source, and then learn how to create a sample Lakeflow Spark Declarative Pipeline (SDP).

### Learning Objectives

By the end of this lesson, you will be able to:
- Efficiently navigate the Workspace to locate course catalogs, schemas, and source files.
- Create a Lakeflow Spark Declarative Pipeline using the Workspace and the Pipeline UI.


### IMPORTANT - PLEASE READ!
- **REQUIRED** - This notebook is required for all users to run. If you do not run this notebook, you will be missing the necessary files and schemas required for the rest of the course.

## REQUIRED - SELECT CLASSIC COMPUTE (your cluster starts with **labuser**)

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically create and reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-1-setup-REQUIRED

## B. Explore the Lab Environment

Explore the raw data source files, catalogs, and schema in the course lab environment.

1. Complete these steps to explore your user catalog and schemas you will be using in this course:

   - a. Select the **Catalog** icon ![Catalog Icon](./Includes/images/catalog_icon.png) in the left navigation bar.

   - b. You should see your unique catalog, named something like **labuser1234_56789**. You will use this catalog throughout the course.

   - c. Expand your **labuser** catalog. It should contain the following schemas:
     - **1_bronze_db**
     - **2_silver_db**
     - **3_gold_db**
     - **default**

2. Complete the following steps to view where our streaming raw source files are coming from:

   a. Select the **Catalog** icon ![Catalog Icon](./Includes/images/catalog_icon.png) in the left navigation bar.

   b. Expand the **dbacademy** catalog.

   c. Expand the **ops** schema and then **Volumes**.

   d. Expand your **labuser@vocareum** volume. You should notice that your volume contains three folders:
   - **customers**
   - **orders**
   - **status**

   e. Expand each folder and notice that each cloud storage location contains a single JSON file to start with.

3. To easily reference this volume path (`/Volumes/dbacademy/ops/your-labuser-name`) throughout the course, you can use the:
   - The python `DA.paths.working_dir` variable
   - The SQL `DA.paths_working_dir` variable

   Run the cells below and confirm that the path points to your volume.

   **Example:** `/Volumes/dbacademy/ops/labuser1234_5678@vocareum`

In [0]:
## With Python
print(DA.paths.working_dir)

In [0]:
%sql
-- With SQL
values(DA.paths_working_dir)

## C. Build a Traditional ETL Pipeline

1. Query the raw JSON file(s) in your `/Volumes/dbacademy/ops/your-labuser-name/orders` volume to preview the data. 

      Notice that the JSON file is displayed ingested into tabular form using the `read_files` function. Take note of the following:

    a. The **orders** JSON file contains order data for a company.

    b. The one JSON file in your **/orders** volume (**00.json**) contains 174 rows. Remember that number for later.

In [0]:
spark.sql(f'''
          SELECT * 
          FROM json.`{DA.paths.working_dir}/orders`
          ''').display()

2. Traditionally, you would build an ETL pipeline by reading all of the files within the cloud storage location each time the pipeline runs. As data scales, this method becomes inefficient, more expensive, and time-consuming.

   For example, you would write code like below.

   **NOTES:** 
   - The tables and views will be written to your **labuser.default** schema (database).
   - Knowledge of the Databricks `read_files` function is prerequisite for this course.

In [0]:
%sql
-- JSON -> Bronze
-- Read ALL files from your working directory each time the query is executed
CREATE OR REPLACE TABLE default.orders_bronze
AS 
SELECT 
  *,
  current_timestamp() AS processing_time,
  _metadata.file_name AS source_file
FROM read_files(
    DA.paths_working_dir || "/orders", 
    format =>"json");


-- Bronze -> Silver
-- Read the entire bronze table each time the query is executed
CREATE OR REPLACE TABLE default.orders_silver
AS 
SELECT 
  order_id,
  timestamp(order_timestamp) AS order_timestamp, 
  customer_id,
  notifications
FROM default.orders_bronze;   


-- Silver -> Gold
-- Aggregate the silver each time the query is executed.
CREATE OR REPLACE VIEW default.orders_by_date_vw     
AS 
SELECT 
  date(order_timestamp) AS order_date, 
  count(*) AS total_daily_orders
FROM default.orders_silver                               
GROUP BY date(order_timestamp);

3. Run the code in the cells to view the **orders_bronze** and **orders_silver** tables, and the **orders_by_date_vw** view. Explore the results.

In [0]:
%sql
SELECT *
FROM default.orders_bronze
LIMIT 5;

In [0]:
%sql
SELECT *
FROM default.orders_silver
LIMIT 5;

In [0]:
%sql
SELECT *
FROM default.orders_by_date_vw
LIMIT 5;

### Considerations

- As JSON files are added to the volume in cloud storage, your **bronze table** code will read **all** of the files each time it executes, rather than reading only new rows of raw data. As the data grows, this can become inefficient and costly.

- The **silver table** code will always read all the rows from the bronze table to prepare the silver table. As the data grows, this can also become inefficient and costly.

- The traditional view, **orders_by_date_vw**, executes each time it is called. As the data grows, this can become inefficient.

- To check data quality as new rows are added, additional code is needed to identify any values that do not meet the required conditions.

- Monitoring the pipeline for each run is a challenge.

- There is no simple user interface to explore, monitor, or fix issues everytime the code runs.

### We can automatically process data incrementally, manage infrastructure, monitor, observe, optimize, and view this ETL pipeline by converting this to use **Spark Declarative Pipelines**!

## D. Get Started Creating a Lakeflow Spark Declarative Pipeline Using the New Lakeflow Pipelines Editor

In this section, we'll show you how to start creating a Spark Declarative Pipeline using the new Lakeflow Pipelines Editor. We won't run or modify the pipeline just yet!

There are a few different ways to create your pipeline. Let's explore these methods.

1. First, complete the following steps to enable the new **Lakeflow Pipelines Editor**:

   **NOTE:** This is being updated and how to enable it might change slightly moving forward.

   a. In the top-right corner, select your user icon ![User Lab Icon](./Includes/images/user_lab_circle_icon.png).

   b. Right-click on **Settings** and select **Open in New Tab**.

   c. Select **Developer**.

   d. Scroll to the bottom and enable **Lakeflow Pipelines Editor** if it's not enabled and Click **Enable tabs for notebooks and files**.

   ![Lakeflow Pipeline Editor](./Includes/images/lakeflow-pipeline-editor.png)

   e. Refresh your browser page to enable the option you turned on.

### D1. Create a Spark Declarative Pipeline Using the File Explorer
1. Complete the following steps to create a Spark Declarative Pipeline using the left navigation pane:

   a. In the left navigation bar, select the **Folder** ![Folder Icon](./Includes/images/folder_icon.png) icon to open the Workspace navigation.

   b. Navigate to the **Build Data Pipelines with Lakeflow Spark Declarative Pipelines** folder (you are most likely already there).

   c. (**PLEASE READ**) To complete this demonstration, it'll be easier to open this same notebook in another tab to follow along with these instructions. Right click on the notebook **1 - REQUIRED - Course Setup and Creating a Pipeline** and select **Open in a New Tab**.

   d. In the other tab select the three ellipsis icon ![Ellipsis Icon](./Includes/images/ellipsis_icon.png) in the folder navigation bar.

   e. Select **Create** -> **ETL Pipeline**:
      - If you have not enabled the new **Lakeflow Pipelines Editor** a pop-up might appear asking you to enable the new editor. Select **Enable** here or complete the previous step.

      </br>

      - Then use the following information:

         - **Name**: `yourfirstname-my-pipeline-project`

         - **Default catalog**: Select your **labuser** catalog

         - **Default schema**: Select your **default** schema (database)

         - Select **Start with sample code in SQL**

         The project will open up in the pipeline editor and look like the following:

      ![Pipeline Editor](./Includes/images/new_pipeline_editor_sample.png)

   f. This will open your Spark Declarative Pipeline within the **Lakeflow Pipelines Editor**. By default, the project creates multiple folders and sample files for you as a starter. You can use this sample folder structure or create your own. Notice the following in the pipeline editor:

      - The Spark Declarative Pipeline is located within the **Pipeline** tab.

      - Here, you start with a sample project and folder structure.

      - To navigate back to all your files and folders, select **All Files**.

      - We will explore the pipeline editor and running a pipeline in the next demonstration.

   g. Close the link with the sample pipeline.

### D2. Create a Spark Declarative Pipeline Using the Pipeline UI
1. You can also create a Spark Declarative Pipeline using the far-left main navigation bar by completing the following steps:

   a. On the far-left navigation bar, right-click **Jobs and Pipelines** and select **Open Link in New Tab**.

   b. Find the blue **Create** button and select it.

   c. Select **ETL pipeline**.

   d. The same **Create pipeline** pop-up appears as before. 

   e. Here select **Add existing assets**. 

   f. The **Add existing assets** button enables you to select a folder with pipeline assets. This option will enable you to associate this new pipeline with code files already available in your Workspace, including Git folders.

   <img src="./Includes/images/existing_assets.png" alt="Existing Assets" width="400">


   g. You can close out of the pop up window and close the pipeline tab. You do not need to select a folder yet.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>