
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 5 - Deploying a Pipeline to Production

In this demonstration, we will begin by adding an additional data source to our pipeline and performing a join with our streaming tables. Then, we will focus on productionalizing the pipeline by adding comments and table properties to the objects we create, scheduling the pipeline, and creating an event log to monitor the pipeline.

### Learning Objectives

By the end of this lesson, you will be able to:
- Apply the appropriate comment syntax and table properties to pipeline objects to enhance readability.
- Demonstrate how to perform a join between two streaming tables using a materialized view to optimize data processing.
- Execute the scheduling of a pipeline using trigger or continuous modes to ensure timely processing.
- Explore the event log to monitor a production Lakeflow Spark Declarative Pipeline.

## REQUIRED - SELECT CLASSIC COMPUTE (your cluster starts with **labuser**)

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course.

This cell will also reset your `/Volumes/dbacademy/ops/labuser/` volume with the JSON files to the starting point, with one JSON file in each volume.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically create and reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-5

## B. Explore the Orders and Status JSON Files

1. Explore the raw data located in the `/Volumes/dbacademy/ops/our-lab-user/orders/` volume. This is the data we have been working with throughout the course demonstrations.

   Run the cell below to view the results. Notice that the orders JSON file(s) contains information about when each order was placed.

In [0]:
SELECT *
FROM read_files(
  DA.paths_working_dir || '/orders/',
  format => 'JSON'
)
LIMIT 10;

2. Explore the **status** raw data located in the `/Volumes/dbacademy/ops/your-lab-user/status/` volume and filter for the specific **order_id** *75123*.

   Run the cell below to view the results. Notice that the status JSON file(s) contain **order_status** information for each order.  

   **NOTE:** The **order_status** can include multiple rows per order and may be any of the following:

   - on the way  
   - canceled  
   - return canceled  
   - reported shipping error  
   - delivered  
   - return processed  
   - return picked up  
   - placed  
   - preparing  
   - return requested


In [0]:
SELECT *
FROM read_files(
  DA.paths_working_dir || '/status/',
  format => 'JSON'
)
WHERE order_id = 75123;

3. One of our objectives is to join the **orders** data with the order **status** data.  

    The query below demonstrates what the result of the final join in the Spark Declarative Pipeline will look like after the data has been incrementally ingested and cleaned when we create the pipeline. Run the cell and review the output.

    Notice that after joining the tables, we can see each **order_id** along with its original **order_timestamp** and the **order_status** at specific points in time.

**NOTE:** The data used in this demo is artificially generated, so the **order_status_timestamps** may not reflect realistic timing.

In [0]:
WITH orders AS (
  SELECT *
  FROM read_files(
        DA.paths_working_dir || '/orders/',
        format => 'JSON'
  )
),
status AS (
  SELECT *
  FROM read_files(
        DA.paths_working_dir || '/status/',
        format => 'JSON'
  )
)
-- Join the views to get the order history with status
SELECT
  orders.order_id,
  timestamp(orders.order_timestamp) AS order_timestamp,
  status.order_status,
  timestamp(status.status_timestamp) AS order_status_timestamp
FROM orders
  INNER JOIN status 
  ON orders.order_id = status.order_id
ORDER BY order_id, order_status_timestamp;

## C. Putting a Pipeline in Production

This course includes a complete Lakeflow Spark Declarative Pipeline project that has already been created.  In this section, you'll explore the Spark Declarative Pipeline and modify its settings for production use.


1. The screenshot below shows what the final Spark Declarative Pipeline will look like when ingesting a single JSON file from the data sources:  
![Final Demo 6 Pipeline](./Includes/images/demo5_pipeline_image_run1.png)

    **Note:** Depending on the number of files you've ingested, the row count may vary.

2. Run the cell below to create your starter Spark Declarative Pipeline for this demonstration. The pipeline will set the following for you:
    - Your default catalog: `labuser`
    - Your configuration parameter: `source` = `/Volumes/dbacademy/ops/your-labuser-name`

    **NOTE:** If the pipeline already exists, an error will be returned. In that case, you'll need to delete the existing pipeline and rerun this cell.

    To delete the pipeline:

    a. Select **Jobs & Pipelines** from the far-left navigation bar.  

    b. Find the pipeline you want to delete.  

    c. Click the three-dot menu ![ellipsis icon](./Includes/images/ellipsis_icon.png).  

    d. Select **Delete**.

**NOTE:**  The `create_declarative_pipeline` function is a custom function built for this course to create the sample pipeline using the Databricks REST API. This avoids manually creating the pipeline and referencing the pipeline assets.

In [0]:
%python
create_declarative_pipeline(pipeline_name=f'5 - Deploying a Pipeline to Production Project - {DA.catalog_name}', 
                            root_path_folder_name='5 - Deploying a Pipeline to Production Project',
                            catalog_name = DA.catalog_name,
                            schema_name = 'default',
                            source_folder_names=['orders', 'status'],
                            configuration = {'source':DA.paths.working_dir})

3. Complete the following steps to open the starter Spark Declarative Pipeline project for this demonstration:

   a. In the main navigation bar right-click on **Jobs & Pipelines** and select **Open in Link in New Tab**.

   b. In **Jobs & Pipelines** select your **5 - Deploying a Pipeline to Production Project - labuser** pipeline.

   c. **REQUIRED:** At the top near your pipeline name, turn on **New pipeline monitoring**.

   d. In the **Pipeline details** pane on the far right, select **Open in Editor** (field to the right of **Source code**) to open the pipeline in the **Lakeflow Pipeline Editor**.

   e. In the new tab you should see three folders: **explorations**, **orders**, and **status** (plus the extra **python_excluded** folder that contains the Python version). 

   f. Continue to step 4 and 5 below.

##### Explore the code in the `orders/orders_pipeline.sql` file

4. In the new tab select the **orders** folder. It contains the same **orders_pipeline.sql** pipeline you've been working with.  **Follow the instructional comments in the file to proceed.**



##### Explore the code in the `status/status_pipeline` notebook

5. After reviewing the **orders_pipeline.sql** file, you'll be directed to explore the **status/status_pipeline.sql** notebook. This notebook processes new data and adds it to the pipeline. **Follow the instructions provided in the notebook's markdown cells.**

    **NOTE:** The **status/status_pipeline.sql**  notebook will go through setting up the pipeline settings, scheduling and running the production pipeline.

## D. Land More Data to Your Data Source Volume

1. Run the cell below to add **4** more JSON files to your volumes:
    - `/Volumes/dbacademy/ops/your-labuser-volume/orders`
    - `/Volumes/dbacademy/ops/your-labuser-volume/status`

In [0]:
%python
copy_files(copy_from = '/Volumes/dbacademy_retail/v01/retail-pipeline/orders/stream_json', 
           copy_to = f'{DA.paths.working_dir}/orders', 
           n = 5)

copy_files(copy_from = '/Volumes/dbacademy_retail/v01/retail-pipeline/status/stream_json', 
           copy_to = f'{DA.paths.working_dir}/status', 
           n = 5)

2. Navigate back to your pipeline and select **Run pipeline** to process the new landed files.

## E. Introduction to the Pipeline Event Log (Advanced Topic)

After running your pipeline and successfully publishing the event log as a table named **event_log_demo_5** in your **labuser.default** schema (database), begin exploring the event log. 

Here we will quickly introduce the event log. **To process the event log you will need knowledge of parsing JSON formatted strings.**

  - [Monitor Lakeflow Spark Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/observability) documentation

**TROUBLESHOOT:** 
- **REQUIRED:** If you did not run the pipeline and publish the event log, the code below will not run. Please make sure to complete all steps before starting this section.

- **HIDDEN EVENT LOG:** By default, Spark Declarative Pipelines writes the event log to a hidden Delta table in the default catalog and schema configured for the pipeline. While hidden, the table can still be queried by all sufficiently privileged users. By default, only the owner of the pipeline can query the event log table. By default, the name for the hidden event log is formatted as:  
  - `catalog.schema.event_log_{pipeline_id}` - where the pipeline ID is the system-assigned UUID with dashes replaced by underscores.  
  - [Query the Event Log](https://docs.databricks.com/aws/en/dlt/observability#query-the-event-log)

1. Complete the following steps to view the **labuser.default.event_log_demo_5** event log in your catalog:

   a. Select the catalog icon ![Catalog Icon](./Includes/images/catalog_icon.png) from the left navigation pane.

   b. Expand your **labuser** catalog.

   c. Expand the following schemas (databases):
      - **1_bronze_db**
      - **2_silver_db**
      - **3_gold_db**
      - **default**

   d. Notice the following:
      - In the **1_bronze_db**, **2_silver_db**, and **3_gold_db** schemas, the pipeline streaming tables and materialized views were created (they end with **demo5**).
      - In the **default** schema, the pipeline has published the event log as a table named **event_log_demo_5**.

**NOTE:** You might need to refresh the catalogs to view the streaming tables, materialized views, and event log.


2. Query your **labuser.default.event_log_demo_5** table to see what the event log looks like.

   Notice that it contains all events within the pipeline as **STRING** columns (typically JSON-formatted strings) or **STRUCT** columns. Databricks supports the `:` (colon) operator to parse JSON fields. See the [`:` operator documentation](https://docs.databricks.com/) for more details.

   The following table describes the event log schema. Some fields contain JSON data—such as the **details** field—which must be parsed to perform certain queries.

In [0]:
SELECT *
FROM default.event_log_demo_5;

| Field          | Description |
|----------------|-------------|
| `id`           | A unique identifier for the event log record. |
| `sequence`     | A JSON document containing metadata to identify and order events. |
| `origin`       | A JSON document containing metadata for the origin of the event, for example, the cloud provider, the cloud provider region, user_id, pipeline_id, or pipeline_type to show where the pipeline was created, either DBSQL or WORKSPACE. |
| `timestamp`    | The time the event was recorded. |
| `message`      | A human-readable message describing the event. |
| `level`        | The event type, for example, INFO, WARN, ERROR, or METRICS. |
| `maturity_level` | The stability of the event schema. The possible values are:<br><br>- **STABLE**: The schema is stable and will not change.<br>- **NULL**: The schema is stable and will not change. The value may be NULL if the record was created before the maturity_level field was added (release 2022.37).<br>- **EVOLVING**: The schema is not stable and may change.<br>- **DEPRECATED**: The schema is deprecated and the pipeline runtime may stop producing this event at any time. |
| `error`        | If an error occurred, details describing the error. |
| `details`      | A JSON document containing structured details of the event. This is the primary field used for analyzing events. |
| `event_type`   | The event type. |

**[Event Log Schema](https://docs.databricks.com/aws/en/ldp/monitor-event-log-schema)**

3. The majority of the detailed information you will want from the event log is located in the **details** column, which is a JSON-formatted string. You will need to parse this column.

   You can find more information in the Databricks documentation on how to [query JSON strings](https://docs.databricks.com/aws/en/semi-structured/json).

   The code below will:

   - Return the **event_type** column.

   - Return the entire **details** JSON-formatted string.

   - Parse out the **flow_progress** values from the **details** JSON-formatted string, if they exist.

   - Parse out the **user_action** values from the **details** JSON-formatted string, if they exist.


In [0]:
SELECT
  id,
  event_type,
  details,
  details:flow_progress,
  details:user_action
FROM default.event_log_demo_5

4. One use case for the event log is to examine data quality metrics for all runs of your pipeline. These metrics provide valuable insights into your pipeline, both in the short term and long term. Metrics are captured for each constraint throughout the entire lifetime of the table.

   Below is an example query to obtain those metrics. We won’t dive into the JSON parsing code here. This example simply demonstrates what’s possible with the **event_log**.

   Run the cell and observe the results. Notice the following:
   - The **passing_records** for each constraint are displayed.
   - The **failing_records** (WARN) for each constraint are displayed.

**NOTE:** If you have selected **Run pipeline with full table refresh** at any time during your pipeline, your results will include metrics from previous runs as well as from the full refresh. Additional logic is required to isolate results after the full table refresh. This is outside the scope of this course.


In [0]:
CREATE OR REPLACE TEMPORARY VIEW dq_source_vw AS
SELECT explode(
            from_json(details:flow_progress:data_quality:expectations,
                      "array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>")
          ) AS row_expectations
   FROM default.event_log_demo_5
   WHERE event_type = 'flow_progress';


-- View the data
SELECT 
  row_expectations.dataset as dataset,
  row_expectations.name as expectation,
  SUM(row_expectations.passed_records) as passing_records,
  SUM(row_expectations.failed_records) as warnings_records
FROM dq_source_vw
GROUP BY row_expectations.dataset, row_expectations.name
ORDER BY dataset;

### Summary

This was a quick introduction to the pipeline **event_log**. With the **event_log**, you can investigate all aspects of your pipeline runs to explore the runs as well as create overall reports. Feel free to investigate the **event_log** further on your own.

## Additional Resources

- [Lakeflow Spark Declarative Pipelines properties reference](https://docs.databricks.com/aws/en/dlt/properties#dlt-table-properties)

- [Table properties and table options](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-tblproperties)

- [Triggered vs. continuous pipeline mode](https://docs.databricks.com/aws/en/dlt/pipeline-mode)

- [Development and production modes](https://docs.databricks.com/aws/en/dlt/updates#development-and-production-modes)

- [Monitor Lakeflow Spark Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/observability)

- **Materialized views include built-in optimizations where applicable:**
  - [Incremental refresh for materialized views](https://docs.databricks.com/aws/en/optimizations/incremental-refresh)
  - [Delta Live Tables Announces New Capabilities and Performance Optimizations](https://www.databricks.com/blog/2022/06/29/delta-live-tables-announces-new-capabilities-and-performance-optimizations.html)
  - [Cost-effective, incremental ETL with serverless compute for Delta Live Tables pipelines](https://www.databricks.com/blog/cost-effective-incremental-etl-serverless-compute-delta-live-tables-pipelines)

- **Stateful joins:** For stateful joins in pipelines (i.e., joining incrementally as data is ingested), refer to the [Optimize stateful processing in Lakeflow Spark Declarative Pipelines with watermarks](https://docs.databricks.com/aws/en/dlt/stateful-processing) documentation. **Stateful joins are an advanced topic and outside the scope of this course.**

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>