# 3 - Adding Data Quality Expectations

In this demonstration we will add data quality expectations to apply quality constraints that validates data as it flows through Lakeflow Spark Declarative Pipelines. Expectations provide greater insight into data quality metrics and allow you to fail updates or drop records when detecting invalid records.


### Learning Objectives

By the end of this lesson, you will be able to:
- Add quality constraints within a Lakeflow Spark Declarative Pipeline to trigger appropriate actions (warn, drop, or fail) based on data expectations.
- Analyze pipeline metrics to identify and interpret data quality issues across different data flows.


## A. Environment Configuration

Set up the environment variables for this course.


In [0]:
# Define catalog, schema, and volume paths
CATALOG_NAME = 'cetpa_external_catalog'
SCHEMA_NAME = 'ldp_schema'
VOLUME_NAME = 'raw'
WORKING_DIR = f'/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}'

print(f'Catalog: {CATALOG_NAME}')
print(f'Schema: {SCHEMA_NAME}')
print(f'Volume Path: {WORKING_DIR}')


Run the cell below to programmatically view the files in your `/Volumes/ldp_demo/ldp_schema/raw/orders` volume. Confirm you only see the original **00.json** file in the **orders** folder.


In [0]:
spark.sql(f'LIST "{WORKING_DIR}/orders"').display()


## B. Adding Data Quality Expectations

This demonstration includes the simple starter Spark Declarative Pipeline that has already been created in the previous demonstration. We will continue to build on it to explore it's capabilities.


1. Create your starter pipeline for this demonstration. The pipeline should be configured with:

- Your default catalog: `ldp_demo`
- Your default schema: `ldp_schema`
- Your configuration parameter: `source` = `/Volumes/ldp_demo/ldp_schema/raw`

  **NOTE:** If the pipeline already exists, you'll need to delete the existing pipeline first.

  To delete the pipeline:

  - Select **Jobs and Pipelines** from the far-left navigation bar.  

  - Find the pipeline you want to delete.  

  - Click the three-dot menu.  

  - Select **Delete**.


2. Complete the following steps to open the starter Spark Declarative Pipeline project for this demonstration:

   a. In the main navigation bar right-click on **Jobs & Pipelines** and select **Open in Link in New Tab**.

   b. In **Jobs & Pipelines** select your pipeline (or create a new one).

   c. **REQUIRED:** At the top near your pipeline name, turn on **New pipeline monitoring**.

   d. In the **Pipeline details** pane on the far right, select **Open in Editor** (field to the right of **Source code**) to open the pipeline in the **Lakeflow Pipeline Editor**.

   e. In the new tab:
      - Select the **orders** folder

      - Click on **orders_pipeline.sql**.

   f. In the navigation pane of the new tab, you should see **Pipeline** and **All Files**. Ensure you are in the **Pipeline** tab. This will list all files in your pipeline.


3. In the new tab, follow the instructions provided in the comments within the **orders_pipeline.sql** file to add data quality expectations.


## Additional Resources

- [Manage data quality with pipeline expectations](https://docs.databricks.com/aws/en/dlt/expectations)

- [Expectation recommendations and advanced patterns](https://docs.databricks.com/aws/en/dlt/expectation-patterns)

- [Data Quality Management With Databricks](https://www.databricks.com/discover/pages/data-quality-management#expectations-with-delta-live-tables)
