# 5 - Deploying a Pipeline to Production

In this demonstration, we will begin by adding an additional data source to our pipeline and performing a join with our streaming tables. Then, we will focus on productionalizing the pipeline by adding comments and table properties to the objects we create, scheduling the pipeline, and creating an event log to monitor the pipeline.

### Learning Objectives

By the end of this lesson, you will be able to:
- Apply the appropriate comment syntax and table properties to pipeline objects to enhance readability.
- Demonstrate how to perform a join between two streaming tables using a materialized view to optimize data processing.
- Execute the scheduling of a pipeline using trigger or continuous modes to ensure timely processing.
- Explore the event log to monitor a production Lakeflow Spark Declarative Pipeline.


## A. Environment Configuration

Set up the environment variables for this course.


In [0]:
# Define catalog, schema, and volume paths
CATALOG_NAME = 'ldp_demo'
SCHEMA_NAME = 'ldp_schema'
VOLUME_NAME = 'raw'
WORKING_DIR = f'/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}'

print(f'Catalog: {CATALOG_NAME}')
print(f'Schema: {SCHEMA_NAME}')
print(f'Volume Path: {WORKING_DIR}')


## B. Explore the Orders and Status JSON Files

1. Explore the raw data located in the `/Volumes/ldp_demo/ldp_schema/raw/orders/` volume. This is the data we have been working with throughout the course demonstrations.

   Run the cell below to view the results. Notice that the orders JSON file(s) contains information about when each order was placed.


In [0]:
%sql
SELECT *
FROM read_files(
  '/Volumes/ldp_demo/ldp_schema/raw/orders/',
  format => 'JSON'
)
LIMIT 10;


2. Explore the **status** raw data located in the `/Volumes/ldp_demo/ldp_schema/raw/status/` volume and filter for the specific **order_id** *75123*.

   Run the cell below to view the results. Notice that the status JSON file(s) contain **order_status** information for each order.


In [0]:
%sql
SELECT *
FROM read_files(
  '/Volumes/ldp_demo/ldp_schema/raw/status/',
  format => 'JSON'
)
WHERE order_id = 75123;


## C. Create Production Pipeline

1. Create your starter pipeline for this demonstration. The pipeline should be configured with:

- Your default catalog: `ldp_demo`
- Your default schema: `ldp_schema`
- Your configuration parameter: `source` = `/Volumes/ldp_demo/ldp_schema/raw`

2. Complete the following steps to open the starter Spark Declarative Pipeline project:

   a. In the main navigation bar right-click on **Jobs & Pipelines** and select **Open in Link in New Tab**.

   b. In **Jobs & Pipelines** select your pipeline (or create a new one).

   c. **REQUIRED:** At the top near your pipeline name, turn on **New pipeline monitoring**.

   d. In the **Pipeline details** pane on the far right, select **Open in Editor** to open the pipeline in the **Lakeflow Pipeline Editor**.

   e. In the new tab you should see folders: **orders** and **status**.

3. Explore the code in the `orders/orders_pipeline.sql` file and `status/status_pipeline.sql` file. Follow the instructional comments in the files to proceed.


## D. Land More Data to Your Data Source Volume

1. Run the cell below to add more JSON files to your volumes:


In [0]:
def copy_files(copy_from: str, copy_to: str, n: int, sleep=2):
    import os
    import time

    print(f"\n----------------Loading files to volume: '{copy_to}'----------------")

    if os.path.exists(copy_from):
        list_of_files_to_copy = sorted(os.listdir(copy_from))
        total_files = len(list_of_files_to_copy)
    else:
        print(f'Source directory {copy_from} does not exist.')
        return

    if os.path.exists(copy_to):
        list_of_files_in_dest = os.listdir(copy_to)
    else:
        list_of_files_in_dest = []

    assert total_files >= n, f"Source location contains only {total_files} files, but you specified {n} files to copy."

    counter = 1
    for file in list_of_files_to_copy:
        if file in list_of_files_in_dest:
            print(f'File number {counter} - {file} already exists. Skipping.')
        else:
            file_to_copy = f'{copy_from}/{file}'
            copy_file_to = f'{copy_to}/{file}'
            print(f'File number {counter} - Copying {file_to_copy} --> {copy_file_to}')
            dbutils.fs.cp(file_to_copy, copy_file_to, recurse=True)
            time.sleep(sleep)

        if counter == n:
            break
        counter += 1

# Copy additional files
try:
    copy_files(
        copy_from='/Volumes/dbacademy_retail/v01/retail-pipeline/orders/stream_json',
        copy_to=f'{WORKING_DIR}/orders',
        n=5
    )

    copy_files(
        copy_from='/Volumes/dbacademy_retail/v01/retail-pipeline/status/stream_json',
        copy_to=f'{WORKING_DIR}/status',
        n=5
    )
except Exception as e:
    print(f'Note: Could not copy from dbacademy_retail. Error: {e}')
    print(f'Please manually add JSON files to: {WORKING_DIR}/orders/ and {WORKING_DIR}/status/')


2. Navigate back to your pipeline and select **Run pipeline** to process the new landed files.


## E. Monitor Your Pipeline with the Event Log

After running your pipeline and successfully publishing the event log, you can explore the event log. The event log provides detailed information about pipeline runs, data quality metrics, and other pipeline events.

1. Query your event log table (if configured) to see pipeline events and metrics.

2. Explore data quality metrics and pipeline performance through the event log.

For more information, see the [Monitor Lakeflow Spark Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/observability) documentation.
