
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# Exploring the Pipeline Events Logs

DLT uses the event logs to store much of the important information used to manage, report, and understand what's happening during pipeline execution.

DLT stores log information in a special table called the `event_log`. To access log data from this table, you must use the **`event_log`** table valued function (TVF). We will use this function in the cells that follow, passing the  pipeline id as a parameter.

Below, we provide a number of useful queries to explore the event log and gain greater insight into your DLT pipelines.

Run the setup script below to get started.

In [0]:
%run ./Includes/Classroom-Setup-04.4

## Generate Required Query
Run the next cell to generate a query we will need a few minutes.

In [0]:
%python
DA.print_catalog_and_pipeline_name()

# Important -- Please Read!
The rest of the cells in this notebook must be run with either a cluster that is in **shared** access mode or with a SQL warehouse.

**We will be using a SQL warehouse.**

Please click the cluster name at the top of the page and switch to a SQL warehouse.

SQL warehouses provide instant, elastic SQL compute — decoupled from storage — and will automatically scale to provide unlimited concurrency without disruption, for high concurrency use cases.

## Query Event Log
The event log is managed as a Delta Lake table with some of the more important fields stored as nested JSON data.

Copy the query from the previous code cell's output into the next cell, and run the code. This query shows how simple it is to read the event log table.

In [0]:
-- Copy the query from the previous code cell's output into this cell.
-- This cell must be run on a shared cluster or SQL warehouse.

<REPLACE WITH THE OUTPUT FROM THE PREVIOUS CODE CELL>

The query in the previous cell uses the [**`event_log`** table-valued function](https://docs.databricks.com/en/sql/language-manual/functions/event_log.html). This is a built in function that allows you to query the event log for materialized views, streaming tables, and DLT pipelines.

## Perform Audit Logging

Events related to running pipelines and editing configurations are captured as **`user_action`**.

Yours should be the only **`user_name`** for the pipeline you configured during this lesson.

In [0]:
SELECT timestamp, details:user_action:action, details:user_action:user_name
  FROM pipeline_event_log
  WHERE event_type = 'user_action'

## Get Latest Update ID

In many cases, you may wish to get information about the latest update to your pipeline.

We can easily capture the most recent update ID with a SQL query.

In [0]:
DECLARE OR REPLACE VARIABLE latest_update_id STRING;
SET VARIABLE latest_update_id =
(SELECT origin.update_id
    FROM pipeline_event_log
    WHERE event_type = 'create_update'
    ORDER BY timestamp DESC LIMIT 1);

## Examine Lineage

DLT provides built-in lineage information for how data flows through your table.

While the query below only indicates the direct predecessors for each table, this information can easily be combined to trace data in any table back to the point it entered the lakehouse.

In [0]:
SELECT details:flow_definition.output_dataset, details:flow_definition.input_datasets 
  FROM pipeline_event_log
WHERE event_type = 'flow_definition' AND 
      origin.update_id = latest_update_id

## Examine Data Quality Metrics

Finally, data quality metrics can be extremely useful for both long term and short term insights into your data.

Below, we capture the metrics for each constraint throughout the entire lifetime of our table.

In [0]:
SELECT row_expectations.dataset as dataset,
       row_expectations.name as expectation,
       SUM(row_expectations.passed_records) as passing_records,
       SUM(row_expectations.failed_records) as failing_records
FROM
  (SELECT explode(
            from_json(details :flow_progress :data_quality :expectations,
                      "array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>")
          ) row_expectations
   FROM pipeline_event_log
   WHERE event_type = 'flow_progress' AND 
         origin.update_id = latest_update_id
  )
GROUP BY row_expectations.dataset, row_expectations.name


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>