# Delta Live Tables - Monitoring  
  

<img style="float:right" width="500" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/retail-dlt-data-quality-dashboard.png">

Each DLT Pipeline saves events and expectations metrics in the Storage Location defined on the pipeline. From this table we can see what is happening and the quality of the data passing through it.

You can leverage the expecations directly as a SQL table with Databricks SQL to track your expectation metrics and send alerts as required. 

This notebook extracts and analyses expectation metrics to build such KPIS.

## Accessing the Delta Live Table pipeline events with Unity Catalog

Databricks provides an `event_log` function which is automatically going to lookup the event log table. You can specify any table to get access to the logs:

`SELECT * FROM event_log(TABLE(catalog.schema.my_table))`

#### Using Legacy hive_metastore
*Note: If you are not using Unity Catalog (legacy hive_metastore), you can find your event log location opening the Settings of your DLT pipeline, under `storage` :*

```
{
    ...
    "name": "lakehouse_churn_dlt",
    "storage": "/demos/dlt/loans",
    "target": "your schema"
}
```

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-engineering&org_id=4214571749987147&notebook=%2F03-Retail_DLT_CDC_Monitoring&demo_name=dlt-cdc&event=VIEW&path=%2F_dbdemos%2Fdata-engineering%2Fdlt-cdc%2F03-Retail_DLT_CDC_Monitoring&version=1">

In [0]:
%sql
SELECT * FROM event_log(TABLE(pds.dbdemos_sharing_airlinedata.customers)) 

## System table setup
We'll create a table based on the events log being saved by DLT. The system tables are stored under the storage path defined in your DLT settings (the one defined in the widget):

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW demo_cdc_dlt_system_event_log_raw 
  as SELECT * FROM event_log(TABLE(pds.dbdemos_sharing_airlinedata.customers));
SELECT * FROM demo_cdc_dlt_system_event_log_raw order by timestamp desc;

#Delta Live Table expectation analysis
Delta live table tracks our data quality through expectations. These expectations are stored as technical tables without the DLT log events. We can create a view to simply analyze this information

**Make sure you set your DLT storage path in the widget!**

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-engineering&org_id=4214571749987147&notebook=%2F03-Retail_DLT_CDC_Monitoring&demo_name=dlt-cdc&event=VIEW&path=%2F_dbdemos%2Fdata-engineering%2Fdlt-cdc%2F03-Retail_DLT_CDC_Monitoring&version=1">
<!-- [metadata={"description":"Notebook extracting DLT expectations as delta tables used to build DBSQL data quality Dashboard.",
 "authors":["quentin.ambard@databricks.com"],
 "db_resources":{"Dashboards": ["DLT Data Quality Stats"]},
 "search_tags":{"vertical": "retail", "step": "Data Engineering", "components": ["autoloader", "copy into"]},
 "canonicalUrl": {"AWS": "", "Azure": "", "GCP": ""}}] -->

## Analyzing dlt_system_event_log_raw table structure
The `details` column contains metadata about each Event sent to the Event Log. There are different fields depending on what type of Event it is. Some examples include:
* `user_action` Events occur when taking actions like creating the pipeline
* `flow_definition` Events occur when a pipeline is deployed or updated and have lineage, schema, and execution plan information
  * `output_dataset` and `input_datasets` - output table/view and its upstream table(s)/view(s)
  * `flow_type` - whether this is a complete or append flow
  * `explain_text` - the Spark explain plan
* `flow_progress` Events occur when a data flow starts running or finishes processing a batch of data
  * `metrics` - currently contains `num_output_rows`
  * `data_quality` - contains an array of the results of the data quality rules for this particular dataset
    * `dropped_records`
    * `expectations`
      * `name`, `dataset`, `passed_records`, `failed_records`
      
We can leverage this information to track our table quality using SQL

In [0]:
%sql
SELECT 
       id,
       timestamp,
       sequence,
       event_type,
       message,
       level, 
       details
  FROM demo_cdc_dlt_system_event_log_raw
 ORDER BY timestamp ASC;  

In [0]:
%sql 
create or replace temp view cdc_dlt_expectations as (
  SELECT 
    id,
    timestamp,
    details:flow_progress.metrics.num_output_rows as output_records,
    details:flow_progress.data_quality.dropped_records,
    details:flow_progress.status as status_update,
    explode(from_json(details:flow_progress.data_quality.expectations
             ,'array<struct<dataset: string, failed_records: bigint, name: string, passed_records: bigint>>')) expectations
  FROM demo_cdc_dlt_system_event_log_raw 
  where details:flow_progress.data_quality.expectations is not null
  ORDER BY timestamp);
select * from cdc_dlt_expectations

## 3 - Visualizing the Quality Metrics

Let's run a few queries to show the metrics we can display. Ideally, we should be using Databricks SQL to create SQL Dashboard and track all the data, but for this example we'll run a quick query in the dashboard directly:

In [0]:
%sql 
select sum(expectations.failed_records) as failed_records, sum(expectations.passed_records) as passed_records, expectations.name from cdc_dlt_expectations group by expectations.name

### Plotting failed record per expectations

In [0]:
import plotly.express as px
expectations_metrics = spark.sql("select sum(expectations.failed_records) as failed_records, sum(expectations.passed_records) as passed_records, expectations.name from cdc_dlt_expectations group by expectations.name").toPandas()
px.bar(expectations_metrics, x="name", y=["passed_records", "failed_records"], title="DLT expectations metrics")

### What's next?

We now have our data ready to be used for more advanced.

We can start creating our first <a dbdemos-dashboard-id="dlt-expectations" href='/sql/dashboardsv3/01eff4833fe2191d9831b6043b589d66'  target="_blank">DBSQL Dashboard</a> monitoring our data quality & DLT pipeline health.