
# SDP pipeline log analysis

<img style="float:right" width="500" src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/retail/lakehouse-churn/lakehouse-retail-c360-dashboard-dlt-stat.png?raw=true">


Each SDP Pipeline saves events and expectations metrics in the Storage Location defined on the pipeline. From this table we can see what is happening and the quality of the data passing through it.

You can leverage the expecations directly as a SQL table with Databricks SQL to track your expectation metrics and send alerts as required. 

This notebook extracts and analyses expectation metrics to build such KPIS.

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=7405609900705693&notebook=%2F01-Data-ingestion%2F01.1-SDP-SQL%2F01.2-SDP-churn-expectation-dashboard-data-prep&demo_name=lakehouse-retail-c360&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-retail-c360%2F01-Data-ingestion%2F01.1-SDP-SQL%2F01.2-SDP-churn-expectation-dashboard-data-prep&version=1">

## Accessing the Spark Declarative Pipelines pipeline events with Unity Catalog


In [0]:
SELECT * FROM demos.dbdemos_retail_c360.dbdemos_retail_c360_event_logs


## Analyzing event log table structure

The `details` column contains metadata about each Event sent to the Event Log. There are different fields depending on what type of Event it is. Some examples include:
* `user_action` Events occur when taking actions like creating the pipeline
* `flow_definition` Events occur when a pipeline is deployed or updated and have lineage, schema, and execution plan information
  * `output_dataset` and `input_datasets` - output table/view and its upstream table(s)/view(s)
  * `flow_type` - whether this is a complete or append flow
  * `explain_text` - the Spark explain plan
* `flow_progress` Events occur when a data flow starts running or finishes processing a batch of data
  * `metrics` - currently contains `num_output_rows`
  * `data_quality` - contains an array of the results of the data quality rules for this particular dataset
    * `dropped_records`
    * `expectations`
      * `name`, `dataset`, `passed_records`, `failed_records`
  

In [0]:
SELECT
  details:flow_definition.output_dataset,
  details:flow_definition.input_datasets,
  details:flow_definition.flow_type,
  details:flow_definition.schema,
  details:flow_definition
FROM demos.dbdemos_retail_c360.dbdemos_retail_c360_event_logs
WHERE details:flow_definition IS NOT NULL
ORDER BY timestamp

In [0]:
SELECT
  id,
  expectations.dataset,
  expectations.name,
  expectations.failed_records,
  expectations.passed_records
FROM(
  SELECT 
    id,
    timestamp,
    details:flow_progress.metrics,
    details:flow_progress.data_quality.dropped_records,
    explode(from_json(details:flow_progress:data_quality:expectations
             ,schema_of_json("[{'name':'str', 'dataset':'str', 'passed_records':42, 'failed_records':42}]"))) expectations
  FROM demos.dbdemos_retail_c360.dbdemos_retail_c360_event_logs
  WHERE details:flow_progress.metrics IS NOT NULL) data_quality


## That's it! Our data quality metrics are ready! 

Our datable is now ready be queried using DBSQL. Open the <a dbdemos-dashboard-id="sdp-quality-stat" href='/sql/dashboardsv3/01f0f16ad89a1a239643fd28b74e8521' target="_blank">Data Quality Dashboard</a>