<div style="display: flex; justify-content: space-between; align-items: center; padding: 8px 16px; background: #F8F9FA; border-bottom: 2px solid #E0E0E0; margin: 0; line-height: 1;">
    <div style="font-size: 14px; color: #666;">
        <span style="font-weight: bold; color: #333;">{SOURCE_PLATFORM} → Databricks Migration</span>
        <span style="margin-left: 8px; color: #999;">|</span>
        <span style="margin-left: 8px;">04 - Activate</span>
    </div>
    <div style="display: flex; align-items: center; gap: 8px;">
        <img src="https://cdn.simpleicons.org/snowflake/29B5E8" width="24" height="24"/>
        <span style="color: #999; font-size: 16px;">→</span>
        <img src="https://cdn.simpleicons.org/databricks/FF3621" width="24" height="24"/>
    </div>
</div>


<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# Observability and Monitoring

## Overview

Observability is essential for maintaining confidence in your migrated workloads. This lesson covers Databricks' native monitoring capabilities, with **Lakehouse Monitoring** as the primary solution for data quality and drift detection. You'll learn to build operational dashboards, configure alerts, and integrate with external monitoring systems.

## Learning Objectives

By the end of this lesson, you will be able to:
- Configure Lakehouse Monitoring for data quality tracking
- Use System Tables for usage, billing, and audit insights
- Build operational dashboards in Databricks SQL
- Set up alerts and notifications for pipeline issues
- Integrate with external observability tools (CloudWatch, Azure Monitor, Splunk)

## Databricks Observability Stack

Databricks provides a comprehensive observability stack built into the platform. **Lakehouse Monitoring** is the go-to solution for monitoring data quality across your Lakehouse.

<br />
<div class="mermaid">
flowchart TB
    subgraph PLATFORM["<b>Databricks Observability Stack</b>"]
        LM["<b>Lakehouse Monitoring</b><br/><i>Data quality, drift,<br/>statistical profiles</i>"]
        ST["<b>System Tables</b><br/><i>Billing, audit logs,<br/>query history</i>"]
        DLT["<b>DLT Event Logs</b><br/><i>Pipeline runs,<br/>expectations, metrics</i>"]
        DASH["<b>SQL Dashboards</b><br/><i>Operational views,<br/>real-time metrics</i>"]
    end
    subgraph EXTERNAL["<b>External Integrations</b>"]
        CW["CloudWatch"]
        AZ["Azure Monitor"]
        SP["Splunk"]
        DD["Datadog"]
    end
    LM --> DASH
    ST --> DASH
    DLT --> DASH
    DASH --> EXTERNAL
    style PLATFORM fill:#fff,stroke:#FF3621,stroke-width:2px
    style LM fill:#e8f5e9,stroke:#4caf50
    style ST fill:#e3f2fd,stroke:#1976d2
    style DLT fill:#fff3e0,stroke:#ff9800
    style DASH fill:#fce4ec,stroke:#e91e63
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Observability Components

| Component | Purpose | Key Features |
|-----------|---------|-------------|
| **Lakehouse Monitoring** | Data quality & drift | Statistical profiles, anomaly detection, automated alerts |
| **System Tables** | Platform telemetry | Billing usage, audit logs, query history, lineage |
| **DLT Event Logs** | Pipeline observability | Run metrics, expectation results, data flow events |
| **SQL Dashboards** | Visualization | Real-time dashboards, scheduled reports, embedded analytics |

## Lakehouse Monitoring

**Lakehouse Monitoring** is Databricks' built-in solution for monitoring data quality. It automatically profiles your tables, detects drift, and alerts on anomalies—making it the go-to choice for post-migration monitoring.

<div style="border-left: 4px solid #4caf50; background: #e8f5e9; padding: 16px 20px; border-radius: 4px; margin: 16px 0;">
    <div style="display: flex; align-items: flex-start; gap: 12px;">
        <span style="font-size: 24px;">✓</span>
        <div>
            <strong style="color: #2e7d32; font-size: 1.1em;">Recommended: Lakehouse Monitoring</strong>
            <p style="margin: 8px 0 0 0; color: #333;">
                For migration validation and ongoing data quality, Lakehouse Monitoring provides automated profiling, drift detection, and anomaly alerts with minimal configuration. Enable it on your Silver and Gold tables immediately after migration.
            </p>
        </div>
    </div>
</div>

### Enable Lakehouse Monitoring

<div class="code-block" data-language="python">
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorTimeSeries, MonitorSnapshot

w = WorkspaceClient()

# Create a snapshot monitor for a static table
monitor = w.quality_monitors.create(
    table_name="catalog.silver.customers",
    assets_dir="/Workspace/Users/me/monitors",
    output_schema_name="catalog.monitoring",
    snapshot=MonitorSnapshot()
)

# Create a time-series monitor for incremental data
ts_monitor = w.quality_monitors.create(
    table_name="catalog.silver.orders",
    assets_dir="/Workspace/Users/me/monitors",
    output_schema_name="catalog.monitoring",
    time_series=MonitorTimeSeries(
        timestamp_col="order_date",
        granularities=["1 day"]
    )
)

print(f"Monitor created: {monitor.table_name}")
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'python';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

### Monitor Types

| Monitor Type | Use Case | Timestamp Required |
|--------------|----------|--------------------|
| **Snapshot** | Static/dimension tables | No |
| **Time Series** | Incremental/fact tables | Yes |
| **Inference** | ML model predictions | Yes (+ prediction col) |

### What Lakehouse Monitoring Tracks

- **Profile metrics**: Row counts, null percentages, distinct counts, min/max values
- **Statistical metrics**: Mean, stddev, quantiles for numeric columns
- **Drift detection**: Changes in distributions between time windows
- **Anomaly detection**: Automatic identification of unusual patterns

### Query Monitor Results

Lakehouse Monitoring stores results in Delta tables. Query these tables to build custom dashboards or integrate with alerting systems.

<div class="code-block" data-language="sql">
-- View profile metrics for a monitored table
SELECT 
    window.start AS window_start,
    column_name,
    count,
    null_count,
    distinct_count,
    mean,
    stddev
FROM catalog.monitoring.customers_profile_metrics
WHERE window.start >= CURRENT_DATE - INTERVAL 7 DAYS
ORDER BY window.start DESC, column_name;

-- Check for drift alerts
SELECT 
    window.start AS window_start,
    column_name,
    drift_type,
    drift_score,
    is_drift_detected
FROM catalog.monitoring.customers_drift_metrics
WHERE is_drift_detected = true
ORDER BY window.start DESC;

-- Anomaly detection results
SELECT 
    window.start,
    column_name,
    metric_name,
    observed_value,
    expected_value,
    anomaly_score
FROM catalog.monitoring.customers_anomaly_metrics
WHERE anomaly_score > 0.9
ORDER BY anomaly_score DESC;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## System Tables

Unity Catalog System Tables provide platform-wide telemetry for billing, auditing, and usage analysis. These tables are essential for operational visibility.

| System Table | Content | Retention |
|--------------|---------|----------|
| `system.billing.usage` | DBU consumption by SKU | 365 days |
| `system.access.audit` | API calls, data access events | 365 days |
| `system.compute.clusters` | Cluster configurations, events | 365 days |
| `system.query.history` | SQL query execution details | 30 days |
| `system.lakeflow.job_run_timeline` | Job run details and metrics | 365 days |

### Query System Tables

<div class="code-block" data-language="sql">
-- Daily DBU usage by SKU
SELECT 
    DATE(usage_date) AS date,
    sku_name,
    SUM(usage_quantity) AS total_dbus
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY DATE(usage_date), sku_name
ORDER BY date DESC, total_dbus DESC;

-- Query performance analysis
SELECT 
    DATE(start_time) AS query_date,
    COUNT(*) AS query_count,
    AVG(total_duration_ms) / 1000 AS avg_duration_sec,
    MAX(total_duration_ms) / 1000 AS max_duration_sec,
    SUM(rows_produced) AS total_rows
FROM system.query.history
WHERE start_time >= CURRENT_DATE - INTERVAL 7 DAYS
GROUP BY DATE(start_time)
ORDER BY query_date DESC;

-- Failed job runs
SELECT 
    job_name,
    run_id,
    start_time,
    end_time,
    result_state,
    error_message
FROM system.lakeflow.job_run_timeline
WHERE result_state = 'FAILED'
  AND start_time >= CURRENT_DATE - INTERVAL 7 DAYS
ORDER BY start_time DESC;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Alerting and Notifications

Configure alerts to proactively detect issues. Databricks supports alerts on SQL query results with multiple notification destinations.

<br />
<div class="mermaid">
flowchart LR
    QUERY["SQL Query"] --> ALERT["Alert Condition"]
    ALERT --> |Threshold Met| NOTIFY{"Notification"}
    NOTIFY --> EMAIL["Email"]
    NOTIFY --> SLACK["Slack"]
    NOTIFY --> PAGER["PagerDuty"]
    NOTIFY --> WEBHOOK["Webhook"]
    style QUERY fill:#e3f2fd,stroke:#1976d2
    style ALERT fill:#fff3e0,stroke:#ff9800
    style NOTIFY fill:#fce4ec,stroke:#e91e63
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Alert Configuration Examples

<div class="code-block" data-language="sql">
-- Alert: Pipeline SLA breach (query returns rows when SLA violated)
SELECT 
    job_name,
    run_id,
    DATEDIFF(MINUTE, expected_end_time, actual_end_time) AS minutes_late
FROM (
    SELECT 
        job_name,
        run_id,
        end_time AS actual_end_time,
        DATE_TRUNC('HOUR', start_time) + INTERVAL 2 HOURS AS expected_end_time
    FROM system.lakeflow.job_run_timeline
    WHERE job_name = 'daily_etl_pipeline'
      AND DATE(start_time) = CURRENT_DATE
)
WHERE actual_end_time > expected_end_time;

-- Alert: Data quality anomaly detected
SELECT 
    table_name,
    column_name,
    anomaly_score,
    observed_value,
    expected_value
FROM catalog.monitoring.all_tables_anomaly_metrics
WHERE window.start >= CURRENT_TIMESTAMP - INTERVAL 1 HOUR
  AND anomaly_score > 0.95;

-- Alert: Row count dropped significantly
SELECT 
    table_name,
    today.row_count AS current_count,
    yesterday.row_count AS previous_count,
    (yesterday.row_count - today.row_count) * 100.0 / yesterday.row_count AS pct_decrease
FROM daily_row_counts today
JOIN daily_row_counts yesterday 
  ON today.table_name = yesterday.table_name
 AND today.count_date = yesterday.count_date + INTERVAL 1 DAY
WHERE (yesterday.row_count - today.row_count) * 100.0 / yesterday.row_count > 10;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## External Observability Integration

Integrate Databricks with your existing monitoring infrastructure for unified observability.

| Platform | Integration Method | Use Cases |
|----------|-------------------|----------|
| **AWS CloudWatch** | Cluster log delivery, custom metrics | AWS-native monitoring, log aggregation |
| **Azure Monitor** | Diagnostic settings, Log Analytics | Azure-native monitoring, Sentinel integration |
| **Splunk** | HEC (HTTP Event Collector) | Enterprise log management, SIEM |
| **Datadog** | Agent integration, custom metrics | APM, infrastructure monitoring |
| **PagerDuty** | Alert destination | Incident management |

### CloudWatch Integration Example

<div class="code-block" data-language="python">
import boto3
import json

# Send custom metrics to CloudWatch
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

def publish_pipeline_metrics(pipeline_name, run_duration_sec, records_processed):
    cloudwatch.put_metric_data(
        Namespace='Databricks/Migration',
        MetricData=[
            {
                'MetricName': 'PipelineRunDuration',
                'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}],
                'Value': run_duration_sec,
                'Unit': 'Seconds'
            },
            {
                'MetricName': 'RecordsProcessed',
                'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}],
                'Value': records_processed,
                'Unit': 'Count'
            }
        ]
    )

# Usage in pipeline
publish_pipeline_metrics(
    pipeline_name='daily_customer_sync',
    run_duration_sec=245,
    records_processed=150000
)
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'python';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Summary

### Observability Checklist

- [ ] Enable Lakehouse Monitoring on Silver and Gold tables
- [ ] Configure time-series monitors for incremental tables
- [ ] Set up System Tables queries for operational insights
- [ ] Create SQL dashboards for key metrics
- [ ] Configure alerts for SLA breaches and data anomalies
- [ ] Integrate with external monitoring (CloudWatch, Azure Monitor, etc.)

### Key Takeaways

| Component | Action |
|-----------|--------|
| **Lakehouse Monitoring** | Enable on all production tables; configure drift detection |
| **System Tables** | Query for billing, audit, and performance insights |
| **Alerts** | Set up for pipeline failures, data quality issues, SLA breaches |
| **Dashboards** | Build operational views for ongoing visibility |

### Next Steps

With monitoring in place, plan the production cutover:

- [**4.3 - Cutover Execution**]($./4.3 Cutover Execution) - Blue-green, canary, and other cutover strategies

<div style="color: #FF3621; font-weight: bold; font-size: 2em; margin-bottom: 12px;">COURSE DEVELOPER (remove before publishing)</div>

### Template Customization

**Placeholders to replace:**
- `{SOURCE_PLATFORM}` - Source platform name

**Platform-specific additions:**
- Add equivalent monitoring features from source platform
- Document migration of existing alerts/dashboards
- Include cloud-specific integration details

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>
