<div style="display: flex; justify-content: space-between; align-items: center; padding: 8px 16px; background: #F8F9FA; border-bottom: 2px solid #E0E0E0; margin: 0; line-height: 1;">
    <div style="font-size: 14px; color: #666;">
        <span style="font-weight: bold; color: #333;">{SOURCE_PLATFORM} ‚Üí Databricks Migration</span>
        <span style="margin-left: 8px; color: #999;">|</span>
        <span style="margin-left: 8px;">05 - Enable</span>
    </div>
    <div style="display: flex; align-items: center; gap: 8px;">
        <img src="https://cdn.simpleicons.org/snowflake/29B5E8" width="24" height="24"/>
        <span style="color: #999; font-size: 16px;">‚Üí</span>
        <img src="https://cdn.simpleicons.org/databricks/FF3621" width="24" height="24"/>
    </div>
</div>

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# Platform Operations and Cost Management

## Overview

**Critical Context:** Earlier modules focused on *building* your Databricks platform - migrating data, converting code, and establishing governance. This module shifts focus to *operationalizing* and *scaling* the platform for ongoing production use.

Post-migration success requires robust operational practices: optimizing compute for production concurrency, implementing cost controls and attribution, integrating observability tooling, establishing incident response workflows, and managing infrastructure as code. These capabilities transform a successful migration into a sustainable, production-grade data platform.

## Learning Objectives

By the end of this lesson, you will be able to:
- Configure and tune Serverless SQL Warehouses for production concurrency
- Implement billing tags and DBU cost attribution using `system.billing.usage`
- Integrate Databricks with enterprise observability platforms (CloudWatch, Azure Monitor, Splunk)
- Design alert policies and incident response workflows
- Manage platform infrastructure as code using Terraform and Databricks Asset Bundles

## Production Compute Optimization

### Serverless SQL Warehouse Tuning

Serverless SQL Warehouses provide instant startup, automatic scaling, and minimal operational overhead - ideal for production analytics workloads. However, achieving optimal cost and performance requires understanding configuration options and scaling behavior.

<br />
<div class="mermaid">
graph LR
    subgraph "User Request Lifecycle"
        A[Query Submitted] --> B{Warehouse<br/>Running?}
        B -->|No| C[Instant Startup<br/>&lt;10s]
        B -->|Yes| D[Route to Cluster]
        C --> D
        D --> E{Capacity<br/>Available?}
        E -->|Yes| F[Execute Query]
        E -->|No| G[Auto-Scale<br/>Add Cluster]
        G --> F
        F --> H[Return Results]
    end
    style C fill:#E3F2FD
    style G fill:#FFF3E0
    style F fill:#E8F5E9
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Key Configuration Parameters

| Parameter | Purpose | Production Considerations |
|-----------|---------|---------------------------|
| **Warehouse Size** | Base compute capacity (X-Small to 4X-Large) | Start with Medium; monitor queue time and scaling events; upsize if consistent saturation |
| **Max Clusters** | Auto-scaling limit (1-10 for Serverless) | Set to 3-5x base capacity for BI workloads; monitor peak concurrency patterns |
| **Auto-stop** | Idle timeout before shutdown | 10-15 minutes for production workloads; balance cost vs. startup delay |
| **Scaling Policy** | How aggressively to scale (Economy, Standard, Performance) | Use Standard for predictable workloads; Performance for low-latency SLAs |
| **Catalog** | Default Unity Catalog for queries | Align with governance model; users can override with USE CATALOG |
| **Query History** | Retention for query logs | Maximum retention for compliance and troubleshooting |
| **Spot Instance Policy** | Use spot/preemptible instances | Enable for non-critical workloads to reduce costs by ~70% |

<div style="background-color: #E3F2FD; border-left: 4px solid #2196F3; padding: 12px 16px; margin: 16px 0;">
<strong>üí° Recommendation: Right-Sizing Strategy</strong><br/><br/>
Start conservative (Small/Medium, max clusters = 2-3) and scale based on observed metrics. Over-provisioning wastes budget; under-provisioning creates user frustration. Monitor these signals:
<ul>
<li><strong>Query Queue Time</strong>: Sustained queuing indicates insufficient capacity</li>
<li><strong>Cluster Utilization</strong>: Target 60-80% average utilization during business hours</li>
<li><strong>Auto-scaling Events</strong>: Frequent scaling suggests base size too small or max clusters too low</li>
<li><strong>Query Duration P95/P99</strong>: Percentile latencies reveal user experience degradation</li>
</ul>
</div>

### Concurrency Patterns and Anti-Patterns

**High-Concurrency BI Workloads**

| Pattern | Description | Configuration |
|---------|-------------|---------------|
| **Dashboard Refresh** | 50-200 concurrent queries during business hours; predictable schedule | Medium/Large warehouse, max clusters = 5-8, Standard scaling policy |
| **Ad-Hoc Exploration** | 10-30 concurrent users; unpredictable query patterns | Medium warehouse, max clusters = 3-5, Economy scaling for cost control |
| **Executive Reporting** | Low concurrency (&lt;10), low-latency SLA (&lt;5s) | Small/Medium warehouse, max clusters = 2, Performance scaling, dedicated warehouse |
| **Embedded Analytics** | Customer-facing dashboards; variable concurrency; uptime SLA | Large warehouse, max clusters = 8-10, Performance scaling, always-on (no auto-stop) |

**Anti-Patterns to Avoid**

<div style="background-color: #FFEBEE; border-left: 4px solid #F44336; padding: 12px 16px; margin: 16px 0;">
<strong>‚ö†Ô∏è Common Mistakes</strong><br/><br/>
<ul>
<li><strong>Single Warehouse for All Workloads</strong>: Mixing ETL, BI, and ad-hoc creates resource contention and unpredictable costs. Use dedicated warehouses per workload class with appropriate tagging.</li>
<li><strong>Aggressive Auto-Stop (&lt;5 min)</strong>: Causes frequent startup delays. For production BI, 10-15 min idle timeout balances cost and UX.</li>
<li><strong>Max Clusters = 1</strong>: Eliminates concurrency benefits. Serverless is designed to scale; allow at least 2-3 clusters.</li>
<li><strong>Ignoring Query Patterns</strong>: Not all queries are equal. Segment users: power users (complex, long-running) vs. dashboards (simple, frequent).</li>
<li><strong>No Tagging Strategy</strong>: Without billing tags, cost attribution is impossible. Tag every warehouse with cost center, team, environment.</li>
</ul>
</div>

### Monitoring Serverless Performance

Use Databricks system tables to monitor warehouse performance and identify optimization opportunities:

<div class="code-block" data-language="sql">-- Query performance metrics by warehouse
SELECT
  warehouse_id,
  warehouse_name,
  date_trunc('hour', start_time) AS hour,
  COUNT(*) AS query_count,
  ROUND(AVG(total_duration_ms) / 1000, 2) AS avg_duration_seconds,
  ROUND(PERCENTILE(total_duration_ms, 0.95) / 1000, 2) AS p95_duration_seconds,
  ROUND(PERCENTILE(total_duration_ms, 0.99) / 1000, 2) AS p99_duration_seconds,
  SUM(CASE WHEN error_message IS NOT NULL THEN 1 ELSE 0 END) AS error_count,
  ROUND(AVG(queued_duration_ms) / 1000, 2) AS avg_queue_time_seconds
FROM
  system.query.history
WHERE
  start_time >= CURRENT_DATE - INTERVAL 7 DAYS
  AND warehouse_id IS NOT NULL
GROUP BY
  warehouse_id, warehouse_name, hour
ORDER BY
  hour DESC, query_count DESC;</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

<div class="code-block" data-language="sql">-- Identify queries with high queue times (capacity issues)
SELECT
  query_id,
  warehouse_name,
  user_name,
  start_time,
  ROUND(queued_duration_ms / 1000, 2) AS queue_time_seconds,
  ROUND(total_duration_ms / 1000, 2) AS total_duration_seconds,
  statement_type,
  LEFT(query_text, 100) AS query_preview
FROM
  system.query.history
WHERE
  start_time >= CURRENT_DATE - INTERVAL 3 DAYS
  AND queued_duration_ms > 5000  -- Queries queued >5 seconds
ORDER BY
  queued_duration_ms DESC
LIMIT 50;</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### Classic vs. Serverless SQL Warehouses

| Consideration | Classic SQL Warehouse | Serverless SQL Warehouse |
|---------------|------------------------|--------------------------|
| **Startup Time** | 3-5 minutes (cold start) | &lt;10 seconds |
| **Scaling Speed** | 2-3 minutes to add cluster | 10-20 seconds to add capacity |
| **Infrastructure** | Runs in customer's cloud account (VPC/VNet) | Runs in Databricks-managed infrastructure |
| **Private Connectivity** | Supports Private Link, VPC peering | Limited private connectivity options ¬π |
| **Customization** | Cluster policies, init scripts, custom libraries | Limited customization; managed environment |
| **Cost Model** | Pay for uptime (idle time = wasted cost) | Pay for query execution only; near-zero idle cost |
| **Data Access** | Direct access to customer storage via IAM/SPN | Accesses storage via secure proxy |
| **Compliance** | Data plane in customer account (regulatory advantage) | Data plane in Databricks account (simplified ops) |
| **Best For** | Strict data residency, custom networking, regulated industries | General-purpose analytics, cost optimization, developer productivity |

<div style="font-size: 0.85em; color: #555; margin-top: 8px;">
¬π Serverless SQL Warehouses support Private Link on AWS and Azure with additional configuration. Consult Databricks documentation for latest capabilities.
</div>

## Cost Management and Attribution

### The Challenge: Multi-Tenant Cost Allocation

Post-migration, Databricks typically serves multiple teams, cost centers, and projects on a shared infrastructure. Without proper tagging and attribution, answering "What did Team X spend last month?" becomes impossible.

<br />
<div class="mermaid">
graph TB
    subgraph "Cost Attribution Flow"
        A[Databricks Usage<br/>DBUs Consumed] --> B[system.billing.usage<br/>Raw Usage Data]
        B --> C[Tag Enrichment<br/>cost_center, team, project]
        C --> D[Cost Allocation Logic<br/>DBUs √ó Rate]
        D --> E[Chargeback Reports<br/>FinOps Dashboards]
        E --> F[Budget Alerts<br/>Anomaly Detection]
    end
    style B fill:#FFF3E0
    style C fill:#E3F2FD
    style D fill:#F3E5F5
    style E fill:#E8F5E9
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Billing Tags: The Foundation of Cost Attribution

**Tag Strategy**

Every compute resource (warehouse, cluster, DLT pipeline, model endpoint) should have consistent tags:

| Tag Key | Purpose | Example Values |
|---------|---------|----------------|
| `cost_center` | Finance chargeback | `"marketing"`, `"engineering"`, `"data_science"` |
| `team` | Team-level attribution | `"growth_analytics"`, `"ml_platform"`, `"bi_reporting"` |
| `environment` | Separate dev/staging/prod costs | `"dev"`, `"staging"`, `"prod"` |
| `project` | Project-specific tracking | `"customer_360"`, `"fraud_detection"`, `"recommendation_engine"` |
| `owner` | Responsible individual/group | `"jane.doe@company.com"`, `"data-platform-team"` |
| `budget_code` | Finance system integration | `"FY26-Q1-ANALYTICS"`, `"CAPEX-2026-ML"` |

<div style="background-color: #E3F2FD; border-left: 4px solid #2196F3; padding: 12px 16px; margin: 16px 0;">
<strong>üí° Recommendation: Enforce Tags with Cluster Policies</strong><br/><br/>
Use Databricks Cluster Policies to require specific tags before users can create compute resources. This ensures complete coverage and prevents untagged spend from becoming "dark matter" in your billing data.
</div>

### Applying Tags to Compute Resources

**SQL Warehouse Tags** (via UI or API):

<div class="code-block" data-language="python"># Using Databricks SDK for Python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create or update SQL Warehouse with tags
warehouse = w.warehouses.create(
    name="analytics_warehouse_prod",
    cluster_size="Medium",
    max_num_clusters=5,
    auto_stop_mins=15,
    tags={
        "cost_center": "analytics",
        "team": "bi_reporting",
        "environment": "prod",
        "owner": "bi-team@company.com"
    }
)</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Cluster Tags** (via cluster configuration):

<div class="code-block" data-language="python"># Using Databricks SDK for Python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import ClusterSpec

w = WorkspaceClient()

cluster = w.clusters.create(
    cluster_name="etl_job_cluster",
    spark_version=w.clusters.select_spark_version(latest=True),
    node_type_id=w.clusters.select_node_type(local_disk=True),
    num_workers=4,
    autotermination_minutes=30,
    custom_tags={
        "cost_center": "data_engineering",
        "team": "etl_platform",
        "environment": "prod",
        "project": "daily_aggregation"
    }
).result()</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### Using `system.billing.usage` for Cost Attribution

The `system.billing.usage` table is the authoritative source for DBU consumption and cost data. It's updated daily with all usage across the account.

**Table Schema Overview:**

| Column | Type | Description |
|--------|------|-------------|
| `account_id` | STRING | Databricks account identifier |
| `workspace_id` | STRING | Workspace where usage occurred |
| `sku_name` | STRING | Product SKU (e.g., `STANDARD_ALL_PURPOSE_COMPUTE`, `SQL_COMPUTE`) |
| `cloud` | STRING | Cloud provider (`AWS`, `AZURE`, `GCP`) |
| `usage_start_time` | TIMESTAMP | Start of usage period |
| `usage_end_time` | TIMESTAMP | End of usage period |
| `usage_date` | DATE | Date of usage (for partitioning) |
| `custom_tags` | MAP&lt;STRING, STRING&gt; | User-defined tags from compute resources |
| `usage_unit` | STRING | Unit of measurement (typically `DBU`) |
| `usage_quantity` | DECIMAL | Amount of DBUs consumed |
| `usage_metadata` | STRUCT | Additional context (cluster_id, warehouse_id, job_id, etc.) |
| `billing_origin_product` | STRING | Origin product (e.g., `JOBS`, `SQL`, `NOTEBOOKS`) |

<div class="code-block" data-language="sql">-- Daily cost by cost center (requires DBU rate mapping)
SELECT
  usage_date,
  custom_tags['cost_center'] AS cost_center,
  sku_name,
  SUM(usage_quantity) AS total_dbus,
  -- Apply your negotiated DBU rates here
  ROUND(SUM(usage_quantity) *
    CASE
      WHEN sku_name LIKE '%SERVERLESS%' THEN 0.70  -- Example rate
      WHEN sku_name LIKE '%ALL_PURPOSE%' THEN 0.55
      WHEN sku_name LIKE '%JOBS%' THEN 0.15
      ELSE 0.40
    END, 2) AS estimated_cost_usd
FROM
  system.billing.usage
WHERE
  usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
  AND custom_tags['cost_center'] IS NOT NULL
GROUP BY
  usage_date, cost_center, sku_name
ORDER BY
  usage_date DESC, total_dbus DESC;</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

<div class="code-block" data-language="sql">-- Identify untagged usage (cost attribution gaps)
SELECT
  usage_date,
  sku_name,
  billing_origin_product,
  usage_metadata.cluster_id,
  usage_metadata.warehouse_id,
  SUM(usage_quantity) AS untagged_dbus
FROM
  system.billing.usage
WHERE
  usage_date >= CURRENT_DATE - INTERVAL 7 DAYS
  AND (
    custom_tags IS NULL
    OR custom_tags['cost_center'] IS NULL
    OR custom_tags['team'] IS NULL
  )
GROUP BY
  usage_date, sku_name, billing_origin_product,
  usage_metadata.cluster_id, usage_metadata.warehouse_id
ORDER BY
  untagged_dbus DESC;</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

<div class="code-block" data-language="sql">-- Top 10 most expensive workloads by team (past 30 days)
SELECT
  custom_tags['team'] AS team,
  custom_tags['project'] AS project,
  sku_name,
  COUNT(DISTINCT usage_metadata.cluster_id) AS unique_clusters,
  SUM(usage_quantity) AS total_dbus,
  ROUND(SUM(usage_quantity) * 0.40, 2) AS estimated_cost_usd  -- Adjust rate
FROM
  system.billing.usage
WHERE
  usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
  AND custom_tags['team'] IS NOT NULL
GROUP BY
  team, project, sku_name
ORDER BY
  total_dbus DESC
LIMIT 10;</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### Budget Monitoring and Alerts

Build dashboards and alerts on `system.billing.usage` to proactively manage costs:

| Alert Type | Logic | Action |
|------------|-------|--------|
| **Budget Threshold** | Team DBU spend > 80% of monthly budget | Email team lead; require approval for new clusters |
| **Anomaly Detection** | Daily spend > 2œÉ from 30-day average | Investigate for runaway jobs or misconfiguration |
| **Untagged Usage** | Untagged DBUs > 5% of total usage | Notify platform team; enforce tagging policies |
| **Idle Compute** | Clusters running >24 hours with zero queries | Auto-terminate; notify owner |
| **Cost Per Query** | Warehouse cost/query > threshold | Investigate inefficient queries; optimize warehouse size |

<div style="background-color: #FFF3E0; border-left: 4px solid #FF9800; padding: 12px 16px; margin: 16px 0;">
<strong>‚öôÔ∏è Best Practice: Automated Cost Governance</strong><br/><br/>
Combine <code>system.billing.usage</code> alerts with automated responses:
<ul>
<li><strong>Soft Limits</strong>: Email notifications at 75%, 90%, 100% of budget</li>
<li><strong>Hard Limits</strong>: Use Cluster Policies to restrict instance types or max clusters when budget exceeded</li>
<li><strong>Weekly Reports</strong>: Scheduled SQL queries that email cost summaries to team leads every Monday</li>
<li><strong>Chargeback Integration</strong>: Export tagged usage to finance systems (SAP, Oracle Financials) for departmental billing</li>
</ul>
</div>

## Observability and Monitoring Integrations

### The Observability Stack

Production data platforms require integration with enterprise observability tools for centralized monitoring, alerting, and incident response.

<br />
<div class="mermaid">
graph TB
    subgraph "Databricks Platform"
        A[Clusters & Warehouses]
        B[Jobs & Pipelines]
        C[System Tables]
        D[Audit Logs]
    end

    subgraph "Log Delivery"
        E[Diagnostic Logs]
        F[Audit Log Delivery]
        G[System Tables Export]
    end

    subgraph "Observability Platforms"
        H[CloudWatch<br/>AWS]
        I[Azure Monitor<br/>Log Analytics]
        J[Splunk<br/>Enterprise]
        K[Datadog]
        L[Prometheus/Grafana]
    end

    A --> E
    B --> E
    D --> F
    C --> G

    E --> H
    E --> I
    E --> J
    F --> H
    F --> I
    F --> J
    G --> K
    G --> L

    style A fill:#E3F2FD
    style B fill:#E3F2FD
    style C fill:#FFF3E0
    style D fill:#FFF3E0
    style H fill:#FFE0B2
    style I fill:#B3E5FC
    style J fill:#C5E1A5
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Log Types and Use Cases

| Log Type | Content | Primary Use Cases |
|----------|---------|-------------------|
| **Cluster Event Logs** | Cluster lifecycle (start, resize, terminate), driver/executor logs | Debugging job failures, performance tuning, capacity planning |
| **Audit Logs** | All user actions (logins, data access, permission changes) | Security compliance, access reviews, incident forensics |
| **Job Run Logs** | Job execution history, task outcomes, error messages | Pipeline monitoring, SLA tracking, failure analysis |
| **Query History** | SQL query text, execution plans, performance metrics | Query optimization, cost attribution, user behavior analysis |
| **System Metrics** | CPU, memory, disk I/O, network throughput | Capacity planning, anomaly detection, resource optimization |

### Integration Patterns

**Pattern 1: Log Delivery to Cloud Storage**

Databricks delivers logs to your cloud storage (S3, ADLS Gen2, GCS), which you then ingest into your observability platform.

<div class="code-block" data-language="python"># Configure audit log delivery using Databricks SDK
from databricks.sdk import AccountClient
from databricks.sdk.service.billing import LogDeliveryConfiguration, LogType, OutputFormat

a = AccountClient()

# Create log delivery configuration for audit logs
log_delivery = a.log_delivery.create(
    log_delivery_configuration=LogDeliveryConfiguration(
        config_name="audit_logs_to_s3",
        log_type=LogType.AUDIT_LOGS,
        output_format=OutputFormat.JSON,
        credentials_id="<storage-credential-id>",
        storage_configuration_id="<storage-config-id>",
        workspace_ids_filter=[1234567890123456]  # Optional: specific workspaces
    )
)

print(f"Log delivery configured: {log_delivery.log_delivery_configuration.config_id}")</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Pattern 2: Direct CloudWatch Integration (AWS)**

For AWS deployments, stream logs directly to CloudWatch Logs:

<div class="code-block" data-language="python"># Configure cluster to send logs to CloudWatch
cluster_config = {
    "cluster_name": "analytics_cluster",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "i3.xlarge",
    "num_workers": 4,
    "cluster_log_conf": {
        "cloudwatch": {
            "log_group_name": "/databricks/clusters",
            "log_stream_name_prefix": "analytics-"
        }
    },
    "custom_tags": {
        "cost_center": "analytics",
        "environment": "prod"
    }
}

# Create cluster with CloudWatch logging enabled
w.clusters.create(**cluster_config)</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Pattern 3: Azure Monitor Integration (Azure)**

For Azure deployments, integrate with Log Analytics workspace:

<div class="code-block" data-language="python"># Configure cluster to send logs to Azure Log Analytics
cluster_config = {
    "cluster_name": "etl_cluster_prod",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 8,
    "cluster_log_conf": {
        "dbfs": {
            "destination": "dbfs:/cluster-logs/etl-prod"
        }
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/init-scripts/azure-monitor-agent.sh"
            }
        }
    ]
}

# Then use Azure Monitor agent to forward logs from DBFS/storage</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Pattern 4: Splunk HTTP Event Collector (HEC)**

For Splunk environments, use HTTP Event Collector to stream logs:

<div class="code-block" data-language="python"># Python script to forward Databricks logs to Splunk HEC
import requests
import json
from datetime import datetime, timedelta

SPLUNK_HEC_URL = "https://splunk.company.com:8088/services/collector/event"
SPLUNK_HEC_TOKEN = "YOUR-HEC-TOKEN"

def send_to_splunk(event_data):
    """Send event to Splunk HEC"""
    payload = {
        "time": int(datetime.now().timestamp()),
        "sourcetype": "databricks:audit",
        "source": "databricks_api",
        "event": event_data
    }

    headers = {
        "Authorization": f"Splunk {SPLUNK_HEC_TOKEN}",
        "Content-Type": "application/json"
    }

    response = requests.post(SPLUNK_HEC_URL, headers=headers, json=payload, verify=True)
    return response.status_code == 200

# Query audit logs from system tables and forward to Splunk
audit_logs = spark.sql("""
    SELECT *
    FROM system.access.audit
    WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL 1 HOUR
""").collect()

for log in audit_logs:
    send_to_splunk(log.asDict())</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### Key Metrics to Monitor

Build dashboards in your observability platform for these critical metrics:

| Metric Category | Specific Metrics | Alert Thresholds |
|-----------------|------------------|------------------|
| **Cluster Health** | Cluster start failures, node evictions, OOM errors | >5% failure rate |
| **Job Reliability** | Job success rate, retry counts, execution time P95/P99 | <95% success rate |
| **Query Performance** | Query duration P95/P99, queue time, error rate | >10s P95 latency |
| **Cost** | Daily DBU spend, cost per query, cost per job | >20% day-over-day increase |
| **Concurrency** | Active queries, queued queries, warehouse utilization | Queue depth >10 |
| **Data Quality** | DLT expectations pass rate, row counts, schema changes | <99% pass rate |
| **Security** | Failed login attempts, privilege escalations, data exfiltration patterns | >10 failures/hour |

<div style="background-color: #E3F2FD; border-left: 4px solid #2196F3; padding: 12px 16px; margin: 16px 0;">
<strong>üí° Recommendation: Use System Tables for Metrics</strong><br/><br/>
Databricks System Tables (<code>system.query.history</code>, <code>system.access.audit</code>, <code>system.billing.usage</code>) provide SQL-queryable access to all operational metrics. This enables:
<ul>
<li>Custom dashboards in Databricks SQL or your BI tool</li>
<li>Scheduled queries that export metrics to external systems</li>
<li>Alert queries that trigger notifications via webhooks</li>
<li>Historical trend analysis without log retention limits</li>
</ul>
</div>

## Alert Policies and Incident Response

### Databricks Alert Framework

Alerts transform monitoring data into actionable notifications. Databricks provides native alerting capabilities plus integration with external systems.

<br />
<div class="mermaid">
flowchart TB
    subgraph "Alert Sources"
        A[SQL Query Alert<br/>Scheduled Query]
        B[Job Failure Alert<br/>Workflow Notification]
        C[DLT Expectations<br/>Data Quality]
        D[Custom Metrics<br/>System Tables]
    end

    subgraph "Alert Routing"
        E[Alert Destination]
    end

    subgraph "Notification Channels"
        F[Email]
        G[Slack/Teams]
        H[PagerDuty]
        I[Webhook<br/>Custom Integration]
        J[ITSM<br/>ServiceNow/Jira]
    end

    A --> E
    B --> E
    C --> E
    D --> E

    E --> F
    E --> G
    E --> H
    E --> I
    E --> J

    style A fill:#E3F2FD
    style B fill:#E3F2FD
    style C fill:#FFF3E0
    style D fill:#FFF3E0
    style H fill:#FFEBEE
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Alert Types and Configuration

**1. SQL Query Alerts** (for operational metrics)

<div class="code-block" data-language="sql">-- Alert query: Detect failed jobs in the past hour
SELECT
  job_id,
  job_name,
  run_id,
  start_time,
  end_time,
  result_state,
  error_message
FROM
  system.lakeflow.job_run_timeline
WHERE
  start_time >= CURRENT_TIMESTAMP - INTERVAL 1 HOUR
  AND result_state = 'FAILED'
  AND job_name NOT LIKE '%test%'  -- Exclude test jobs
ORDER BY
  start_time DESC;

-- Configure this query to run every 15 minutes
-- Alert condition: Row count > 0
-- Notification: Email + Slack</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

<div class="code-block" data-language="sql">-- Alert query: Detect cost anomalies
SELECT
  usage_date,
  SUM(usage_quantity) AS total_dbus,
  AVG(SUM(usage_quantity)) OVER (
    ORDER BY usage_date
    ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
  ) AS avg_30day_dbus,
  STDDEV(SUM(usage_quantity)) OVER (
    ORDER BY usage_date
    ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
  ) AS stddev_30day_dbus
FROM
  system.billing.usage
WHERE
  usage_date >= CURRENT_DATE - INTERVAL 60 DAYS
GROUP BY
  usage_date
HAVING
  total_dbus > avg_30day_dbus + (2 * stddev_30day_dbus)  -- 2 sigma threshold
ORDER BY
  usage_date DESC
LIMIT 1;

-- Alert condition: Row count > 0 (anomaly detected)
-- Notification: Email finance team + platform team</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**2. Job Failure Notifications**

Configure notifications directly in Lakeflow Jobs (Workflows):

<div class="code-block" data-language="python"># Using Databricks SDK to create job with notifications
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import (
    JobEmailNotifications, JobNotificationSettings,
    JobSettings, Task, NotebookTask, Source
)

w = WorkspaceClient()

job = w.jobs.create(
    name="critical_etl_pipeline",
    tasks=[
        Task(
            task_key="ingest_data",
            notebook_task=NotebookTask(
                notebook_path="/Pipelines/Ingest",
                source=Source.WORKSPACE
            ),
            email_notifications=JobEmailNotifications(
                on_failure=["data-eng-oncall@company.com"],
                on_success=["data-eng-team@company.com"]
            )
        )
    ],
    email_notifications=JobEmailNotifications(
        on_failure=["data-platform-oncall@company.com", "manager@company.com"],
        no_alert_for_skipped_runs=True
    ),
    webhook_notifications={
        "on_failure": [{
            "id": "pagerduty_webhook_id"
        }]
    },
    notification_settings=JobNotificationSettings(
        no_alert_for_canceled_runs=True,
        no_alert_for_skipped_runs=True
    )
)</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**3. DLT Data Quality Alerts**

Spark Declarative Pipelines (DLT) can alert on expectation failures:

<div class="code-block" data-language="python"># In a DLT pipeline notebook
import dlt
from pyspark.sql.functions import col

@dlt.table(
    name="clean_customer_data",
    comment="Customer data with quality expectations"
)
@dlt.expect_all_or_fail({
    "valid_email": "email IS NOT NULL AND email LIKE '%@%'",
    "valid_created_date": "created_date >= '2020-01-01'"
})
@dlt.expect_or_drop("positive_customer_id", "customer_id > 0")
def clean_customers():
    return spark.read.table("raw_customers")

# Configure DLT pipeline settings to send alerts on expectation failures
# UI: Settings > Email Notifications > "On Expectation Failures"</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### Incident Response Workflow

A mature incident response process ensures alerts lead to resolution, not just notifications.

| Phase | Activities | Responsible Party | Tools/Systems |
|-------|-----------|-------------------|---------------|
| **Detection** | Alert fires based on metric threshold or failure | Automated monitoring | Databricks Alerts, CloudWatch, PagerDuty |
| **Triage** | On-call engineer assesses severity and impact | Platform SRE / Data Engineer | System tables, job logs, query history |
| **Communication** | Notify stakeholders; create incident ticket | On-call engineer | Slack, ServiceNow, Jira, StatusPage |
| **Investigation** | Root cause analysis using logs and lineage | Data Engineering team | Databricks workspace, Unity Catalog lineage |
| **Mitigation** | Apply temporary fix or rollback change | Data Engineering team | Git revert, job disable, cluster restart |
| **Resolution** | Implement permanent fix; verify recovery | Data Engineering team | Code changes, infrastructure updates |
| **Postmortem** | Document RCA, action items, prevention measures | Team lead | Wiki, postmortem doc, retrospective meeting |

<div style="background-color: #FFF3E0; border-left: 4px solid #FF9800; padding: 12px 16px; margin: 16px 0;">
<strong>‚öôÔ∏è Best Practice: Runbooks for Common Incidents</strong><br/><br/>
Document standard operating procedures for frequent incident types:
<ul>
<li><strong>Job Failure</strong>: Check recent code changes, verify input data availability, review cluster logs for OOM/timeout errors</li>
<li><strong>Query Performance Degradation</strong>: Check for table bloat, analyze query plans, verify warehouse capacity</li>
<li><strong>Cost Spike</strong>: Identify runaway jobs via <code>system.billing.usage</code>, terminate idle clusters, review recent deployments</li>
<li><strong>Data Quality Issue</strong>: Trace lineage to source, check DLT expectations, notify upstream data providers</li>
<li><strong>Access Denied</strong>: Verify Unity Catalog permissions, check recent REVOKE statements, confirm identity provider sync</li>
</ul>
</div>

### Alert Destination Configuration

**Slack Integration Example:**

<div class="code-block" data-language="python"># Create a webhook destination for Slack notifications
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import AlertDestination

w = WorkspaceClient()

# Create Slack webhook destination
slack_dest = w.alert_destinations.create(
    name="data_eng_slack_channel",
    config={
        "url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    },
    destination_type="WEBHOOK"
)

print(f"Slack destination created: {slack_dest.id}")</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**PagerDuty Integration Example:**

<div class="code-block" data-language="python"># Create PagerDuty webhook destination
pagerduty_dest = w.alert_destinations.create(
    name="oncall_pagerduty",
    config={
        "url": "https://events.pagerduty.com/v2/enqueue",
        "headers": {
            "Authorization": "Token token=YOUR_PAGERDUTY_API_KEY"
        },
        "template": {
            "routing_key": "YOUR_PAGERDUTY_INTEGRATION_KEY",
            "event_action": "trigger",
            "payload": {
                "summary": "Databricks Alert: {{ ALERT_NAME }}",
                "severity": "error",
                "source": "databricks",
                "custom_details": {
                    "alert_url": "{{ ALERT_URL }}",
                    "query_url": "{{ QUERY_URL }}"
                }
            }
        }
    },
    destination_type="WEBHOOK"
)</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-python.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

## Infrastructure as Code (IaC)

### The Challenge: Platform Drift and Change Management

Post-migration, your Databricks platform will evolve: new workspaces, catalog changes, cluster policy updates, job definitions, and permission grants. Without IaC, these changes become:
- **Undocumented**: No record of who changed what and why
- **Unrepeatable**: Manual UI changes can't be replicated across environments
- **Unauditable**: Compliance teams can't verify configurations
- **Fragile**: No way to rollback breaking changes

IaC solves this by treating infrastructure configuration as versioned code subject to the same rigor as application code: peer review, CI/CD, automated testing, and rollback capabilities.

<br />
<div class="mermaid">
graph LR
    subgraph "Development Flow"
        A[Local Config<br/>Changes] --> B[Git Commit]
        B --> C[Pull Request]
        C --> D[Peer Review]
        D --> E[CI/CD Pipeline]
        E --> F{Validation<br/>Pass?}
        F -->|No| G[Fix Issues]
        F -->|Yes| H[Deploy to Dev]
        G --> A
        H --> I[Promote to Prod]
    end

    subgraph "Databricks Platform"
        J[Workspaces]
        K[Clusters]
        L[Jobs]
        M[Unity Catalog]
        N[Permissions]
    end

    I --> J
    I --> K
    I --> L
    I --> M
    I --> N

    style A fill:#E3F2FD
    style C fill:#FFF3E0
    style E fill:#F3E5F5
    style I fill:#E8F5E9
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### IaC Options for Databricks

| Tool | Strengths | Best For | Learning Curve |
|------|-----------|----------|----------------|
| **Terraform** | Mature, multi-cloud, large ecosystem, declarative | Cross-platform infrastructure; teams already using Terraform | Medium |
| **Databricks Asset Bundles (DABs)** | Native Databricks tooling, Git-integrated, YAML-based | Application deployment (jobs, pipelines, notebooks); developer workflows | Low |
| **Databricks Terraform Provider** | Official Databricks provider for Terraform; comprehensive resource coverage | Account-level config, workspaces, Unity Catalog, clusters, jobs | Medium |
| **Pulumi** | Multi-language (Python, TypeScript, Go), strong typing | Teams preferring code over YAML/HCL; complex logic in IaC | Medium-High |
| **Ansible** | Agentless, procedural, existing playbooks | Organizations standardized on Ansible; configuration management | Medium |

<div style="background-color: #E3F2FD; border-left: 4px solid #2196F3; padding: 12px 16px; margin: 16px 0;">
<strong>üí° Recommendation: Hybrid Approach</strong><br/><br/>
<ul>
<li><strong>Terraform</strong>: Manage account-level resources (workspaces, metastores, storage credentials, network configs)</li>
<li><strong>Databricks Asset Bundles</strong>: Deploy application-layer resources (jobs, DLT pipelines, notebooks, dashboards)</li>
</ul>
This separation aligns with ownership: platform team owns infrastructure (Terraform), data teams own applications (DABs).
</div>

### Terraform for Databricks: Core Patterns

**Provider Configuration:**

<div class="code-block" data-language="hcl"># providers.tf
terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.35"
    }
  }
}

# Authenticate using Service Principal (recommended for CI/CD)
provider "databricks" {
  host       = var.databricks_host
  client_id  = var.databricks_client_id
  client_secret = var.databricks_client_secret
}

# Alternative: Account-level provider for managing workspaces
provider "databricks" {
  alias      = "account"
  host       = "https://accounts.cloud.databricks.com"
  account_id = var.databricks_account_id
  client_id  = var.databricks_client_id
  client_secret = var.databricks_client_secret
}</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-hcl.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Managing SQL Warehouses:**

<div class="code-block" data-language="hcl"># sql_warehouses.tf
resource "databricks_sql_endpoint" "analytics_prod" {
  name             = "analytics_warehouse_prod"
  cluster_size     = "Medium"
  max_num_clusters = 5
  auto_stop_mins   = 15

  tags {
    custom_tags = {
      cost_center = "analytics"
      team        = "bi_reporting"
      environment = "prod"
      managed_by  = "terraform"
    }
  }

  channel {
    name = "CHANNEL_NAME_CURRENT"  # Use current channel for latest features
  }
}

resource "databricks_sql_endpoint" "data_science_dev" {
  name             = "data_science_warehouse_dev"
  cluster_size     = "Small"
  max_num_clusters = 2
  auto_stop_mins   = 10

  enable_serverless_compute = true

  tags {
    custom_tags = {
      cost_center = "data_science"
      team        = "ml_platform"
      environment = "dev"
      managed_by  = "terraform"
    }
  }
}</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-hcl.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Managing Unity Catalog Resources:**

<div class="code-block" data-language="hcl"># unity_catalog.tf
resource "databricks_catalog" "analytics" {
  name         = "analytics_prod"
  comment      = "Production analytics catalog - managed by Terraform"
  owner        = "data_platform_team"
  isolation_mode = "OPEN"

  properties = {
    managed_by   = "terraform"
    cost_center  = "analytics"
    created_date = timestamp()
  }
}

resource "databricks_schema" "sales" {
  catalog_name = databricks_catalog.analytics.name
  name         = "sales"
  comment      = "Sales domain data"
  owner        = "sales_analytics_team"
}

resource "databricks_grants" "sales_schema_access" {
  schema = "${databricks_catalog.analytics.name}.${databricks_schema.sales.name}"

  grant {
    principal  = "sales_analysts"
    privileges = ["SELECT", "USE_SCHEMA", "USE_CATALOG"]
  }

  grant {
    principal  = "sales_engineers"
    privileges = ["SELECT", "MODIFY", "USE_SCHEMA", "USE_CATALOG"]
  }
}</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-hcl.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Managing Cluster Policies:**

<div class="code-block" data-language="hcl"># cluster_policies.tf
resource "databricks_cluster_policy" "cost_optimized" {
  name = "cost_optimized_policy"

  definition = jsonencode({
    "spark_version": {
      "type": "fixed",
      "value": "14.3.x-scala2.12"
    },
    "node_type_id": {
      "type": "allowlist",
      "values": ["i3.xlarge", "i3.2xlarge", "m5d.large"]
    },
    "autotermination_minutes": {
      "type": "range",
      "minValue": 10,
      "maxValue": 120,
      "defaultValue": 30
    },
    "custom_tags.cost_center": {
      "type": "fixed",
      "value": "{{user_email.split('@')[1].split('.')[0]}}"  # Infer from user email
    },
    "custom_tags.managed_by": {
      "type": "fixed",
      "value": "terraform"
    },
    "aws_attributes.spot_bid_price_percent": {
      "type": "fixed",
      "value": 100  # Use spot instances
    }
  })
}

resource "databricks_permissions" "cost_optimized_usage" {
  cluster_policy_id = databricks_cluster_policy.cost_optimized.id

  access_control {
    group_name       = "data_engineers"
    permission_level = "CAN_USE"
  }
}</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-hcl.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### Databricks Asset Bundles: Application Deployment

Databricks Asset Bundles (DABs) provide a Git-native way to deploy jobs, pipelines, and notebooks as versioned applications.

**Bundle Structure:**

<div class="code-block" data-language="yaml"># databricks.yml (root of your Git repository)
bundle:
  name: sales_analytics_pipeline

include:
  - resources/*.yml

targets:
  dev:
    mode: development
    workspace:
      host: https://dev-workspace.cloud.databricks.com
    variables:
      catalog: dev_catalog
      warehouse_id: abc123dev

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.cloud.databricks.com
    variables:
      catalog: prod_catalog
      warehouse_id: xyz789prod
    # Require approval for production deployments
    run_as:
      service_principal_name: "prod-deployment-sp"</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-yaml.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Job Definition in Bundle:**

<div class="code-block" data-language="yaml"># resources/sales_etl_job.yml
resources:
  jobs:
    sales_daily_aggregation:
      name: "Sales Daily Aggregation - ${bundle.target}"

      tasks:
        - task_key: ingest_raw_sales
          notebook_task:
            notebook_path: ../notebooks/ingest_sales.py
            source: GIT
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: i3.xlarge
            num_workers: 4
            custom_tags:
              cost_center: "sales"
              team: "analytics"
              environment: "${bundle.target}"

        - task_key: transform_sales
          depends_on:
            - task_key: ingest_raw_sales
          notebook_task:
            notebook_path: ../notebooks/transform_sales.py
            source: GIT
          existing_cluster_id: "${var.shared_cluster_id}"

      schedule:
        quartz_cron_expression: "0 0 2 * * ?"  # 2 AM daily
        timezone_id: "America/New_York"
        pause_status: PAUSED  # Start paused; enable after validation

      email_notifications:
        on_failure:
          - "sales-analytics-oncall@company.com"
        on_success:
          - "sales-analytics-team@company.com"

      tags:
        cost_center: "sales"
        managed_by: "databricks_asset_bundle"</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-yaml.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**Deploying with DABs:**

<div class="code-block" data-language="bash"># Install Databricks CLI with Asset Bundles support
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Authenticate to workspace
databricks configure

# Validate bundle configuration
databricks bundle validate -t dev

# Deploy to development
databricks bundle deploy -t dev

# Run the deployed job
databricks bundle run sales_daily_aggregation -t dev

# Promote to production (after validation)
databricks bundle deploy -t prod</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-bash.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### CI/CD Pipeline Integration

**GitHub Actions Example for Terraform:**

<div class="code-block" data-language="yaml"># .github/workflows/terraform-databricks.yml
name: Terraform - Databricks Infrastructure

on:
  pull_request:
    branches: [main]
    paths:
      - 'terraform/**'
  push:
    branches: [main]
    paths:
      - 'terraform/**'

jobs:
  terraform:
    runs-on: ubuntu-latest

    env:
      DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
      DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
      DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}

    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.6.0

      - name: Terraform Init
        working-directory: ./terraform
        run: terraform init

      - name: Terraform Validate
        working-directory: ./terraform
        run: terraform validate

      - name: Terraform Plan
        working-directory: ./terraform
        run: terraform plan -out=tfplan

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        working-directory: ./terraform
        run: terraform apply -auto-approve tfplan</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-yaml.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

**GitHub Actions Example for Asset Bundles:**

<div class="code-block" data-language="yaml"># .github/workflows/databricks-bundle-deploy.yml
name: Deploy Databricks Asset Bundle

on:
  push:
    branches: [main, develop]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Install Databricks CLI
        run: |
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
          echo "$HOME/.databricks/bin" >> $GITHUB_PATH

      - name: Validate Bundle
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: |
          TARGET=${{ github.ref == 'refs/heads/main' && 'prod' || 'dev' }}
          databricks bundle validate -t $TARGET

      - name: Deploy to Dev
        if: github.ref == 'refs/heads/develop'
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_DEV_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_DEV_TOKEN }}
        run: databricks bundle deploy -t dev

      - name: Deploy to Prod
        if: github.ref == 'refs/heads/main'
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}
        run: databricks bundle deploy -t prod</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-yaml.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.code-block').forEach((block) => {
    const code = block.textContent;
    const language = block.getAttribute('data-language') || 'python';
    block.innerHTML = `<pre><code class="language-${language}">${code.trim()}</code></pre>`;
  });
  Prism.highlightAll();
});
</script>

### IaC Best Practices

| Practice | Why It Matters | Implementation |
|----------|----------------|----------------|
| **State Management** | Terraform state must be shared and locked | Use remote backend (S3 + DynamoDB, Azure Storage, Terraform Cloud) |
| **Environment Separation** | Dev changes shouldn't impact prod | Separate state files per environment; use workspaces or directories |
| **Secret Management** | Never commit credentials to Git | Use GitHub Secrets, AWS Secrets Manager, Azure Key Vault, HashiCorp Vault |
| **Drift Detection** | Detect manual changes outside IaC | Run `terraform plan` daily; alert on drift; use Terraform Cloud for automated detection |
| **Modular Design** | Reusable, composable infrastructure | Create Terraform modules for common patterns (warehouse, cluster policy, catalog) |
| **Code Review** | Prevent misconfigurations | Require PR approval for all infrastructure changes; use automated validation tools |
| **Rollback Strategy** | Quick recovery from bad changes | Tag releases; maintain previous Terraform state; test rollback procedures |
| **Documentation** | Knowledge transfer and onboarding | Document module usage, variable meanings, and deployment procedures |

<div style="background-color: #FFEBEE; border-left: 4px solid #F44336; padding: 12px 16px; margin: 16px 0;">
<strong>‚ö†Ô∏è Common IaC Pitfalls</strong><br/><br/>
<ul>
<li><strong>Destructive Changes</strong>: Some Terraform changes trigger resource replacement (deletion + recreation). Always review plan output for <code>-/+</code> (replace) operations.</li>
<li><strong>State File Loss</strong>: Losing Terraform state means losing track of managed resources. Always use remote state with versioning.</li>
<li><strong>Hardcoded Values</strong>: Hardcoding workspace IDs, cluster IDs, or credentials makes code brittle. Use variables and data sources.</li>
<li><strong>Incomplete Migration</strong>: Mixing manual changes with IaC creates drift and confusion. Commit to full IaC adoption or define boundaries clearly.</li>
</ul>
</div>

## Summary: Operational Maturity Checklist

Assess your post-migration operational readiness:

### Compute Optimization
- [ ] Serverless SQL Warehouses configured with appropriate size and max clusters
- [ ] Warehouse-level tagging for cost attribution
- [ ] Monitoring in place for queue times and concurrency
- [ ] Cluster policies enforce cost controls and tagging requirements

### Cost Management
- [ ] Billing tags defined and enforced across all compute resources
- [ ] `system.billing.usage` dashboards built for cost visibility
- [ ] Budget alerts configured for teams and cost centers
- [ ] Untagged usage < 5% of total spend
- [ ] Monthly cost review process established

### Observability
- [ ] Log delivery configured (CloudWatch/Azure Monitor/Splunk)
- [ ] System table queries built for operational metrics
- [ ] Dashboards created for job health, query performance, and costs
- [ ] Integration with enterprise monitoring platform complete

### Alerting & Incident Response
- [ ] SQL query alerts configured for critical metrics
- [ ] Job failure notifications routed to appropriate teams
- [ ] PagerDuty/on-call integration for production incidents
- [ ] Incident response runbooks documented
- [ ] Postmortem process defined

### Infrastructure as Code
- [ ] Terraform or DABs adopted for infrastructure management
- [ ] CI/CD pipeline configured for automated deployments
- [ ] Remote state backend configured with locking
- [ ] Drift detection process in place
- [ ] Dev/staging/prod environment separation established

<div style="background-color: #E8F5E9; border-left: 4px solid #4CAF50; padding: 12px 16px; margin: 16px 0;">
<strong>‚úÖ Next Steps</strong><br/><br/>
With operational practices in place, the platform is ready for broader adoption. The next module focuses on consumer integration - enabling downstream users and applications to leverage the migrated platform.
</div>

## Related Resources

- [Databricks SQL Warehouse Documentation](https://docs.databricks.com/sql/admin/sql-endpoints.html)
- [System Tables Reference](https://docs.databricks.com/admin/system-tables/index.html)
- [Databricks Terraform Provider](https://registry.terraform.io/providers/databricks/databricks/latest/docs)
- [Databricks Asset Bundles Guide](https://docs.databricks.com/dev-tools/bundles/index.html)
- [Unity Catalog Best Practices](https://docs.databricks.com/data-governance/unity-catalog/best-practices.html)

---

**Next Lesson:** [5.2 - Consumer Integration]($./5.2 - Consumer Integration)

<div style="color: #FF3621; font-weight: bold; font-size: 2em; margin-bottom: 12px;">COURSE DEVELOPER (remove before publishing)</div>

### Customization Notes

**Platform-Specific Additions:**

When customizing for a specific source platform (Snowflake, Redshift, BigQuery, Synapse, etc.), add:

1. **Compute Mapping**: Document how {SOURCE_PLATFORM} compute resources (warehouses, clusters, serverless) map to Databricks equivalents and typical sizing conversions

2. **Cost Comparison**: Provide specific examples of DBU-based pricing vs. source platform pricing models; include sample cost calculations for common workload types

3. **Observability Tools**: If migrating from a platform with specific monitoring tools (e.g., Snowflake Resource Monitors, Redshift Performance Insights), document equivalent capabilities in Databricks

4. **IaC Patterns**: If the source platform uses specific IaC tools (e.g., SnowSQL scripts, CloudFormation for Redshift), provide migration patterns to Terraform/DABs

5. **Tagging Standards**: If the organization has existing tagging taxonomies from cloud providers or source platforms, map them to Databricks custom tags

6. **Alert Templates**: Provide platform-specific alert query templates based on common operational patterns from the source system

**Integration Examples:**

Add code examples for:
- Source platform-specific log forwarding patterns
- ITSM integrations (ServiceNow, Jira) relevant to the organization
- Cost allocation to existing finance systems (SAP, Oracle Financials)
- Compliance reporting specific to industry (HIPAA, SOX, GDPR)

**Hands-On Labs:**

Consider adding:
- Lab 1: Configure a Serverless SQL Warehouse and monitor performance
- Lab 2: Build a cost attribution dashboard using `system.billing.usage`
- Lab 3: Set up log delivery and create alerts
- Lab 4: Deploy infrastructure using Terraform or Asset Bundles

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>