<div style="display: flex; justify-content: space-between; align-items: center; padding: 8px 16px; background: #F8F9FA; border-bottom: 2px solid #E0E0E0; margin: 0; line-height: 1;">
    <div style="font-size: 14px; color: #666;">
        <span style="font-weight: bold; color: #333;">{SOURCE_PLATFORM} ‚Üí Databricks Migration</span>
        <span style="margin-left: 8px; color: #999;">|</span>
        <span style="margin-left: 8px;">02 - Design</span>
    </div>
    <div style="display: flex; align-items: center; gap: 8px;">
        <img src="https://cdn.simpleicons.org/snowflake/29B5E8" width="24" height="24"/>
        <span style="color: #999; font-size: 16px;">‚Üí</span>
        <img src="https://cdn.simpleicons.org/databricks/FF3621" width="24" height="24"/>
    </div>
</div>


<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# Storage and Governance Design

## Overview

This module covers storage architecture design, data organization patterns, governance hierarchies, and managed storage configuration for your Databricks lakehouse. Proper storage design ensures performance, cost efficiency, and maintainability.

## Learning Objectives

By the end of this lesson, you will be able to:
- Design storage architecture for Bronze/Silver/Gold layers
- Configure managed and external storage locations
- Establish data organization and partitioning strategies
- Define ownership and governance hierarchies
- Implement data lifecycle and retention policies
- Plan for data quality and validation frameworks

## Storage Architecture Overview

Databricks separates compute from storage, storing data in cloud object storage (S3, ADLS Gen2, GCS) with Delta Lake as the default format.

<br />
<div class="mermaid">
flowchart TB
    subgraph STORAGE["Storage Architecture"]
        direction TB
        A["Unity Catalog<br/>Metastore Storage"]
        B["Managed Storage<br/>(UC-controlled)"]
        C["External Storage<br/>(customer-controlled)"]
        D["Staging/Landing<br/>Zones"]
    end
    A --> B
    A --> C
    B --> D
    style STORAGE fill:#fff,stroke:#FF3621,stroke-width:2px
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

| Storage Type | Purpose | Managed By | Use Case |
|--------------|---------|------------|----------|
| **Metastore Storage** | Unity Catalog metadata | Databricks | System tables, audit logs, lineage |
| **Managed Storage** | Default table storage | Unity Catalog | Production tables with full UC governance |
| **External Storage** | Customer-specified locations | Customer + UC governance | Legacy data, shared storage, specific compliance needs |
| **Landing/Staging** | Temporary ingestion zones | Customer | File ingestion before processing |

## Medallion Architecture Design

The Bronze/Silver/Gold pattern organizes data by quality and transformation level, providing clear separation of concerns.

<br />
<div class="mermaid">
flowchart LR
    Sources["Data<br/>Sources"] --> Bronze["üü§ Bronze<br/><i>Raw</i>"]
    Bronze --> Silver["‚ö™ Silver<br/><i>Cleansed</i>"]
    Silver --> Gold["üü° Gold<br/><i>Curated</i>"]
    Gold --> Consumers["BI / ML /<br/>Applications"]
    style Bronze fill:#D2691E,color:#fff
    style Silver fill:#C0C0C0,color:#000
    style Gold fill:#FFD700,color:#000
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Layer Definitions

| Layer | Purpose | Characteristics | Retention | Access |
|-------|---------|----------------|-----------|--------|
| **Bronze** | Raw data as ingested | Immutable, append-only, full history | Long-term (years) | Data engineers |
| **Silver** | Cleansed and conformed | Validated, deduplicated, type-safe | Medium-term (months-years) | Data engineers, analysts |
| **Gold** | Business-level aggregates | Denormalized, optimized for queries | Short-term (weeks-months) | Analysts, BI tools, applications |

### Storage Organization

```
s3://my-org-lakehouse/
‚îú‚îÄ‚îÄ bronze/
‚îÇ   ‚îú‚îÄ‚îÄ {source_system}/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ {table_name}/
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ _delta_log/
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ data files
‚îú‚îÄ‚îÄ silver/
‚îÇ   ‚îú‚îÄ‚îÄ {domain}/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ {entity}/
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ _delta_log/
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ data files
‚îî‚îÄ‚îÄ gold/
    ‚îú‚îÄ‚îÄ {business_area}/
    ‚îÇ   ‚îú‚îÄ‚îÄ {aggregate}/
    ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ _delta_log/
    ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ data files
```

## Managed vs External Storage

Choose between Unity Catalog-managed storage and external locations based on your requirements.

### Managed Storage

Unity Catalog manages the storage location automatically.

| Aspect | Details |
|--------|---------|
| **Location** | UC automatically assigns within managed storage root |
| **Governance** | Full Unity Catalog governance |
| **Lifecycle** | Tables and data managed together; DROP TABLE deletes data |
| **Use Case** | New tables, production workloads, full governance |

<div class="code-block" data-language="sql">
-- Create managed table (default)
CREATE TABLE prod_catalog.silver.customers (
  customer_id BIGINT,
  customer_name STRING,
  email STRING,
  created_at TIMESTAMP
) USING DELTA;
</div>

### External Storage

Customer specifies and controls the storage location.

| Aspect | Details |
|--------|---------|
| **Location** | Customer-specified cloud storage path |
| **Governance** | UC governs access; customer owns storage |
| **Lifecycle** | DROP TABLE removes metadata only; data remains |
| **Use Case** | Legacy data, shared storage, data lake migration |

<div class="code-block" data-language="sql">
-- Create external table
CREATE EXTERNAL TABLE prod_catalog.bronze.legacy_customers
LOCATION 's3://my-bucket/legacy/customers/'
AS SELECT * FROM ...;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## External Locations and Storage Credentials

External locations provide governed access to specific cloud storage paths.

### Storage Credential Setup

<div class="code-block" data-language="sql">
-- AWS Example
CREATE STORAGE CREDENTIAL bronze_storage_credential
WITH (AWS_IAM_ROLE 'arn:aws:iam::123456789012:role/databricks-bronze-access');

-- Azure Example  
CREATE STORAGE CREDENTIAL bronze_storage_credential
WITH (AZURE_MANAGED_IDENTITY '/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}');

-- GCP Example
CREATE STORAGE CREDENTIAL bronze_storage_credential
WITH (GCP_SERVICE_ACCOUNT_EMAIL 'databricks-sa@project-id.iam.gserviceaccount.com');
</div>

### External Location Definition

<div class="code-block" data-language="sql">
-- Create external location for landing zone
CREATE EXTERNAL LOCATION landing_zone
URL 's3://my-org-landing-zone/'
WITH (STORAGE CREDENTIAL bronze_storage_credential);

-- Grant permissions
GRANT READ FILES ON EXTERNAL LOCATION landing_zone TO `data-engineers`;
GRANT WRITE FILES ON EXTERNAL LOCATION landing_zone TO `ingestion-service-principal`;
</div>

### Authorized Paths

When creating foreign catalogs or connections, specify authorized paths to restrict access:

<div class="code-block" data-language="sql">
CREATE FOREIGN CATALOG legacy_data
USING CONNECTION glue_connection
OPTIONS (authorized_paths 's3://legacy-bucket/prod/');
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Data Organization Patterns

Organize data within catalogs and schemas for discoverability and governance.

### Catalog Organization Strategies

| Strategy | Structure | Pros | Cons |
|----------|-----------|------|------|
| **By Environment** | `dev`, `staging`, `prod` | Clear promotion path | Data duplicated across environments |
| **By Domain** | `sales`, `marketing`, `finance` | Domain ownership | Cross-domain queries more complex |
| **By Layer** | `bronze`, `silver`, `gold` | Clear quality levels | All domains in one catalog |
| **Hybrid** | `prod_sales`, `prod_marketing` | Both separation and clarity | More catalogs to manage |

### Schema Organization

Within catalogs, organize schemas logically:

| Pattern | Example Schemas | Use Case |
|---------|-----------------|----------|
| **By Source** | `salesforce`, `stripe`, `netsuite` | Bronze layer - preserve source identity |
| **By Domain** | `customers`, `orders`, `inventory` | Silver layer - business entities |
| **By Use Case** | `daily_sales`, `customer_360`, `forecasting` | Gold layer - consumption patterns |

<div class="code-block" data-language="sql">
-- Environment-based catalogs
CREATE CATALOG dev;
CREATE CATALOG staging;
CREATE CATALOG prod;

-- Within prod, organize by layer and domain
CREATE SCHEMA prod.bronze_salesforce;
CREATE SCHEMA prod.bronze_stripe;
CREATE SCHEMA prod.silver_customers;
CREATE SCHEMA prod.silver_orders;
CREATE SCHEMA prod.gold_analytics;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Partitioning and Clustering

Optimize query performance and cost using partitioning and liquid clustering.

### Partitioning Strategy

Traditional partitioning creates physical directories on disk:

| When to Partition | Partition Key Choices |
|-------------------|----------------------|
| Large tables (> 1TB) | Date columns (`year`, `month`, `day`) |
| Time-series data | Timestamp truncated to day/hour |
| Multi-tenant data | `tenant_id`, `region` |
| Historical archives | `year`, `quarter` |

<div class="code-block" data-language="sql">
-- Traditional partitioning
CREATE TABLE prod.silver.events (
  event_id BIGINT,
  event_type STRING,
  user_id BIGINT,
  event_timestamp TIMESTAMP,
  event_date DATE
)
USING DELTA
PARTITIONED BY (event_date);
</div>

### Liquid Clustering (Recommended)

Liquid clustering automatically optimizes layout without explicit partition columns:

<div class="code-block" data-language="sql">
-- Liquid clustering (auto-optimizes)
CREATE TABLE prod.silver.events (
  event_id BIGINT,
  event_type STRING,
  user_id BIGINT,
  event_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (event_type, event_timestamp);
</div>

| Approach | Pros | Cons |
|----------|------|------|
| **Partitioning** | Proven, well-understood | Requires choosing keys upfront; can cause small files |
| **Liquid Clustering** | Auto-optimizes, no small file issues | Requires OPTIMIZE to realize benefits |

> **Recommendation:** Use liquid clustering for new tables unless you have specific partitioning requirements.

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Governance Hierarchies

Establish ownership and governance roles for clear accountability.

### Ownership Model

| Level | Owner | Responsibilities |
|-------|-------|------------------|
| **Metastore** | Platform team | Overall governance, metastore admin |
| **Catalog** | Domain or environment team | Catalog-level policies, schema creation |
| **Schema** | Sub-domain or project team | Table creation, access grants |
| **Table** | Data engineer or analyst | Data quality, documentation, evolution |

<div class="code-block" data-language="sql">
-- Assign ownership
ALTER CATALOG prod OWNER TO `platform-team`;
ALTER SCHEMA prod.silver_customers OWNER TO `customer-data-team`;
ALTER TABLE prod.silver_customers.customers OWNER TO `john.doe@company.com`;
</div>

### Privilege Hierarchy

Privileges inherit down the hierarchy:

```
METASTORE
  ‚îú‚îÄ‚îÄ CREATE CATALOG
  ‚îú‚îÄ‚îÄ CREATE CONNECTION
  ‚îî‚îÄ‚îÄ CREATE STORAGE CREDENTIAL
       |
       CATALOG
         ‚îú‚îÄ‚îÄ USAGE (required to access)
         ‚îú‚îÄ‚îÄ CREATE SCHEMA
         ‚îî‚îÄ‚îÄ USE CATALOG
              |
              SCHEMA
                ‚îú‚îÄ‚îÄ USAGE (required to access)
                ‚îú‚îÄ‚îÄ CREATE TABLE
                ‚îî‚îÄ‚îÄ CREATE VIEW
                     |
                     TABLE/VIEW
                       ‚îú‚îÄ‚îÄ SELECT
                       ‚îú‚îÄ‚îÄ MODIFY
                       ‚îî‚îÄ‚îÄ READ FILES / WRITE FILES
```

<div class="code-block" data-language="sql">
-- Grant catalog usage to everyone
GRANT USAGE ON CATALOG prod TO `all-users`;

-- Grant schema access to data engineers
GRANT USAGE, CREATE TABLE ON SCHEMA prod.silver_customers TO `data-engineers`;

-- Grant select to analysts
GRANT SELECT ON SCHEMA prod.gold_analytics TO `analysts`;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Data Lifecycle and Retention

Define retention policies and implement data lifecycle management.

### Retention Strategy

| Layer | Retention Period | Rationale |
|-------|-----------------|-----------|
| **Bronze** | 5-7 years | Compliance, reprocessing, audit trail |
| **Silver** | 2-3 years | Operational analytics, debugging |
| **Gold** | 6-12 months | Active consumption; refresh from Silver as needed |
| **Temporary/Staging** | 7-30 days | Short-term processing only |

### Time Travel and Retention

Delta Lake provides time travel for historical queries:

<div class="code-block" data-language="sql">
-- Set table retention (default 30 days)
ALTER TABLE prod.silver.customers
SET TBLPROPERTIES (
  'delta.logRetentionDuration' = '90 days',
  'delta.deletedFileRetentionDuration' = '90 days'
);

-- Query historical versions
SELECT * FROM prod.silver.customers VERSION AS OF 42;
SELECT * FROM prod.silver.customers TIMESTAMP AS OF '2025-01-01';
</div>

### Vacuum for Storage Cleanup

Remove old data files to reclaim storage:

<div class="code-block" data-language="sql">
-- Vacuum to remove files older than retention period
VACUUM prod.silver.customers RETAIN 168 HOURS; -- 7 days

-- Dry run to see what would be deleted
VACUUM prod.silver.customers DRY RUN;
</div>

> **Caution:** VACUUM permanently deletes files. Ensure retention period is sufficient for time travel needs.

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Volumes for File Storage

Unity Catalog Volumes provide governed access to files (PDFs, images, models, etc.).

### Volume Types

| Type | Use Case | Location Control |
|------|----------|------------------|
| **Managed Volume** | UC-managed file storage | UC assigns location |
| **External Volume** | Existing file storage | Customer specifies path |

<div class="code-block" data-language="sql">
-- Create managed volume for ML models
CREATE VOLUME prod.ml_models.artifacts;

-- Upload files via Databricks UI or dbutils
-- Access: /Volumes/prod/ml_models/artifacts/model_v1.pkl

-- Create external volume for legacy files
CREATE EXTERNAL VOLUME prod.legacy.documents
LOCATION 's3://legacy-files/documents/';
</div>

### File Access Patterns

<div class="code-block" data-language="python">
# Python access via dbutils
dbutils.fs.ls("/Volumes/prod/ml_models/artifacts")

# Direct file path access
with open("/Volumes/prod/ml_models/artifacts/config.json", "r") as f:
    config = json.load(f)
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Data Quality and Validation

Define data quality standards and implement validation frameworks.

### Quality Dimensions

| Dimension | Description | Validation Approach |
|-----------|-------------|---------------------|
| **Completeness** | No unexpected nulls | NOT NULL constraints, expectations |
| **Uniqueness** | No duplicates where expected | Primary key constraints, deduplication |
| **Validity** | Values within acceptable range | CHECK constraints, regex validation |
| **Consistency** | Related values agree | Foreign key checks, cross-field validation |
| **Timeliness** | Data freshness | Timestamp checks, SLA monitoring |

### Data Quality Patterns

| Pattern | Implementation |
|---------|----------------|
| **Schema Enforcement** | Delta table schema, NOT NULL constraints |
| **Expectations** | Great Expectations, DLT Expectations |
| **Audit Columns** | `created_at`, `updated_at`, `source_system` |
| **Quarantine** | Separate invalid records to `_quarantine` table |

<div class="code-block" data-language="sql">
-- Table constraints
CREATE TABLE prod.silver.customers (
  customer_id BIGINT NOT NULL,
  email STRING NOT NULL,
  age INT,
  created_at TIMESTAMP NOT NULL,
  CONSTRAINT valid_email CHECK (email LIKE '%@%'),
  CONSTRAINT valid_age CHECK (age >= 0 AND age <= 120)
);

-- DLT expectations
CREATE LIVE TABLE silver_customers (
  CONSTRAINT valid_customer_id EXPECT (customer_id IS NOT NULL),
  CONSTRAINT valid_email EXPECT (email RLIKE '^[^@]+@[^@]+\\.[^@]+$') ON VIOLATION DROP ROW
)
AS SELECT ...
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '‚úì Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Summary

### Storage Design Checklist

| Area | Checklist Item | Status |
|------|---------------|--------|
| **Architecture** | Medallion layers defined (Bronze/Silver/Gold) | ‚òê |
| | Storage type chosen (managed vs external) | ‚òê |
| | Landing/staging zones configured | ‚òê |
| **Organization** | Catalog strategy selected | ‚òê |
| | Schema naming conventions defined | ‚òê |
| | Table naming standards documented | ‚òê |
| **Storage** | Storage credentials created | ‚òê |
| | External locations defined | ‚òê |
| | Managed storage roots configured | ‚òê |
| **Optimization** | Partitioning or clustering strategy chosen | ‚òê |
| | Retention policies defined by layer | ‚òê |
| | VACUUM schedule established | ‚òê |
| **Governance** | Ownership model defined | ‚òê |
| | Privilege hierarchy documented | ‚òê |
| | Data quality dimensions identified | ‚òê |
| **Volumes** | Volume strategy for file storage | ‚òê |

### Key Design Decisions

| Decision | Options | Recommendation |
|----------|---------|----------------|
| **Storage Type** | Managed vs External | Managed for new data; external for legacy |
| **Catalog Organization** | Environment, domain, layer, hybrid | Hybrid: `{env}_{domain}` |
| **Optimization** | Partitioning vs Liquid Clustering | Liquid clustering for new tables |
| **Retention** | Bronze: 5-7y, Silver: 2-3y, Gold: 6-12mo | Align with compliance needs |

### Next Steps

- Proceed to [**2.4 - Security and Access Design**]($./2.4 - Security and Access Design) for RBAC and security patterns

<div style="color: #FF3621; font-weight: bold; font-size: 2em; margin-bottom: 12px;">COURSE DEVELOPER (remove before publishing)</div>

### Template Customization

**Placeholders to replace:**
- `{SOURCE_PLATFORM}` - Source platform name
- Cloud-specific storage paths and IAM examples

**Platform-specific additions:**
- Add cloud-specific storage best practices
- Include region-specific considerations
- Reference cloud-specific optimization techniques

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>
