# dbt (Data Build Tool) Deep Dive

A comprehensive guide to modern analytics engineering with dbt.

## Table of Contents

1. [What is dbt and Why It Matters](#1-what-is-dbt-and-why-it-matters)
2. [dbt Project Structure](#2-dbt-project-structure)
3. [Model Types](#3-model-types)
4. [Materializations](#4-materializations)
5. [Testing in dbt](#5-testing-in-dbt)
6. [Documentation and Lineage](#6-documentation-and-lineage)
7. [Jinja Templating and Macros](#7-jinja-templating-and-macros)
8. [Incremental Models Deep Dive](#8-incremental-models-deep-dive)
9. [dbt Best Practices](#9-dbt-best-practices)
10. [dbt Cloud vs dbt Core](#10-dbt-cloud-vs-dbt-core)

---

## 1. What is dbt and Why It Matters

### The Evolution of Data Transformation

**dbt (data build tool)** is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses by writing SQL select statements. It handles the T (Transform) in ELT.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        Traditional ETL vs Modern ELT                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Traditional ETL:                                                          │
│   ┌──────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐     │
│   │  Source  │───▶│   Extract    │───▶│  Transform   │───▶│   Load   │     │
│   │  Systems │    │  (External)  │    │  (External)  │    │   (DW)   │     │
│   └──────────┘    └──────────────┘    └──────────────┘    └──────────┘     │
│                                                                             │
│   Modern ELT with dbt:                                                      │
│   ┌──────────┐    ┌──────────┐    ┌──────────────────────────────────┐     │
│   │  Source  │───▶│  Extract │───▶│         Data Warehouse           │     │
│   │  Systems │    │  & Load  │    │  ┌─────────────────────────────┐ │     │
│   └──────────┘    └──────────┘    │  │   Transform with dbt        │ │     │
│                                    │  │   (SQL + Jinja)             │ │     │
│                                    │  └─────────────────────────────┘ │     │
│                                    └──────────────────────────────────┘     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Why dbt Matters

| Aspect | Traditional ETL | dbt (ELT) |
|--------|-----------------|------------|
| **Language** | Python, Java, proprietary | SQL + Jinja |
| **Transformation Location** | External servers | Inside the warehouse |
| **Version Control** | Often difficult | Git-native |
| **Testing** | Custom frameworks | Built-in testing |
| **Documentation** | Separate tools | Auto-generated |
| **Scalability** | Limited by ETL server | Warehouse compute |
| **Learning Curve** | Steep | Gentle (SQL-based) |

### The Analytics Engineering Role

dbt has given rise to the **Analytics Engineer** - a role that bridges data engineering and data analysis:

```
┌─────────────────────────────────────────────────────────────────────┐
│                    The Analytics Engineering Spectrum               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Data Engineer          Analytics Engineer         Data Analyst   │
│   ┌───────────┐          ┌───────────────┐         ┌───────────┐   │
│   │ Pipelines │          │   Modeling    │         │ Reporting │   │
│   │ Ingestion │◀────────▶│   Testing     │◀───────▶│ Analysis  │   │
│   │ Infra     │          │   dbt/SQL     │         │ Insights  │   │
│   └───────────┘          └───────────────┘         └───────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Core dbt Concepts

```python
# Key dbt concepts overview
dbt_concepts = {
    "Models": "SQL SELECT statements that define transformations",
    "Sources": "Raw data tables that dbt reads from",
    "Seeds": "CSV files loaded into the warehouse",
    "Snapshots": "Point-in-time captures for SCD Type 2",
    "Macros": "Reusable Jinja code snippets",
    "Tests": "Data quality assertions",
    "Documentation": "Auto-generated data dictionary",
    "Packages": "Shareable dbt code modules"
}
```

---

## 2. dbt Project Structure

### Standard Project Layout

```
my_dbt_project/
│
├── dbt_project.yml          # Project configuration
├── profiles.yml             # Connection profiles (usually in ~/.dbt/)
├── packages.yml             # External package dependencies
│
├── models/                  # SQL transformation files
│   ├── staging/             # Staging models (1:1 with sources)
│   │   ├── _staging.yml     # Schema definitions
│   │   ├── stg_customers.sql
│   │   └── stg_orders.sql
│   │
│   ├── intermediate/        # Business logic transformations
│   │   ├── _intermediate.yml
│   │   └── int_orders_pivoted.sql
│   │
│   └── marts/               # Final business entities
│       ├── core/
│       │   ├── _core.yml
│       │   ├── dim_customers.sql
│       │   └── fct_orders.sql
│       └── marketing/
│           ├── _marketing.yml
│           └── mkt_customer_ltv.sql
│
├── seeds/                   # CSV files to load
│   ├── country_codes.csv
│   └── product_categories.csv
│
├── snapshots/               # SCD Type 2 snapshots
│   └── snapshot_orders.sql
│
├── macros/                  # Reusable Jinja macros
│   ├── generate_schema_name.sql
│   └── cents_to_dollars.sql
│
├── tests/                   # Custom data tests
│   └── assert_total_payment_positive.sql
│
├── analyses/                # Ad-hoc analysis queries
│   └── monthly_revenue.sql
│
└── target/                  # Compiled SQL (git-ignored)
    └── compiled/
```

### The dbt_project.yml File

This is the main configuration file for your dbt project:

```yaml
# dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2

# Profile to use (defined in profiles.yml)
profile: 'analytics'

# Directories
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets: ["target", "dbt_packages"]

# Model configurations
models:
  my_analytics_project:
    # Default materialization
    +materialized: view
    
    staging:
      +materialized: view
      +schema: staging
      +tags: ['staging', 'daily']
    
    intermediate:
      +materialized: ephemeral
      +schema: intermediate
    
    marts:
      core:
        +materialized: table
        +schema: core
      marketing:
        +materialized: table
        +schema: marketing

# Seed configurations
seeds:
  my_analytics_project:
    +schema: seeds
    country_codes:
      +column_types:
        country_code: varchar(3)

# Variables
vars:
  start_date: '2020-01-01'
  default_currency: 'USD'
```

### The profiles.yml File

Connection profiles are typically stored in `~/.dbt/profiles.yml`:

```yaml
# profiles.yml
analytics:
  target: dev  # Default target
  outputs:
    
    dev:
      type: snowflake
      account: xy12345.us-east-1
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: transformer
      database: analytics_dev
      warehouse: transforming
      schema: dbt_{{ env_var('DBT_USER') }}
      threads: 4
    
    prod:
      type: snowflake
      account: xy12345.us-east-1
      user: "{{ env_var('DBT_PROD_USER') }}"
      password: "{{ env_var('DBT_PROD_PASSWORD') }}"
      role: transformer_prod
      database: analytics
      warehouse: transforming_xl
      schema: public
      threads: 8

# BigQuery example
analytics_bq:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: my-analytics-project
      dataset: dbt_dev
      threads: 4
      timeout_seconds: 300
      location: US

# Databricks example
analytics_databricks:
  target: dev
  outputs:
    dev:
      type: databricks
      catalog: main
      schema: dbt_dev
      host: "{{ env_var('DATABRICKS_HOST') }}"
      http_path: /sql/1.0/warehouses/abc123
      token: "{{ env_var('DATABRICKS_TOKEN') }}"
      threads: 4
```

### Sources Definition

Sources define the raw data tables that dbt reads from:

```yaml
# models/staging/_sources.yml
version: 2

sources:
  - name: ecommerce
    description: "Raw e-commerce data from the production database"
    database: raw_data
    schema: ecommerce_prod
    
    # Source-level freshness
    freshness:
      warn_after:
        count: 12
        period: hour
      error_after:
        count: 24
        period: hour
    
    # Default for all tables
    loaded_at_field: _etl_loaded_at
    
    tables:
      - name: customers
        description: "Customer master data"
        identifier: raw_customers  # Actual table name if different
        columns:
          - name: customer_id
            description: "Primary key"
            tests:
              - unique
              - not_null
          - name: email
            description: "Customer email address"
            tests:
              - unique
      
      - name: orders
        description: "Order transactions"
        freshness:
          warn_after:
            count: 1
            period: hour
        columns:
          - name: order_id
            tests:
              - unique
              - not_null
          - name: customer_id
            tests:
              - relationships:
                  to: source('ecommerce', 'customers')
                  field: customer_id
      
      - name: products
        description: "Product catalog"
      
      - name: order_items
        description: "Line items for each order"
```

---

## 3. Model Types

### The Three-Layer Model Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         dbt Model Architecture                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   RAW LAYER (Sources)                                                       │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐          │
│   │raw_customers│ │ raw_orders  │ │raw_products │ │raw_order_   │          │
│   │             │ │             │ │             │ │   items     │          │
│   └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘          │
│          │               │               │               │                  │
│          ▼               ▼               ▼               ▼                  │
│   STAGING LAYER (1:1 with sources, renaming, type casting)                 │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐          │
│   │stg_customers│ │ stg_orders  │ │stg_products │ │stg_order_   │          │
│   │             │ │             │ │             │ │   items     │          │
│   └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘          │
│          │               │               │               │                  │
│          └───────────────┼───────────────┼───────────────┘                  │
│                          ▼               │                                  │
│   INTERMEDIATE LAYER (Business logic, joins, aggregations)                 │
│                    ┌─────────────┐       │                                  │
│                    │int_orders_  │       │                                  │
│                    │  enriched   │◀──────┘                                  │
│                    └──────┬──────┘                                          │
│                           │                                                 │
│                           ▼                                                 │
│   MARTS LAYER (Final business entities for consumption)                    │
│   ┌─────────────────────────────────────────────────────────┐              │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │              │
│   │  │dim_customers│  │ fct_orders  │  │ dim_products│     │              │
│   │  └─────────────┘  └─────────────┘  └─────────────┘     │              │
│   └─────────────────────────────────────────────────────────┘              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Staging Models

Staging models create a clean, standardized interface to raw source data:

```sql
-- models/staging/stg_customers.sql

with source as (
    select * from {{ source('ecommerce', 'customers') }}
),

renamed as (
    select
        -- Primary key
        id as customer_id,
        
        -- Attributes
        lower(trim(email)) as email,
        first_name,
        last_name,
        concat(first_name, ' ', last_name) as full_name,
        
        -- Type casting
        cast(created_at as timestamp) as created_at,
        cast(updated_at as timestamp) as updated_at,
        
        -- Status normalization
        case
            when status = 'A' then 'active'
            when status = 'I' then 'inactive'
            when status = 'D' then 'deleted'
            else 'unknown'
        end as customer_status
        
    from source
    where _deleted = false  -- Filter soft deletes
)

select * from renamed
```

```sql
-- models/staging/stg_orders.sql

with source as (
    select * from {{ source('ecommerce', 'orders') }}
),

renamed as (
    select
        -- Keys
        order_id,
        customer_id,
        
        -- Dates
        cast(order_date as date) as order_date,
        cast(shipped_date as date) as shipped_date,
        
        -- Amounts (convert cents to dollars)
        cast(order_total_cents as decimal(18, 2)) / 100 as order_total,
        cast(shipping_cost_cents as decimal(18, 2)) / 100 as shipping_cost,
        cast(tax_cents as decimal(18, 2)) / 100 as tax_amount,
        
        -- Status
        upper(order_status) as order_status,
        
        -- Metadata
        _etl_loaded_at as loaded_at
        
    from source
)

select * from renamed
```

### Intermediate Models

Intermediate models contain business logic and complex transformations:

```sql
-- models/intermediate/int_orders_enriched.sql

with orders as (
    select * from {{ ref('stg_orders') }}
),

order_items as (
    select * from {{ ref('stg_order_items') }}
),

products as (
    select * from {{ ref('stg_products') }}
),

order_items_enriched as (
    select
        oi.order_id,
        oi.product_id,
        oi.quantity,
        oi.unit_price,
        oi.quantity * oi.unit_price as line_total,
        p.product_name,
        p.category,
        p.subcategory
    from order_items oi
    left join products p on oi.product_id = p.product_id
),

order_aggregates as (
    select
        order_id,
        count(distinct product_id) as distinct_products,
        sum(quantity) as total_items,
        sum(line_total) as subtotal,
        array_agg(distinct category) as categories
    from order_items_enriched
    group by order_id
),

final as (
    select
        o.*,
        oa.distinct_products,
        oa.total_items,
        oa.subtotal,
        oa.categories,
        
        -- Derived metrics
        o.order_total - oa.subtotal - o.shipping_cost - o.tax_amount as discount_amount,
        datediff('day', o.order_date, o.shipped_date) as days_to_ship
        
    from orders o
    left join order_aggregates oa on o.order_id = oa.order_id
)

select * from final
```

### Marts Models (Dimensions and Facts)

```sql
-- models/marts/core/dim_customers.sql

with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('int_orders_enriched') }}
),

customer_orders as (
    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders,
        sum(order_total) as lifetime_value
    from orders
    group by customer_id
),

final as (
    select
        -- Surrogate key
        {{ dbt_utils.generate_surrogate_key(['c.customer_id']) }} as customer_sk,
        
        -- Natural key
        c.customer_id,
        
        -- Customer attributes
        c.email,
        c.full_name,
        c.first_name,
        c.last_name,
        c.customer_status,
        c.created_at as customer_created_at,
        
        -- Order metrics
        coalesce(co.first_order_date, null) as first_order_date,
        coalesce(co.most_recent_order_date, null) as most_recent_order_date,
        coalesce(co.number_of_orders, 0) as number_of_orders,
        coalesce(co.lifetime_value, 0) as lifetime_value,
        
        -- Customer segments
        case
            when co.number_of_orders is null then 'prospect'
            when co.number_of_orders = 1 then 'new'
            when co.number_of_orders between 2 and 5 then 'returning'
            when co.number_of_orders > 5 then 'loyal'
        end as customer_segment,
        
        case
            when co.lifetime_value >= 1000 then 'high_value'
            when co.lifetime_value >= 500 then 'medium_value'
            when co.lifetime_value >= 100 then 'low_value'
            else 'minimal'
        end as value_tier,
        
        -- Metadata
        current_timestamp as dbt_updated_at
        
    from customers c
    left join customer_orders co on c.customer_id = co.customer_id
)

select * from final
```

```sql
-- models/marts/core/fct_orders.sql

with orders as (
    select * from {{ ref('int_orders_enriched') }}
),

customers as (
    select customer_id, customer_sk from {{ ref('dim_customers') }}
),

dates as (
    select * from {{ ref('dim_dates') }}
),

final as (
    select
        -- Surrogate key
        {{ dbt_utils.generate_surrogate_key(['o.order_id']) }} as order_sk,
        
        -- Natural key
        o.order_id,
        
        -- Foreign keys (surrogate)
        c.customer_sk,
        d.date_sk as order_date_sk,
        
        -- Foreign keys (natural) for flexibility
        o.customer_id,
        o.order_date,
        o.shipped_date,
        
        -- Measures
        o.subtotal,
        o.shipping_cost,
        o.tax_amount,
        o.discount_amount,
        o.order_total,
        o.total_items,
        o.distinct_products,
        o.days_to_ship,
        
        -- Attributes
        o.order_status,
        o.categories,
        
        -- Flags
        case when o.shipped_date is not null then true else false end as is_shipped,
        case when o.discount_amount > 0 then true else false end as has_discount,
        
        -- Metadata
        o.loaded_at,
        current_timestamp as dbt_updated_at
        
    from orders o
    left join customers c on o.customer_id = c.customer_id
    left join dates d on o.order_date = d.date_day
)

select * from final
```

---

## 4. Materializations

### Overview of Materialization Types

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                        dbt Materializations Comparison                        │
├──────────────┬───────────────┬───────────────┬────────────────┬──────────────┤
│    Type      │   Storage     │  Query Speed  │  Build Speed   │   Use Case   │
├──────────────┼───────────────┼───────────────┼────────────────┼──────────────┤
│   VIEW       │     None      │     Slow      │     Fast       │   Staging    │
│              │  (computed)   │  (recomputed) │  (no data)     │   Light use  │
├──────────────┼───────────────┼───────────────┼────────────────┼──────────────┤
│   TABLE      │    Full       │     Fast      │     Slow       │   Marts      │
│              │   (stored)    │  (pre-built)  │  (full scan)   │   Reports    │
├──────────────┼───────────────┼───────────────┼────────────────┼──────────────┤
│ INCREMENTAL  │   Append/     │     Fast      │     Fast       │   Large      │
│              │   Merge       │  (pre-built)  │ (only changes) │   Fact tbls  │
├──────────────┼───────────────┼───────────────┼────────────────┼──────────────┤
│  EPHEMERAL   │     None      │      N/A      │     None       │ Intermediate │
│              │    (CTE)      │  (inlined)    │  (no object)   │  Temp logic  │
└──────────────┴───────────────┴───────────────┴────────────────┴──────────────┘
```

### View Materialization

```sql
-- models/staging/stg_products.sql

{{ config(
    materialized='view',
    schema='staging'
) }}

with source as (
    select * from {{ source('ecommerce', 'products') }}
),

renamed as (
    select
        product_id,
        product_name,
        category,
        subcategory,
        cast(price_cents as decimal(18, 2)) / 100 as price,
        cast(cost_cents as decimal(18, 2)) / 100 as cost,
        is_active,
        created_at
    from source
)

select * from renamed
```

**Generated SQL:**
```sql
CREATE VIEW staging.stg_products AS (
    SELECT ... FROM raw.products
);
```

### Table Materialization

```sql
-- models/marts/core/dim_products.sql

{{ config(
    materialized='table',
    schema='core',
    sort='category',
    dist='product_id',  -- Redshift specific
    cluster_by=['category', 'subcategory'],  -- Snowflake/BigQuery
    partition_by={
        "field": "created_at",
        "data_type": "timestamp",
        "granularity": "day"
    }  -- BigQuery specific
) }}

with products as (
    select * from {{ ref('stg_products') }}
),

final as (
    select
        {{ dbt_utils.generate_surrogate_key(['product_id']) }} as product_sk,
        product_id,
        product_name,
        category,
        subcategory,
        price,
        cost,
        price - cost as margin,
        (price - cost) / nullif(price, 0) as margin_percent,
        is_active,
        created_at,
        current_timestamp as dbt_updated_at
    from products
)

select * from final
```

**Generated SQL:**
```sql
CREATE TABLE core.dim_products AS (
    SELECT ... FROM staging.stg_products
);
```

### Ephemeral Materialization

Ephemeral models don't create database objects; they're injected as CTEs:

```sql
-- models/intermediate/int_customer_orders_summary.sql

{{ config(materialized='ephemeral') }}

-- This won't create a table/view; it's inlined as a CTE
select
    customer_id,
    count(*) as order_count,
    sum(order_total) as total_spent,
    avg(order_total) as avg_order_value
from {{ ref('stg_orders') }}
group by customer_id
```

When another model references this ephemeral model:

```sql
-- Compiled output (ephemeral is inlined)
WITH __dbt__cte__int_customer_orders_summary AS (
    SELECT
        customer_id,
        count(*) as order_count,
        sum(order_total) as total_spent,
        avg(order_total) as avg_order_value
    FROM staging.stg_orders
    GROUP BY customer_id
)

SELECT
    c.*,
    cos.order_count
FROM staging.stg_customers c
LEFT JOIN __dbt__cte__int_customer_orders_summary cos 
    ON c.customer_id = cos.customer_id
```

### Incremental Materialization

```sql
-- models/marts/core/fct_page_views.sql

{{ config(
    materialized='incremental',
    unique_key='page_view_id',
    incremental_strategy='merge',  -- or 'delete+insert', 'insert_overwrite', 'append'
    on_schema_change='sync_all_columns',
    cluster_by=['event_date']
) }}

with page_views as (
    select
        page_view_id,
        user_id,
        session_id,
        page_url,
        referrer_url,
        event_timestamp,
        cast(event_timestamp as date) as event_date,
        device_type,
        browser,
        country_code
    from {{ ref('stg_page_views') }}
    
    {% if is_incremental() %}
    -- Only process new/updated records
    where event_timestamp > (
        select max(event_timestamp) 
        from {{ this }}
    )
    {% endif %}
)

select * from page_views
```

---

## 5. Testing in dbt

### Schema Tests (Generic Tests)

```yaml
# models/staging/_staging.yml
version: 2

models:
  - name: stg_customers
    description: "Staged customer data"
    columns:
      - name: customer_id
        description: "Primary key"
        data_tests:
          - unique
          - not_null
      
      - name: email
        description: "Customer email"
        data_tests:
          - unique
          - not_null
          # Custom test with parameters
          - dbt_utils.not_empty_string
      
      - name: customer_status
        description: "Customer status"
        data_tests:
          - accepted_values:
              values: ['active', 'inactive', 'deleted', 'unknown']
              quote: true
      
      - name: created_at
        description: "When customer was created"
        data_tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "<= current_timestamp"

  - name: stg_orders
    description: "Staged order data"
    columns:
      - name: order_id
        data_tests:
          - unique
          - not_null
      
      - name: customer_id
        data_tests:
          - not_null
          # Referential integrity test
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      
      - name: order_total
        data_tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true
```

### Singular Data Tests

Custom SQL tests in the `tests/` directory:

```sql
-- tests/assert_order_total_equals_sum_of_items.sql

-- This test fails if any rows are returned
with order_totals as (
    select
        order_id,
        order_total
    from {{ ref('fct_orders') }}
),

item_totals as (
    select
        order_id,
        sum(line_total) as calculated_total
    from {{ ref('fct_order_items') }}
    group by order_id
),

mismatches as (
    select
        ot.order_id,
        ot.order_total,
        it.calculated_total,
        abs(ot.order_total - it.calculated_total) as difference
    from order_totals ot
    inner join item_totals it on ot.order_id = it.order_id
    where abs(ot.order_total - it.calculated_total) > 0.01  -- Allow small rounding
)

select * from mismatches
```

```sql
-- tests/assert_no_orphan_orders.sql

-- Find orders with no associated customer
select
    o.order_id,
    o.customer_id,
    o.order_date
from {{ ref('fct_orders') }} o
left join {{ ref('dim_customers') }} c on o.customer_id = c.customer_id
where c.customer_id is null
```

### Custom Generic Tests

Create reusable test macros:

```sql
-- macros/tests/test_positive_value.sql

{% test positive_value(model, column_name) %}

select
    {{ column_name }} as invalid_value
from {{ model }}
where {{ column_name }} < 0

{% endtest %}
```

```sql
-- macros/tests/test_valid_date_range.sql

{% test valid_date_range(model, column_name, min_date, max_date) %}

{# Set defaults if not provided #}
{% set min_dt = min_date or "'1900-01-01'" %}
{% set max_dt = max_date or "current_date" %}

select
    {{ column_name }} as invalid_date
from {{ model }}
where {{ column_name }} < {{ min_dt }}
   or {{ column_name }} > {{ max_dt }}

{% endtest %}
```

**Usage in YAML:**
```yaml
columns:
  - name: order_total
    data_tests:
      - positive_value
  
  - name: order_date
    data_tests:
      - valid_date_range:
          min_date: "'2020-01-01'"
```

### dbt Expectations Package

The `dbt_expectations` package provides Great Expectations-style tests:

```yaml
# packages.yml
packages:
  - package: calogica/dbt_expectations
    version: [">=0.9.0", "<0.10.0"]
```

```yaml
# models/marts/_core.yml
models:
  - name: fct_orders
    columns:
      - name: order_total
        data_tests:
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 10000
              strictly: false
      
      - name: email
        data_tests:
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
    
    # Table-level tests
    data_tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 1000
          max_value: 10000000
      
      - dbt_expectations.expect_table_row_count_to_equal_other_table:
          compare_model: ref('stg_orders')
```

---

## 6. Documentation and Lineage

### Model Documentation

```yaml
# models/marts/core/_core.yml
version: 2

models:
  - name: dim_customers
    description: |
      Customer dimension table containing all customer attributes
      and derived metrics.
      
      ## Business Logic
      - Customer segments are assigned based on order frequency
      - Lifetime value is calculated from all historical orders
      - Prospects are customers who haven't placed an order yet
      
      ## Refresh Schedule
      - This table is refreshed daily at 6:00 AM UTC
      
      ## Ownership
      - Team: Data Platform
      - Contact: analytics@company.com
    
    meta:
      owner: "analytics-team"
      contains_pii: true
      sla: "daily"
    
    columns:
      - name: customer_sk
        description: "Surrogate key (MD5 hash of customer_id)"
      
      - name: customer_id
        description: "Natural key from source system"
        meta:
          is_primary_key: true
      
      - name: customer_segment
        description: |
          Customer segment based on order frequency:
          - **prospect**: No orders placed
          - **new**: 1 order
          - **returning**: 2-5 orders
          - **loyal**: 6+ orders
      
      - name: lifetime_value
        description: "Total revenue from all customer orders"
        meta:
          unit: "USD"
```

### Doc Blocks

Reusable documentation across models:

```markdown
{# docs/_docs.md #}

{% docs customer_id %}

The unique identifier for a customer in our system.

This is the primary key from the source CRM system and is used
throughout the data warehouse to identify customers.

**Format**: UUID v4
**Example**: `550e8400-e29b-41d4-a716-446655440000`

{% enddocs %}


{% docs order_status %}

The current status of an order in the fulfillment pipeline.

| Status | Description |
|--------|-------------|
| PENDING | Order received, awaiting payment |
| CONFIRMED | Payment received, preparing for shipment |
| SHIPPED | Order has been shipped |
| DELIVERED | Order delivered to customer |
| CANCELLED | Order was cancelled |
| REFUNDED | Order was refunded |

{% enddocs %}


{% docs __overview__ %}

# Analytics Data Warehouse

Welcome to the analytics data warehouse documentation!

## Navigation

Use the sidebar to explore:
- **Sources**: Raw data from upstream systems
- **Staging**: Cleaned, standardized data
- **Marts**: Business-ready dimensional models

## Getting Help

Contact the Analytics Team at analytics@company.com

{% enddocs %}
```

**Reference in YAML:**
```yaml
columns:
  - name: customer_id
    description: '{{ doc("customer_id") }}'
```

### Generating Documentation

```bash
# Generate documentation
dbt docs generate

# Serve documentation locally
dbt docs serve --port 8080

# Generate for production (with target)
dbt docs generate --target prod
```

### DAG Lineage

dbt automatically tracks lineage through `ref()` and `source()` functions:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                            dbt DAG Example                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   source.ecommerce.customers ──▶ stg_customers ──┐                         │
│                                                   │                         │
│   source.ecommerce.orders ────▶ stg_orders ──────┼──▶ int_orders ──┐       │
│                                                   │     _enriched   │       │
│   source.ecommerce.products ──▶ stg_products ────┘                  │       │
│                                                                      │       │
│   source.ecommerce.order_items ─▶ stg_order_items ─────────────────┘       │
│                                                                      │       │
│                                           ┌──────────────────────────┘       │
│                                           │                                  │
│                                           ▼                                  │
│                               ┌─────────────────────┐                       │
│                               │    dim_customers    │                       │
│                               │    fct_orders       │                       │
│                               │    dim_products     │                       │
│                               └─────────────────────┘                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## 7. Jinja Templating and Macros

### Jinja Basics in dbt

dbt uses Jinja templating to make SQL dynamic:

```sql
-- Basic Jinja syntax in dbt

{# This is a comment #}

{# Variables #}
{% set my_list = ['a', 'b', 'c'] %}
{% set my_dict = {'key': 'value'} %}

{# Expressions (output values) #}
{{ ref('my_model') }}
{{ var('start_date') }}

{# Control structures #}
{% if condition %}
    SELECT ...
{% elif other_condition %}
    SELECT ...
{% else %}
    SELECT ...
{% endif %}

{# Loops #}
{% for item in my_list %}
    {{ item }}
{% endfor %}
```

### Built-in dbt Functions

```sql
-- models/example_jinja.sql

{% set payment_methods = ['credit_card', 'bank_transfer', 'paypal', 'gift_card'] %}

with payments as (
    select * from {{ ref('stg_payments') }}
),

pivoted as (
    select
        order_id,
        
        {% for payment_method in payment_methods %}
        sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) 
            as {{ payment_method }}_amount
        {% if not loop.last %},{% endif %}
        {% endfor %}
        
    from payments
    group by order_id
)

select * from pivoted
```

**Compiled SQL:**
```sql
with payments as (
    select * from analytics.staging.stg_payments
),

pivoted as (
    select
        order_id,
        sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount,
        sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount,
        sum(case when payment_method = 'paypal' then amount else 0 end) as paypal_amount,
        sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount
    from payments
    group by order_id
)

select * from pivoted
```

### Creating Custom Macros

```sql
-- macros/cents_to_dollars.sql

{% macro cents_to_dollars(column_name, precision=2) %}
    cast({{ column_name }} as decimal(18, {{ precision }})) / 100
{% endmacro %}
```

**Usage:**
```sql
select
    order_id,
    {{ cents_to_dollars('amount_cents') }} as amount_dollars
from orders
```

```sql
-- macros/generate_date_spine.sql

{% macro generate_date_spine(start_date, end_date) %}

with date_spine as (
    {{ dbt_utils.date_spine(
        datepart="day",
        start_date="cast('" ~ start_date ~ "' as date)",
        end_date="cast('" ~ end_date ~ "' as date)"
    ) }}
),

final as (
    select
        cast(date_day as date) as date_day,
        extract(year from date_day) as year,
        extract(month from date_day) as month,
        extract(day from date_day) as day,
        extract(dow from date_day) as day_of_week,
        case 
            when extract(dow from date_day) in (0, 6) then true 
            else false 
        end as is_weekend
    from date_spine
)

select * from final

{% endmacro %}
```

```sql
-- macros/safe_divide.sql

{% macro safe_divide(numerator, denominator, default_value=0) %}
    case 
        when {{ denominator }} = 0 or {{ denominator }} is null then {{ default_value }}
        else {{ numerator }} / {{ denominator }}
    end
{% endmacro %}
```

```sql
-- macros/star_except.sql

{% macro star_except(relation, except_columns) %}
    {%- set columns = adapter.get_columns_in_relation(relation) -%}
    {%- set except_columns = except_columns | map('lower') | list -%}
    
    {%- for column in columns -%}
        {%- if column.name | lower not in except_columns -%}
            {{ column.name }}{% if not loop.last %}, {% endif %}
        {%- endif -%}
    {%- endfor -%}
{% endmacro %}
```

### Cross-Database Macros

```sql
-- macros/cross_db/datediff.sql

{% macro datediff(datepart, start_date, end_date) %}
    {{ adapter.dispatch('datediff', 'my_project')(datepart, start_date, end_date) }}
{% endmacro %}

{% macro default__datediff(datepart, start_date, end_date) %}
    datediff({{ datepart }}, {{ start_date }}, {{ end_date }})
{% endmacro %}

{% macro bigquery__datediff(datepart, start_date, end_date) %}
    date_diff({{ end_date }}, {{ start_date }}, {{ datepart }})
{% endmacro %}

{% macro postgres__datediff(datepart, start_date, end_date) %}
    {% if datepart == 'day' %}
        ({{ end_date }}::date - {{ start_date }}::date)
    {% elif datepart == 'hour' %}
        extract(epoch from ({{ end_date }}::timestamp - {{ start_date }}::timestamp)) / 3600
    {% endif %}
{% endmacro %}
```

---

## 8. Incremental Models Deep Dive

### Incremental Strategies

```
┌───────────────────────────────────────────────────────────────────────────────┐
│                      Incremental Strategy Comparison                           │
├─────────────────┬──────────────────────────────────────────────────────────────┤
│   Strategy      │   Description                                               │
├─────────────────┼──────────────────────────────────────────────────────────────┤
│   append        │   INSERT new rows only (no updates)                         │
│                 │   Fastest, but can't handle updates                         │
├─────────────────┼──────────────────────────────────────────────────────────────┤
│   delete+insert │   DELETE matching rows, then INSERT                         │
│                 │   Good for most cases, widely supported                     │
├─────────────────┼──────────────────────────────────────────────────────────────┤
│   merge         │   MERGE statement (upsert)                                  │
│                 │   Best for updates, requires unique_key                     │
├─────────────────┼──────────────────────────────────────────────────────────────┤
│insert_overwrite │   Replace entire partitions                                 │
│                 │   Best for partitioned tables (BigQuery, Spark)             │
└─────────────────┴──────────────────────────────────────────────────────────────┘
```

### Append Strategy

```sql
-- models/marts/fct_events.sql

{{ config(
    materialized='incremental',
    incremental_strategy='append'
) }}

-- Best for immutable event data
select
    event_id,
    user_id,
    event_type,
    event_timestamp,
    event_properties
from {{ ref('stg_events') }}

{% if is_incremental() %}
where event_timestamp > (
    select max(event_timestamp) from {{ this }}
)
{% endif %}
```

### Merge Strategy

```sql
-- models/marts/fct_subscriptions.sql

{{ config(
    materialized='incremental',
    unique_key='subscription_id',
    incremental_strategy='merge',
    merge_update_columns=['status', 'end_date', 'updated_at']
) }}

with subscriptions as (
    select
        subscription_id,
        customer_id,
        plan_id,
        status,
        start_date,
        end_date,
        monthly_amount,
        updated_at
    from {{ ref('stg_subscriptions') }}
    
    {% if is_incremental() %}
    -- Only process records updated since last run
    where updated_at > (
        select max(updated_at) from {{ this }}
    )
    {% endif %}
)

select * from subscriptions
```

**Generated SQL (Snowflake):**
```sql
MERGE INTO analytics.core.fct_subscriptions AS target
USING (
    SELECT * FROM temp_subscriptions
) AS source
ON target.subscription_id = source.subscription_id
WHEN MATCHED THEN UPDATE SET
    status = source.status,
    end_date = source.end_date,
    updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT (
    subscription_id, customer_id, plan_id, status, 
    start_date, end_date, monthly_amount, updated_at
) VALUES (
    source.subscription_id, source.customer_id, source.plan_id, source.status,
    source.start_date, source.end_date, source.monthly_amount, source.updated_at
);
```

### Insert Overwrite Strategy (Partitioned)

```sql
-- models/marts/fct_daily_metrics.sql (BigQuery example)

{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={
        'field': 'metric_date',
        'data_type': 'date',
        'granularity': 'day'
    },
    cluster_by=['metric_type']
) }}

{% set partitions_to_replace = [
    'current_date',
    'date_sub(current_date, interval 1 day)',
    'date_sub(current_date, interval 2 day)'
] %}

with daily_metrics as (
    select
        cast(event_timestamp as date) as metric_date,
        metric_type,
        count(*) as event_count,
        sum(metric_value) as total_value,
        avg(metric_value) as avg_value
    from {{ ref('stg_metrics') }}
    
    {% if is_incremental() %}
    where cast(event_timestamp as date) in (
        {{ partitions_to_replace | join(', ') }}
    )
    {% endif %}
    
    group by 1, 2
)

select * from daily_metrics
```

### Handling Late-Arriving Data

```sql
-- models/marts/fct_transactions.sql

{{ config(
    materialized='incremental',
    unique_key='transaction_id',
    incremental_strategy='merge'
) }}

{% set lookback_window = 3 %}  -- Days to look back for late data

with transactions as (
    select
        transaction_id,
        customer_id,
        transaction_date,
        amount,
        status,
        _etl_loaded_at
    from {{ ref('stg_transactions') }}
    
    {% if is_incremental() %}
    -- Look back to catch late-arriving data
    where transaction_date >= (
        select dateadd('day', -{{ lookback_window }}, max(transaction_date))
        from {{ this }}
    )
    {% endif %}
)

select * from transactions
```

### Full Refresh

```bash
# Force full refresh of incremental model
dbt run --select fct_transactions --full-refresh

# Full refresh entire project
dbt run --full-refresh
```

```sql
-- Conditional full refresh based on variable
{{ config(
    materialized='incremental',
    unique_key='id'
) }}

select * from {{ ref('source_table') }}

{% if is_incremental() and not var('full_refresh', false) %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
```

---

## 9. dbt Best Practices

### Naming Conventions

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        dbt Naming Conventions                                │
├──────────────────┬──────────────────────────────────────────────────────────┤
│  Layer           │   Convention                                             │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Sources         │   source('schema_name', 'table_name')                    │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Staging         │   stg_<source>__<entity>                                 │
│                  │   Example: stg_stripe__payments                          │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Intermediate    │   int_<entity>_<verb>                                    │
│                  │   Example: int_orders_pivoted                            │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Facts           │   fct_<verb/entity>                                      │
│                  │   Example: fct_orders, fct_page_views                    │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Dimensions      │   dim_<entity>                                           │
│                  │   Example: dim_customers, dim_products                   │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Metrics         │   [deprecated in favor of semantic layer]                │
├──────────────────┼──────────────────────────────────────────────────────────┤
│  Columns         │   snake_case, descriptive                                │
│                  │   Keys: <entity>_id (e.g., customer_id)                  │
│                  │   Dates: <event>_at or <event>_date                      │
│                  │   Booleans: is_<condition> or has_<thing>                │
└──────────────────┴──────────────────────────────────────────────────────────┘
```

### Folder Structure Best Practices

```yaml
# Recommended folder structure
models/
├── staging/
│   ├── stripe/
│   │   ├── _stripe__sources.yml      # Source definitions
│   │   ├── _stripe__models.yml       # Model tests & docs
│   │   ├── stg_stripe__customers.sql
│   │   └── stg_stripe__payments.sql
│   ├── salesforce/
│   │   ├── _salesforce__sources.yml
│   │   ├── _salesforce__models.yml
│   │   └── stg_salesforce__accounts.sql
│   └── README.md
│
├── intermediate/
│   ├── finance/
│   │   ├── _int_finance__models.yml
│   │   └── int_payments_pivoted.sql
│   └── README.md
│
└── marts/
    ├── core/
    │   ├── _core__models.yml
    │   ├── dim_customers.sql
    │   └── fct_orders.sql
    ├── finance/
    │   ├── _finance__models.yml
    │   └── fct_monthly_revenue.sql
    └── marketing/
        ├── _marketing__models.yml
        └── mkt_customer_attribution.sql
```

### SQL Style Guide

```sql
-- ✅ Good: Clean, readable SQL

with source as (
    select * from {{ source('stripe', 'payments') }}
),

renamed as (
    select
        -- Primary key
        id as payment_id,
        
        -- Foreign keys
        order_id,
        customer_id,
        
        -- Dimensions
        payment_method,
        status as payment_status,
        
        -- Measures
        {{ cents_to_dollars('amount') }} as amount,
        
        -- Timestamps
        created_at as payment_created_at,
        updated_at as payment_updated_at
        
    from source
    where not _fivetran_deleted  -- Exclude soft deletes
)

select * from renamed


-- ❌ Bad: Hard to read, inconsistent

SELECT id as payment_id, order_id, customer_id,payment_method,
status, cast(amount as decimal(18,2))/100 as amount, created_at,updated_at
FROM {{ source('stripe', 'payments') }} WHERE _fivetran_deleted = false
```

### Testing Best Practices

```yaml
# Test coverage recommendations
version: 2

models:
  - name: dim_customers
    description: "Customer dimension"
    
    # REQUIRED: Primary key tests on all models
    columns:
      - name: customer_id
        data_tests:
          - unique
          - not_null
    
    # RECOMMENDED: Foreign key relationships
    columns:
      - name: customer_segment
        data_tests:
          - accepted_values:
              values: ['prospect', 'new', 'returning', 'loyal']

  - name: fct_orders
    columns:
      - name: order_id
        data_tests:
          - unique
          - not_null
      
      - name: customer_id
        data_tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      
      - name: order_total
        data_tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
```

### Performance Optimization

```sql
-- Use CTEs wisely - avoid over-nesting

-- ✅ Good: Logical CTEs
with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('stg_orders') }}
),

customer_orders as (
    select
        customer_id,
        count(*) as order_count
    from orders
    group by customer_id
)

select
    c.*,
    coalesce(co.order_count, 0) as order_count
from customers c
left join customer_orders co using (customer_id)


-- ❌ Bad: Unnecessary CTEs
with step1 as (
    select * from {{ ref('stg_customers') }}
),

step2 as (
    select * from step1 where status = 'active'
),

step3 as (
    select * from step2
)

select * from step3
```

### dbt Commands Cheat Sheet

```bash
# Build commands
dbt run                           # Run all models
dbt run --select my_model         # Run specific model
dbt run --select my_model+        # Run model and downstream
dbt run --select +my_model        # Run model and upstream
dbt run --select +my_model+       # Run model with all dependencies
dbt run --select staging.*        # Run all staging models
dbt run --select tag:daily        # Run models with tag
dbt run --exclude my_model        # Run all except specific model
dbt run --full-refresh            # Rebuild incremental models

# Testing
dbt test                          # Run all tests
dbt test --select my_model        # Test specific model
dbt test --select source:*        # Test all sources

# Documentation
dbt docs generate                 # Generate docs
dbt docs serve                    # Serve docs locally

# Other commands
dbt compile                       # Compile SQL without running
dbt debug                         # Test connection
dbt deps                          # Install packages
dbt seed                          # Load seed files
dbt snapshot                      # Run snapshots
dbt source freshness              # Check source freshness
dbt ls                            # List resources
dbt clean                         # Clean target directory
```

---

## 10. dbt Cloud vs dbt Core

### Feature Comparison

```
┌───────────────────────────────────────────────────────────────────────────────┐
│                       dbt Core vs dbt Cloud Comparison                         │
├─────────────────────────┬───────────────────────┬─────────────────────────────┤
│   Feature               │     dbt Core          │      dbt Cloud              │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Price                 │   Free (Open Source)  │   Free tier + Paid plans   │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Installation          │   CLI install         │   Web-based (SaaS)          │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   IDE                   │   External (VS Code)  │   Browser-based IDE         │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Scheduling            │   External (Airflow)  │   Built-in scheduler        │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Git Integration       │   Manual              │   Native (GitHub, GitLab)   │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Environment Mgmt      │   Manual              │   Built-in environments     │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Documentation Host    │   Self-hosted         │   Built-in hosting          │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Metadata/Discovery    │   Artifacts only      │   dbt Explorer              │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Semantic Layer        │   Limited             │   Full (dbt Semantic Layer) │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   CI/CD                 │   External setup      │   Slim CI built-in          │
├─────────────────────────┼───────────────────────┼─────────────────────────────┤
│   Alerting              │   External            │   Built-in notifications    │
└─────────────────────────┴───────────────────────┴─────────────────────────────┘
```

### dbt Core Setup

```bash
# Installation
pip install dbt-core
pip install dbt-snowflake  # or dbt-bigquery, dbt-redshift, etc.

# Initialize project
dbt init my_project

# Configure profiles.yml (in ~/.dbt/)
# Run models
dbt run

# Orchestration with Airflow
# (requires external scheduler)
```

```python
# Example: Airflow DAG for dbt Core
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
    'owner': 'analytics',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 1,
}

with DAG(
    'dbt_daily_run',
    default_args=default_args,
    schedule_interval='0 6 * * *',  # 6 AM daily
    catchup=False
) as dag:
    
    dbt_run = BashOperator(
        task_id='dbt_run',
        bash_command='cd /path/to/dbt/project && dbt run --target prod',
    )
    
    dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='cd /path/to/dbt/project && dbt test --target prod',
    )
    
    dbt_run >> dbt_test
```

### dbt Cloud Setup

dbt Cloud provides a managed experience:

```yaml
# dbt Cloud project configuration
# Managed in the dbt Cloud UI

# Environment configuration:
# - Development: Individual developer schemas
# - CI: Temporary schemas for PR testing
# - Production: Main data warehouse schemas

# Job configuration example (via UI or API):
jobs:
  - name: "Daily Production Run"
    environment: production
    schedule:
      cron: "0 6 * * *"
    commands:
      - dbt run
      - dbt test
    notifications:
      slack: "#data-alerts"
      email: "data-team@company.com"
```

### When to Choose Each

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      Decision Framework                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Choose dbt Core when:                                                     │
│   ├── Budget is limited                                                     │
│   ├── You have strong DevOps/platform engineering support                  │
│   ├── You need maximum customization                                        │
│   ├── You're already using Airflow/Dagster/Prefect                         │
│   └── Data stays within strict compliance boundaries                        │
│                                                                             │
│   Choose dbt Cloud when:                                                    │
│   ├── You want faster time-to-value                                         │
│   ├── You have a smaller team without dedicated platform engineers         │
│   ├── You need built-in CI/CD, scheduling, and monitoring                  │
│   ├── You want the Semantic Layer for metrics consistency                  │
│   └── You value the managed IDE and documentation hosting                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Summary

### Key Takeaways

1. **dbt transforms the T in ELT** - pushing transformations into the data warehouse where compute is abundant

2. **The three-layer model architecture** (staging → intermediate → marts) provides clear separation of concerns

3. **Materializations** let you optimize for different use cases:
   - Views for light queries
   - Tables for heavy consumption
   - Incremental for large, append-heavy data
   - Ephemeral for DRY intermediate logic

4. **Testing is first-class** with built-in schema tests, custom tests, and packages like dbt_expectations

5. **Documentation and lineage** are automatically generated from your project

6. **Jinja templating** enables powerful, DRY SQL with macros and control structures

7. **Follow naming conventions and folder structure** best practices for maintainable projects

8. **Choose dbt Core or Cloud** based on your team size, budget, and infrastructure capabilities

### Additional Resources

- [dbt Documentation](https://docs.getdbt.com/)
- [dbt Learn (Free Courses)](https://courses.getdbt.com/)
- [dbt Hub (Packages)](https://hub.getdbt.com/)
- [dbt Slack Community](https://community.getdbt.com/)
- [dbt Best Practices Guide](https://docs.getdbt.com/guides/best-practices)
- [GitLab's dbt Guide](https://about.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/)
- [Fishtown Analytics Blog](https://www.getdbt.com/blog/)

---

## Appendix: Quick Reference

### Common Packages

```yaml
# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]
  
  - package: calogica/dbt_expectations
    version: [">=0.9.0", "<0.10.0"]
  
  - package: dbt-labs/codegen
    version: [">=0.9.0", "<0.10.0"]
  
  - package: dbt-labs/audit_helper
    version: [">=0.6.0", "<0.7.0"]
```

### Useful Macros from dbt_utils

```sql
-- Surrogate key generation
{{ dbt_utils.generate_surrogate_key(['field1', 'field2']) }}

-- Pivot
{{ dbt_utils.pivot(
    'payment_method',
    dbt_utils.get_column_values(ref('stg_payments'), 'payment_method'),
    agg='sum',
    then_value='amount'
) }}

-- Unpivot
{{ dbt_utils.unpivot(
    relation=ref('my_model'),
    cast_to='float',
    exclude=['id', 'created_at'],
    field_name='metric_name',
    value_name='metric_value'
) }}

-- Date spine
{{ dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2020-01-01' as date)",
    end_date="current_date"
) }}

-- Get column values
{% set payment_methods = dbt_utils.get_column_values(
    table=ref('stg_payments'),
    column='payment_method'
) %}

-- Star (select all columns)
{{ dbt_utils.star(from=ref('my_model'), except=['column_to_exclude']) }}
```

### Snapshots (SCD Type 2)

```sql
-- snapshots/snapshot_customers.sql

{% snapshot snapshot_customers %}

{{
    config(
        target_database='analytics',
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=True
    )
}}

select * from {{ source('ecommerce', 'customers') }}

{% endsnapshot %}
```

**Result columns added:**
- `dbt_scd_id` - Unique key for snapshot row
- `dbt_updated_at` - When snapshot was taken
- `dbt_valid_from` - When this version became valid
- `dbt_valid_to` - When this version was superseded (NULL if current)

### Environment Variables

```sql
-- Using environment variables in models
{% if target.name == 'prod' %}
    -- Production logic
    select * from {{ ref('full_dataset') }}
{% else %}
    -- Development: sample data
    select * from {{ ref('full_dataset') }}
    where created_at >= dateadd('month', -3, current_date)
{% endif %}

-- Access env vars directly
{{ env_var('MY_SECRET_KEY') }}
{{ env_var('OPTIONAL_VAR', 'default_value') }}
```

### Model Selection Syntax

```bash
# Selection syntax examples

# By name
dbt run --select my_model
dbt run --select my_model another_model

# By path
dbt run --select models/staging/stripe/*
dbt run --select path:models/marts/core

# By tag
dbt run --select tag:daily
dbt run --select tag:marketing tag:finance

# By config
dbt run --select config.materialized:incremental

# By package
dbt run --select package:dbt_utils

# Graph operators
dbt run --select my_model+       # Downstream
dbt run --select +my_model       # Upstream
dbt run --select +my_model+      # Both
dbt run --select 1+my_model+2    # N levels
dbt run --select @my_model       # Upstream + downstream of upstreams

# Intersections and unions
dbt run --select tag:daily,config.materialized:table  # AND
dbt run --select staging+ marketing+                   # OR (space)

# Exclude
dbt run --select staging+ --exclude stg_deprecated

# State-based (slim CI)
dbt run --select state:modified+    # Modified and downstream
dbt run --select state:new          # New models only
```