# Data Warehousing — Overview

## Purpose
This notebook provides a comprehensive introduction to **data warehousing** concepts, architectures, and modern platforms. Data warehouses are central to business intelligence, enabling organizations to consolidate, analyze, and derive insights from vast amounts of data.

## Key Questions
1. What is a data warehouse, and how does it differ from operational databases?
2. What are OLAP and OLTP, and when should each be used?
3. How do star and snowflake schemas structure data for analytics?
4. What is dimensional modeling, and why are facts and dimensions important?
5. What are the leading modern data warehouse platforms?

---
## 1. What is a Data Warehouse?

A **data warehouse** is a centralized repository designed to store, integrate, and manage large volumes of structured data from multiple sources. It is optimized for **analytical queries** and **reporting**, rather than transactional processing.

### Characteristics of a Data Warehouse
| Characteristic | Description |
|----------------|-------------|
| **Subject-Oriented** | Organized around key business subjects (e.g., sales, customers, products) |
| **Integrated** | Consolidates data from disparate sources into a consistent format |
| **Time-Variant** | Maintains historical data for trend analysis |
| **Non-Volatile** | Data is stable; once loaded, it is not frequently changed |

### Data Warehouse vs. Data Lake
| Aspect | Data Warehouse | Data Lake |
|--------|----------------|----------|
| **Data Type** | Structured, curated | Raw, unstructured, semi-structured |
| **Schema** | Schema-on-write | Schema-on-read |
| **Use Case** | BI, reporting, dashboards | ML, data science, exploration |
| **Processing** | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) |

---
## 2. OLAP vs. OLTP

Understanding the difference between **OLAP** and **OLTP** is fundamental to data warehousing.

### OLTP (Online Transaction Processing)
- Designed for **transactional workloads** (inserts, updates, deletes)
- Optimized for **fast, short queries** affecting few rows
- Examples: banking systems, e-commerce order processing, CRM
- Normalized schema to minimize redundancy

### OLAP (Online Analytical Processing)
- Designed for **complex analytical queries** across large datasets
- Optimized for **aggregations, joins, and historical analysis**
- Examples: sales trend analysis, financial reporting, dashboards
- Denormalized schema for faster reads

### Comparison Table
| Feature | OLTP | OLAP |
|---------|------|------|
| **Purpose** | Day-to-day operations | Analytical reporting |
| **Query Type** | Simple, short | Complex, long-running |
| **Data Volume per Query** | Small (rows) | Large (millions of rows) |
| **Schema Design** | Normalized (3NF) | Denormalized (Star/Snowflake) |
| **Concurrency** | High (many users) | Lower (analysts, reports) |
| **Data Freshness** | Real-time | Periodic (batch loads) |

In [None]:
# Conceptual Example: OLTP vs OLAP Query Patterns

# OLTP Query Example (transactional - affects single row)
oltp_query = """
SELECT order_id, customer_name, total_amount
FROM orders
WHERE order_id = 12345;
"""

# OLAP Query Example (analytical - aggregates millions of rows)
olap_query = """
SELECT 
    d.year,
    d.quarter,
    p.category,
    SUM(f.sales_amount) AS total_sales,
    AVG(f.sales_amount) AS avg_sale
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY d.year, d.quarter, p.category
ORDER BY d.year, d.quarter;
"""

print("OLTP Query (Single Record Lookup):")
print(oltp_query)
print("\nOLAP Query (Aggregation & Analysis):")
print(olap_query)

---
## 3. Star Schema and Snowflake Schema

Data warehouses typically use **dimensional models** organized as star or snowflake schemas.

### Star Schema
The **star schema** is the simplest and most widely used dimensional model.

```
                    ┌─────────────┐
                    │  dim_date   │
                    └──────┬──────┘
                           │
┌─────────────┐     ┌──────┴──────┐     ┌─────────────┐
│ dim_product │─────│  fact_sales │─────│ dim_customer│
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────┴──────┐
                    │  dim_store  │
                    └─────────────┘
```

**Characteristics:**
- Central **fact table** surrounded by **dimension tables**
- Dimension tables are **denormalized** (flat)
- Simple joins, excellent query performance
- Some data redundancy in dimensions

### Snowflake Schema
The **snowflake schema** normalizes dimension tables into sub-dimensions.

```
┌───────────┐     ┌─────────────┐
│  country  │─────│  dim_store  │
└───────────┘     └──────┬──────┘
                         │
                  ┌──────┴──────┐
                  │  fact_sales │
                  └──────┬──────┘
                         │
┌───────────┐     ┌──────┴──────┐
│ category  │─────│ dim_product │
└───────────┘     └─────────────┘
```

**Characteristics:**
- Dimension tables are **normalized** (split into related tables)
- Reduces data redundancy
- More complex joins, slightly slower queries
- Easier to maintain data integrity

### Comparison
| Aspect | Star Schema | Snowflake Schema |
|--------|-------------|------------------|
| **Normalization** | Denormalized | Normalized |
| **Query Complexity** | Simple | More complex |
| **Query Performance** | Faster | Slower |
| **Storage** | More redundancy | Less redundancy |
| **Maintenance** | Easier updates to facts | Easier dimension integrity |

In [None]:
# Star Schema Example: SQL DDL

star_schema_ddl = """
-- DIMENSION TABLES (Denormalized)

CREATE TABLE dim_date (
    date_key        INT PRIMARY KEY,
    full_date       DATE,
    day_of_week     VARCHAR(10),
    month           INT,
    quarter         INT,
    year            INT,
    is_holiday      BOOLEAN
);

CREATE TABLE dim_product (
    product_key     INT PRIMARY KEY,
    product_id      VARCHAR(50),
    product_name    VARCHAR(200),
    category        VARCHAR(100),      -- Denormalized
    subcategory     VARCHAR(100),      -- Denormalized
    brand           VARCHAR(100),
    unit_price      DECIMAL(10,2)
);

CREATE TABLE dim_customer (
    customer_key    INT PRIMARY KEY,
    customer_id     VARCHAR(50),
    customer_name   VARCHAR(200),
    city            VARCHAR(100),      -- Denormalized
    state           VARCHAR(100),      -- Denormalized
    country         VARCHAR(100),      -- Denormalized
    segment         VARCHAR(50)
);

-- FACT TABLE (Measures + Foreign Keys)

CREATE TABLE fact_sales (
    sales_key       INT PRIMARY KEY,
    date_key        INT REFERENCES dim_date(date_key),
    product_key     INT REFERENCES dim_product(product_key),
    customer_key    INT REFERENCES dim_customer(customer_key),
    quantity        INT,
    sales_amount    DECIMAL(12,2),
    discount        DECIMAL(5,2),
    profit          DECIMAL(12,2)
);
"""

print("Star Schema DDL Example:")
print(star_schema_ddl)

---
## 4. Dimensional Modeling: Facts and Dimensions

**Dimensional modeling** is a design technique optimized for data retrieval and analysis.

### Fact Tables
Fact tables contain **quantitative, measurable data** (metrics) about business events.

| Property | Description |
|----------|-------------|
| **Measures** | Numeric values that can be aggregated (SUM, AVG, COUNT) |
| **Foreign Keys** | References to dimension tables |
| **Grain** | The level of detail (e.g., one row per transaction) |
| **Types** | Transaction facts, periodic snapshots, accumulating snapshots |

**Types of Fact Tables:**
1. **Transaction Fact Table**: One row per event (e.g., each sale)
2. **Periodic Snapshot**: One row per time period (e.g., daily inventory levels)
3. **Accumulating Snapshot**: One row per lifecycle (e.g., order fulfillment stages)

### Dimension Tables
Dimension tables contain **descriptive attributes** that provide context to facts.

| Property | Description |
|----------|-------------|
| **Attributes** | Descriptive fields (name, category, location) |
| **Surrogate Key** | System-generated primary key |
| **Natural Key** | Business identifier (e.g., product_id) |
| **Hierarchies** | Drill-down paths (Year → Quarter → Month → Day) |

### Slowly Changing Dimensions (SCD)
Dimensions change over time. **SCD types** define how to handle changes:

| Type | Strategy | Use Case |
|------|----------|----------|
| **SCD Type 1** | Overwrite old value | No history needed |
| **SCD Type 2** | Add new row with version | Full history required |
| **SCD Type 3** | Add column for previous value | Limited history |

In [None]:
# SCD Type 2 Example: Tracking Customer Address Changes

scd_type2_example = """
-- SCD Type 2: Customer dimension with versioning

CREATE TABLE dim_customer_scd2 (
    customer_key        INT PRIMARY KEY,       -- Surrogate key
    customer_id         VARCHAR(50),           -- Natural key (business key)
    customer_name       VARCHAR(200),
    city                VARCHAR(100),
    state               VARCHAR(100),
    country             VARCHAR(100),
    effective_date      DATE,                  -- When this version became active
    expiration_date     DATE,                  -- When this version expired (NULL = current)
    is_current          BOOLEAN                -- Flag for current record
);

-- Example: Customer moved from New York to Los Angeles

-- Original record (now expired)
-- customer_key=1, customer_id='C001', city='New York', 
-- effective_date='2020-01-01', expiration_date='2024-06-15', is_current=FALSE

-- New record (current)
-- customer_key=2, customer_id='C001', city='Los Angeles',
-- effective_date='2024-06-15', expiration_date=NULL, is_current=TRUE

-- Query to get current customer data
SELECT * FROM dim_customer_scd2 WHERE is_current = TRUE;

-- Query to get customer data as of a specific date
SELECT * FROM dim_customer_scd2 
WHERE '2023-01-01' BETWEEN effective_date AND COALESCE(expiration_date, '9999-12-31');
"""

print("SCD Type 2 Implementation Example:")
print(scd_type2_example)

---
## 5. Modern Data Warehouse Platforms

Modern cloud data warehouses offer scalability, performance, and ease of use.

### Platform Comparison

| Platform | Provider | Key Features |
|----------|----------|-------------|
| **Snowflake** | Snowflake Inc. | Separation of storage/compute, near-zero maintenance, data sharing |
| **BigQuery** | Google Cloud | Serverless, built-in ML, real-time analytics |
| **Redshift** | AWS | Tight AWS integration, Spectrum for S3 queries, ML integration |
| **Azure Synapse** | Microsoft | Unified analytics, Power BI integration, serverless options |
| **Databricks SQL** | Databricks | Lakehouse architecture, Delta Lake, unified with ML |

### Snowflake
- **Architecture**: Multi-cluster shared data architecture
- **Scaling**: Independent scaling of compute and storage
- **Features**: Time travel, zero-copy cloning, secure data sharing
- **Pricing**: Pay per second of compute used

### Google BigQuery
- **Architecture**: Serverless, columnar storage (Capacitor format)
- **Scaling**: Automatic, no infrastructure management
- **Features**: Built-in ML (BQML), streaming inserts, geospatial analysis
- **Pricing**: Pay per query (bytes scanned) or flat-rate

### Amazon Redshift
- **Architecture**: Massively parallel processing (MPP), columnar storage
- **Scaling**: RA3 nodes separate compute/storage, Serverless option
- **Features**: Redshift Spectrum, ML integration, federated queries
- **Pricing**: On-demand or reserved instances

### Architecture Pattern: Modern Data Stack
```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Sources   │────▶│  Ingestion  │────▶│  Warehouse  │
│ (DBs, APIs) │     │  (Fivetran) │     │ (Snowflake) │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                               │
                    ┌─────────────┐     ┌──────▼──────┐
                    │     BI      │◀────│  Transform  │
                    │  (Looker)   │     │    (dbt)    │
                    └─────────────┘     └─────────────┘
```

In [None]:
# Example: Platform-Specific Query Syntax

queries = {
    "Snowflake": """
-- Snowflake: Time Travel (query data from 1 hour ago)
SELECT * FROM sales AT (OFFSET => -3600);

-- Snowflake: Zero-copy clone
CREATE TABLE sales_backup CLONE sales;

-- Snowflake: Clustering for performance
ALTER TABLE fact_sales CLUSTER BY (date_key, product_key);
""",
    
    "BigQuery": """
-- BigQuery: Partitioned table for cost optimization
CREATE TABLE `project.dataset.sales`
PARTITION BY DATE(sale_date)
CLUSTER BY customer_id AS
SELECT * FROM raw_sales;

-- BigQuery: Built-in ML (create a forecasting model)
CREATE MODEL `project.dataset.sales_forecast`
OPTIONS(model_type='ARIMA_PLUS') AS
SELECT sale_date, SUM(amount) as total_sales
FROM sales GROUP BY sale_date;
""",
    
    "Redshift": """
-- Redshift: Query data in S3 with Spectrum
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG DATABASE 'my_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftRole';

-- Redshift: Distribution style for join optimization
CREATE TABLE fact_sales (
    sale_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
) DISTKEY(customer_id) SORTKEY(sale_date);
"""
}

for platform, query in queries.items():
    print(f"=== {platform} ===")
    print(query)
    print()

---
## 6. Best Practices for Data Warehousing

### Design Principles
1. **Define clear grain**: Establish the level of detail for each fact table
2. **Use surrogate keys**: Synthetic keys protect against source system changes
3. **Conform dimensions**: Shared dimensions across fact tables enable cross-analysis
4. **Partition large tables**: Improve query performance and reduce costs
5. **Document lineage**: Track data from source to warehouse

### Performance Optimization
| Technique | Description |
|-----------|-------------|
| **Partitioning** | Divide tables by date or key for faster scans |
| **Clustering** | Co-locate related rows for efficient access |
| **Materialized Views** | Pre-compute expensive aggregations |
| **Column Pruning** | Select only needed columns to reduce I/O |
| **Predicate Pushdown** | Filter early in query execution |

### Data Quality
- Implement data validation checks during ETL
- Use tools like **dbt tests** or **Great Expectations**
- Monitor for schema drift and data anomalies
- Establish SLAs for data freshness

---
## Takeaways

| Concept | Key Points |
|---------|------------|
| **Data Warehouse** | Centralized, subject-oriented repository optimized for analytics |
| **OLAP vs OLTP** | OLAP for complex analytics; OLTP for transactional operations |
| **Star Schema** | Denormalized design with central fact table and dimension tables |
| **Snowflake Schema** | Normalized dimensions for reduced redundancy |
| **Dimensional Modeling** | Facts hold measures; Dimensions provide context |
| **SCD Types** | Strategies for handling dimension changes over time |
| **Modern Platforms** | Snowflake, BigQuery, Redshift offer cloud-scale analytics |

### Next Steps
- Explore **ETL/ELT pipelines** for loading data into warehouses
- Learn **dbt (data build tool)** for transformation and testing
- Practice designing dimensional models for real business scenarios
- Understand cost optimization strategies for cloud warehouses