# Data Lakes — Overview

## Purpose
Understand data lake architecture, design patterns, and how they differ from traditional data warehouses. Learn about zone architectures, file formats, and the evolution toward lakehouse architectures.

## Key Questions
- What is a data lake and how does it differ from a data warehouse?
- How should data be organized within a data lake (zone architecture)?
- Which file formats are best suited for different use cases?
- What is a data lakehouse and why is it gaining adoption?

---
## 1. Data Lake vs Data Warehouse

### What is a Data Lake?
A **data lake** is a centralized repository that stores **raw data in its native format** — structured, semi-structured, and unstructured — at any scale. It uses a **schema-on-read** approach, meaning data structure is applied when the data is accessed, not when it's stored.

### Key Differences

| Aspect | Data Lake | Data Warehouse |
|--------|-----------|----------------|
| **Schema** | Schema-on-read | Schema-on-write |
| **Data Types** | Raw, unstructured, semi-structured, structured | Structured, processed |
| **Storage Cost** | Low (object storage) | Higher (optimized databases) |
| **Processing** | ELT (Extract, Load, Transform) | ETL (Extract, Transform, Load) |
| **Users** | Data scientists, engineers | Business analysts, BI users |
| **Flexibility** | High — store everything | Lower — predefined schema |
| **Query Performance** | Variable (depends on optimization) | Optimized for fast queries |
| **Data Quality** | Can include raw/dirty data | Curated, validated data |

### When to Use Each

**Data Lake:**
- Exploratory analytics and data science
- Machine learning model training
- Storing diverse data formats (logs, IoT, images)
- Future-proofing — store now, analyze later

**Data Warehouse:**
- Business intelligence and reporting
- Consistent, reliable metrics
- Ad-hoc SQL queries by analysts
- Regulatory compliance requiring structured data

---
## 2. Zone Architecture (Medallion Architecture)

Data lakes organize data into **zones** (or layers) to manage data quality and access. The most common pattern is the **medallion architecture**: Bronze → Silver → Gold.

```
┌─────────────────────────────────────────────────────────────────┐
│                         DATA LAKE                              │
├───────────────┬───────────────────┬─────────────────────────────┤
│   RAW/BRONZE  │  STAGING/SILVER   │      CURATED/GOLD           │
│               │                   │                             │
│  • Raw ingestion  │  • Cleaned      │  • Business-ready         │
│  • No transforms  │  • Validated    │  • Aggregated             │
│  • Full history   │  • Deduplicated │  • Denormalized           │
│  • Immutable      │  • Standardized │  • Feature stores         │
└───────────────┴───────────────────┴─────────────────────────────┘
     ↓                    ↓                      ↓
  Landing Zone      Transformation Zone     Consumption Zone
```

### Zone Details

| Zone | Also Called | Purpose | Data Quality | Users |
|------|-------------|---------|--------------|-------|
| **Raw/Bronze** | Landing, Ingestion | Store data exactly as received | Low — raw, unvalidated | Data Engineers |
| **Staging/Silver** | Refined, Cleansed | Clean, validate, standardize | Medium — cleaned | Data Engineers, Scientists |
| **Curated/Gold** | Consumption, Trusted | Business-ready, aggregated | High — production quality | Analysts, Applications |

### Best Practices

1. **Keep raw data immutable** — never modify Bronze layer
2. **Partition data** — by date, region, or business key
3. **Apply data contracts** — define schemas between zones
4. **Track lineage** — document transformations between layers
5. **Implement access controls** — restrict Gold to authorized users

---
## 3. File Formats for Data Lakes

Choosing the right file format significantly impacts query performance, storage costs, and compatibility.

### Comparison of Columnar Formats

| Format | Type | Compression | Schema Evolution | Best For |
|--------|------|-------------|------------------|----------|
| **Parquet** | Columnar | Excellent (Snappy, GZIP) | Limited | Analytics, Spark, general purpose |
| **ORC** | Columnar | Excellent (ZLIB, Snappy) | Good | Hive, heavy read workloads |
| **Avro** | Row-based | Good | Excellent | Streaming, schema evolution |
| **Delta Lake** | Columnar + ACID | Excellent | Excellent | Lakehouse, ACID transactions |
| **Iceberg** | Columnar + ACID | Excellent | Excellent | Multi-engine, large tables |
| **Hudi** | Columnar + ACID | Excellent | Excellent | Incremental updates, CDC |

### Format Deep Dive

#### Parquet
- **Columnar storage** — reads only required columns
- **Predicate pushdown** — filters at storage level
- **Widely supported** — Spark, Presto, Athena, etc.
- **Best for**: Analytical queries, data warehousing

#### ORC (Optimized Row Columnar)
- Originally developed for Hive
- **Better compression** than Parquet in some cases
- Built-in **indexes** and **bloom filters**
- **Best for**: Hive ecosystems, read-heavy workloads

#### Avro
- **Row-based** — efficient for write-heavy workloads
- **Schema stored with data** — self-describing
- **Excellent schema evolution** — add/remove fields easily
- **Best for**: Kafka, streaming, data exchange

#### Delta Lake
- **ACID transactions** on top of Parquet
- **Time travel** — query historical versions
- **Schema enforcement** — prevent bad data
- **Merge/Upsert** operations (MERGE INTO)
- **Best for**: Lakehouse architecture, reliable pipelines

In [None]:
# Example: Reading/Writing Different Formats with PySpark

# Note: This is illustrative code - requires PySpark environment

from pyspark.sql import SparkSession

# Initialize Spark (with Delta Lake support)
spark = SparkSession.builder \
    .appName("DataLakeFormats") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Sample DataFrame
df = spark.createDataFrame([
    (1, "Alice", 100.0, "2024-01-01"),
    (2, "Bob", 200.0, "2024-01-02"),
    (3, "Charlie", 150.0, "2024-01-03")
], ["id", "name", "amount", "date"])

# Write as Parquet (partitioned by date)
df.write.partitionBy("date").parquet("/data/bronze/transactions_parquet")

# Write as ORC
df.write.orc("/data/bronze/transactions_orc")

# Write as Avro
df.write.format("avro").save("/data/bronze/transactions_avro")

# Write as Delta Lake (with ACID support)
df.write.format("delta").save("/data/silver/transactions_delta")

# Delta Lake: Time Travel (query previous version)
# df_v0 = spark.read.format("delta").option("versionAsOf", 0).load("/data/silver/transactions_delta")

---
## 4. Data Lakehouse Architecture

A **data lakehouse** combines the best features of data lakes and data warehouses:

```
┌──────────────────────────────────────────────────────────────────┐
│                     DATA LAKEHOUSE                               │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│   │      BI      │  │    ML/AI     │  │  Streaming   │          │
│   │  Analytics   │  │  Workloads   │  │  Analytics   │  ← USE   │
│   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘   CASES  │
│          │                 │                 │                   │
│   ┌──────┴─────────────────┴─────────────────┴───────┐          │
│   │              UNIFIED QUERY ENGINE                │          │
│   │         (Spark, Presto, Trino, Dremio)           │          │
│   └──────────────────────┬───────────────────────────┘          │
│                          │                                       │
│   ┌──────────────────────┴───────────────────────────┐          │
│   │           METADATA & GOVERNANCE LAYER            │          │
│   │    (Delta Lake, Iceberg, Hudi, Unity Catalog)    │          │
│   │  • ACID Transactions  • Schema Enforcement       │          │
│   │  • Time Travel        • Data Lineage             │          │
│   └──────────────────────┬───────────────────────────┘          │
│                          │                                       │
│   ┌──────────────────────┴───────────────────────────┐          │
│   │              OPEN FILE FORMATS                   │          │
│   │           (Parquet, ORC on Object Storage)       │          │
│   └──────────────────────┬───────────────────────────┘          │
│                          │                                       │
│   ┌──────────────────────┴───────────────────────────┐          │
│   │            CLOUD OBJECT STORAGE                  │          │
│   │         (S3, ADLS, GCS, MinIO)                   │  ← BASE  │
│   └──────────────────────────────────────────────────┘          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
```

### Key Benefits of Lakehouse

| Feature | Traditional Lake | Lakehouse |
|---------|------------------|----------|
| ACID Transactions | ❌ | ✅ |
| Schema Enforcement | ❌ | ✅ |
| Time Travel | ❌ | ✅ |
| BI Tool Support | Limited | Full SQL support |
| Data Quality | Manual | Built-in constraints |
| Concurrent Writes | Problematic | Supported |
| Cost | Low | Low (same storage) |

### Popular Lakehouse Technologies

1. **Delta Lake** (Databricks) — Most mature, tight Spark integration
2. **Apache Iceberg** (Netflix) — Engine-agnostic, great for multi-engine
3. **Apache Hudi** (Uber) — Best for incremental/CDC workloads

### Lakehouse vs Two-Tier Architecture

**Traditional (Lake + Warehouse):**
```
Data Sources → Data Lake → ETL → Data Warehouse → BI Tools
                   ↓
               ML/Data Science
```
- Data duplication
- Complex pipelines
- Data staleness

**Lakehouse:**
```
Data Sources → Data Lakehouse → BI Tools + ML + Streaming
```
- Single source of truth
- Reduced complexity
- Real-time capabilities

In [None]:
# Example: Delta Lake ACID Operations

# Note: Requires Delta Lake environment

from delta.tables import DeltaTable

# MERGE (Upsert) - Update existing, insert new
delta_table = DeltaTable.forPath(spark, "/data/silver/customers")

new_data = spark.createDataFrame([
    (1, "Alice Smith", "alice@new.com"),  # Update
    (4, "Diana", "diana@new.com")          # Insert
], ["id", "name", "email"])

# MERGE operation (UPSERT)
delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    set={"name": "source.name", "email": "source.email"}
).whenNotMatchedInsertAll().execute()

# Time Travel - Query historical version
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/data/silver/customers")

# View history
delta_table.history().show()

# Rollback to previous version
# spark.sql("RESTORE TABLE delta.`/data/silver/customers` TO VERSION AS OF 2")

---
## 5. Data Lake Governance & Catalog

### Essential Governance Components

| Component | Purpose | Tools |
|-----------|---------|-------|
| **Data Catalog** | Discover and understand data | AWS Glue, Unity Catalog, Hive Metastore |
| **Access Control** | Secure data access | IAM, Ranger, Unity Catalog |
| **Data Lineage** | Track data flow | OpenLineage, Marquez, Purview |
| **Data Quality** | Validate data | Great Expectations, Deequ, Soda |
| **Schema Registry** | Manage schemas | Confluent Schema Registry, AWS Glue |

### Data Lake Anti-Patterns ("Data Swamp")

| Anti-Pattern | Problem | Solution |
|--------------|---------|----------|
| No metadata | Can't find or understand data | Implement data catalog |
| No governance | Data quality degrades | Data contracts, validation |
| Dump everything | Storage bloat, no value | Define retention policies |
| No access controls | Security risks | Role-based access |
| No documentation | Tribal knowledge | Schema docs, lineage tracking |

---
## Takeaways

### Key Concepts

1. **Data Lake** = centralized storage for raw data in native formats (schema-on-read)
2. **Zone Architecture** = Bronze (raw) → Silver (cleaned) → Gold (curated)
3. **File Formats** = Use Parquet/ORC for analytics, Avro for streaming, Delta/Iceberg for ACID
4. **Lakehouse** = Lake + Warehouse features (ACID, time travel, schema enforcement)

### Decision Framework

```
Need ACID transactions?          → Use Delta Lake/Iceberg/Hudi
Heavy schema evolution?          → Use Avro (streaming) or Iceberg
Pure analytics workloads?        → Parquet is sufficient
Multi-engine environment?        → Consider Apache Iceberg
Need warehouse + lake?           → Adopt Lakehouse architecture
```

### Best Practices Checklist

- [ ] Implement zone/medallion architecture
- [ ] Keep Bronze layer immutable
- [ ] Partition data by common query patterns
- [ ] Use columnar formats (Parquet/ORC) for analytics
- [ ] Implement data catalog and governance
- [ ] Consider lakehouse formats (Delta/Iceberg) for reliability
- [ ] Define data retention and lifecycle policies
- [ ] Monitor data quality across zones