# Data Engineering — Overview

## Purpose
To understand the fundamentals of data engineering, including its role in the data ecosystem, core concepts like ETL/ELT and data pipelines, and the modern data stack architecture.

## Key Questions
- What is data engineering and why is it critical for data-driven organizations?
- What are the differences between ETL and ELT approaches?
- How do batch and streaming processing differ?
- What components make up the modern data stack?
- How do data engineers ensure data quality and reliability?

## Topics
1. What is Data Engineering?
2. Core Concepts: Pipelines, ETL/ELT, Batch vs Streaming
3. The Modern Data Stack
4. Data Engineering Landscape Visualization
5. Key Takeaways

---

## 1. What is Data Engineering?

**Data Engineering** is the discipline of designing, building, and maintaining the infrastructure and systems that enable the collection, storage, transformation, and delivery of data at scale.

### Role in the Data Ecosystem

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA ECOSYSTEM                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Data Sources          Data Engineering           Data Consumers           │
│   ────────────          ────────────────           ──────────────           │
│                                                                              │
│   • Databases     ──►   • Ingestion         ──►   • Data Scientists         │
│   • APIs                • Transformation          • Analysts                │
│   • IoT Devices         • Storage                 • ML Engineers            │
│   • Log Files           • Orchestration           • Business Users          │
│   • SaaS Apps           • Quality                 • Applications            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Responsibilities

| Responsibility | Description |
|----------------|-------------|
| **Data Ingestion** | Collecting data from various sources (APIs, databases, files, streams) |
| **Data Transformation** | Cleaning, validating, and transforming raw data into usable formats |
| **Data Storage** | Designing and managing data warehouses, lakes, and lakehouses |
| **Data Orchestration** | Scheduling and managing data pipeline workflows |
| **Data Quality** | Ensuring accuracy, completeness, and reliability of data |
| **Infrastructure** | Building and maintaining scalable data infrastructure |

---

## 2. Core Concepts

### Data Pipelines

A **data pipeline** is a series of data processing steps that move data from source to destination, applying transformations along the way.

```
┌──────────┐    ┌──────────┐    ┌───────────┐    ┌──────────┐    ┌─────────────┐
│  SOURCE  │───►│ EXTRACT  │───►│ TRANSFORM │───►│   LOAD   │───►│ DESTINATION │
└──────────┘    └──────────┘    └───────────┘    └──────────┘    └─────────────┘
                     │               │               │
                     ▼               ▼               ▼
                Validation      Business        Quality
                & Schema        Logic           Checks
```

### ETL vs ELT

| Aspect | ETL (Extract-Transform-Load) | ELT (Extract-Load-Transform) |
|--------|------------------------------|------------------------------|
| **Order** | Transform before loading | Load first, transform in destination |
| **Where** | Transformation on separate server | Transformation in data warehouse |
| **Best For** | On-premise, structured data | Cloud, large-scale analytics |
| **Speed** | Slower for large datasets | Faster loading, scalable transforms |
| **Cost** | Compute costs upfront | Warehouse compute costs |
| **Tools** | Informatica, Talend, SSIS | dbt, Snowflake, BigQuery |

### Batch vs Streaming Processing

| Characteristic | Batch Processing | Stream Processing |
|----------------|------------------|-------------------|
| **Latency** | Minutes to hours | Milliseconds to seconds |
| **Data Volume** | Large historical datasets | Continuous real-time data |
| **Use Cases** | Reports, analytics, ML training | Real-time dashboards, alerts, fraud detection |
| **Complexity** | Lower | Higher |
| **Tools** | Spark, Hive, Airflow | Kafka, Flink, Spark Streaming |
| **Processing** | All data at once | Record by record or micro-batches |

---

## 3. The Modern Data Stack

The **Modern Data Stack (MDS)** is a collection of cloud-native tools that work together to collect, store, transform, and analyze data.

### Architecture Layers

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        MODERN DATA STACK                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ CONSUMPTION: Tableau, Looker, Power BI, Metabase, Custom Apps       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    ▲                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ TRANSFORMATION: dbt, Dataform, Spark                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    ▲                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ STORAGE: Snowflake, BigQuery, Databricks, Redshift                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    ▲                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ INGESTION: Fivetran, Airbyte, Stitch, Debezium                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    ▲                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ ORCHESTRATION: Airflow, Dagster, Prefect, Mage                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Components

| Layer | Purpose | Popular Tools |
|-------|---------|---------------|
| **Ingestion** | Extract data from sources | Fivetran, Airbyte, Stitch, Debezium |
| **Storage** | Centralized data repository | Snowflake, BigQuery, Databricks, Redshift |
| **Transformation** | Model and transform data | dbt, Dataform, Spark SQL |
| **Orchestration** | Workflow scheduling | Airflow, Dagster, Prefect |
| **Quality** | Data validation & testing | Great Expectations, dbt tests, Monte Carlo |
| **Catalog** | Metadata management | Atlan, DataHub, Alation |
| **Consumption** | Visualization & analysis | Tableau, Looker, Power BI, Metabase |

---

## 4. Data Engineering Landscape Visualization

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Data Engineering Landscape - Sunburst Chart
labels = [
    "Data Engineering",
    # Level 1: Main Categories
    "Ingestion", "Storage", "Processing", "Orchestration", "Quality", "Consumption",
    # Level 2: Ingestion
    "Batch Ingestion", "Stream Ingestion", "CDC",
    # Level 2: Storage
    "Data Warehouse", "Data Lake", "Lakehouse",
    # Level 2: Processing
    "Batch", "Stream", "Transformation",
    # Level 2: Orchestration
    "Workflow", "Scheduling", "Monitoring",
    # Level 2: Quality
    "Validation", "Testing", "Observability",
    # Level 2: Consumption
    "BI Tools", "ML/AI", "APIs"
]

parents = [
    "",
    # Level 1 parents
    "Data Engineering", "Data Engineering", "Data Engineering", 
    "Data Engineering", "Data Engineering", "Data Engineering",
    # Level 2: Ingestion parents
    "Ingestion", "Ingestion", "Ingestion",
    # Level 2: Storage parents
    "Storage", "Storage", "Storage",
    # Level 2: Processing parents
    "Processing", "Processing", "Processing",
    # Level 2: Orchestration parents
    "Orchestration", "Orchestration", "Orchestration",
    # Level 2: Quality parents
    "Quality", "Quality", "Quality",
    # Level 2: Consumption parents
    "Consumption", "Consumption", "Consumption"
]

values = [
    100,
    # Level 1 values
    18, 20, 22, 15, 12, 13,
    # Level 2 values
    6, 7, 5,
    8, 6, 6,
    8, 7, 7,
    6, 5, 4,
    4, 4, 4,
    5, 5, 3
]

colors = [
    "#2E4057",  # Center
    # Level 1 colors
    "#048A81", "#54C6EB", "#8EE3EF", "#F7A072", "#D64550", "#7D5BA6",
    # Level 2 colors (lighter shades)
    "#06B6A8", "#07D4C4", "#05A89A",
    "#6DD4F5", "#85DCFA", "#9EE4FC",
    "#A8EBF3", "#C2F1F7", "#DCF7FB",
    "#F9B894", "#FACAAD", "#FCDCC6",
    "#E06B75", "#EA919A", "#F4B7BE",
    "#9A7FC0", "#B7A3D4", "#D4C7E8"
]

fig = go.Figure(go.Sunburst(
    labels=labels,
    parents=parents,
    values=values,
    marker=dict(colors=colors),
    branchvalues="total",
    hovertemplate='<b>%{label}</b><br>Relative Weight: %{value}<extra></extra>',
    textfont=dict(size=12)
))

fig.update_layout(
    title=dict(
        text="Data Engineering Landscape",
        font=dict(size=20, color="#2E4057"),
        x=0.5
    ),
    width=700,
    height=600,
    margin=dict(t=60, l=20, r=20, b=20)
)

fig

In [None]:
import plotly.graph_objects as go

# Batch vs Streaming Comparison
categories = ['Latency', 'Throughput', 'Complexity', 'Cost Efficiency', 'Real-time Capability', 'Fault Tolerance']

batch_scores = [2, 5, 2, 4, 1, 4]  # 1-5 scale
streaming_scores = [5, 3, 4, 3, 5, 3]

fig = go.Figure()

fig.add_trace(go.Scatterpolar(
    r=batch_scores + [batch_scores[0]],
    theta=categories + [categories[0]],
    fill='toself',
    name='Batch Processing',
    line=dict(color='#048A81', width=2),
    fillcolor='rgba(4, 138, 129, 0.3)'
))

fig.add_trace(go.Scatterpolar(
    r=streaming_scores + [streaming_scores[0]],
    theta=categories + [categories[0]],
    fill='toself',
    name='Stream Processing',
    line=dict(color='#D64550', width=2),
    fillcolor='rgba(214, 69, 80, 0.3)'
))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 5],
            tickvals=[1, 2, 3, 4, 5],
            ticktext=['Low', '', 'Medium', '', 'High']
        )
    ),
    title=dict(
        text="Batch vs Stream Processing Comparison",
        font=dict(size=18, color="#2E4057"),
        x=0.5
    ),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    ),
    width=650,
    height=500,
    margin=dict(t=80, b=80)
)

fig

In [None]:
import plotly.graph_objects as go

# Modern Data Stack Tools by Category
categories = ['Ingestion', 'Storage', 'Transformation', 'Orchestration', 'Quality', 'BI/Analytics']
tools_count = [12, 8, 6, 10, 7, 15]
adoption_rate = [78, 92, 85, 72, 58, 88]  # Percentage

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Tools Available by Category", "Enterprise Adoption Rate (%)"),
    specs=[[{"type": "bar"}, {"type": "bar"}]]
)

colors = ['#048A81', '#54C6EB', '#8EE3EF', '#F7A072', '#D64550', '#7D5BA6']

fig.add_trace(
    go.Bar(
        x=categories,
        y=tools_count,
        marker_color=colors,
        text=tools_count,
        textposition='outside',
        name='Tools Count',
        showlegend=False
    ),
    row=1, col=1
)

fig.add_trace(
    go.Bar(
        x=categories,
        y=adoption_rate,
        marker_color=colors,
        text=[f"{r}%" for r in adoption_rate],
        textposition='outside',
        name='Adoption Rate',
        showlegend=False
    ),
    row=1, col=2
)

fig.update_layout(
    title=dict(
        text="Modern Data Stack Overview",
        font=dict(size=20, color="#2E4057"),
        x=0.5
    ),
    width=900,
    height=450,
    margin=dict(t=80, b=60)
)

fig.update_yaxes(title_text="Number of Tools", row=1, col=1)
fig.update_yaxes(title_text="Adoption Rate (%)", range=[0, 100], row=1, col=2)
fig.update_xaxes(tickangle=45)

fig

---

## 5. Key Takeaways

### Summary

| Concept | Key Points |
|---------|------------|
| **Data Engineering** | Bridges data sources and consumers; enables data-driven decisions |
| **ETL vs ELT** | ETL transforms before loading; ELT leverages warehouse compute power |
| **Batch vs Stream** | Batch for throughput & analytics; Stream for real-time requirements |
| **Modern Data Stack** | Cloud-native, modular tools working together |
| **Data Quality** | Critical for trust; implement testing and observability |

### Best Practices

1. **Design for Scale** — Build pipelines that can handle 10x growth
2. **Embrace Modularity** — Use composable tools that do one thing well
3. **Prioritize Data Quality** — Test data like you test code
4. **Document Everything** — Maintain data catalogs and lineage
5. **Monitor Proactively** — Set up alerts before issues impact downstream users

### Further Reading

- [Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/) by Joe Reis & Matt Housley
- [The Data Warehouse Toolkit](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/) by Ralph Kimball
- [Designing Data-Intensive Applications](https://dataintensive.net/) by Martin Kleppmann