# ETL/ELT — Overview

## Purpose
ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two fundamental data integration patterns used to move data from source systems to target destinations (data warehouses, data lakes, etc.). Understanding when and how to apply each pattern is critical for building efficient, scalable data pipelines.

## Key Questions
1. What are the core differences between ETL and ELT?
2. When should you choose ETL over ELT, and vice versa?
3. How do transformation strategies differ between the two approaches?
4. What are the performance and scalability implications of each pattern?
5. How do modern cloud data platforms influence the ETL vs ELT decision?

---
## ETL vs ELT: Core Concepts

### ETL (Extract, Transform, Load)
```
┌──────────┐    ┌───────────────┐    ┌──────────────┐
│  Source  │───▶│  Transform    │───▶│   Target     │
│  Systems │    │  (Staging)    │    │  (Warehouse) │
└──────────┘    └───────────────┘    └──────────────┘
   Extract      Transform first       Load cleaned data
```

- **Extract**: Pull data from source systems (databases, APIs, files)
- **Transform**: Clean, validate, aggregate, and reshape data in a staging area
- **Load**: Insert transformed data into the target system

### ELT (Extract, Load, Transform)
```
┌──────────┐    ┌──────────────┐    ┌───────────────┐
│  Source  │───▶│   Target     │───▶│   Transform   │
│  Systems │    │  (Raw Zone)  │    │  (In-place)   │
└──────────┘    └──────────────┘    └───────────────┘
   Extract      Load raw data       Transform in target
```

- **Extract**: Pull data from source systems
- **Load**: Load raw data directly into target system (data lake/warehouse)
- **Transform**: Use target system's compute power to transform data

---
## When to Use Each Approach

| Criteria | ETL | ELT |
|----------|-----|-----|
| **Data Volume** | Small to medium datasets | Large-scale, big data workloads |
| **Target System** | Traditional data warehouses | Cloud data warehouses (Snowflake, BigQuery, Redshift) |
| **Compute Resources** | External processing server | Leverage target's compute power |
| **Data Quality** | Clean before loading | Store raw, clean as needed |
| **Flexibility** | Fixed transformations | Schema-on-read flexibility |
| **Latency Requirements** | Batch-oriented | Near real-time possible |
| **Compliance/Security** | Transform sensitive data before loading | May require additional security layers |

### Choose ETL When:
- Working with legacy on-premises data warehouses
- Data needs to be cleaned/masked before entering target system
- Complex transformations require specialized tools (Informatica, Talend)
- Strict data governance requires validated data only

### Choose ELT When:
- Using modern cloud data platforms with MPP (Massively Parallel Processing)
- Data volumes are large and growing rapidly
- Need flexibility to re-transform historical data
- Want to leverage SQL-based transformations (dbt, Dataform)

---
## Python ETL Example: Basic Pipeline

In [None]:
import pandas as pd
from datetime import datetime
from typing import Dict, List, Any

# Simulated source data (Extract phase would typically read from DB/API/Files)
raw_sales_data = [
    {"order_id": "ORD001", "product": "Widget A", "quantity": 10, "price": "$25.50", "date": "2025-01-15"},
    {"order_id": "ORD002", "product": "Widget B", "quantity": 5, "price": "$45.00", "date": "2025-01-16"},
    {"order_id": "ORD003", "product": "Widget A", "quantity": 8, "price": "$25.50", "date": "2025-01-16"},
    {"order_id": "ORD004", "product": "Widget C", "quantity": 3, "price": "invalid", "date": "2025-01-17"},
    {"order_id": "ORD005", "product": "Widget B", "quantity": -2, "price": "$45.00", "date": "2025-01-17"},
]

print("Raw Sales Data:")
pd.DataFrame(raw_sales_data)

In [None]:
class ETLPipeline:
    """Simple ETL pipeline demonstrating Extract, Transform, Load phases."""
    
    def __init__(self):
        self.extracted_data = []
        self.transformed_data = []
        self.errors = []
    
    # ===============================
    # EXTRACT PHASE
    # ===============================
    def extract(self, source_data: List[Dict]) -> 'ETLPipeline':
        """Extract data from source (simulated here with in-memory data)."""
        print(f"[EXTRACT] Reading {len(source_data)} records from source...")
        self.extracted_data = source_data.copy()
        return self
    
    # ===============================
    # TRANSFORM PHASE
    # ===============================
    def transform(self) -> 'ETLPipeline':
        """Apply transformations: clean, validate, enrich data."""
        print("[TRANSFORM] Applying transformations...")
        
        for record in self.extracted_data:
            try:
                transformed = self._transform_record(record)
                if transformed:
                    self.transformed_data.append(transformed)
            except Exception as e:
                self.errors.append({"record": record, "error": str(e)})
        
        print(f"[TRANSFORM] Successfully transformed {len(self.transformed_data)} records")
        print(f"[TRANSFORM] Rejected {len(self.errors)} records with errors")
        return self
    
    def _transform_record(self, record: Dict) -> Dict:
        """Transform individual record."""
        # Clean price: remove $ and convert to float
        price_str = record["price"].replace("$", "").strip()
        price = float(price_str)  # Will raise ValueError for invalid prices
        
        # Validate quantity
        quantity = int(record["quantity"])
        if quantity <= 0:
            raise ValueError(f"Invalid quantity: {quantity}")
        
        # Calculate derived fields
        total = price * quantity
        
        # Parse and standardize date
        parsed_date = datetime.strptime(record["date"], "%Y-%m-%d")
        
        return {
            "order_id": record["order_id"],
            "product": record["product"].upper(),  # Standardize to uppercase
            "quantity": quantity,
            "unit_price": price,
            "total_amount": round(total, 2),
            "order_date": parsed_date.date(),
            "order_month": parsed_date.strftime("%Y-%m"),
            "processed_at": datetime.now().isoformat()
        }
    
    # ===============================
    # LOAD PHASE
    # ===============================
    def load(self) -> pd.DataFrame:
        """Load transformed data to target (returning DataFrame as simulation)."""
        print(f"[LOAD] Loading {len(self.transformed_data)} records to target...")
        df = pd.DataFrame(self.transformed_data)
        print("[LOAD] Complete!")
        return df


# Run the ETL pipeline
pipeline = ETLPipeline()
result_df = (
    pipeline
    .extract(raw_sales_data)
    .transform()
    .load()
)

print("\n=== Transformed Data ===")
result_df

In [None]:
# Show rejected records
print("=== Rejected Records ===")
for error in pipeline.errors:
    print(f"  Order {error['record']['order_id']}: {error['error']}")

---
## Python ELT Example: Load First, Transform Later

In [None]:
class ELTPipeline:
    """Simple ELT pipeline: Load raw data first, transform in-place."""
    
    def __init__(self):
        self.raw_table = None  # Simulates raw data zone
        self.transformed_view = None  # Simulates transformed view/table
    
    # ===============================
    # EXTRACT PHASE
    # ===============================
    def extract(self, source_data: List[Dict]) -> 'ELTPipeline':
        """Extract from source (minimal processing)."""
        print(f"[EXTRACT] Reading {len(source_data)} records...")
        self._source_data = source_data
        return self
    
    # ===============================
    # LOAD PHASE (happens BEFORE transform in ELT)
    # ===============================
    def load_raw(self) -> 'ELTPipeline':
        """Load raw data directly to target (no transformation)."""
        print("[LOAD] Loading raw data to staging zone...")
        
        # Add metadata columns (common in ELT)
        raw_with_metadata = []
        for record in self._source_data:
            enriched = record.copy()
            enriched["_loaded_at"] = datetime.now().isoformat()
            enriched["_source"] = "sales_api"
            enriched["_raw_record"] = str(record)  # Keep original for debugging
            raw_with_metadata.append(enriched)
        
        self.raw_table = pd.DataFrame(raw_with_metadata)
        print(f"[LOAD] Loaded {len(self.raw_table)} records to raw zone")
        return self
    
    # ===============================
    # TRANSFORM PHASE (happens in target system)
    # ===============================
    def transform_in_place(self) -> pd.DataFrame:
        """Transform data using target system's compute (SQL-like operations)."""
        print("[TRANSFORM] Transforming data in target system...")
        
        # Simulate SQL-based transformations that would run in the data warehouse
        df = self.raw_table.copy()
        
        # Clean price column
        df["unit_price"] = df["price"].str.replace("$", "", regex=False)
        df["unit_price"] = pd.to_numeric(df["unit_price"], errors="coerce")
        
        # Convert quantity and filter invalid
        df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
        
        # Calculate total (NULL if any component is invalid)
        df["total_amount"] = df["unit_price"] * df["quantity"]
        
        # Add validation flag
        df["is_valid"] = (df["unit_price"].notna()) & (df["quantity"] > 0)
        
        # Parse date
        df["order_date"] = pd.to_datetime(df["date"], errors="coerce")
        
        self.transformed_view = df
        print("[TRANSFORM] Transformation complete (all records preserved with validity flag)")
        return df


# Run the ELT pipeline
elt_pipeline = ELTPipeline()
elt_result = (
    elt_pipeline
    .extract(raw_sales_data)
    .load_raw()  # Load BEFORE transform
    .transform_in_place()
)

print("\n=== ELT Result (all records with validity flag) ===")
elt_result[["order_id", "product", "quantity", "unit_price", "total_amount", "is_valid"]]

---
## ETL vs ELT: Comparison Visualization

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Comparison data
categories = [
    "Data Volume Handling",
    "Transformation Flexibility",
    "Real-time Capability",
    "Data Lineage Tracking",
    "Cost Efficiency (Cloud)",
    "Legacy System Support"
]

etl_scores = [3, 3, 2, 4, 2, 5]
elt_scores = [5, 5, 4, 3, 4, 2]

# Create radar chart
fig = go.Figure()

fig.add_trace(go.Scatterpolar(
    r=etl_scores + [etl_scores[0]],  # Close the polygon
    theta=categories + [categories[0]],
    fill='toself',
    name='ETL',
    line_color='#636EFA',
    fillcolor='rgba(99, 110, 250, 0.3)'
))

fig.add_trace(go.Scatterpolar(
    r=elt_scores + [elt_scores[0]],
    theta=categories + [categories[0]],
    fill='toself',
    name='ELT',
    line_color='#EF553B',
    fillcolor='rgba(239, 85, 59, 0.3)'
))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 5],
            tickvals=[1, 2, 3, 4, 5],
            ticktext=['1-Poor', '2', '3-Average', '4', '5-Excellent']
        )
    ),
    title=dict(
        text="ETL vs ELT: Capability Comparison",
        x=0.5,
        font=dict(size=18)
    ),
    showlegend=True,
    legend=dict(x=0.85, y=0.95),
    height=500
)

fig

In [None]:
# Use case suitability comparison
use_cases = [
    "Traditional DW Migration",
    "Cloud Data Lake",
    "Real-time Analytics",
    "Data Quality First",
    "Schema Evolution",
    "Big Data Processing"
]

etl_fit = [90, 40, 30, 85, 35, 30]
elt_fit = [30, 95, 75, 50, 85, 90]

fig2 = go.Figure()

fig2.add_trace(go.Bar(
    name='ETL',
    x=use_cases,
    y=etl_fit,
    marker_color='#636EFA',
    text=etl_fit,
    textposition='outside'
))

fig2.add_trace(go.Bar(
    name='ELT',
    x=use_cases,
    y=elt_fit,
    marker_color='#EF553B',
    text=elt_fit,
    textposition='outside'
))

fig2.update_layout(
    title=dict(
        text="ETL vs ELT: Use Case Suitability (%)",
        x=0.5,
        font=dict(size=18)
    ),
    barmode='group',
    yaxis_title="Suitability Score (%)",
    yaxis=dict(range=[0, 110]),
    xaxis_tickangle=-30,
    height=450,
    legend=dict(x=0.85, y=0.95)
)

fig2.show()

---
## Modern Hybrid Approach: ETLT

Many modern data platforms use a **hybrid ETLT** pattern:

```
┌──────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│  Source  │───▶│  Light ETL  │───▶│   Raw Zone   │───▶│  Transform  │
│  Systems │    │  (Minimal)  │    │  (Data Lake) │    │  (in DW)    │
└──────────┘    └─────────────┘    └──────────────┘    └─────────────┘
   Extract      Basic cleaning      Load raw +         Heavy transforms
                (PII masking)       metadata           (dbt, SQL)
```

### Benefits of ETLT:
- **Pre-load**: Apply essential transformations (PII masking, schema validation)
- **Post-load**: Leverage cloud DW compute for heavy transformations
- **Flexibility**: Keep raw data for re-processing while maintaining data quality

---
## Key Takeaways

| Aspect | ETL | ELT |
|--------|-----|-----|
| **Order** | Extract → Transform → Load | Extract → Load → Transform |
| **Best For** | On-prem, regulated data, smaller volumes | Cloud DW, big data, flexible schemas |
| **Transform Engine** | External (ETL server) | Target system (DW compute) |
| **Data Quality** | Validated before load | Raw preserved, validated later |
| **Modern Tools** | Informatica, Talend, SSIS | dbt, Dataform, Spark SQL |
| **Scalability** | Limited by ETL server | Scales with cloud resources |

### Decision Framework:

1. **Start with ELT** if using modern cloud data platforms (Snowflake, BigQuery, Databricks)
2. **Use ETL** when data must be cleaned/masked before entering target system
3. **Consider hybrid ETLT** for enterprise environments with mixed requirements
4. **Prioritize data quality** regardless of pattern—bad data in = bad insights out

### Next Steps:
- Explore **dbt** for SQL-based transformations in ELT pipelines
- Learn **Apache Airflow** for orchestrating complex data pipelines
- Study **streaming ETL** with Kafka, Spark Streaming for real-time needs