# Cloud Data Engineering — Overview

## Purpose

This notebook provides an overview of cloud-based data engineering services across the three major cloud providers: AWS, Azure, and GCP. Understanding these services is essential for designing scalable, cost-effective, and resilient data pipelines in modern enterprise environments.

## Key Questions

1. What are the core data engineering services offered by each major cloud provider?
2. How do equivalent services compare across AWS, Azure, and GCP?
3. What factors should guide cloud service selection for data workloads?
4. How do you design for multi-cloud or hybrid-cloud data architectures?
5. What are the cost, performance, and operational trade-offs between providers?

---

## AWS Data Services

Amazon Web Services offers a comprehensive suite of data engineering services:

### AWS Glue

**Serverless ETL service** for data preparation and transformation.

| Feature | Description |
|---------|-------------|
| **Glue Data Catalog** | Centralized metadata repository (Hive-compatible) |
| **Glue Crawlers** | Automatic schema discovery and cataloging |
| **Glue ETL Jobs** | PySpark/Scala-based serverless transformations |
| **Glue Studio** | Visual ETL job authoring interface |
| **Glue DataBrew** | No-code data preparation for analysts |

```python
# Example: Simple Glue ETL job structure
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Read from Glue Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="raw_data"
)

# Transform and write
transformed = ApplyMapping.apply(frame=datasource, mappings=[...])
glueContext.write_dynamic_frame.from_options(transformed, ...)
```

### Amazon EMR (Elastic MapReduce)

**Managed big data platform** for running Apache Spark, Hive, Presto, and other frameworks.

| Deployment Option | Use Case |
|-------------------|----------|
| **EMR on EC2** | Full control, persistent clusters |
| **EMR on EKS** | Kubernetes-native Spark workloads |
| **EMR Serverless** | Auto-scaling, pay-per-use compute |

### Amazon Kinesis

**Real-time streaming data platform** with multiple components:

- **Kinesis Data Streams**: Low-latency data ingestion (sharded)
- **Kinesis Data Firehose**: Zero-admin delivery to S3, Redshift, OpenSearch
- **Kinesis Data Analytics**: SQL/Flink-based stream processing

### Amazon S3 (Simple Storage Service)

**Object storage** — the foundation of AWS data lakes.

| Storage Class | Use Case | Retrieval |
|---------------|----------|----------|
| S3 Standard | Frequently accessed data | Immediate |
| S3 Intelligent-Tiering | Unknown access patterns | Immediate |
| S3 Glacier Instant | Archive with instant access | Immediate |
| S3 Glacier Deep Archive | Long-term archive | 12-48 hours |

---

## Azure Data Services

Microsoft Azure provides tightly integrated data engineering services:

### Azure Data Factory (ADF)

**Cloud-scale ETL/ELT orchestration service** with 90+ native connectors.

| Component | Description |
|-----------|-------------|
| **Pipelines** | Orchestration workflows with activities |
| **Data Flows** | Spark-based visual transformations |
| **Integration Runtime** | Compute for data movement (Azure, self-hosted, SSIS) |
| **Triggers** | Schedule, tumbling window, or event-based execution |

```json
// Example: ADF pipeline definition structure
{
  "name": "IngestPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyFromBlobToLake",
        "type": "Copy",
        "inputs": [{"referenceName": "BlobSource"}],
        "outputs": [{"referenceName": "LakeSink"}]
      }
    ]
  }
}
```

### Azure Databricks

**Unified analytics platform** built on Apache Spark with collaborative notebooks.

- **Delta Lake**: ACID transactions on data lakes
- **Unity Catalog**: Unified governance across workspaces
- **Photon Engine**: Vectorized query execution (3-8x faster)
- **MLflow Integration**: End-to-end ML lifecycle management

### Azure Synapse Analytics

**Unified analytics service** combining data warehousing and big data.

| Component | Description |
|-----------|-------------|
| **Dedicated SQL Pool** | MPP data warehouse (formerly SQL DW) |
| **Serverless SQL Pool** | Query data lake files directly |
| **Spark Pools** | Managed Apache Spark clusters |
| **Synapse Pipelines** | ADF-compatible orchestration |
| **Synapse Link** | Real-time analytics on operational data |

---

## GCP Data Services

Google Cloud Platform leverages Google's expertise in large-scale data processing:

### Cloud Dataflow

**Serverless stream and batch processing** based on Apache Beam.

```python
# Example: Simple Dataflow pipeline
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    '--runner=DataflowRunner',
    '--project=my-project',
    '--region=us-central1',
    '--temp_location=gs://my-bucket/temp'
])

with beam.Pipeline(options=options) as p:
    (p 
     | 'Read' >> beam.io.ReadFromPubSub(topic='projects/my-project/topics/events')
     | 'Transform' >> beam.Map(process_event)
     | 'Write' >> beam.io.WriteToBigQuery('dataset.table'))
```

**Key Features:**
- Unified batch and streaming model
- Auto-scaling and dynamic work rebalancing
- Exactly-once processing semantics

### Cloud Dataproc

**Managed Spark/Hadoop service** with fast cluster provisioning (~90 seconds).

| Feature | Benefit |
|---------|--------|
| **Preemptible VMs** | Up to 80% cost savings |
| **Component Gateway** | Easy access to Spark UI, Jupyter |
| **Autoscaling** | Scale workers based on YARN metrics |
| **Dataproc Serverless** | No cluster management required |

### Cloud Pub/Sub

**Serverless messaging service** for event-driven architectures.

- **At-least-once delivery** with exactly-once processing (with Dataflow)
- **Push and pull subscriptions**
- **Message retention**: Up to 31 days
- **Dead-letter topics** for failed message handling

### Google Cloud Storage (GCS)

**Object storage** with strong consistency and global edge caching.

| Storage Class | Min Duration | Use Case |
|---------------|--------------|----------|
| Standard | None | Frequently accessed |
| Nearline | 30 days | Monthly access |
| Coldline | 90 days | Quarterly access |
| Archive | 365 days | Yearly access |

---

## Service Comparison Matrix

| Capability | AWS | Azure | GCP |
|------------|-----|-------|-----|
| **ETL/Orchestration** | Glue, Step Functions | Data Factory, Synapse Pipelines | Dataflow, Cloud Composer |
| **Managed Spark** | EMR, Glue | Databricks, Synapse Spark | Dataproc, Dataproc Serverless |
| **Streaming** | Kinesis | Event Hubs, Stream Analytics | Pub/Sub, Dataflow |
| **Object Storage** | S3 | Blob Storage, ADLS Gen2 | Cloud Storage |
| **Data Warehouse** | Redshift | Synapse Dedicated Pool | BigQuery |
| **Data Catalog** | Glue Data Catalog | Purview | Data Catalog |
| **Serverless Query** | Athena | Synapse Serverless | BigQuery |

---

## Cost Optimization & FinOps

**Common cost drivers:**
- Storage growth (hot vs cold tiers)
- Data egress between regions/providers
- Over-provisioned clusters or long-running jobs

**Practices:**
- Use **serverless** where possible for spiky workloads.
- Apply **auto-scaling** and **spot/preemptible** instances.
- Optimize file formats (Parquet/ORC), partitioning, and compaction.
- Set **budgets/alerts** and chargeback tags.

**Decision hint:** Start with cost visibility (tags, budgets), then optimize the biggest line items first.

---

## Security, Governance, and Observability

**Security & IAM**
- Enforce **least privilege** with scoped roles and service accounts.
- Use **customer-managed keys** (KMS) for encryption at rest.
- Centralize secrets (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager).

**Governance**
- Data catalogs + lineage (Glue Data Catalog, Purview, Data Catalog).
- Policy enforcement (row/column-level security, masking).
- Data classification and retention policies.

**Observability**
- Unified logging/metrics/tracing (CloudWatch, Azure Monitor, Cloud Logging).
- Data pipeline SLIs: freshness, completeness, failure rate, cost per TB.

---

## Multi-Cloud Considerations

### Why Multi-Cloud?

| Driver | Example |
|--------|--------|
| **Best-of-breed** | BigQuery for analytics + AWS for ML |
| **Vendor lock-in avoidance** | Portability requirements |
| **Regulatory/compliance** | Data residency constraints |
| **M&A integration** | Inherited cloud environments |
| **Resilience** | Cross-cloud disaster recovery |

### Multi-Cloud Data Patterns

```
┌─────────────────────────────────────────────────────────────┐
│                    Multi-Cloud Data Mesh                    │
├─────────────────┬─────────────────┬─────────────────────────┤
│      AWS        │     Azure       │         GCP             │
│  ┌───────────┐  │  ┌───────────┐  │  ┌───────────────────┐  │
│  │ S3 Lake   │  │  │ ADLS Gen2 │  │  │ BigQuery + GCS    │  │
│  └─────┬─────┘  │  └─────┬─────┘  │  └─────────┬─────────┘  │
│        │        │        │        │            │            │
│        └────────┴────────┴────────┴────────────┘            │
│                         │                                   │
│              ┌──────────▼──────────┐                        │
│              │  Data Virtualization │                       │
│              │  (Starburst/Dremio)  │                       │
│              └─────────────────────┘                        │
└─────────────────────────────────────────────────────────────┘
```

### Key Challenges

1. **Data Movement Costs**: Egress fees between clouds ($0.08-0.12/GB)
2. **Latency**: Cross-cloud network latency (50-150ms typical)
3. **Consistency**: Eventual consistency across distributed systems
4. **Security**: Unified IAM and encryption key management
5. **Observability**: Centralized monitoring across providers

### Multi-Cloud Tools & Strategies

| Approach | Tools |
|----------|-------|
| **Data Virtualization** | Starburst, Dremio, Denodo |
| **Portable Compute** | Apache Spark, Apache Beam, Kubernetes |
| **Open Formats** | Parquet, Avro, Delta Lake, Apache Iceberg |
| **Infrastructure as Code** | Terraform, Pulumi |
| **Unified Orchestration** | Apache Airflow, Dagster, Prefect |

---

## Takeaway

### Key Insights

1. **Each cloud has strengths**: AWS excels in breadth, Azure in enterprise integration, GCP in analytics and ML

2. **Serverless is the trend**: Glue, Dataflow, Synapse Serverless reduce operational overhead

3. **Open formats enable portability**: Parquet, Delta Lake, and Iceberg reduce vendor lock-in

4. **Multi-cloud requires investment**: Added complexity in networking, security, and operations

5. **Cost optimization is critical**: Reserved capacity, spot instances, and right-sizing impact TCO significantly

### Decision Framework

```
┌─────────────────────────────────────────────────────────┐
│              Cloud Selection Criteria                   │
├─────────────────────────────────────────────────────────┤
│  1. Existing ecosystem (enterprise agreements, skills)  │
│  2. Specific service capabilities required              │
│  3. Data residency and compliance requirements          │
│  4. Integration with source/target systems              │
│  5. Total cost of ownership (compute + egress + ops)    │
│  6. Future portability requirements                     │
└─────────────────────────────────────────────────────────┘
```

### Further Reading

- AWS Well-Architected Framework — Data Analytics Lens
- Azure Cloud Adoption Framework — Data Management
- Google Cloud Architecture Framework — Data Analytics
- Data Mesh: Delivering Data-Driven Value at Scale (Zhamak Dehghani)