# Azure Data Engineering Services

This notebook provides a comprehensive overview of Azure's data engineering services, covering storage, orchestration, processing, analytics, and streaming capabilities.

## Table of Contents
1. [Azure Data Lake Storage Gen2](#azure-data-lake-storage-gen2)
2. [Azure Data Factory](#azure-data-factory)
3. [Azure Databricks](#azure-databricks)
4. [Azure Synapse Analytics](#azure-synapse-analytics)
5. [Azure Event Hubs](#azure-event-hubs)
6. [Best Practices & Architecture Patterns](#best-practices--architecture-patterns)
7. [Takeaways](#takeaways)

---

## Azure Data Lake Storage Gen2

**Azure Data Lake Storage Gen2 (ADLS Gen2)** is a highly scalable and cost-effective data lake solution for big data analytics. It combines the capabilities of Azure Blob Storage with a hierarchical file system.

### Key Features

| Feature | Description |
|---------|-------------|
| **Hierarchical Namespace** | True directory structure for efficient file operations |
| **POSIX-compliant ACLs** | Fine-grained access control at file/folder level |
| **Azure Blob API Compatible** | Works with existing blob storage tools and SDKs |
| **Cost-Effective** | Hot, Cool, and Archive tiers for cost optimization |
| **Hadoop Compatible** | ABFS driver for seamless Hadoop ecosystem integration |

### Storage Hierarchy

```
Storage Account
└── Container (File System)
    └── Directory
        └── Sub-Directory
            └── Files (Blobs)
```

### Access Patterns

- **Storage Account Key**: Full access (avoid in production)
- **Shared Access Signature (SAS)**: Time-limited, scoped access
- **Azure AD + RBAC**: Role-based access control
- **ACLs**: POSIX-style permissions on files/directories

In [None]:
# Example: Connecting to ADLS Gen2 with Azure SDK
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential

# Using Azure AD authentication (recommended)
account_name = "your_storage_account"
account_url = f"https://{account_name}.dfs.core.windows.net"

# DefaultAzureCredential supports multiple auth methods
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(account_url, credential=credential)

# Create a file system (container)
file_system_client = service_client.create_file_system(file_system="raw-data")

# Create a directory
directory_client = file_system_client.create_directory("sales/2024")

# Upload a file
file_client = directory_client.create_file("transactions.parquet")
with open("local_transactions.parquet", "rb") as data:
    file_client.upload_data(data, overwrite=True)

In [None]:
# Example: Reading from ADLS Gen2 with PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ADLS Gen2 Example") \
    .config("spark.hadoop.fs.azure.account.auth.type", "OAuth") \
    .config("spark.hadoop.fs.azure.account.oauth.provider.type", 
            "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \
    .getOrCreate()

# ABFS URI format for ADLS Gen2
storage_account = "your_storage_account"
container = "raw-data"
path = "sales/2024/transactions.parquet"

# Read data using abfss:// protocol (secure)
df = spark.read.parquet(f"abfss://{container}@{storage_account}.dfs.core.windows.net/{path}")
df.show()

---

## Azure Data Factory

**Azure Data Factory (ADF)** is a cloud-based ETL/ELT and data integration service for orchestrating data movement and transformation at scale.

### Core Components

| Component | Description |
|-----------|-------------|
| **Pipelines** | Logical grouping of activities that perform a unit of work |
| **Activities** | Processing steps within a pipeline (Copy, Transform, Control) |
| **Datasets** | Named views of data pointing to data stores |
| **Linked Services** | Connection strings to data stores and compute resources |
| **Triggers** | Units that determine when a pipeline should execute |
| **Integration Runtime** | Compute infrastructure for data movement and dispatch |

### Integration Runtime Types

1. **Azure IR**: Managed compute in Azure regions
2. **Self-Hosted IR**: On-premises or private network connectivity
3. **Azure-SSIS IR**: Run SSIS packages in the cloud

### Activity Types

```
┌─────────────────────────────────────────────────────────┐
│                    ADF Activities                        │
├─────────────────┬─────────────────┬─────────────────────┤
│ Data Movement   │ Data Transform  │ Control Flow        │
├─────────────────┼─────────────────┼─────────────────────┤
│ Copy Activity   │ Data Flow       │ If Condition        │
│                 │ Databricks      │ ForEach / Until     │
│                 │ HDInsight       │ Switch              │
│                 │ Stored Procedure│ Wait / Web          │
│                 │ Azure Functions │ Execute Pipeline    │
└─────────────────┴─────────────────┴─────────────────────┘
```

In [None]:
# Example: Creating ADF Pipeline programmatically
from azure.identity import DefaultAzureCredential
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *

# Initialize client
credential = DefaultAzureCredential()
subscription_id = "your-subscription-id"
resource_group = "your-resource-group"
factory_name = "your-data-factory"

adf_client = DataFactoryManagementClient(credential, subscription_id)

# Create a Linked Service to ADLS Gen2
ls_adls = LinkedServiceResource(
    properties=AzureBlobFSLinkedService(
        url=f"https://yourstorageaccount.dfs.core.windows.net"
    )
)
adf_client.linked_services.create_or_update(
    resource_group, factory_name, "ls_adls_gen2", ls_adls
)

# Create a Dataset
ds_parquet = DatasetResource(
    properties=ParquetDataset(
        linked_service_name=LinkedServiceReference(
            reference_name="ls_adls_gen2"
        ),
        location=AzureBlobFSLocation(
            file_system="raw-data",
            folder_path="sales/2024"
        )
    )
)
adf_client.datasets.create_or_update(
    resource_group, factory_name, "ds_sales_parquet", ds_parquet
)

In [None]:
# Example: Create a Copy Pipeline
copy_activity = CopyActivity(
    name="CopyFromBlobToADLS",
    inputs=[DatasetReference(reference_name="ds_source_blob")],
    outputs=[DatasetReference(reference_name="ds_sales_parquet")],
    source=BlobSource(),
    sink=ParquetSink()
)

# Create pipeline with activities
pipeline = PipelineResource(
    activities=[copy_activity],
    parameters={
        "inputPath": ParameterSpecification(type="String"),
        "outputPath": ParameterSpecification(type="String")
    }
)

adf_client.pipelines.create_or_update(
    resource_group, factory_name, "pl_copy_sales_data", pipeline
)

# Create a Schedule Trigger
trigger = TriggerResource(
    properties=ScheduleTrigger(
        pipelines=[
            TriggerPipelineReference(
                pipeline_reference=PipelineReference(reference_name="pl_copy_sales_data"),
                parameters={"inputPath": "source/", "outputPath": "dest/"}
            )
        ],
        recurrence=ScheduleTriggerRecurrence(
            frequency="Day",
            interval=1,
            start_time="2024-01-01T00:00:00Z",
            time_zone="UTC"
        )
    )
)

adf_client.triggers.create_or_update(
    resource_group, factory_name, "tr_daily_sales_copy", trigger
)

---

## Azure Databricks

**Azure Databricks** is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure.

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Workspace** | Environment for accessing Databricks assets |
| **Clusters** | Managed Spark compute resources |
| **Notebooks** | Collaborative code documents |
| **Jobs** | Scheduled or triggered notebook/JAR execution |
| **Delta Lake** | ACID transactions on data lakes |
| **Unity Catalog** | Unified governance for data and AI |

### Cluster Types

1. **All-Purpose Clusters**: Interactive analysis, shared among users
2. **Job Clusters**: Ephemeral, created for job runs, terminated after
3. **SQL Warehouses**: Optimized for SQL analytics workloads

### Delta Lake Benefits

```
┌──────────────────────────────────────────────────────────┐
│                      Delta Lake                           │
├──────────────────────────────────────────────────────────┤
│  ✓ ACID Transactions    ✓ Schema Enforcement             │
│  ✓ Time Travel          ✓ Unified Batch + Streaming      │
│  ✓ Scalable Metadata    ✓ Data Versioning                │
│  ✓ Z-Order Indexing     ✓ Auto-Optimization              │
└──────────────────────────────────────────────────────────┘
```

In [None]:
# Example: Delta Lake Operations in Azure Databricks
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp

spark = SparkSession.builder \
    .appName("Delta Lake Example") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Create a Delta table
delta_path = "abfss://curated@storage.dfs.core.windows.net/sales_delta"

df = spark.read.parquet("abfss://raw@storage.dfs.core.windows.net/sales/")
df.write.format("delta").mode("overwrite").save(delta_path)

# Register as a table in metastore
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS sales_delta
    USING DELTA
    LOCATION '{delta_path}'
""")

In [None]:
# Example: Delta Lake Merge (Upsert) Operation
from delta.tables import DeltaTable

# Load existing Delta table
delta_table = DeltaTable.forPath(spark, delta_path)

# New/updated records
updates_df = spark.read.parquet("abfss://raw@storage.dfs.core.windows.net/sales_updates/")

# Merge operation (upsert)
delta_table.alias("target") \
    .merge(
        updates_df.alias("source"),
        "target.transaction_id = source.transaction_id"
    ) \
    .whenMatchedUpdate(set={
        "amount": col("source.amount"),
        "updated_at": current_timestamp()
    }) \
    .whenNotMatchedInsert(values={
        "transaction_id": col("source.transaction_id"),
        "amount": col("source.amount"),
        "created_at": current_timestamp(),
        "updated_at": current_timestamp()
    }) \
    .execute()

# Time Travel: Query previous version
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)

# View table history
delta_table.history().show()

In [None]:
# Example: Structured Streaming with Delta Lake
# Read streaming data
streaming_df = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "abfss://config@storage.dfs.core.windows.net/schema/") \
    .load("abfss://landing@storage.dfs.core.windows.net/events/")

# Write to Delta table with checkpointing
query = streaming_df \
    .writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "abfss://config@storage.dfs.core.windows.net/checkpoints/events/") \
    .option("mergeSchema", "true") \
    .trigger(processingTime="10 seconds") \
    .start("abfss://curated@storage.dfs.core.windows.net/events_delta/")

---

## Azure Synapse Analytics

**Azure Synapse Analytics** is an integrated analytics service that combines enterprise data warehousing and big data analytics.

### Core Components

| Component | Description |
|-----------|-------------|
| **Synapse Workspace** | Unified experience for data engineering and analytics |
| **Dedicated SQL Pool** | Enterprise data warehouse (formerly SQL DW) |
| **Serverless SQL Pool** | Query data lake without loading |
| **Apache Spark Pool** | Big data processing and ML |
| **Synapse Pipelines** | Data orchestration (ADF-compatible) |
| **Synapse Link** | Real-time analytics on operational data |

### Architecture Overview

```
                    ┌─────────────────────────────────────┐
                    │        Azure Synapse Studio         │
                    │    (Unified Development Experience) │
                    └─────────────────────────────────────┘
                                      │
        ┌─────────────────────────────┼─────────────────────────────┐
        │                             │                             │
        ▼                             ▼                             ▼
┌───────────────┐           ┌───────────────┐           ┌───────────────┐
│ Dedicated SQL │           │ Serverless    │           │ Apache Spark  │
│ Pool (MPP)    │           │ SQL Pool      │           │ Pool          │
├───────────────┤           ├───────────────┤           ├───────────────┤
│ - EDW         │           │ - Ad-hoc      │           │ - Big Data    │
│ - PolyBase    │           │ - Pay per TB  │           │ - ML/AI       │
│ - Materialized│           │ - No loading  │           │ - Delta Lake  │
│   Views       │           │ - OPENROWSET  │           │ - Notebooks   │
└───────────────┘           └───────────────┘           └───────────────┘
        │                             │                             │
        └─────────────────────────────┴─────────────────────────────┘
                                      │
                                      ▼
                    ┌─────────────────────────────────────┐
                    │     Azure Data Lake Storage Gen2    │
                    │         (Primary Storage)           │
                    └─────────────────────────────────────┘
```

In [None]:
# Example: Serverless SQL Pool - Query Parquet in Data Lake

# SQL script to query data lake files directly
serverless_sql_query = """
-- Query Parquet files in ADLS Gen2 using OPENROWSET
SELECT 
    customer_id,
    product_category,
    SUM(amount) AS total_amount,
    COUNT(*) AS transaction_count
FROM OPENROWSET(
    BULK 'https://storage.dfs.core.windows.net/curated/sales/**/*.parquet',
    FORMAT = 'PARQUET'
) AS sales
GROUP BY customer_id, product_category
ORDER BY total_amount DESC;

-- Create an external table for reusable access
CREATE EXTERNAL TABLE sales_external (
    transaction_id VARCHAR(50),
    customer_id VARCHAR(50),
    product_category VARCHAR(100),
    amount DECIMAL(18,2),
    transaction_date DATE
)
WITH (
    LOCATION = 'curated/sales/',
    DATA_SOURCE = adls_curated,
    FILE_FORMAT = parquet_format
);
"""
print(serverless_sql_query)

In [None]:
# Example: Dedicated SQL Pool - Data Warehouse Patterns

dedicated_sql_patterns = """
-- Create dimension table with proper distribution
CREATE TABLE dim_customer
WITH (
    DISTRIBUTION = REPLICATE,  -- Small dimension tables
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT 
    customer_id,
    customer_name,
    customer_segment,
    region
FROM staging.customers;

-- Create fact table with hash distribution
CREATE TABLE fact_sales
WITH (
    DISTRIBUTION = HASH(customer_id),  -- Large fact tables
    CLUSTERED COLUMNSTORE INDEX,
    PARTITION (transaction_date RANGE RIGHT FOR VALUES (
        '2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01'
    ))
)
AS
SELECT 
    transaction_id,
    customer_id,
    product_id,
    amount,
    transaction_date
FROM staging.sales;

-- Create materialized view for common aggregations
CREATE MATERIALIZED VIEW mv_daily_sales
WITH (DISTRIBUTION = HASH(customer_id))
AS
SELECT 
    customer_id,
    CAST(transaction_date AS DATE) AS sale_date,
    SUM(amount) AS daily_total,
    COUNT(*) AS transaction_count
FROM fact_sales
GROUP BY customer_id, CAST(transaction_date AS DATE);
"""
print(dedicated_sql_patterns)

In [None]:
# Example: Synapse Spark Pool - Integration with SQL Pools
from pyspark.sql import SparkSession

# In Synapse Spark notebook
spark = SparkSession.builder.getOrCreate()

# Read from Dedicated SQL Pool using Synapse connector
df_from_sql = spark.read \
    .synapsesql("dedicated_pool.dbo.fact_sales") \
    .filter("transaction_date >= '2024-01-01'")

# Process with Spark
df_aggregated = df_from_sql.groupBy("customer_id") \
    .agg({"amount": "sum", "transaction_id": "count"})

# Write results back to SQL Pool
df_aggregated.write \
    .synapsesql(
        "dedicated_pool.dbo.customer_summary",
        mode="overwrite"
    )

# Read from Serverless SQL Pool external table
df_external = spark.read \
    .synapsesql("serverless_pool.dbo.sales_external")

---

## Azure Event Hubs

**Azure Event Hubs** is a big data streaming platform and event ingestion service capable of receiving and processing millions of events per second.

### Key Features

| Feature | Description |
|---------|-------------|
| **Partitions** | Ordered sequence of events, enables parallelism |
| **Consumer Groups** | View (state, position, offset) of entire event hub |
| **Capture** | Automatic data capture to ADLS or Blob Storage |
| **Schema Registry** | Centralized schema management with Avro support |
| **Kafka Compatible** | Use Kafka clients/protocols with Event Hubs |

### Pricing Tiers

| Tier | Throughput Units | Use Case |
|------|------------------|----------|
| Basic | 1 TU (1 MB/s in, 2 MB/s out) | Development/Testing |
| Standard | Up to 40 TUs | Production workloads |
| Premium | Processing Units (PUs) | High-throughput, isolation |
| Dedicated | Capacity Units (CUs) | Enterprise, guaranteed capacity |

### Architecture

```
Producers                   Event Hub                    Consumers
─────────                   ─────────                    ─────────
                     ┌────────────────────┐
 IoT Devices    ────►│   Partition 0      │────►  Spark Streaming
                     ├────────────────────┤
 Applications   ────►│   Partition 1      │────►  Azure Functions
                     ├────────────────────┤
 Services       ────►│   Partition 2      │────►  Stream Analytics
                     ├────────────────────┤
 Kafka Clients  ────►│   Partition N      │────►  Databricks
                     └────────────────────┘
                              │
                              ▼ (Capture)
                     ┌────────────────────┐
                     │   ADLS Gen2        │
                     │   (Avro format)    │
                     └────────────────────┘
```

In [None]:
# Example: Sending Events to Event Hubs
from azure.eventhub import EventHubProducerClient, EventData
from azure.identity import DefaultAzureCredential
import json

# Using Azure AD authentication
credential = DefaultAzureCredential()
eventhub_namespace = "your-namespace.servicebus.windows.net"
eventhub_name = "sales-events"

producer = EventHubProducerClient(
    fully_qualified_namespace=eventhub_namespace,
    eventhub_name=eventhub_name,
    credential=credential
)

# Send batch of events
async def send_events():
    async with producer:
        # Create a batch
        event_batch = await producer.create_batch()
        
        events = [
            {"event_type": "sale", "customer_id": "C001", "amount": 150.00},
            {"event_type": "sale", "customer_id": "C002", "amount": 299.99},
            {"event_type": "return", "customer_id": "C001", "amount": -50.00}
        ]
        
        for event in events:
            event_batch.add(EventData(json.dumps(event)))
        
        await producer.send_batch(event_batch)
        print(f"Sent {len(events)} events")

# import asyncio
# asyncio.run(send_events())

In [None]:
# Example: Consuming Events from Event Hubs
from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Checkpoint store for tracking progress
checkpoint_store = BlobCheckpointStore(
    blob_account_url="https://storage.blob.core.windows.net",
    container_name="eventhub-checkpoints",
    credential=credential
)

consumer = EventHubConsumerClient(
    fully_qualified_namespace="your-namespace.servicebus.windows.net",
    eventhub_name="sales-events",
    consumer_group="$Default",
    credential=credential,
    checkpoint_store=checkpoint_store
)

async def on_event(partition_context, event):
    """Process incoming events"""
    print(f"Partition: {partition_context.partition_id}")
    print(f"Event: {event.body_as_str()}")
    
    # Checkpoint after processing
    await partition_context.update_checkpoint(event)

async def receive_events():
    async with consumer:
        await consumer.receive(
            on_event=on_event,
            starting_position="-1"  # From beginning
        )

# import asyncio
# asyncio.run(receive_events())

In [None]:
# Example: Spark Structured Streaming from Event Hubs
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StringType, DoubleType

spark = SparkSession.builder \
    .appName("EventHubs Streaming") \
    .getOrCreate()

# Connection configuration
connection_string = "Endpoint=sb://namespace.servicebus.windows.net/;..." 
eh_conf = {
    "eventhubs.connectionString": sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connection_string)
}

# Define schema for incoming events
event_schema = StructType() \
    .add("event_type", StringType()) \
    .add("customer_id", StringType()) \
    .add("amount", DoubleType())

# Read from Event Hubs
df_stream = spark.readStream \
    .format("eventhubs") \
    .options(**eh_conf) \
    .load()

# Parse event body
df_parsed = df_stream \
    .withColumn("body", col("body").cast("string")) \
    .withColumn("event", from_json(col("body"), event_schema)) \
    .select(
        col("event.event_type"),
        col("event.customer_id"),
        col("event.amount"),
        col("enqueuedTime").alias("event_time")
    )

# Write to Delta Lake
query = df_parsed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "abfss://checkpoints@storage.dfs.core.windows.net/eventhubs/") \
    .start("abfss://curated@storage.dfs.core.windows.net/streaming_events/")

---

## Best Practices & Architecture Patterns

### 1. Medallion Architecture (Bronze-Silver-Gold)

A data design pattern used to organize data in a lakehouse.

```
┌──────────────────────────────────────────────────────────────────────────┐
│                         Medallion Architecture                            │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│   ┌─────────────┐      ┌─────────────┐      ┌─────────────┐              │
│   │   BRONZE    │      │   SILVER    │      │    GOLD     │              │
│   │   (Raw)     │ ───► │  (Cleansed) │ ───► │  (Curated)  │              │
│   └─────────────┘      └─────────────┘      └─────────────┘              │
│                                                                           │
│   • Raw ingestion       • Cleansed        • Business-level               │
│   • Schema-on-read      • Deduplicated    • Aggregated                   │
│   • Append-only         • Validated       • Star schema                  │
│   • Full history        • Conformed       • Consumption-ready            │
│                                                                           │
└──────────────────────────────────────────────────────────────────────────┘
```

### 2. Lambda vs Kappa Architecture

| Aspect | Lambda | Kappa |
|--------|--------|-------|
| **Processing** | Batch + Stream (separate) | Stream only |
| **Complexity** | Higher (two systems) | Lower (single system) |
| **Reprocessing** | Batch layer | Replay stream |
| **Use Case** | Legacy + Real-time | Streaming-first |
| **Azure Services** | ADF + Event Hubs + Spark | Event Hubs + Spark Streaming |

### 3. Security Best Practices

```
┌──────────────────────────────────────────────────────────────────────────┐
│                        Security Layers                                    │
├──────────────────────────────────────────────────────────────────────────┤
│  1. Network Security                                                      │
│     • Private Endpoints for all services                                  │
│     • VNet integration                                                    │
│     • Network Security Groups (NSGs)                                      │
│                                                                           │
│  2. Identity & Access                                                     │
│     • Azure AD + Managed Identities                                       │
│     • RBAC for resource access                                            │
│     • ACLs for fine-grained data access                                   │
│                                                                           │
│  3. Data Protection                                                       │
│     • Encryption at rest (Azure-managed or CMK)                           │
│     • Encryption in transit (TLS 1.2+)                                    │
│     • Data masking for sensitive fields                                   │
│                                                                           │
│  4. Governance                                                            │
│     • Unity Catalog / Microsoft Purview                                   │
│     • Data lineage tracking                                               │
│     • Audit logging                                                       │
└──────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Example: End-to-End Data Pipeline Architecture

architecture_diagram = """
                                AZURE DATA PLATFORM ARCHITECTURE
══════════════════════════════════════════════════════════════════════════════════════

    DATA SOURCES                    INGESTION                      STORAGE
    ───────────                    ─────────                      ───────
                                                                    
    ┌───────────┐                ┌─────────────┐              ┌──────────────┐
    │ Databases │───────────────►│             │              │   ADLS Gen2  │
    └───────────┘                │   Azure     │              │              │
                                 │   Data      │──────────────►  ┌────────┐  │
    ┌───────────┐                │   Factory   │              │  │ Bronze │  │
    │   APIs    │───────────────►│             │              │  └────────┘  │
    └───────────┘                └─────────────┘              │       │      │
                                                              │       ▼      │
    ┌───────────┐                ┌─────────────┐              │  ┌────────┐  │
    │ IoT/Events│───────────────►│ Event Hubs  │──────────────►  │ Silver │  │
    └───────────┘                └─────────────┘              │  └────────┘  │
                                                              │       │      │
    ┌───────────┐                ┌─────────────┐              │       ▼      │
    │   Files   │───────────────►│   Logic     │──────────────►  ┌────────┐  │
    └───────────┘                │   Apps      │              │  │  Gold  │  │
                                 └─────────────┘              │  └────────┘  │
                                                              └──────────────┘
                                                                     │
    ═══════════════════════════════════════════════════════════════════════════
    
    PROCESSING                                                SERVING
    ──────────                                                ───────
                                                                     │
    ┌──────────────────────────────┐                                 ▼
    │       Azure Databricks       │◄────────────────┬───────────────┤
    │  • ETL/ELT Processing        │                 │               │
    │  • ML/AI Workloads           │                 │               │
    │  • Delta Lake                │                 │               ▼
    └──────────────────────────────┘                 │        ┌─────────────┐
                                                     │        │  Synapse    │
    ┌──────────────────────────────┐                 │        │  Analytics  │
    │     Azure Synapse Spark      │◄────────────────┤        │  Dedicated  │
    │  • Big Data Processing       │                 │        │  SQL Pool   │
    │  • Synapse Link              │                 │        └─────────────┘
    └──────────────────────────────┘                 │               │
                                                     │               ▼
    ═══════════════════════════════════════════════════════════════════════════
    
    CONSUMPTION
    ───────────
                                 │                               │
                                 ▼                               ▼
                          ┌─────────────┐                 ┌─────────────┐
                          │  Power BI   │                 │  Custom     │
                          │  Dashboards │                 │  Apps/APIs  │
                          └─────────────┘                 └─────────────┘
"""
print(architecture_diagram)

### 4. Cost Optimization Strategies

| Service | Strategy |
|---------|----------|
| **ADLS Gen2** | Lifecycle policies, tiered storage (Hot/Cool/Archive) |
| **ADF** | Use mapping data flows only when needed, optimize IR |
| **Databricks** | Auto-scaling, spot instances, cluster pools |
| **Synapse** | Pause dedicated pools, use serverless for ad-hoc |
| **Event Hubs** | Right-size throughput units, enable auto-inflate |

### 5. Monitoring & Observability

- **Azure Monitor**: Centralized metrics and logs
- **Log Analytics**: Query and analyze telemetry
- **Application Insights**: End-to-end tracing
- **Azure Alerts**: Proactive notifications
- **Databricks Observability**: Ganglia, Spark UI, Query History

---

## Takeaways

### Key Points Summary

| Service | Primary Use | Key Benefit |
|---------|-------------|-------------|
| **ADLS Gen2** | Data Lake Storage | Hierarchical namespace + cost-effective |
| **Azure Data Factory** | Orchestration & ETL | 90+ connectors, visual pipelines |
| **Azure Databricks** | Big Data Processing | Unified analytics, Delta Lake |
| **Azure Synapse** | Analytics Hub | SQL + Spark + Pipelines unified |
| **Azure Event Hubs** | Event Streaming | Millions of events/second |

### When to Use What?

```
┌─────────────────────────────────────────────────────────────────────────┐
│                          Decision Guide                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Need batch orchestration?          → Azure Data Factory                 │
│                                                                          │
│  Need complex transformations?      → Databricks or Synapse Spark        │
│                                                                          │
│  Need real-time streaming?          → Event Hubs + Spark Streaming       │
│                                                                          │
│  Need SQL-based analytics?          → Synapse SQL (Serverless/Dedicated) │
│                                                                          │
│  Need ML/AI workloads?              → Azure Databricks                   │
│                                                                          │
│  Need enterprise data warehouse?    → Synapse Dedicated SQL Pool         │
│                                                                          │
│  Need ad-hoc data exploration?      → Synapse Serverless SQL Pool        │
│                                                                          │
│  Need ACID on data lake?            → Delta Lake (Databricks/Synapse)    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
```

### Essential Best Practices

1. **Security First**: Use Managed Identities, Private Endpoints, and proper RBAC
2. **Medallion Architecture**: Organize data in Bronze → Silver → Gold layers
3. **Delta Lake**: Enable ACID transactions and time travel on your data lake
4. **Cost Management**: Use lifecycle policies, auto-pause, and right-sizing
5. **Monitoring**: Implement comprehensive observability from day one
6. **Governance**: Use Unity Catalog or Microsoft Purview for data governance
7. **Idempotency**: Design pipelines to be safely re-runnable
8. **Partition Strategy**: Plan partitioning based on query patterns