# AWS Data Engineering Services

This notebook provides a comprehensive overview of AWS services commonly used in data engineering workflows. We'll explore storage, ETL, big data processing, streaming, and serverless computing services that form the backbone of modern data platforms on AWS.

---

## Table of Contents

1. [Amazon S3 - Data Lake Storage](#1-amazon-s3---data-lake-storage)
2. [AWS Glue - ETL and Data Catalog](#2-aws-glue---etl-and-data-catalog)
3. [Amazon EMR - Big Data Processing](#3-amazon-emr---big-data-processing)
4. [Amazon Kinesis - Real-Time Streaming](#4-amazon-kinesis---real-time-streaming)
5. [AWS Lambda - Serverless Processing](#5-aws-lambda---serverless-processing)
6. [Best Practices and Architecture Patterns](#6-best-practices-and-architecture-patterns)
7. [Key Takeaways](#7-key-takeaways)

---

## 1. Amazon S3 - Data Lake Storage

**Amazon Simple Storage Service (S3)** is the foundation of most AWS data engineering architectures, serving as a highly scalable, durable, and cost-effective object storage service.

### Key Features

| Feature | Description |
|---------|-------------|
| **Durability** | 99.999999999% (11 9's) durability |
| **Availability** | 99.99% availability SLA |
| **Scalability** | Virtually unlimited storage |
| **Storage Classes** | Standard, Intelligent-Tiering, Glacier, etc. |
| **Security** | Encryption at rest and in transit, IAM policies, bucket policies |

### S3 Storage Classes for Data Lakes

```
┌─────────────────────────────────────────────────────────────────┐
│                      S3 Storage Classes                         │
├─────────────────────┬───────────────────┬───────────────────────┤
│     Hot Data        │   Warm Data       │      Cold Data        │
├─────────────────────┼───────────────────┼───────────────────────┤
│  S3 Standard        │  S3 Standard-IA   │  S3 Glacier Instant   │
│  (Frequent access)  │  (Infrequent)     │  S3 Glacier Flexible  │
│                     │  S3 One Zone-IA   │  S3 Glacier Deep      │
└─────────────────────┴───────────────────┴───────────────────────┘
```

### Data Lake Organization Pattern

```
s3://my-data-lake/
├── raw/                    # Landing zone (Bronze)
│   ├── source1/
│   │   └── year=2024/month=01/day=15/
│   └── source2/
├── processed/              # Cleaned data (Silver)
│   ├── domain1/
│   └── domain2/
├── curated/                # Business-ready (Gold)
│   ├── analytics/
│   └── ml-features/
└── archive/                # Historical data
```

In [None]:
# Example: Working with S3 using boto3
import boto3
from datetime import datetime

# Initialize S3 client
s3_client = boto3.client('s3')

# Configuration
BUCKET_NAME = 'my-data-lake'
RAW_PREFIX = 'raw/'
PROCESSED_PREFIX = 'processed/'

def upload_to_raw_zone(local_file: str, source_name: str) -> str:
    """Upload file to raw zone with date partitioning."""
    now = datetime.now()
    s3_key = f"{RAW_PREFIX}{source_name}/year={now.year}/month={now.month:02d}/day={now.day:02d}/{local_file}"
    
    s3_client.upload_file(
        Filename=local_file,
        Bucket=BUCKET_NAME,
        Key=s3_key,
        ExtraArgs={'ServerSideEncryption': 'AES256'}
    )
    return f"s3://{BUCKET_NAME}/{s3_key}"

def list_objects_with_prefix(prefix: str, max_keys: int = 100) -> list:
    """List objects in S3 with a given prefix."""
    response = s3_client.list_objects_v2(
        Bucket=BUCKET_NAME,
        Prefix=prefix,
        MaxKeys=max_keys
    )
    return [obj['Key'] for obj in response.get('Contents', [])]

# Example usage
# s3_path = upload_to_raw_zone('data.csv', 'sales_system')
# print(f"Uploaded to: {s3_path}")

In [None]:
# S3 Lifecycle Policy Example (JSON configuration)
lifecycle_policy = {
    "Rules": [
        {
            "ID": "MoveToGlacierAfter90Days",
            "Status": "Enabled",
            "Filter": {"Prefix": "raw/"},
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"
                }
            ],
            "Expiration": {
                "Days": 365
            }
        },
        {
            "ID": "DeleteIncompleteMultipartUploads",
            "Status": "Enabled",
            "Filter": {"Prefix": ""},
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7
            }
        }
    ]
}

# Apply lifecycle policy
# s3_client.put_bucket_lifecycle_configuration(
#     Bucket=BUCKET_NAME,
#     LifecycleConfiguration=lifecycle_policy
# )

---

## 2. AWS Glue - ETL and Data Catalog

**AWS Glue** is a fully managed ETL (Extract, Transform, Load) service that also provides a centralized metadata repository called the **Glue Data Catalog**.

### AWS Glue Components

```
┌─────────────────────────────────────────────────────────────────────┐
│                         AWS Glue Architecture                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐   │
│  │   Crawlers   │───▶│ Data Catalog │◀───│   External Tools     │   │
│  │              │    │  (Databases, │    │  (Athena, Redshift,  │   │
│  │ Auto-discover│    │   Tables,    │    │   EMR, QuickSight)   │   │
│  │   schemas    │    │  Partitions) │    │                      │   │
│  └──────────────┘    └──────────────┘    └──────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │                      Glue ETL Jobs                            │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │  │
│  │  │ Spark Jobs  │  │ Python Shell│  │ Glue Studio (Visual)│   │  │
│  │  │ (PySpark)   │  │   Jobs      │  │                     │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘   │  │
│  └───────────────────────────────────────────────────────────────┘  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Key Features

| Component | Description | Use Case |
|-----------|-------------|----------|
| **Crawlers** | Automatically discover schema and populate catalog | Schema discovery, partition detection |
| **Data Catalog** | Centralized metadata repository | Schema management, data discovery |
| **ETL Jobs** | Serverless Spark jobs for transformation | Data transformation, cleansing |
| **Glue Studio** | Visual ETL job authoring | Low-code ETL development |
| **DataBrew** | Visual data preparation | Data profiling, cleansing |
| **Workflows** | Orchestrate ETL jobs | Complex pipeline orchestration |

In [None]:
# Example: AWS Glue ETL Job (PySpark)
# This script would run as an AWS Glue job

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import *
from pyspark.context import SparkContext
from pyspark.sql.functions import col, to_date, year, month

# Initialize Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Read from Glue Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="sales_database",
    table_name="raw_transactions",
    transformation_ctx="datasource"
)

# Apply mappings (rename and cast columns)
mapped_df = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("transaction_id", "string", "transaction_id", "string"),
        ("amount", "string", "amount", "decimal(10,2)"),
        ("transaction_date", "string", "transaction_date", "date"),
        ("customer_id", "string", "customer_id", "string")
    ],
    transformation_ctx="mapped_df"
)

# Convert to Spark DataFrame for complex transformations
df = mapped_df.toDF()

# Add partition columns
df_partitioned = df.withColumn("year", year(col("transaction_date"))) \
                   .withColumn("month", month(col("transaction_date")))

# Filter out invalid records
df_clean = df_partitioned.filter(col("amount") > 0)

# Convert back to DynamicFrame
output_dyf = DynamicFrame.fromDF(df_clean, glueContext, "output_dyf")

# Write to processed zone with partitioning
glueContext.write_dynamic_frame.from_options(
    frame=output_dyf,
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": "s3://my-data-lake/processed/transactions/",
        "partitionKeys": ["year", "month"]
    },
    transformation_ctx="write_output"
)

job.commit()

In [None]:
# Managing Glue Catalog with boto3
import boto3

glue_client = boto3.client('glue')

def create_glue_database(database_name: str, description: str = "") -> dict:
    """Create a Glue Data Catalog database."""
    return glue_client.create_database(
        DatabaseInput={
            'Name': database_name,
            'Description': description
        }
    )

def create_glue_crawler(crawler_name: str, database_name: str, 
                        s3_path: str, iam_role: str) -> dict:
    """Create a Glue Crawler to discover schema."""
    return glue_client.create_crawler(
        Name=crawler_name,
        Role=iam_role,
        DatabaseName=database_name,
        Targets={
            'S3Targets': [
                {
                    'Path': s3_path,
                    'Exclusions': ['**/_SUCCESS', '**/_temporary/**']
                }
            ]
        },
        SchemaChangePolicy={
            'UpdateBehavior': 'UPDATE_IN_DATABASE',
            'DeleteBehavior': 'LOG'
        },
        RecrawlPolicy={
            'RecrawlBehavior': 'CRAWL_NEW_FOLDERS_ONLY'
        },
        Configuration='''{"Version": 1.0, "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas"}}'''
    )

def start_crawler(crawler_name: str) -> dict:
    """Start a Glue Crawler."""
    return glue_client.start_crawler(Name=crawler_name)

def get_table_schema(database_name: str, table_name: str) -> list:
    """Get schema of a table from Glue Catalog."""
    response = glue_client.get_table(
        DatabaseName=database_name,
        Name=table_name
    )
    return response['Table']['StorageDescriptor']['Columns']

---

## 3. Amazon EMR - Big Data Processing

**Amazon Elastic MapReduce (EMR)** is a managed cluster platform for running big data frameworks like Apache Spark, Hadoop, Hive, Presto, and Flink.

### EMR Deployment Options

```
┌─────────────────────────────────────────────────────────────────────┐
│                     EMR Deployment Options                          │
├─────────────────────┬─────────────────────┬─────────────────────────┤
│    EMR on EC2       │   EMR on EKS        │    EMR Serverless       │
├─────────────────────┼─────────────────────┼─────────────────────────┤
│ • Traditional       │ • Run on existing   │ • No cluster mgmt       │
│   managed clusters  │   Kubernetes        │ • Auto-scaling          │
│ • Full control      │ • Share resources   │ • Pay per use           │
│ • Custom AMIs       │ • Container-based   │ • Quick startup         │
│ • Spot instances    │ • Multi-tenant      │ • Spark & Hive only     │
└─────────────────────┴─────────────────────┴─────────────────────────┘
```

### EMR Cluster Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                      EMR Cluster Architecture                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│    ┌──────────────────┐                                             │
│    │   Master Node    │  HDFS NameNode, YARN ResourceManager        │
│    │                  │  Spark Driver, Hive Metastore               │
│    └────────┬─────────┘                                             │
│             │                                                        │
│    ┌────────┴─────────────────────────────────────────────┐         │
│    │                                                       │         │
│    ▼                                                       ▼         │
│  ┌───────────────────────┐     ┌───────────────────────────┐        │
│  │ Core Nodes (2-100+)   │     │ Task Nodes (Optional)     │        │
│  │ • HDFS DataNodes      │     │ • Compute only            │        │
│  │ • YARN NodeManagers   │     │ • No HDFS storage         │        │
│  │ • Always running      │     │ • Can use Spot instances  │        │
│  └───────────────────────┘     └───────────────────────────┘        │
│                                                                      │
│                         ┌───────────────────┐                        │
│                         │  Amazon S3        │                        │
│                         │  (EMRFS)          │                        │
│                         │  Persistent Store │                        │
│                         └───────────────────┘                        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Example: Creating an EMR Cluster with boto3
import boto3

emr_client = boto3.client('emr')

def create_emr_cluster(
    cluster_name: str,
    log_uri: str,
    ec2_key_name: str,
    subnet_id: str
) -> str:
    """Create an EMR cluster for Spark processing."""
    
    response = emr_client.run_job_flow(
        Name=cluster_name,
        LogUri=log_uri,
        ReleaseLabel='emr-7.0.0',
        Applications=[
            {'Name': 'Spark'},
            {'Name': 'Hive'},
            {'Name': 'JupyterEnterpriseGateway'}
        ],
        Instances={
            'MasterInstanceType': 'm5.xlarge',
            'SlaveInstanceType': 'm5.xlarge',
            'InstanceCount': 3,
            'Ec2KeyName': ec2_key_name,
            'Ec2SubnetId': subnet_id,
            'KeepJobFlowAliveWhenNoSteps': True,
            'TerminationProtected': False
        },
        Configurations=[
            {
                'Classification': 'spark-defaults',
                'Properties': {
                    'spark.dynamicAllocation.enabled': 'true',
                    'spark.sql.adaptive.enabled': 'true',
                    'spark.serializer': 'org.apache.spark.serializer.KryoSerializer'
                }
            },
            {
                'Classification': 'spark-hive-site',
                'Properties': {
                    'hive.metastore.client.factory.class': 
                        'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'
                }
            }
        ],
        ServiceRole='EMR_DefaultRole',
        JobFlowRole='EMR_EC2_DefaultRole',
        VisibleToAllUsers=True,
        Tags=[
            {'Key': 'Environment', 'Value': 'production'},
            {'Key': 'Project', 'Value': 'data-pipeline'}
        ]
    )
    
    return response['JobFlowId']

def add_spark_step(cluster_id: str, step_name: str, 
                   script_path: str, args: list = None) -> str:
    """Add a Spark step to running EMR cluster."""
    
    step_args = [
        'spark-submit',
        '--deploy-mode', 'cluster',
        '--master', 'yarn',
        script_path
    ]
    if args:
        step_args.extend(args)
    
    response = emr_client.add_job_flow_steps(
        JobFlowId=cluster_id,
        Steps=[
            {
                'Name': step_name,
                'ActionOnFailure': 'CONTINUE',
                'HadoopJarStep': {
                    'Jar': 'command-runner.jar',
                    'Args': step_args
                }
            }
        ]
    )
    
    return response['StepIds'][0]

In [None]:
# EMR Serverless Example
import boto3

emr_serverless = boto3.client('emr-serverless')

def create_emr_serverless_application(app_name: str) -> str:
    """Create an EMR Serverless application."""
    response = emr_serverless.create_application(
        name=app_name,
        releaseLabel='emr-7.0.0',
        type='SPARK',
        autoStartConfiguration={
            'enabled': True
        },
        autoStopConfiguration={
            'enabled': True,
            'idleTimeoutMinutes': 15
        },
        maximumCapacity={
            'cpu': '200 vCPU',
            'memory': '400 GB'
        }
    )
    return response['applicationId']

def submit_serverless_job(
    application_id: str,
    execution_role: str,
    script_path: str,
    spark_config: dict = None
) -> str:
    """Submit a Spark job to EMR Serverless."""
    
    spark_submit_params = {
        'entryPoint': script_path,
        'sparkSubmitParameters': '--conf spark.executor.cores=4 --conf spark.executor.memory=8g'
    }
    
    response = emr_serverless.start_job_run(
        applicationId=application_id,
        executionRoleArn=execution_role,
        jobDriver={
            'sparkSubmit': spark_submit_params
        },
        configurationOverrides={
            'monitoringConfiguration': {
                's3MonitoringConfiguration': {
                    'logUri': 's3://my-emr-logs/serverless/'
                }
            }
        }
    )
    
    return response['jobRunId']

---

## 4. Amazon Kinesis - Real-Time Streaming

**Amazon Kinesis** is a platform for collecting, processing, and analyzing real-time streaming data. It consists of multiple services designed for different streaming use cases.

### Kinesis Services Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                       Amazon Kinesis Family                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │
│  │ Kinesis Data    │    │ Kinesis Data    │    │ Kinesis Data    │  │
│  │ Streams         │───▶│ Analytics       │───▶│ Firehose        │  │
│  │                 │    │                 │    │                 │  │
│  │ • Custom apps   │    │ • SQL on streams│    │ • Auto delivery │  │
│  │ • Full control  │    │ • Flink apps    │    │ • S3, Redshift  │  │
│  │ • Sub-second    │    │ • Aggregations  │    │ • Elasticsearch │  │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘  │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Kinesis Video Streams                      │    │
│  │     Capture, process, and store video streams               │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Kinesis Data Streams Architecture

```
Producers                    Kinesis Stream                    Consumers
                            ┌──────────────┐
┌─────────┐                 │  Shard 1     │                 ┌─────────────┐
│ App 1   │──┐              │  ────────    │              ┌──│ Lambda      │
└─────────┘  │              │  Partition   │              │  └─────────────┘
             │              │  Key Hash    │              │
┌─────────┐  │  PutRecord   ├──────────────┤  GetRecords  │  ┌─────────────┐
│ App 2   │──┼─────────────▶│  Shard 2     │──────────────┼──│ EMR         │
└─────────┘  │              │  ────────    │              │  └─────────────┘
             │              │              │              │
┌─────────┐  │              ├──────────────┤              │  ┌─────────────┐
│ IoT     │──┘              │  Shard N     │              └──│ Custom App  │
└─────────┘                 │  ────────    │                 └─────────────┘
                            └──────────────┘
                            
Data Retention: 24 hours (default) to 365 days
```

In [None]:
# Kinesis Data Streams - Producer Example
import boto3
import json
from datetime import datetime
import uuid

kinesis_client = boto3.client('kinesis')

STREAM_NAME = 'user-events-stream'

def put_record(stream_name: str, data: dict, partition_key: str) -> dict:
    """Put a single record to Kinesis stream."""
    response = kinesis_client.put_record(
        StreamName=stream_name,
        Data=json.dumps(data).encode('utf-8'),
        PartitionKey=partition_key
    )
    return response

def put_records_batch(stream_name: str, records: list) -> dict:
    """Put multiple records to Kinesis stream (batch)."""
    kinesis_records = [
        {
            'Data': json.dumps(record['data']).encode('utf-8'),
            'PartitionKey': record['partition_key']
        }
        for record in records
    ]
    
    response = kinesis_client.put_records(
        StreamName=stream_name,
        Records=kinesis_records
    )
    return response

# Example: Send user click events
def send_click_event(user_id: str, page: str, action: str):
    """Send a user click event to Kinesis."""
    event = {
        'event_id': str(uuid.uuid4()),
        'user_id': user_id,
        'page': page,
        'action': action,
        'timestamp': datetime.utcnow().isoformat()
    }
    
    return put_record(
        stream_name=STREAM_NAME,
        data=event,
        partition_key=user_id  # Ensures same user's events go to same shard
    )

# Example usage
# send_click_event('user-123', '/products', 'view')

In [None]:
# Kinesis Data Streams - Consumer Example
import boto3
import json
import time

kinesis_client = boto3.client('kinesis')

def get_shard_iterator(stream_name: str, shard_id: str, 
                       iterator_type: str = 'LATEST') -> str:
    """Get shard iterator for reading from stream."""
    response = kinesis_client.get_shard_iterator(
        StreamName=stream_name,
        ShardId=shard_id,
        ShardIteratorType=iterator_type  # LATEST, TRIM_HORIZON, AT_TIMESTAMP
    )
    return response['ShardIterator']

def consume_stream(stream_name: str, process_func: callable, 
                   max_iterations: int = 100):
    """Consume records from all shards in a Kinesis stream."""
    
    # Describe stream to get shards
    response = kinesis_client.describe_stream(StreamName=stream_name)
    shards = response['StreamDescription']['Shards']
    
    shard_iterators = {}
    for shard in shards:
        shard_id = shard['ShardId']
        shard_iterators[shard_id] = get_shard_iterator(
            stream_name, shard_id, 'LATEST'
        )
    
    iteration = 0
    while iteration < max_iterations:
        for shard_id, iterator in list(shard_iterators.items()):
            if iterator is None:
                continue
                
            response = kinesis_client.get_records(
                ShardIterator=iterator,
                Limit=100
            )
            
            for record in response['Records']:
                data = json.loads(record['Data'].decode('utf-8'))
                process_func(data)
            
            shard_iterators[shard_id] = response['NextShardIterator']
        
        iteration += 1
        time.sleep(0.2)  # Avoid throttling

# Example processor
def process_event(event: dict):
    """Process a single event from the stream."""
    print(f"Processing: {event['event_id']} - User: {event['user_id']}")

# consume_stream('user-events-stream', process_event)

In [None]:
# Kinesis Data Firehose Example
import boto3

firehose_client = boto3.client('firehose')

def create_s3_delivery_stream(
    stream_name: str,
    bucket_arn: str,
    role_arn: str,
    prefix: str = 'data/'
) -> dict:
    """Create a Kinesis Firehose delivery stream to S3."""
    
    return firehose_client.create_delivery_stream(
        DeliveryStreamName=stream_name,
        DeliveryStreamType='DirectPut',
        ExtendedS3DestinationConfiguration={
            'RoleARN': role_arn,
            'BucketARN': bucket_arn,
            'Prefix': f'{prefix}year=!{{timestamp:yyyy}}/month=!{{timestamp:MM}}/day=!{{timestamp:dd}}/',
            'ErrorOutputPrefix': 'errors/',
            'BufferingHints': {
                'SizeInMBs': 128,
                'IntervalInSeconds': 300
            },
            'CompressionFormat': 'GZIP',
            'DataFormatConversionConfiguration': {
                'Enabled': True,
                'SchemaConfiguration': {
                    'DatabaseName': 'analytics_db',
                    'TableName': 'events',
                    'RoleARN': role_arn
                },
                'InputFormatConfiguration': {
                    'Deserializer': {
                        'OpenXJsonSerDe': {}
                    }
                },
                'OutputFormatConfiguration': {
                    'Serializer': {
                        'ParquetSerDe': {
                            'Compression': 'SNAPPY'
                        }
                    }
                }
            }
        }
    )

def put_firehose_record(stream_name: str, data: dict) -> dict:
    """Put a record directly to Firehose."""
    return firehose_client.put_record(
        DeliveryStreamName=stream_name,
        Record={
            'Data': json.dumps(data).encode('utf-8')
        }
    )

---

## 5. AWS Lambda - Serverless Processing

**AWS Lambda** is a serverless compute service that runs code in response to events without provisioning or managing servers. It's ideal for event-driven data processing, real-time file processing, and lightweight ETL tasks.

### Lambda Use Cases in Data Engineering

```
┌─────────────────────────────────────────────────────────────────────┐
│               Lambda in Data Engineering Workflows                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. S3 Event Processing          2. Stream Processing               │
│  ┌─────┐    ┌────────┐          ┌─────────┐    ┌────────┐          │
│  │ S3  │───▶│ Lambda │          │ Kinesis │───▶│ Lambda │          │
│  └─────┘    └────┬───┘          └─────────┘    └────┬───┘          │
│                  │                                   │              │
│                  ▼                                   ▼              │
│           ┌────────────┐                      ┌────────────┐        │
│           │ Transform  │                      │ DynamoDB   │        │
│           │ & Load     │                      │ Aggregation│        │
│           └────────────┘                      └────────────┘        │
│                                                                      │
│  3. API Data Ingestion           4. Scheduled ETL                   │
│  ┌───────────┐    ┌────────┐    ┌───────────┐    ┌────────┐        │
│  │ API GW    │───▶│ Lambda │    │ EventBrg  │───▶│ Lambda │        │
│  └───────────┘    └────┬───┘    │ Schedule  │    └────┬───┘        │
│                        │        └───────────┘         │             │
│                        ▼                              ▼             │
│                 ┌────────────┐                 ┌────────────┐       │
│                 │ Kinesis    │                 │ Glue Job   │       │
│                 │ Firehose   │                 │ Trigger    │       │
│                 └────────────┘                 └────────────┘       │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Lambda Limits and Considerations

| Resource | Limit |
|----------|-------|
| Memory | 128 MB - 10,240 MB |
| Timeout | 15 minutes max |
| Payload | 6 MB (sync), 256 KB (async) |
| Ephemeral Storage | 512 MB - 10,240 MB |
| Concurrent Executions | 1,000 (default) |

In [None]:
# Lambda Function: S3 Event Processor
# This would be deployed as a Lambda function

import json
import boto3
import urllib.parse
from datetime import datetime

s3_client = boto3.client('s3')
glue_client = boto3.client('glue')

def lambda_handler(event, context):
    """
    Lambda function triggered by S3 PutObject events.
    Processes incoming files and triggers downstream workflows.
    """
    processed_files = []
    
    for record in event['Records']:
        # Extract bucket and key from event
        bucket = record['s3']['bucket']['name']
        key = urllib.parse.unquote_plus(record['s3']['object']['key'])
        
        print(f"Processing: s3://{bucket}/{key}")
        
        # Skip non-data files
        if key.endswith('_SUCCESS') or key.startswith('_'):
            continue
        
        # Get file metadata
        response = s3_client.head_object(Bucket=bucket, Key=key)
        file_size = response['ContentLength']
        
        # Validate file
        if file_size == 0:
            print(f"Skipping empty file: {key}")
            continue
        
        # Determine processing based on file type
        if key.endswith('.csv'):
            process_csv_file(bucket, key)
        elif key.endswith('.json'):
            process_json_file(bucket, key)
        elif key.endswith('.parquet'):
            trigger_glue_crawler(bucket, key)
        
        processed_files.append({
            'bucket': bucket,
            'key': key,
            'size': file_size,
            'processed_at': datetime.utcnow().isoformat()
        })
    
    return {
        'statusCode': 200,
        'body': json.dumps({
            'message': f'Processed {len(processed_files)} files',
            'files': processed_files
        })
    }

def process_csv_file(bucket: str, key: str):
    """Process CSV file and convert to Parquet."""
    # Trigger Glue job for CSV to Parquet conversion
    glue_client.start_job_run(
        JobName='csv-to-parquet-converter',
        Arguments={
            '--source_bucket': bucket,
            '--source_key': key
        }
    )

def process_json_file(bucket: str, key: str):
    """Process JSON file."""
    pass  # Implementation depends on use case

def trigger_glue_crawler(bucket: str, key: str):
    """Trigger Glue crawler to update catalog."""
    crawler_name = 'data-lake-crawler'
    try:
        glue_client.start_crawler(Name=crawler_name)
    except glue_client.exceptions.CrawlerRunningException:
        print(f"Crawler {crawler_name} is already running")

In [None]:
# Lambda Function: Kinesis Stream Processor
import json
import base64
import boto3
from datetime import datetime
from collections import defaultdict

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('event_aggregations')

def lambda_handler(event, context):
    """
    Lambda function triggered by Kinesis Data Streams.
    Aggregates events and writes to DynamoDB.
    """
    aggregations = defaultdict(lambda: {'count': 0, 'events': []})
    
    for record in event['Records']:
        # Decode Kinesis record
        payload = base64.b64decode(record['kinesis']['data'])
        data = json.loads(payload)
        
        # Aggregate by user_id
        user_id = data.get('user_id', 'unknown')
        aggregations[user_id]['count'] += 1
        aggregations[user_id]['events'].append({
            'event_id': data.get('event_id'),
            'action': data.get('action'),
            'timestamp': data.get('timestamp')
        })
    
    # Write aggregations to DynamoDB
    timestamp = datetime.utcnow().strftime('%Y-%m-%d-%H')
    
    with table.batch_writer() as batch:
        for user_id, agg_data in aggregations.items():
            batch.put_item(Item={
                'pk': f'USER#{user_id}',
                'sk': f'HOUR#{timestamp}',
                'event_count': agg_data['count'],
                'events': agg_data['events'][:10],  # Keep last 10 events
                'updated_at': datetime.utcnow().isoformat()
            })
    
    return {
        'statusCode': 200,
        'processedRecords': len(event['Records']),
        'uniqueUsers': len(aggregations)
    }

In [None]:
# Creating Lambda Functions with boto3
import boto3
import zipfile
import io

lambda_client = boto3.client('lambda')

def create_lambda_function(
    function_name: str,
    handler: str,
    role_arn: str,
    code_path: str,
    runtime: str = 'python3.11',
    memory: int = 256,
    timeout: int = 60,
    environment: dict = None
) -> dict:
    """Create a Lambda function."""
    
    # Create deployment package
    zip_buffer = io.BytesIO()
    with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zf:
        zf.write(code_path, 'lambda_function.py')
    zip_buffer.seek(0)
    
    return lambda_client.create_function(
        FunctionName=function_name,
        Runtime=runtime,
        Role=role_arn,
        Handler=handler,
        Code={'ZipFile': zip_buffer.read()},
        MemorySize=memory,
        Timeout=timeout,
        Environment={'Variables': environment or {}},
        TracingConfig={'Mode': 'Active'}  # Enable X-Ray
    )

def add_s3_trigger(
    function_name: str,
    bucket_name: str,
    prefix: str = '',
    suffix: str = ''
) -> dict:
    """Add S3 trigger to Lambda function."""
    s3_client = boto3.client('s3')
    
    # Add permission for S3 to invoke Lambda
    lambda_client.add_permission(
        FunctionName=function_name,
        StatementId=f's3-trigger-{bucket_name}',
        Action='lambda:InvokeFunction',
        Principal='s3.amazonaws.com',
        SourceArn=f'arn:aws:s3:::{bucket_name}'
    )
    
    # Get Lambda ARN
    lambda_arn = lambda_client.get_function(
        FunctionName=function_name
    )['Configuration']['FunctionArn']
    
    # Configure S3 bucket notification
    notification_config = {
        'LambdaFunctionConfigurations': [
            {
                'LambdaFunctionArn': lambda_arn,
                'Events': ['s3:ObjectCreated:*'],
                'Filter': {
                    'Key': {
                        'FilterRules': []
                    }
                }
            }
        ]
    }
    
    if prefix:
        notification_config['LambdaFunctionConfigurations'][0]['Filter']['Key']['FilterRules'].append(
            {'Name': 'prefix', 'Value': prefix}
        )
    if suffix:
        notification_config['LambdaFunctionConfigurations'][0]['Filter']['Key']['FilterRules'].append(
            {'Name': 'suffix', 'Value': suffix}
        )
    
    return s3_client.put_bucket_notification_configuration(
        Bucket=bucket_name,
        NotificationConfiguration=notification_config
    )

---

## 6. Best Practices and Architecture Patterns

### Modern Data Lake Architecture on AWS

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                      Modern Data Lake Architecture                               │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                           DATA SOURCES                                    │   │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────┐ │   │
│  │  │Databases│  │   APIs  │  │   IoT   │  │  Logs   │  │ SaaS Apps       │ │   │
│  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  └────────┬────────┘ │   │
│  └───────┼────────────┼────────────┼────────────┼────────────────┼──────────┘   │
│          │            │            │            │                │              │
│          ▼            ▼            ▼            ▼                ▼              │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                         INGESTION LAYER                                   │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │   │
│  │  │    Kinesis   │  │   Lambda     │  │    DMS       │  │   AppFlow    │  │   │
│  │  │   Firehose   │  │  (API GW)    │  │ (Databases)  │  │   (SaaS)     │  │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │   │
│  └───────────────────────────────────────────────────────────────────────────┘   │
│                                      │                                           │
│                                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                         STORAGE LAYER (S3)                                │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                    │   │
│  │  │    Bronze    │  │    Silver    │  │     Gold     │                    │   │
│  │  │  (Raw Data)  │──│  (Cleaned)   │──│  (Curated)   │                    │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘                    │   │
│  └───────────────────────────────────────────────────────────────────────────┘   │
│                                      │                                           │
│                                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                       PROCESSING LAYER                                    │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │   │
│  │  │  AWS Glue    │  │     EMR      │  │   Lambda     │  │    Step      │  │   │
│  │  │  (ETL)       │  │  (Spark)     │  │ (Serverless) │  │  Functions   │  │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │   │
│  └───────────────────────────────────────────────────────────────────────────┘   │
│                                      │                                           │
│                                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                       CONSUMPTION LAYER                                   │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │   │
│  │  │   Athena     │  │   Redshift   │  │  QuickSight  │  │  SageMaker   │  │   │
│  │  │  (Ad-hoc)    │  │  (DW)        │  │   (BI)       │  │    (ML)      │  │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │   │
│  └───────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                    GOVERNANCE & SECURITY                                  │   │
│  │  Lake Formation │ IAM │ KMS │ CloudTrail │ Glue Catalog │ DataZone       │   │
│  └──────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
```

### Best Practices by Service

#### Amazon S3 Best Practices

| Category | Best Practice |
|----------|---------------|
| **Partitioning** | Use date-based partitioning (year/month/day) for time-series data |
| **File Format** | Use columnar formats (Parquet, ORC) for analytics workloads |
| **File Size** | Target 128MB - 1GB files for optimal performance |
| **Naming** | Use consistent, meaningful prefixes for efficient listing |
| **Lifecycle** | Implement lifecycle policies to manage storage costs |
| **Security** | Enable default encryption, block public access |

#### AWS Glue Best Practices

| Category | Best Practice |
|----------|---------------|
| **Job Bookmarks** | Enable to process only new data incrementally |
| **Partitioning** | Push down predicates to reduce data scanned |
| **Worker Type** | Choose appropriate worker type (G.1X, G.2X) based on workload |
| **Error Handling** | Implement proper error handling and dead letter queues |
| **Catalog** | Keep catalog organized with meaningful database/table names |

#### Amazon EMR Best Practices

| Category | Best Practice |
|----------|---------------|
| **Instance Types** | Use Spot instances for task nodes (up to 90% savings) |
| **Storage** | Use EMRFS (S3) for persistent data, local HDFS for temp |
| **Scaling** | Configure managed scaling for dynamic workloads |
| **Security** | Enable encryption at rest and in transit |
| **Monitoring** | Use CloudWatch and Spark UI for monitoring |

#### Amazon Kinesis Best Practices

| Category | Best Practice |
|----------|---------------|
| **Sharding** | Choose partition key that distributes data evenly |
| **Batching** | Use PutRecords for batch ingestion (up to 500 records) |
| **Enhanced Fan-Out** | Use for low-latency consumption with multiple consumers |
| **Error Handling** | Implement retry logic with exponential backoff |
| **Monitoring** | Monitor IteratorAge to detect consumer lag |

#### AWS Lambda Best Practices

| Category | Best Practice |
|----------|---------------|
| **Memory** | Tune memory based on workload (CPU scales with memory) |
| **Cold Starts** | Use Provisioned Concurrency for latency-sensitive workloads |
| **Connections** | Reuse connections (DB, HTTP) across invocations |
| **Packaging** | Minimize deployment package size for faster cold starts |
| **Idempotency** | Design functions to be idempotent for retry safety |

In [None]:
# Example: Complete Data Pipeline Orchestration with Step Functions
import boto3
import json

sfn_client = boto3.client('stepfunctions')

# Step Functions state machine definition for a data pipeline
state_machine_definition = {
    "Comment": "Data Pipeline: Ingest, Transform, and Load",
    "StartAt": "CheckNewData",
    "States": {
        "CheckNewData": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:123456789:function:check-new-data",
            "Next": "HasNewData"
        },
        "HasNewData": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.hasNewData",
                    "BooleanEquals": True,
                    "Next": "StartGlueCrawler"
                }
            ],
            "Default": "NoNewData"
        },
        "NoNewData": {
            "Type": "Succeed"
        },
        "StartGlueCrawler": {
            "Type": "Task",
            "Resource": "arn:aws:states:::glue:startCrawler.sync",
            "Parameters": {
                "Name": "data-lake-crawler"
            },
            "Next": "RunGlueETLJob"
        },
        "RunGlueETLJob": {
            "Type": "Task",
            "Resource": "arn:aws:states:::glue:startJobRun.sync",
            "Parameters": {
                "JobName": "transform-to-curated",
                "Arguments": {
                    "--source_database.$": "$.database",
                    "--source_table.$": "$.table"
                }
            },
            "Catch": [
                {
                    "ErrorEquals": ["States.ALL"],
                    "Next": "HandleETLError"
                }
            ],
            "Next": "NotifySuccess"
        },
        "HandleETLError": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:123456789:function:handle-etl-error",
            "Next": "NotifyFailure"
        },
        "NotifySuccess": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sns:publish",
            "Parameters": {
                "TopicArn": "arn:aws:sns:us-east-1:123456789:pipeline-notifications",
                "Message": "Data pipeline completed successfully"
            },
            "End": True
        },
        "NotifyFailure": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sns:publish",
            "Parameters": {
                "TopicArn": "arn:aws:sns:us-east-1:123456789:pipeline-notifications",
                "Message.$": "States.Format('Pipeline failed: {}', $.error)"
            },
            "End": True
        }
    }
}

# Create state machine
# sfn_client.create_state_machine(
#     name='data-pipeline-orchestrator',
#     definition=json.dumps(state_machine_definition),
#     roleArn='arn:aws:iam::123456789:role/StepFunctionsExecutionRole'
# )

---

## 7. Key Takeaways

### Summary of AWS Data Engineering Services

| Service | Primary Use Case | When to Use |
|---------|-----------------|-------------|
| **S3** | Data Lake Storage | Always - foundation of AWS data architecture |
| **Glue** | ETL & Catalog | Serverless ETL, schema management, data discovery |
| **EMR** | Big Data Processing | Large-scale Spark/Hadoop, complex analytics |
| **Kinesis** | Real-time Streaming | Event-driven, real-time analytics, IoT |
| **Lambda** | Serverless Compute | Event-driven processing, lightweight ETL |

### Decision Framework

```
┌─────────────────────────────────────────────────────────────────────┐
│                    Choosing the Right Service                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Data Volume?                                                        │
│  ├── Small (< 1GB)        → Lambda + S3                             │
│  ├── Medium (1-100GB)     → Glue ETL                                │
│  └── Large (> 100GB)      → EMR or Glue with Spark                  │
│                                                                      │
│  Processing Type?                                                    │
│  ├── Batch                → Glue or EMR                             │
│  ├── Streaming            → Kinesis + Lambda/Flink                  │
│  └── Micro-batch          → Kinesis Firehose                        │
│                                                                      │
│  Latency Requirements?                                               │
│  ├── Real-time (< 1s)     → Kinesis Data Streams + Lambda           │
│  ├── Near real-time       → Kinesis Firehose                        │
│  └── Batch (> 1 min)      → Glue or EMR                             │
│                                                                      │
│  Operational Overhead?                                               │
│  ├── Minimal              → Glue, Lambda, Kinesis Firehose          │
│  ├── Moderate             → EMR Serverless                          │
│  └── Full Control         → EMR on EC2                              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Key Architecture Principles

1. **Decouple Compute and Storage**: Use S3 as the central data lake, separate from processing engines
2. **Use the Right Tool**: Match service to workload (streaming vs batch, simple vs complex)
3. **Design for Failure**: Implement retry logic, dead letter queues, and monitoring
4. **Optimize Costs**: Use lifecycle policies, Spot instances, and right-size resources
5. **Security First**: Encrypt data at rest and in transit, use least privilege IAM policies
6. **Automate Everything**: Use IaC (CloudFormation, CDK, Terraform) for reproducibility
7. **Monitor and Alert**: Set up CloudWatch dashboards, alarms, and log analysis

### Additional Resources

- [AWS Data Analytics Lens](https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/welcome.html)
- [AWS Big Data Blog](https://aws.amazon.com/blogs/big-data/)
- [AWS Data Lake Whitepaper](https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html)
- [Modern Data Architecture on AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/)