AWS Data Pipeline - Ikerian Cloud Engineer Assessment

Overview

This project implements an end-to-end AWS data pipeline that processes retina scan patient data using S3, Lambda, and CloudWatch, all managed through Terraform infrastructure as code. The pipeline extracts patient identification information from detailed retina scan records for downstream processing.

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   Raw Data      │    │   Lambda         │    │   Processed Data    │
│   S3 Bucket     │───▶│   Function       │───▶│   S3 Bucket         │
│                 │    │                  │    │                     │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
         │                       │                        │
         │                       ▼                        │
         │              ┌──────────────────┐              │
         │              │   CloudWatch     │              │
         └──────────────│   Logs           │◀─────────────┘
                        └──────────────────┘

Components

1. S3 Buckets

Raw Data Bucket: Stores original JSON files
Processed Data Bucket: Stores processed JSON with extracted fields

2. Lambda Function

Runtime: Python 3.12
Trigger: S3 ObjectCreated events on .json files
Processing: Extracts patient_id and patient_name from retina scan data
Output: Saves simplified patient records to processed bucket

3. IAM Role & Policies

Lambda Execution Role: Allows Lambda to assume role
S3 Permissions: Read from raw bucket, write to processed bucket
CloudWatch Permissions: Create logs and log streams

4. CloudWatch Logs

Log Group: /aws/lambda/ikerian-data-pipeline-data-processor
Retention: 14 days
Purpose: Debug and monitor Lambda execution

Prerequisites

AWS CLI configured with appropriate credentials
Terraform >= 1.0 installed
Bash shell (for deployment script)

Deployment Instructions

GitHub Actions (Recommended)

Fork this repository
Set AWS credentials in GitHub Secrets:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
Push to main branch - deployment runs automatically

Local Deployment

# Run deployment script
chmod +x deploy.sh
./deploy.sh

Manual Deployment

# Build layer
mkdir -p layers/python/lib/python3.12/site-packages
pip install -r layers/requirements.txt -t layers/python/lib/python3.12/site-packages

# Deploy infrastructure
terraform init
terraform plan
terraform apply

# Upload test data
aws s3 cp ikerian_sample.json s3://$(terraform output -raw raw_data_bucket_name)/

File Structure

├── main.tf              # Main Terraform configuration
├── providers.tf         # AWS provider configuration
├── variables.tf         # Terraform variables
├── locals.tf           # Local values and naming
├── outputs.tf          # Terraform outputs
├── backend.tf          # Backend resources
├── lambda_function.py  # Lambda function code
├── ikerian_sample.json # Sample retina scan data
├── deploy.sh          # Deployment script
├── env/               # Environment configurations
│   ├── backend.hcl    # Dev backend config
│   └── prod-backend.hcl # Prod backend config
├── .github/workflows/ # GitHub Actions
│   └── deploy.yml     # CI/CD pipeline
├── layers/            # Lambda layers
│   ├── requirements.txt # Python dependencies
│   └── python/        # Layer packages
├── modules/           # Terraform modules
│   ├── s3/           # S3 buckets module
│   ├── lambda/       # Lambda function module
│   ├── iam/          # IAM roles module
│   ├── layers/       # Lambda layers module
│   └── README.md     # Module documentation
└── README.md         # This documentation

Sample Data Format

Input (Raw Data)

[
  {
    "patient_id": "A12345",
    "patient_name": "Ikerian A",
    "scan_date": "2025-01-01",
    "retina_thickness_microns": 275,
    "optic_disc_cup_ratio": 0.4,
    "diagnosis": "normal"
  },
  {
    "patient_id": "B67890",
    "patient_name": "Ikerian B",
    "scan_date": "2025-02-15",
    "retina_thickness_microns": 305,
    "optic_disc_cup_ratio": 0.6,
    "diagnosis": "suspected glaucoma"
  }
]

Output (Processed Data)

[
  {
    "patient_id": "A12345",
    "patient_name": "Ikerian A"
  },
  {
    "patient_id": "B67890",
    "patient_name": "Ikerian B"
  }
]

Monitoring and Debugging

CloudWatch Logs: Check /aws/lambda/ikerian-data-pipeline-data-processor
S3 Events: Monitor bucket notifications in S3 console
Lambda Metrics: View execution metrics in Lambda console

Approach and Design Decisions

1. Modular Architecture with for_each

Choice: Terraform modules with for_each for multi-environment deployment
Rationale: Single configuration manages multiple environments (dev, prod)
Structure: Pipeline module orchestrates S3, Lambda, and IAM modules per environment

2. Infrastructure as Code

Choice: Terraform for infrastructure management
Rationale: Version control, reproducibility, and declarative configuration

2. Event-Driven Architecture

Choice: S3 event triggers for Lambda
Rationale: Automatic processing when new files are uploaded, serverless and cost-effective

3. Security Best Practices

Principle of Least Privilege: IAM policies grant only necessary permissions
Resource Isolation: Separate buckets for raw and processed data
Logging: Comprehensive CloudWatch logging for audit and debugging

4. Error Handling

Lambda: Try-catch blocks with detailed error logging
S3: Versioning enabled for data recovery
CloudWatch: Structured logging for troubleshooting

Assumptions Made

Data Format: Input files are valid JSON arrays
Required Fields: All records contain patient_id and patient_name
File Size: Files are small enough for Lambda memory limits
Region: Default deployment to us-east-1
Naming: Unique bucket names using random suffixes

Challenges Faced and Solutions

1. S3 Bucket Naming Conflicts

Challenge: S3 bucket names must be globally unique
Solution: Added random suffix to bucket names

2. Lambda Permissions

Challenge: Complex IAM permissions for S3 and CloudWatch
Solution: Separate IAM policy with specific resource ARNs

3. S3 Event Configuration

Challenge: Circular dependency between S3 notification and Lambda
Solution: Used depends_on in Terraform to manage resource creation order

Cleanup

To destroy all resources:

terraform destroy

Cost Optimization

Lambda: Pay-per-execution model
S3: Standard storage class for infrequent access
CloudWatch: 14-day log retention to minimize costs

Future Enhancements

Data Validation: Add schema validation for input JSON
Error Handling: Dead letter queue for failed processing
Monitoring: CloudWatch alarms for failures
Scaling: SQS for high-volume processing
Security: S3 bucket encryption and VPC endpoints

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
env		env
layers		layers
modules		modules
.gitignore		.gitignore
README.md		README.md
architecture.md		architecture.md
backend.tf		backend.tf
deploy.sh		deploy.sh
ikerian_sample.json		ikerian_sample.json
lambda-s3-process-flow.jpg		lambda-s3-process-flow.jpg
lambda_function.py		lambda_function.py
locals.tf		locals.tf
main.tf		main.tf
outputs.tf		outputs.tf
providers.tf		providers.tf
requirements.txt		requirements.txt
variables.tf		variables.tf

sthakur1985/lambda-ikerian-json-process

Folders and files

Latest commit

History

Repository files navigation