diff --git a/_posts/2025-09-22-building-scalable-data-warehouse-backup-system.md b/_posts/2025-09-22-building-scalable-data-warehouse-backup-system.md new file mode 100644 index 0000000..3a9f68d --- /dev/null +++ b/_posts/2025-09-22-building-scalable-data-warehouse-backup-system.md @@ -0,0 +1,136 @@ +--- +layout: post +title: "Building a Scalable Data Warehouse Backup System with AWS" +tags: + - Data-warehouse + - Terraform + - AWS + - Deltalake + - Backup +team: Core Infrastructure +author: Oleh Motrunych +--- + +We designed and implemented a scalable, cost-optimized backup system for S3 data warehouses that runs automatically on a monthly schedule. The system handles petabytes of data across multiple databases and uses a hybrid approach: AWS Lambda for small workloads and ECS Fargate for larger ones. +At its core, the pipeline performs incremental backups — copying only new or changed parquet files while always preserving delta logs — dramatically reducing costs and runtime compared to full backups. Data is validated through S3 Inventory manifests, processed in parallel, and stored in Glacier for long-term retention. +To avoid data loss and reduce storage costs, we also implemented a safe deletion workflow. Files older than 90 days, successfully backed up, and no longer present in the source are tagged for lifecycle-based cleanup instead of being deleted immediately. +This approach ensures reliability, efficiency, and safety: backups scale seamlessly from small to massive datasets, compute resources are right-sized, and storage is continuously optimized. + +![Open Data Warehouse Backup System diagram](..//files/backup_system_diagram.png) + +--- + +### Our old approach had problems: + +- Copying over the same files all the time – not effective from a cost perspective +- Timeouts when manifests were too large for Lambda +- Redundant backups inflating storage cost +- Orphaned files piling up without clean deletion + +--- + +### We needed a systematic, automated, and cost-effective way to: + +- Run monthly backups across all databases +- Scale from small jobs to massive datasets +- Handle incremental changes instead of full copies +- Safely clean up old data without risk of data loss + +--- + +### The Design at a Glance + +We built a hybrid backup architecture on AWS primitives: + +- Step Functions – orchestrates the workflow +- Lambda – lightweight jobs for small manifests +- ECS Fargate – heavy jobs with no timeout constraints +- S3 + S3 Batch Ops – storage and bulk copy/delete operations +- EventBridge – monthly scheduler +- Glue, CloudWatch, Secrets Manager – reporting, monitoring, secure keys +- IAM – access and roles + +The core idea: Do not copy file what are already in back up and copy over always delta log, Small manifests run in Lambda, big ones in ECS. + +--- + +### How It Works + +1. **Database Discovery** + + Parse S3 Inventory manifests + Identify database prefixes + Queue for processing (up to 40 in parallel) + +2. **Manifest Validation** + + Before we touch data, we validate: + - JSON structure + - All CSV parts present + - File counts + checksums match + If incomplete → wait up to 30 minutes before retry + +3. **Routing by Size** + + - ≤25 files → Lambda (15 minutes, 5GB) + - 25 files → ECS Fargate (16GB RAM, 4 vCPUs, unlimited runtime) + +4. **Incremental Backup Logic** + + - Load exclusion set from last backup + - Always include delta logs + - Only back up parquet files not yet in backup + - Ignore non-STANDARD storage classes (we use Intelligent-Tiering; over time files can go to Glacier and we don’t want to touch them) + - Process CSVs in parallel (20 workers) + - Emit new manifest + checksum for integrity + +5. **Copying Files** + + - Feed manifests into S3 Batch Operations + - Copy objects into Glacier storage + +6. **Safe Deletion** + + - Compare current inventory vs. incremental manifests + - Identify parquet files that: + - Were backed up successfully + - No longer exist in source + - Are older than 90 days + - Tag them for deletion instead of deleting immediately + - Deletion is performed using S3 lifecycle configuration for cost-optimized deletion + - Tags include timestamps for rollback + audit + +--- + +### Error Handling & Resilience + +- Retries with exponential backoff + jitter +- Strict validation before deletes +- Exclusion lists ensure delta logs are never deleted +- ECS tasks run in private subnets with VPC endpoints + +--- + +### Cost & Performance Gains + +- Incremental logic = no redundant transfers +- Lifecycle rules = backups → Glacier, old ones cleaned +- Size-based routing = Lambda for cheap jobs, ECS for heavy jobs +- Parallelism = 20 CSV workers per manifest, 40 DBs at once + +--- + +### Lessons Learned + +- Always validate manifests before processing +- Never delete immediately → tagging first saved us money +- Thresholds matter: 25 files was our sweet spot +- CloudWatch + Slack reports gave us visibility we didn’t have before + +--- + +### Conclusion + +By combining Lambda, ECS Fargate, and S3 Batch Ops, we’ve built a resilient backup system that scales from small to massive datasets. Instead of repeatedly copying the same files, the system now performs truly incremental backups — capturing only new or changed parquet files while always preserving delta logs. This not only minimizes costs but also dramatically reduces runtime. + +Our safe deletion workflow ensures that stale data is removed without risk, using lifecycle-based cleanup rather than immediate deletion. Together, these design choices give us reliable backups, efficient scaling, and continuous optimization of storage. What used to be expensive, error-prone, and manual is now automated, predictable, and cost-effective. diff --git a/files/backup_system_diagram.png b/files/backup_system_diagram.png new file mode 100644 index 0000000..8410a5c Binary files /dev/null and b/files/backup_system_diagram.png differ