A complete solution for predictive auto-scaling of EC2 instances based on ML-predicted traffic patterns for Saleor e-commerce platform.
This project implements an intelligent auto-scaling system that uses Machine Learning to predict traffic surges on an e-commerce website and proactively scales EC2 instances using AWS Auto Scaling. The system learns from historical CloudWatch metrics to predict future load and adjusts infrastructure before demand spikes occur.
- Predictive Scaling: ML model predicts traffic patterns and scales infrastructure proactively
- Saleor E-Commerce: Production-ready Saleor platform for testing
- Load Testing: Comprehensive Locust scripts to simulate realistic traffic patterns
- Infrastructure as Code: Complete Terraform configuration for AWS resources
- Real-time Monitoring: CloudWatch dashboards and metrics collection
- Automated Deployment: Lambda function for continuous prediction and scaling
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Users / Load Testing β
β (Locust) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Load Balancer β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Auto Scaling Group (EC2 Instances) β
β Saleor E-Commerce β
βββββββββββββββ¬βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β RDS PostgreSQL β β ElastiCache Redis β
ββββββββββββββββββββββββ βββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ML & Monitoring Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β CloudWatch β Lambda (ML Prediction) β Auto Scaling API β
β β β
β S3 (Model Storage) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
predict/
βββ terraform/ # Infrastructure as Code
β βββ main.tf # Provider and backend configuration
β βββ variables.tf # Input variables
β βββ outputs.tf # Output values
β βββ vpc.tf # VPC, subnets, networking
β βββ security_groups.tf # Security groups
β βββ ec2.tf # EC2, ASG, ALB configuration
β βββ rds.tf # Database and cache
β βββ monitoring.tf # CloudWatch, Lambda, EventBridge
β βββ user_data.sh # EC2 initialization script
β βββ terraform.tfvars.example
β
βββ ml-model/ # Machine Learning components
β βββ predictive_scaler.py # ML model class
β βββ train_model.py # Training script
β βββ requirements.txt
β
βββ lambda/ # AWS Lambda function
β βββ lambda_function.py # Lambda handler
β βββ build.sh # Build script (Linux/Mac)
β βββ build.ps1 # Build script (Windows)
β βββ requirements.txt
β
βββ locust/ # Load testing
β βββ locustfile.py # Main load test scenarios
β βββ traffic_patterns.py # Traffic surge patterns
β βββ run_test.sh # Test runner (Linux/Mac)
β βββ run_test.ps1 # Test runner (Windows)
β βββ requirements.txt
β
βββ scripts/ # Utility scripts
βββ README.md
- AWS Account with appropriate permissions
- Terraform >= 1.0
- Python >= 3.9
- AWS CLI configured with credentials
- SSH Key Pair created in AWS
# Configure AWS CLI
aws configure
# Verify configuration
aws sts get-caller-identitycd terraform
# Copy and edit variables
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings
# Initialize Terraform
terraform init
# Review the plan
terraform plan
# Deploy infrastructure
terraform applyRequired variables in terraform.tfvars:
aws_region = "us-east-1"
project_name = "saleor-predictive-scaling"
key_pair_name = "your-key-pair-name"
db_password = "YourSecurePassword123!"On Windows (PowerShell):
cd lambda
.\build.ps1
# Upload to Lambda (will be done automatically by Terraform)On Linux/Mac:
cd lambda
chmod +x build.sh
./build.shcd ml-model
# Install dependencies
pip install -r requirements.txt
# Configure AWS credentials
export AWS_PROFILE=your-profile
export ASG_NAME=your-asg-name
export S3_BUCKET=your-s3-bucket
# Wait for some data to be collected (run for a few hours first)
# Then train the model
python train_model.pyOn Windows (PowerShell):
cd locust
# Install dependencies
pip install -r requirements.txt
# Get ALB DNS from Terraform output
$ALB_URL = terraform output -raw alb_url
# Run different test scenarios
.\run_test.ps1 -TargetHost $ALB_URL -Scenario baseline
.\run_test.ps1 -TargetHost $ALB_URL -Scenario surge
.\run_test.ps1 -TargetHost $ALB_URL -Scenario flash-sale
# Or use web UI
.\run_test.ps1 -TargetHost $ALB_URL -Scenario webOn Linux/Mac:
cd locust
# Install dependencies
pip install -r requirements.txt
# Get ALB DNS from Terraform output
ALB_URL=$(terraform output -raw alb_url)
# Run different test scenarios
./run_test.sh $ALB_URL baseline
./run_test.sh $ALB_URL surge
./run_test.sh $ALB_URL flash-sale
# Or use web UI
./run_test.sh $ALB_URL web- baseline - Steady baseline traffic (20 users)
- surge - Gradual traffic surge simulation
- sinusoidal - Cyclical traffic pattern (15-min cycles)
- step - Step-wise load increases
- flash-sale - Sudden traffic spike simulation
- web - Interactive web UI (http://localhost:8089)
- Baseline: Simulates normal e-commerce browsing
- Surge: Gradual ramp from 10 to 200 users over 12 minutes
- Flash Sale: Sudden spike to 300 users during 3-minute window
- Sinusoidal: Continuous wave pattern (10-150 users)
- Step Load: Incremental increases every 3 minutes
- Request count (ALB metrics)
- Average response time
- CPU utilization
- Hour of day
- Day of week
Random Forest Regressor with the following characteristics:
- 100 estimators
- Max depth: 10
- Feature scaling with StandardScaler
- Collect historical CloudWatch metrics
- Extract features and align timestamps
- Normalize features using StandardScaler
- Train Random Forest model
- Save model to S3 for Lambda function
- Lambda triggered every 5 minutes (EventBridge)
- Collect current metrics
- Load ML model from S3
- Predict required capacity
- Scale ASG if needed (threshold: Β±1 instance)
Access the dashboard to view:
- Request count and response times
- Auto Scaling metrics (desired/actual capacity)
- CPU utilization
- Database metrics
Application Metrics:
- ALB Request Count
- Target Response Time
- Healthy/Unhealthy Host Count
Infrastructure Metrics:
- EC2 CPU Utilization
- Memory Usage
- Network I/O
- Disk Usage
Auto Scaling Metrics:
- Group Desired Capacity
- Group In-Service Instances
- Scaling Activities
| Variable | Description | Default |
|---|---|---|
aws_region |
AWS region | us-east-1 |
instance_type |
EC2 instance type | t3.medium |
min_size |
Min ASG instances | 1 |
max_size |
Max ASG instances | 10 |
desired_capacity |
Initial capacity | 2 |
db_instance_class |
RDS instance class | db.t3.medium |
| Variable | Description |
|---|---|
ASG_NAME |
Auto Scaling Group name |
S3_BUCKET |
S3 bucket for model storage |
SNS_TOPIC_ARN |
SNS topic for notifications |
MIN_INSTANCES |
Minimum instances |
MAX_INSTANCES |
Maximum instances |
- Database Credentials: Store in AWS Secrets Manager
- SSL/TLS: Enable HTTPS on ALB (update Terraform)
- Security Groups: Restrict SSH access to specific IPs
- IAM Roles: Follow least privilege principle
- Encryption: Enable at rest and in transit
Estimated Monthly Costs (us-east-1):
- EC2 (2x t3.medium): ~$60
- RDS (db.t3.medium Multi-AZ): ~$120
- ElastiCache (cache.t3.medium): ~$50
- ALB: ~$20
- Data Transfer: ~$10
- Total: ~$260/month
Cost Reduction Tips:
- Use Spot Instances for non-production
- Enable Auto Scaling to scale down during low traffic
- Use smaller instance types for testing
- Clean up resources when not in use
- Check CloudWatch Logs for Lambda function
- Verify model exists in S3
- Check IAM permissions
- Ensure EventBridge rule is enabled
- Verify ALB DNS is correct
- Check security groups allow traffic
- Ensure Saleor is running on EC2 instances
- Check EC2 instance health in target group
- Ensure sufficient historical data (>10 data points)
- Check AWS credentials and permissions
- Verify S3 bucket exists and is accessible
- Check CloudWatch metrics are being collected
- Saleor Documentation
- Locust Documentation
- Terraform AWS Provider
- AWS Auto Scaling
- scikit-learn Documentation
Contributions are welcome! Areas for improvement:
- Add more sophisticated ML models (LSTM, Prophet)
- Implement multi-region deployment
- Add cost optimization algorithms
- Enhance monitoring dashboards
- Add automated testing
This project is for educational and demonstration purposes.
- Initial Data Collection: The system needs to run for several hours/days to collect enough data for meaningful predictions
- Model Retraining: Retrain the model periodically to adapt to changing traffic patterns
- Testing: Always test in a non-production environment first
- Cleanup: Run
terraform destroyto remove all resources and avoid charges
- Deploy infrastructure with Terraform
- Let the system run and collect metrics for 24-48 hours
- Train the initial ML model
- Run load tests to validate auto-scaling
- Monitor and adjust thresholds as needed
- Retrain model weekly/monthly for best results
Built with: Terraform β’ AWS β’ Python β’ Saleor β’ Locust β’ scikit-learn