Predictive Auto-Scaling for E-Commerce Platform

A complete solution for predictive auto-scaling of EC2 instances based on ML-predicted traffic patterns for Saleor e-commerce platform.

📖 Quick Links

🎯 Project Overview

This project implements an intelligent auto-scaling system that uses Machine Learning to predict traffic surges on an e-commerce website and proactively scales EC2 instances using AWS Auto Scaling. The system learns from historical CloudWatch metrics to predict future load and adjusts infrastructure before demand spikes occur.

Key Features

Predictive Scaling: ML model predicts traffic patterns and scales infrastructure proactively
Saleor E-Commerce: Production-ready Saleor platform for testing
Load Testing: Comprehensive Locust scripts to simulate realistic traffic patterns
Infrastructure as Code: Complete Terraform configuration for AWS resources
Real-time Monitoring: CloudWatch dashboards and metrics collection
Automated Deployment: Lambda function for continuous prediction and scaling

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Users / Load Testing                     │
│                           (Locust)                               │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Application Load Balancer                      │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│              Auto Scaling Group (EC2 Instances)                  │
│                     Saleor E-Commerce                            │
└─────────────┬──────────────────────────────┬────────────────────┘
              │                              │
              ▼                              ▼
┌──────────────────────┐        ┌─────────────────────────┐
│   RDS PostgreSQL     │        │  ElastiCache Redis      │
└──────────────────────┘        └─────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      ML & Monitoring Layer                       │
├─────────────────────────────────────────────────────────────────┤
│  CloudWatch → Lambda (ML Prediction) → Auto Scaling API         │
│                         ↓                                        │
│                   S3 (Model Storage)                             │
└─────────────────────────────────────────────────────────────────┘

📁 Project Structure

predict/
├── terraform/              # Infrastructure as Code
│   ├── main.tf            # Provider and backend configuration
│   ├── variables.tf       # Input variables
│   ├── outputs.tf         # Output values
│   ├── vpc.tf             # VPC, subnets, networking
│   ├── security_groups.tf # Security groups
│   ├── ec2.tf             # EC2, ASG, ALB configuration
│   ├── rds.tf             # Database and cache
│   ├── monitoring.tf      # CloudWatch, Lambda, EventBridge
│   ├── user_data.sh       # EC2 initialization script
│   └── terraform.tfvars.example
│
├── ml-model/              # Machine Learning components
│   ├── predictive_scaler.py  # ML model class
│   ├── train_model.py        # Training script
│   └── requirements.txt
│
├── lambda/                # AWS Lambda function
│   ├── lambda_function.py    # Lambda handler
│   ├── build.sh              # Build script (Linux/Mac)
│   ├── build.ps1             # Build script (Windows)
│   └── requirements.txt
│
├── locust/                # Load testing
│   ├── locustfile.py         # Main load test scenarios
│   ├── traffic_patterns.py   # Traffic surge patterns
│   ├── run_test.sh           # Test runner (Linux/Mac)
│   ├── run_test.ps1          # Test runner (Windows)
│   └── requirements.txt
│
├── scripts/               # Utility scripts
└── README.md

🚀 Quick Start

Prerequisites

AWS Account with appropriate permissions
Terraform >= 1.0
Python >= 3.9
AWS CLI configured with credentials
SSH Key Pair created in AWS

Step 1: Configure AWS Credentials

# Configure AWS CLI
aws configure

# Verify configuration
aws sts get-caller-identity

Step 2: Deploy Infrastructure

cd terraform

# Copy and edit variables
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings

# Initialize Terraform
terraform init

# Review the plan
terraform plan

# Deploy infrastructure
terraform apply

Required variables in terraform.tfvars:

aws_region      = "us-east-1"
project_name    = "saleor-predictive-scaling"
key_pair_name   = "your-key-pair-name"
db_password     = "YourSecurePassword123!"

Step 3: Build and Deploy Lambda Function

On Windows (PowerShell):

cd lambda
.\build.ps1

# Upload to Lambda (will be done automatically by Terraform)

On Linux/Mac:

cd lambda
chmod +x build.sh
./build.sh

Step 4: Train the ML Model

cd ml-model

# Install dependencies
pip install -r requirements.txt

# Configure AWS credentials
export AWS_PROFILE=your-profile
export ASG_NAME=your-asg-name
export S3_BUCKET=your-s3-bucket

# Wait for some data to be collected (run for a few hours first)
# Then train the model
python train_model.py

Step 5: Run Load Tests

On Windows (PowerShell):

cd locust

# Install dependencies
pip install -r requirements.txt

# Get ALB DNS from Terraform output
$ALB_URL = terraform output -raw alb_url

# Run different test scenarios
.\run_test.ps1 -TargetHost $ALB_URL -Scenario baseline
.\run_test.ps1 -TargetHost $ALB_URL -Scenario surge
.\run_test.ps1 -TargetHost $ALB_URL -Scenario flash-sale

# Or use web UI
.\run_test.ps1 -TargetHost $ALB_URL -Scenario web

On Linux/Mac:

cd locust

# Install dependencies
pip install -r requirements.txt

# Get ALB DNS from Terraform output
ALB_URL=$(terraform output -raw alb_url)

# Run different test scenarios
./run_test.sh $ALB_URL baseline
./run_test.sh $ALB_URL surge
./run_test.sh $ALB_URL flash-sale

# Or use web UI
./run_test.sh $ALB_URL web

🧪 Load Test Scenarios

Available Scenarios

baseline - Steady baseline traffic (20 users)
surge - Gradual traffic surge simulation
sinusoidal - Cyclical traffic pattern (15-min cycles)
step - Step-wise load increases
flash-sale - Sudden traffic spike simulation
web - Interactive web UI (http://localhost:8089)

Traffic Patterns

Baseline: Simulates normal e-commerce browsing
Surge: Gradual ramp from 10 to 200 users over 12 minutes
Flash Sale: Sudden spike to 300 users during 3-minute window
Sinusoidal: Continuous wave pattern (10-150 users)
Step Load: Incremental increases every 3 minutes

🤖 ML Model Details

Features Used

Request count (ALB metrics)
Average response time
CPU utilization
Hour of day
Day of week

Model Type

Random Forest Regressor with the following characteristics:

100 estimators
Max depth: 10
Feature scaling with StandardScaler

Training Process

Collect historical CloudWatch metrics
Extract features and align timestamps
Normalize features using StandardScaler
Train Random Forest model
Save model to S3 for Lambda function

Prediction Flow

Lambda triggered every 5 minutes (EventBridge)
Collect current metrics
Load ML model from S3
Predict required capacity
Scale ASG if needed (threshold: ±1 instance)

📊 Monitoring

CloudWatch Dashboard

Access the dashboard to view:

Request count and response times
Auto Scaling metrics (desired/actual capacity)
CPU utilization
Database metrics

Metrics Collected

Application Metrics:

ALB Request Count
Target Response Time
Healthy/Unhealthy Host Count

Infrastructure Metrics:

EC2 CPU Utilization
Memory Usage
Network I/O
Disk Usage

Auto Scaling Metrics:

Group Desired Capacity
Group In-Service Instances
Scaling Activities

🔧 Configuration

Terraform Variables

Variable	Description	Default
`aws_region`	AWS region	us-east-1
`instance_type`	EC2 instance type	t3.medium
`min_size`	Min ASG instances	1
`max_size`	Max ASG instances	10
`desired_capacity`	Initial capacity	2
`db_instance_class`	RDS instance class	db.t3.medium

Lambda Environment Variables

Variable	Description
`ASG_NAME`	Auto Scaling Group name
`S3_BUCKET`	S3 bucket for model storage
`SNS_TOPIC_ARN`	SNS topic for notifications
`MIN_INSTANCES`	Minimum instances
`MAX_INSTANCES`	Maximum instances

🔒 Security Best Practices

Database Credentials: Store in AWS Secrets Manager
SSL/TLS: Enable HTTPS on ALB (update Terraform)
Security Groups: Restrict SSH access to specific IPs
IAM Roles: Follow least privilege principle
Encryption: Enable at rest and in transit

💰 Cost Optimization

Estimated Monthly Costs (us-east-1):

EC2 (2x t3.medium): ~$60
RDS (db.t3.medium Multi-AZ): ~$120
ElastiCache (cache.t3.medium): ~$50
ALB: ~$20
Data Transfer: ~$10
Total: ~$260/month

Cost Reduction Tips:

Use Spot Instances for non-production
Enable Auto Scaling to scale down during low traffic
Use smaller instance types for testing
Clean up resources when not in use

🐛 Troubleshooting

Lambda Function Not Scaling

Check CloudWatch Logs for Lambda function
Verify model exists in S3
Check IAM permissions
Ensure EventBridge rule is enabled

Load Tests Failing

Verify ALB DNS is correct
Check security groups allow traffic
Ensure Saleor is running on EC2 instances
Check EC2 instance health in target group

Model Training Errors

Ensure sufficient historical data (>10 data points)
Check AWS credentials and permissions
Verify S3 bucket exists and is accessible
Check CloudWatch metrics are being collected

📚 Additional Resources

🤝 Contributing

Contributions are welcome! Areas for improvement:

Add more sophisticated ML models (LSTM, Prophet)
Implement multi-region deployment
Add cost optimization algorithms
Enhance monitoring dashboards
Add automated testing

📝 License

This project is for educational and demonstration purposes.

⚠️ Important Notes

Initial Data Collection: The system needs to run for several hours/days to collect enough data for meaningful predictions
Model Retraining: Retrain the model periodically to adapt to changing traffic patterns
Testing: Always test in a non-production environment first
Cleanup: Run terraform destroy to remove all resources and avoid charges

🔄 Next Steps

Deploy infrastructure with Terraform
Let the system run and collect metrics for 24-48 hours
Train the initial ML model
Run load tests to validate auto-scaling
Monitor and adjust thresholds as needed
Retrain model weekly/monthly for best results

Built with: Terraform • AWS • Python • Saleor • Locust • scikit-learn

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
lambda		lambda
locust		locust
ml-model		ml-model
terraform		terraform
ARCHITECTURE.md		ARCHITECTURE.md
CHECKLIST.md		CHECKLIST.md
DEPLOYMENT.md		DEPLOYMENT.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Predictive Auto-Scaling for E-Commerce Platform

📖 Quick Links

🎯 Project Overview

Key Features

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Step 1: Configure AWS Credentials

Step 2: Deploy Infrastructure

Step 3: Build and Deploy Lambda Function

Step 4: Train the ML Model

Step 5: Run Load Tests

🧪 Load Test Scenarios

Available Scenarios

Traffic Patterns

🤖 ML Model Details

Features Used

Model Type

Training Process

Prediction Flow

📊 Monitoring

CloudWatch Dashboard

Metrics Collected

🔧 Configuration

Terraform Variables

Lambda Environment Variables

🔒 Security Best Practices

💰 Cost Optimization

🐛 Troubleshooting

Lambda Function Not Scaling

Load Tests Failing

Model Training Errors

📚 Additional Resources

🤝 Contributing

📝 License

⚠️ Important Notes

🔄 Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages