Skip to content

pyjeebz/predict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Predictive Auto-Scaling for E-Commerce Platform

Terraform AWS Python License

A complete solution for predictive auto-scaling of EC2 instances based on ML-predicted traffic patterns for Saleor e-commerce platform.


πŸ“– Quick Links

🎯 Project Overview

This project implements an intelligent auto-scaling system that uses Machine Learning to predict traffic surges on an e-commerce website and proactively scales EC2 instances using AWS Auto Scaling. The system learns from historical CloudWatch metrics to predict future load and adjusts infrastructure before demand spikes occur.

Key Features

  • Predictive Scaling: ML model predicts traffic patterns and scales infrastructure proactively
  • Saleor E-Commerce: Production-ready Saleor platform for testing
  • Load Testing: Comprehensive Locust scripts to simulate realistic traffic patterns
  • Infrastructure as Code: Complete Terraform configuration for AWS resources
  • Real-time Monitoring: CloudWatch dashboards and metrics collection
  • Automated Deployment: Lambda function for continuous prediction and scaling

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Users / Load Testing                     β”‚
β”‚                           (Locust)                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Application Load Balancer                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Auto Scaling Group (EC2 Instances)                  β”‚
β”‚                     Saleor E-Commerce                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                              β”‚
              β–Ό                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   RDS PostgreSQL     β”‚        β”‚  ElastiCache Redis      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      ML & Monitoring Layer                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  CloudWatch β†’ Lambda (ML Prediction) β†’ Auto Scaling API         β”‚
β”‚                         ↓                                        β”‚
β”‚                   S3 (Model Storage)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

predict/
β”œβ”€β”€ terraform/              # Infrastructure as Code
β”‚   β”œβ”€β”€ main.tf            # Provider and backend configuration
β”‚   β”œβ”€β”€ variables.tf       # Input variables
β”‚   β”œβ”€β”€ outputs.tf         # Output values
β”‚   β”œβ”€β”€ vpc.tf             # VPC, subnets, networking
β”‚   β”œβ”€β”€ security_groups.tf # Security groups
β”‚   β”œβ”€β”€ ec2.tf             # EC2, ASG, ALB configuration
β”‚   β”œβ”€β”€ rds.tf             # Database and cache
β”‚   β”œβ”€β”€ monitoring.tf      # CloudWatch, Lambda, EventBridge
β”‚   β”œβ”€β”€ user_data.sh       # EC2 initialization script
β”‚   └── terraform.tfvars.example
β”‚
β”œβ”€β”€ ml-model/              # Machine Learning components
β”‚   β”œβ”€β”€ predictive_scaler.py  # ML model class
β”‚   β”œβ”€β”€ train_model.py        # Training script
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ lambda/                # AWS Lambda function
β”‚   β”œβ”€β”€ lambda_function.py    # Lambda handler
β”‚   β”œβ”€β”€ build.sh              # Build script (Linux/Mac)
β”‚   β”œβ”€β”€ build.ps1             # Build script (Windows)
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ locust/                # Load testing
β”‚   β”œβ”€β”€ locustfile.py         # Main load test scenarios
β”‚   β”œβ”€β”€ traffic_patterns.py   # Traffic surge patterns
β”‚   β”œβ”€β”€ run_test.sh           # Test runner (Linux/Mac)
β”‚   β”œβ”€β”€ run_test.ps1          # Test runner (Windows)
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ scripts/               # Utility scripts
└── README.md

πŸš€ Quick Start

Prerequisites

  1. AWS Account with appropriate permissions
  2. Terraform >= 1.0
  3. Python >= 3.9
  4. AWS CLI configured with credentials
  5. SSH Key Pair created in AWS

Step 1: Configure AWS Credentials

# Configure AWS CLI
aws configure

# Verify configuration
aws sts get-caller-identity

Step 2: Deploy Infrastructure

cd terraform

# Copy and edit variables
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings

# Initialize Terraform
terraform init

# Review the plan
terraform plan

# Deploy infrastructure
terraform apply

Required variables in terraform.tfvars:

aws_region      = "us-east-1"
project_name    = "saleor-predictive-scaling"
key_pair_name   = "your-key-pair-name"
db_password     = "YourSecurePassword123!"

Step 3: Build and Deploy Lambda Function

On Windows (PowerShell):

cd lambda
.\build.ps1

# Upload to Lambda (will be done automatically by Terraform)

On Linux/Mac:

cd lambda
chmod +x build.sh
./build.sh

Step 4: Train the ML Model

cd ml-model

# Install dependencies
pip install -r requirements.txt

# Configure AWS credentials
export AWS_PROFILE=your-profile
export ASG_NAME=your-asg-name
export S3_BUCKET=your-s3-bucket

# Wait for some data to be collected (run for a few hours first)
# Then train the model
python train_model.py

Step 5: Run Load Tests

On Windows (PowerShell):

cd locust

# Install dependencies
pip install -r requirements.txt

# Get ALB DNS from Terraform output
$ALB_URL = terraform output -raw alb_url

# Run different test scenarios
.\run_test.ps1 -TargetHost $ALB_URL -Scenario baseline
.\run_test.ps1 -TargetHost $ALB_URL -Scenario surge
.\run_test.ps1 -TargetHost $ALB_URL -Scenario flash-sale

# Or use web UI
.\run_test.ps1 -TargetHost $ALB_URL -Scenario web

On Linux/Mac:

cd locust

# Install dependencies
pip install -r requirements.txt

# Get ALB DNS from Terraform output
ALB_URL=$(terraform output -raw alb_url)

# Run different test scenarios
./run_test.sh $ALB_URL baseline
./run_test.sh $ALB_URL surge
./run_test.sh $ALB_URL flash-sale

# Or use web UI
./run_test.sh $ALB_URL web

πŸ§ͺ Load Test Scenarios

Available Scenarios

  1. baseline - Steady baseline traffic (20 users)
  2. surge - Gradual traffic surge simulation
  3. sinusoidal - Cyclical traffic pattern (15-min cycles)
  4. step - Step-wise load increases
  5. flash-sale - Sudden traffic spike simulation
  6. web - Interactive web UI (http://localhost:8089)

Traffic Patterns

  • Baseline: Simulates normal e-commerce browsing
  • Surge: Gradual ramp from 10 to 200 users over 12 minutes
  • Flash Sale: Sudden spike to 300 users during 3-minute window
  • Sinusoidal: Continuous wave pattern (10-150 users)
  • Step Load: Incremental increases every 3 minutes

πŸ€– ML Model Details

Features Used

  • Request count (ALB metrics)
  • Average response time
  • CPU utilization
  • Hour of day
  • Day of week

Model Type

Random Forest Regressor with the following characteristics:

  • 100 estimators
  • Max depth: 10
  • Feature scaling with StandardScaler

Training Process

  1. Collect historical CloudWatch metrics
  2. Extract features and align timestamps
  3. Normalize features using StandardScaler
  4. Train Random Forest model
  5. Save model to S3 for Lambda function

Prediction Flow

  1. Lambda triggered every 5 minutes (EventBridge)
  2. Collect current metrics
  3. Load ML model from S3
  4. Predict required capacity
  5. Scale ASG if needed (threshold: Β±1 instance)

πŸ“Š Monitoring

CloudWatch Dashboard

Access the dashboard to view:

  • Request count and response times
  • Auto Scaling metrics (desired/actual capacity)
  • CPU utilization
  • Database metrics

Metrics Collected

Application Metrics:

  • ALB Request Count
  • Target Response Time
  • Healthy/Unhealthy Host Count

Infrastructure Metrics:

  • EC2 CPU Utilization
  • Memory Usage
  • Network I/O
  • Disk Usage

Auto Scaling Metrics:

  • Group Desired Capacity
  • Group In-Service Instances
  • Scaling Activities

πŸ”§ Configuration

Terraform Variables

Variable Description Default
aws_region AWS region us-east-1
instance_type EC2 instance type t3.medium
min_size Min ASG instances 1
max_size Max ASG instances 10
desired_capacity Initial capacity 2
db_instance_class RDS instance class db.t3.medium

Lambda Environment Variables

Variable Description
ASG_NAME Auto Scaling Group name
S3_BUCKET S3 bucket for model storage
SNS_TOPIC_ARN SNS topic for notifications
MIN_INSTANCES Minimum instances
MAX_INSTANCES Maximum instances

πŸ”’ Security Best Practices

  1. Database Credentials: Store in AWS Secrets Manager
  2. SSL/TLS: Enable HTTPS on ALB (update Terraform)
  3. Security Groups: Restrict SSH access to specific IPs
  4. IAM Roles: Follow least privilege principle
  5. Encryption: Enable at rest and in transit

πŸ’° Cost Optimization

Estimated Monthly Costs (us-east-1):

  • EC2 (2x t3.medium): ~$60
  • RDS (db.t3.medium Multi-AZ): ~$120
  • ElastiCache (cache.t3.medium): ~$50
  • ALB: ~$20
  • Data Transfer: ~$10
  • Total: ~$260/month

Cost Reduction Tips:

  • Use Spot Instances for non-production
  • Enable Auto Scaling to scale down during low traffic
  • Use smaller instance types for testing
  • Clean up resources when not in use

πŸ› Troubleshooting

Lambda Function Not Scaling

  1. Check CloudWatch Logs for Lambda function
  2. Verify model exists in S3
  3. Check IAM permissions
  4. Ensure EventBridge rule is enabled

Load Tests Failing

  1. Verify ALB DNS is correct
  2. Check security groups allow traffic
  3. Ensure Saleor is running on EC2 instances
  4. Check EC2 instance health in target group

Model Training Errors

  1. Ensure sufficient historical data (>10 data points)
  2. Check AWS credentials and permissions
  3. Verify S3 bucket exists and is accessible
  4. Check CloudWatch metrics are being collected

πŸ“š Additional Resources

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Add more sophisticated ML models (LSTM, Prophet)
  • Implement multi-region deployment
  • Add cost optimization algorithms
  • Enhance monitoring dashboards
  • Add automated testing

πŸ“ License

This project is for educational and demonstration purposes.

⚠️ Important Notes

  1. Initial Data Collection: The system needs to run for several hours/days to collect enough data for meaningful predictions
  2. Model Retraining: Retrain the model periodically to adapt to changing traffic patterns
  3. Testing: Always test in a non-production environment first
  4. Cleanup: Run terraform destroy to remove all resources and avoid charges

πŸ”„ Next Steps

  1. Deploy infrastructure with Terraform
  2. Let the system run and collect metrics for 24-48 hours
  3. Train the initial ML model
  4. Run load tests to validate auto-scaling
  5. Monitor and adjust thresholds as needed
  6. Retrain model weekly/monthly for best results

Built with: Terraform β€’ AWS β€’ Python β€’ Saleor β€’ Locust β€’ scikit-learn

About

Predictive Scaling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors