Skip to content

yleizour/clonesquad-ec2-pet-autoscaler

 
 

Repository files navigation

CloneSquad, an AWS EC2 Pet Autoscaler

An Autoscaler for mutable architectures on AWS EC2

Because mutables architectures are still highly common and as they are encountered in most Cloud migrations, CloneSquad is a Serverless Autoscaler software with the main goal to get the most of the Cloud benefits while taking the constraint to never create or terminate EC2 instances but only by doing start/stop of existing ones (aka Pet machines).

CloneSquad is designed to be used when AWS Auto Scaling cannot be: It manages as well EC2 ALB/NLBs, target groups and health checks mechanisms.

Features and Benefits (Please also read the FAQ)

  • Scaling (see Documentation details)
    • Automatic autoscaling based on internal and/or user-defined alarms & metrics,
    • Desired instance count mode (ex: temporarily force 100% of instances to run and allow mutable update),
    • Always-on Availability Zone instance balancing algorithm,
    • Multi targetgroup support (associated to one or multiple ALB or NLB) at the same time (w/ smart instance draining before shutdown),
      • Note: CloneSquad can also work without any managed TargetGroup if not applicable to user use-case.
    • Automatic replacement of unhealthy/unavail/impaired instances,
    • (Optional) Vertical scaling (by leveraging instance type distribution in the fleet),
  • Cost optimization
    • Support for 'persistent' Spot instances aside of On-Demand ones in the same fleet with configurable priorities, Spot Rebalance recommendation and interruption handling,
    • Smart management of t[3|4].xxx burstable instances (aka 'CPU Crediting mode' to avoid overcost linked to unlimited bursting),
    • (Optional) 'LightHouse' mode allowing to run automatically cheap instance types during low activity periods,
    • (Optional) Extra cost optimization options for non-autoscaled resources: Static subfleet support both for EC2 Instances, RDS databases and TransferFamily servers. Allows simple on/off use-cases (in combination with the scheduler. See demonstration).
  • Resilience
  • Agility
    • Support for mixed instance type fleet,
    • Integrated event scheduler ('cron' or 'rate' based) for complex scaling scenario,
    • Configuration hierarchy for complex dynamic settings,
    • API Gateway to monitor and make some basic operations.
  • Observability
    • (Optional) CloudWatch dashboard (Note: activated by default),
    • Events & Notifications (Lambda/SQS/SNS targets) framework to react to Squad events (ex: Register a just-started instance to an external monitoring solution and/or DNS),
    • Extensive debuggability with encountered scaling issues and exceptions exported to S3 (with contextual CloudWatch dashboard PNG snapshots).

Installing / Getting started

Pre-requisites:

  • An S3 bucket to upload the CloneSquad artifacts
  • An EC2 instance with 'aws-cli', Docker installed and an attached role allowing upload to the previously defined S3 bucket

Step 1) Extract and Upload the latest CloneSquad CloudFormation template and associated artifacts

CLONESQUAD_VERSION=latest
CLONESQUAD_S3_BUCKETNAME="<your_S3_bucket_name_where_to_publish_clonesquad_artifacts>"
CLONESQUAD_S3_PREFIX="clonesquad-artifacts" # Note: Prefix MUST be non-empty
CLONESQUAD_GROUPNAME="test"
docker pull clonesquad/devkit:${CLONESQUAD_VERSION}
export AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION:-$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')}
# Upload template.yaml and CloneSquad Lambda functions to specified S3 bucket and prefix
docker run --rm clonesquad/devkit:${CLONESQUAD_VERSION} extract-version "${CLONESQUAD_S3_BUCKETNAME}" "${CLONESQUAD_S3_PREFIX}" latest s3
# Deploy a CloneSquad setup to manage GroupName=test (see Documentation for Group name concept)
aws cloudformation create-stack --template-url https://s3.amazonaws.com/${CLONESQUAD_S3_BUCKETNAME}/${CLONESQUAD_S3_PREFIX}/template.yaml \
    --stack-name MyFirstCloneSquad-${CLONESQUAD_GROUPNAME} \
    --capabilities '["CAPABILITY_NAMED_IAM","CAPABILITY_IAM","CAPABILITY_AUTO_EXPAND"]' \
    --parameter "[{\"ParameterKey\":\"GroupName\",\"ParameterValue\":\"${CLONESQUAD_GROUPNAME}\"}]"
aws cloudformation wait stack-create-complete --stack-name MyFirstCloneSquad-${CLONESQUAD_GROUPNAME}

Note: If you get the error 'fatal error: Unable to locate credentials', you may have forgot to set a valid IAM role on the EC2 deployment instance.

This CloneSquad deployment is now ready to manage all EC2 instances and EC2 Targetgroups tagged with key 'clonesquad:group-name' and value 'test'.

You should see a CloneSquad-test dashboard in the CloudWatch console looking like this (but blank, without any graphs):

CloudWatch dashboard

Step 2) Give to your CloneSquad deployment some EC2 instances and Targetgroups to manage

Next step is to create instances with this appropriate 'clonesquad:group-name' tag defined. For a quick demonstration using a fleet of 20 instances mixing Spot and On-Demand instances, go to examples/environments/demo-instance-fleet.

In order to deploy this demonstration, you MUST configure the CloneSquad DevKit once and run the deploy script from within this container: See instructions!

Optional next step is to define also the tag 'clonesquad:group-name' with value 'test' on one or more EC2 targetgroups: CloneSquad will automatically manage the membership of previousy created instances to these targetgroups. The demo-loadbalancers demonstration is showing this.

Initial Configuration

The default configuration has autoscaling in/out active and a directive defined to keep the serving fleet with, at least, 2 serving/healthy instances. Vertical scaling is disabled; 'LightHouse' mode as well. In this default configuration, CloneSquad does not make distinction between Spot and On-Demand instances managing them as an homogenous fleet.

Better benefits can be obtained by using vertical scaling and instance type priorities.

As general concept, the CloneSquad configuration can be done dynamically through a DynamoDB table or using a cascading set of YAML files located on a external Web servers, one or more S3 buckets requiring SigV4 authentication or finally directly integrated within the CloneSquad deployment for maximum resiliency toward external runtime dependencies. See Configuration reference for more information.

Costs

CloneSquad uses some AWS resources that will be billed at end of the month.

Below, some rough key figures to build an estimate:

  • CloudWatch
    • Alarms: ((five_permanent_alarms) + (nb_of_serving_instances_at_a_given_time)) * 0.10$ per month
    • Dashboard: 1 x 3$ per month (can be disabled)
    • Metrics:
      • Up to 25 x CloudWatch metrics (0.10$ each) ~2.5$ per month (Metrics can be disabled individually with cloudwatch.metrics.excluded to save costs. Metrics are also disabled if not applicable).
      • GetMetricData API call cost highly depends on number of running EC2 instances at a given time (as a rule of thumb, assume 1k requests per hour (=~7$ per month) when Squad is small/medium; assume more on large Squad and/or with intense and frequent scale out activities.
  • Lambda
    • The 'Main' Lambda function runs every 20 seconds by default for <4 seconds (~5$ per month)
  • DynamoDB
    • Should be a few $ per month depending on scaling activities. See DynamodDBConfiguration parameter to configure DynamoDB tables in PROVISIONED capacity billing model instead of default PAY_PER_REQUEST to significantly reduce these costs if needed.
  • X-Ray
    • Few cents per month (can be disabled)

WARNING: The provided demonstrations deploy EC2 Instances with AWS Cloudwatch Log agents enabled that create tens of Custom metrics (RAM...). These custom metrics will generate a significant part of the demonstration bill and may not be considered as part of the CloneSquad cost.

Roadmap

  • Improve documentation,
  • Refine the IAM Role used by the CloneSquad Lambda that are far too wide,
  • Collect feedbacks from users about what they like/they do not like about CloneSquad., This early release is meant to understand and validate the original concept of CloneSquad (please send feedbacks to jeancharlesjorel@gmail.com)
  • Think about an automatic testing capability (currently, tests are manuals),
  • Implement a CI/CD pipeline for release (beyond the existing release-everything script...),
  • There may be a cost benefit to move the 'Main' Lambda to an ECS Fargate Spot task. With limited effort, we could do this move while keeping the 'Main' Lambda as automatic fallback in case of Spot interruption. To be investigated.

Contributing

If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome.

Licensing

The code in this project is licensed under MIT license.

Developping / Building / Releasing

See dedicated documentation.

About

An AWS EC2 Autoscaler for mutable architectures.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.3%
  • Shell 3.7%