Terraform module which creates a microservice that works for all Fargate/EC2 instances. The applications can range from deploying general purpose applications, machine learning inference, high performance computing and more.
There are already some terraform microservices available, however they offer low variety in configurations and usually only supports Fargate. Here you have access to all EC2 instances with easy configuration.
# main.tf
module "microservice" {
source = "vistimi/microservice/aws"
name = "microservice-example"
bucket_env = {
force_destroy = false
versioning = true
file_key = "file_local_name.env"
file_path = "file_in_bucket_name.env"
}
vpc = {
id = "my_vpc_id"
subnet_tier_ids = ["id_subnet_tier_1", "id_subnet_tier_2"]
subnet_intra_ids = ["id_subnet_intra_1", "id_subnet_intra_2"]
}
orchestrator = {
group = {
name = "first"
deployment = {
min_size = 1
max_size = 2
desired_size = 1
containers = [
{
name = "first"
docker = {
repository = {
name = "ubuntu"
}
image = {
tag = "latest"
}
}
traffics = [
{
# this will redirect http:80 to http:80
listener = {
# port is by default 80 with http
protocol = "http"
}
target = {
port = 80
protocol = "http" # if not specified, the protocol will be the same as the listener
health_check_path = "/" # if not specified, the health_check_path will be "/"
}
}
]
entrypoint = [
"/bin/bash",
"-c",
]
command = [
<<EOT
# ...
EOT
]
readonly_root_filesystem = false
}
]
}
ec2 = {
key_name = "name_of_key_to_ssh_with"
instance_types = ["t2.micro"]
os = "linux"
os_version = "2023"
capacities = [{
type = "ON_DEMAND"
}]
}
}
ecs = {
# override default ecs behaviour
}
}
tags = {}
}
# provider.tf
#-------------------------------------------
# AWS
#-------------------------------------------
provider "aws" {
region = "us-east-1"
}
#-------------------------------------------
# Docker
#-------------------------------------------
locals {
ecr_address = format("%v.dkr.ecr.%v.amazonaws.com", data.aws_caller_identity.current.account_id, data.aws_region.current.name)
# ecr_adress = "public.ecr.aws"
}
data "aws_region" "current" {}
data "aws_caller_identity" "current" {}
data "aws_ecr_authorization_token" "token" {}
provider "docker" {
registry_auth {
address = local.ecr_address
username = data.aws_ecr_authorization_token.token.user_name
password = data.aws_ecr_authorization_token.token.password
}
}
#-------------------------------------------
# Kubernetes
#-------------------------------------------
# provider "kubernetes" {
# host = one(values(module.eks)).cluster_endpoint
# cluster_ca_certificate = base64decode(one(values(module.eks)).cluster.certificate_authority_data)
# exec {
# api_version = "client.authentication.k8s.io/v1beta1"
# command = "aws"
# # This requires the awscli to be installed locally where Terraform is executed
# args = ["eks", "get-token", "--cluster-name", one(values(module.eks)).cluster.name]
# }
# }
# provider "kubectl" {
# host = one(values(module.eks)).cluster_endpoint
# cluster_ca_certificate = base64decode(one(values(module.eks)).cluster.certificate_authority_data)
# exec {
# api_version = "client.authentication.k8s.io/v1beta1"
# command = "aws"
# # This requires the awscli to be installed locally where Terraform is executed
# args = ["eks", "get-token", "--cluster-name", one(values(module.eks)).cluster.name]
# }
# }
πββοΈ If you want to unify your infrastructure with terraform, use this module. Terraform covers a wide range of cloud providers, hence reducing dependability over one provider/platform.
πββοΈIf you want to use other serving systems such as torchserve or TensorFlow Serving, then use this module.
Data platforms are a great way to simply and efficiently manage your AI lifecycle from training to deployment. However they are quite pricy and only work for data application. Some frameworks like ray.io cluster or mlflow.org will offer easily lifecycle management from local machine to complex cloud deployment for ML projects.
π If your want to do training, then you are better off using a data platform or a framework. This module is not designed for training, but for inference.
- heterogeneous clusters, consisting of different instance types
- Parallelizing data processing with autoscaling
The microservice offers the following features:
- Load balancer
- HTTP(S)
- Rest/gRPC
- Redirects
- Auto scaling
- Route53 records
- ACM
- Environement file
- Cloudwatch logs
- Docker build
- ECR
- Container orchestrators
- ECS
- Fargate
- EC2
- General Purpose
- Compute Optimized
- Memory Optimized
- Accelerated Computing (GPU, Inferentia, Trainium)
- Accelerated Computing (Gaudi) not supported
- Storage Optimized: supported/not tested
- HPC Optimized: supported/not tested
- EKS
- Fargate
- EC2
- ECS
Go check the examples Go check the tests
ECS | EKS |
---|---|
cluster | cluster |
service | node-group |
task | node |
task-definition | deployment |
container-definition | pod |
β Error: The closest matching container-instance <id>
has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide
It means that the memory given to the container or the service or both is superior to what is allowed. ECS requires a certain amount of memory to run and is different for each instance. Currently there is only 90% of the memory used for the containers, leaving enough overhead space to not encounter that problem. To override that from happening you can override the memory and cpu allocation by specifying it in the containers.
β Error: Scaling activity <id>
: Failed: We currently do not have sufficient <instance_type>
capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get <instance_type>
capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f. Launching EC2 instance failed
It means that not available instances in the available zones. Unfortunately AWS does not have enough capacity in some regions. A possible solution would be to retry deploying the microservice until it is successful.
β Error: creating ELBv2 Listener <arn>
: operation error Elastic Load Balancing v2: CreateListener, https response error StatusCode: 400, RequestID: <id>
, DuplicateListener: A listener already exists on this port for this load balancer <arn>
β Error: [WARN] A duplicate Security Group rule was found on <sg>
. This may be a side effect of a now-fixed Terraform issue causing two security groups with identical attributes but different source_security_group_ids to overwrite each other in the state. See hashicorp/terraform#2376 for more information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 0.0.0.0/0, TCP, from port: <port>
, to port: <port>
, ALLOW" already exists
β Error: creating ECS Service <name>
: InvalidParameterException: Unable to Start a service that is still Draining.
This happens if you try to reapply after a failed deployment. It will delete the service but the task that failed its deployment will not be deleted. The solution is to delete the task manually.
β Error: waiting for ECS Service <arn>
create: timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)
This means that the service takes either more time than 20 minutes to become stable or that the configuration is wrong. Make sure you give it enough time to deploy.
With EC2, one indication to see if it is well deployed is to check the target group that it uses an ephemeral port. If you see something like this, wait:
Instance ID | Name | Port | Zone | Health status | Health status details |
---|---|---|---|---|---|
<id> |
<name> |
8080 | us-east-1a | [x] Unhealthy | Request timed out |
You should see something like this:
Instance ID | Name | Port | Zone | Health status | Health status details |
---|---|---|---|---|---|
<id> |
<name> |
8080 | us-east-1a | [x] Unhealthy | Request timed out |
<id> |
<name> |
32771 | us-east-1a | [v] Healthy | - |
This indicates that the service used indeed an ephemeral port and is well deployed. If you see an error using the ephemeral port, it means that the service or something else is wrongly configured. Check the logs for more information.
It means that there is a capacity provider change and thus need to delete it. However, it is still associated with the cluster and thus cannot be removed.
β Error: waiting for ECS Capacity Provider <arn>
to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)
On AWS console:
Delete failed
The capacity provider cannot be deleted because it is associated with cluster: <name>. Remove the capacity provider from the cluster and try again.
You can delete it manually every time there is a change of instances. On the AWS console, go to EC2 -> Auto Scaling Groups -> select and delete old one
(it will wait for the deletion lifecycle policy if there is one).
Also go to ECS -> cluster -> <your_clustser> -> Infrastructure -> Capacity Providers -> select and delete old one
.
β Error: deleting ECS Cluster <cluster_arn>
: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
or
β Error: waiting for Auto Scaling Group <capacity_provider_name>
delete: found resource
Since January 2024, ECS adds automatically a lifecycle hook
name = "ecs-managed-draining-termination-hook"
lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
default_result = "CONTINUE"
heartbeat_timeout = 3600
This means that by default an instance will wait 1h, 3600 seconds, before being terminated. Its state will change from InServive
, to Terminating:Wait
to Terminating:Proceed
. This allows for graceful shutdown of the instance. If the instance is not terminated after 1h, it will be terminated forcefully.
This lifecycle hook is is overriden by this deployment to a duration of 60 seconds. This removes the occuring error.
This lifecycle hook cannot be removed unless done manually with the console or with the CLI. The Lifecycle hook that generates this behaviour can be seen in the console EC2 > Auto Scaling groups > asg_name > Instance management > Lifecycle hooks
. If you remove the lifecycle hook, the instances will be terminated immediately, skipping the state Terminating:Wait
.
You can check the state of the instances in the console EC2 > Auto Scaling groups > asg_name > Instance management > Instances
. There you can check the instances still attached to the auto scaling group. If you want to terminate them, you can also do it manually.
β Error: updating Auto Scaling Group <name>
: operation error Auto Scaling: UpdateAutoScalingGroup, https response error StatusCode: 400, RequestID: <requestId>
, api error ValidationError: An active instance refresh with a desired configuration exists. All configuration options derived from the desired configuration are not available for update while the instance refresh is active.
It means that there is already a refresh in progress. The refresh is a process that allows to update the instances in the auto scaling group. It is a process that can take a while. The refresh can be seen in the console EC2 > Auto Scaling groups > asg_name > Instance management > Refresh
. This is also impacted by the lifecycle hook ecs-managed-draining-termination-hook
.
See LICENSE for full details.