Skip to content

vistimi/terraform-aws-microservice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

80 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AWS microservice terraform module

Terraform module which creates a microservice that works for all Fargate/EC2 instances. The applications can range from deploying general purpose applications, machine learning inference, high performance computing and more.

There are already some terraform microservices available, however they offer low variety in configurations and usually only supports Fargate. Here you have access to all EC2 instances with easy configuration.

Usage

# main.tf

module "microservice" {
  source = "vistimi/microservice/aws"

  name = "microservice-example"

  bucket_env = {
    force_destroy = false
    versioning    = true
    file_key      = "file_local_name.env"
    file_path     = "file_in_bucket_name.env"
  }

  vpc = {
    id               = "my_vpc_id"
    subnet_tier_ids  = ["id_subnet_tier_1", "id_subnet_tier_2"]
    subnet_intra_ids = ["id_subnet_intra_1", "id_subnet_intra_2"]
  }

  orchestrator = {
    group = {
      name = "first"
      deployment = {
        min_size     = 1
        max_size     = 2
        desired_size = 1

        containers = [
          {
            name = "first"
            docker = {
              repository = {
                name = "ubuntu"
              }
              image = {
                tag = "latest"
              }
            }

            traffics = [
              {
                # this will redirect http:80 to http:80
                listener = {
                  # port is by default 80 with http
                  protocol = "http"
                }
                target = {
                  port              = 80
                  protocol          = "http" # if not specified, the protocol will be the same as the listener
                  health_check_path = "/"    # if not specified, the health_check_path will be "/"
                }
              }
            ]

            entrypoint = [
              "/bin/bash",
              "-c",
            ]
            command = [
              <<EOT
              # ...
              EOT
            ]
            readonly_root_filesystem = false
          }
        ]
      }

      ec2 = {
        key_name       = "name_of_key_to_ssh_with"
        instance_types = ["t2.micro"]
        os             = "linux"
        os_version     = "2023"
        capacities = [{
            type   = "ON_DEMAND"
          }]
      }
    }
    ecs = {
      # override default ecs behaviour
    }
  }

  tags = {}
}
# provider.tf

#-------------------------------------------
#                   AWS
#-------------------------------------------

provider "aws" {
  region = "us-east-1"
}

#-------------------------------------------
#                   Docker
#-------------------------------------------

locals {
  ecr_address = format("%v.dkr.ecr.%v.amazonaws.com", data.aws_caller_identity.current.account_id, data.aws_region.current.name)
  # ecr_adress = "public.ecr.aws"
}

data "aws_region" "current" {}
data "aws_caller_identity" "current" {}
data "aws_ecr_authorization_token" "token" {}

provider "docker" {
  registry_auth {
    address  = local.ecr_address
    username = data.aws_ecr_authorization_token.token.user_name
    password = data.aws_ecr_authorization_token.token.password
  }
}

#-------------------------------------------
#               Kubernetes
#-------------------------------------------

# provider "kubernetes" {
#     host                   = one(values(module.eks)).cluster_endpoint
#     cluster_ca_certificate = base64decode(one(values(module.eks)).cluster.certificate_authority_data)

#     exec {
#       api_version = "client.authentication.k8s.io/v1beta1"
#       command     = "aws"
#       # This requires the awscli to be installed locally where Terraform is executed
#       args = ["eks", "get-token", "--cluster-name", one(values(module.eks)).cluster.name]
#     }
# }
# provider "kubectl" {
#     host                   = one(values(module.eks)).cluster_endpoint
#     cluster_ca_certificate = base64decode(one(values(module.eks)).cluster.certificate_authority_data)

#     exec {
#       api_version = "client.authentication.k8s.io/v1beta1"
#       command     = "aws"
#       # This requires the awscli to be installed locally where Terraform is executed
#       args = ["eks", "get-token", "--cluster-name", one(values(module.eks)).cluster.name]
#     }
# }

Data platforms or frameworks

πŸ™†β€β™€οΈ If you want to unify your infrastructure with terraform, use this module. Terraform covers a wide range of cloud providers, hence reducing dependability over one provider/platform.

πŸ™†β€β™‚οΈIf you want to use other serving systems such as torchserve or TensorFlow Serving, then use this module.

Data platforms are a great way to simply and efficiently manage your AI lifecycle from training to deployment. However they are quite pricy and only work for data application. Some frameworks like ray.io cluster or mlflow.org will offer easily lifecycle management from local machine to complex cloud deployment for ML projects.

πŸ™… If your want to do training, then you are better off using a data platform or a framework. This module is not designed for training, but for inference.

Specificities

  • heterogeneous clusters, consisting of different instance types
  • Parallelizing data processing with autoscaling

The microservice offers the following features:

  • Load balancer
    • HTTP(S)
    • Rest/gRPC
    • Redirects
  • Auto scaling
  • Route53 records
  • ACM
  • Environement file
  • Cloudwatch logs
  • Docker build
  • ECR
  • Container orchestrators
    • ECS
      • Fargate
      • EC2
        • General Purpose
        • Compute Optimized
        • Memory Optimized
        • Accelerated Computing (GPU, Inferentia, Trainium)
        • Accelerated Computing (Gaudi) not supported
        • Storage Optimized: supported/not tested
        • HPC Optimized: supported/not tested
    • EKS
      • Fargate
      • EC2

Architecture

Architecture

Examples

Go check the examples Go check the tests

ECS vs EKS equivalent

ECS EKS
cluster cluster
service node-group
task node
task-definition deployment
container-definition pod

Errors

ECS

insufficient memory

β”‚ Error: The closest matching container-instance <id> has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide

It means that the memory given to the container or the service or both is superior to what is allowed. ECS requires a certain amount of memory to run and is different for each instance. Currently there is only 90% of the memory used for the containers, leaving enough overhead space to not encounter that problem. To override that from happening you can override the memory and cpu allocation by specifying it in the containers.

insufficient instances

β”‚ Error: Scaling activity <id>: Failed: We currently do not have sufficient <instance_type> capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get <instance_type> capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f. Launching EC2 instance failed

It means that not available instances in the available zones. Unfortunately AWS does not have enough capacity in some regions. A possible solution would be to retry deploying the microservice until it is successful.

task not deleted but service is

β”‚ Error: creating ELBv2 Listener <arn>: operation error Elastic Load Balancing v2: CreateListener, https response error StatusCode: 400, RequestID: <id>, DuplicateListener: A listener already exists on this port for this load balancer <arn>

β”‚ Error: [WARN] A duplicate Security Group rule was found on <sg>. This may be a side effect of a now-fixed Terraform issue causing two security groups with identical attributes but different source_security_group_ids to overwrite each other in the state. See hashicorp/terraform#2376 for more information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 0.0.0.0/0, TCP, from port: <port>, to port: <port>, ALLOW" already exists

β”‚ Error: creating ECS Service <name>: InvalidParameterException: Unable to Start a service that is still Draining.

This happens if you try to reapply after a failed deployment. It will delete the service but the task that failed its deployment will not be deleted. The solution is to delete the task manually.

Timeout

Service

β”‚ Error: waiting for ECS Service <arn> create: timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)

This means that the service takes either more time than 20 minutes to become stable or that the configuration is wrong. Make sure you give it enough time to deploy.

With EC2, one indication to see if it is well deployed is to check the target group that it uses an ephemeral port. If you see something like this, wait:

Instance ID Name Port Zone Health status Health status details
<id> <name> 8080 us-east-1a [x] Unhealthy Request timed out

You should see something like this:

Instance ID Name Port Zone Health status Health status details
<id> <name> 8080 us-east-1a [x] Unhealthy Request timed out
<id> <name> 32771 us-east-1a [v] Healthy -

This indicates that the service used indeed an ephemeral port and is well deployed. If you see an error using the ephemeral port, it means that the service or something else is wrongly configured. Check the logs for more information.

Capacity Provider

It means that there is a capacity provider change and thus need to delete it. However, it is still associated with the cluster and thus cannot be removed.

β”‚ Error: waiting for ECS Capacity Provider <arn> to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)

On AWS console:

Delete failed
The capacity provider cannot be deleted because it is associated with cluster: <name>. Remove the capacity provider from the cluster and try again.

You can delete it manually every time there is a change of instances. On the AWS console, go to EC2 -> Auto Scaling Groups -> select and delete old one (it will wait for the deletion lifecycle policy if there is one). Also go to ECS -> cluster -> <your_clustser> -> Infrastructure -> Capacity Providers -> select and delete old one.

Lifecycle Hook

β”‚ Error: deleting ECS Cluster <cluster_arn>: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

or

β”‚ Error: waiting for Auto Scaling Group <capacity_provider_name> delete: found resource

Since January 2024, ECS adds automatically a lifecycle hook

name                  = "ecs-managed-draining-termination-hook"
lifecycle_transition  = "autoscaling:EC2_INSTANCE_TERMINATING"
default_result        = "CONTINUE"
heartbeat_timeout     = 3600

This means that by default an instance will wait 1h, 3600 seconds, before being terminated. Its state will change from InServive, to Terminating:Wait to Terminating:Proceed. This allows for graceful shutdown of the instance. If the instance is not terminated after 1h, it will be terminated forcefully.

This lifecycle hook is is overriden by this deployment to a duration of 60 seconds. This removes the occuring error.

This lifecycle hook cannot be removed unless done manually with the console or with the CLI. The Lifecycle hook that generates this behaviour can be seen in the console EC2 > Auto Scaling groups > asg_name > Instance management > Lifecycle hooks. If you remove the lifecycle hook, the instances will be terminated immediately, skipping the state Terminating:Wait.

You can check the state of the instances in the console EC2 > Auto Scaling groups > asg_name > Instance management > Instances. There you can check the instances still attached to the auto scaling group. If you want to terminate them, you can also do it manually.

instance refresh

β”‚ Error: updating Auto Scaling Group <name>: operation error Auto Scaling: UpdateAutoScalingGroup, https response error StatusCode: 400, RequestID: <requestId>, api error ValidationError: An active instance refresh with a desired configuration exists. All configuration options derived from the desired configuration are not available for update while the instance refresh is active.

It means that there is already a refresh in progress. The refresh is a process that allows to update the instances in the auto scaling group. It is a process that can take a while. The refresh can be seen in the console EC2 > Auto Scaling groups > asg_name > Instance management > Refresh. This is also impacted by the lifecycle hook ecs-managed-draining-termination-hook.

License

See LICENSE for full details.