Skip to content

vistimi/terraform-aws-microservice

Repository files navigation

AWS microservice terraform module

Terraform module which creates a microservice that works for all Fargate/EC2 instances. The applications can range from deploying general purpose applications, machine learning training. machine learning inference, high performance computing and more.

There are already some terraform microservices available, however they offer low variety in configurations and usually only supports Fargate. Here you have access to all EC2 instances with easy configuration.

Usage

module "microservice" {
  source = "vistimi/microservice/aws"

  name = "microservice-example"

  bucket_env = {
    force_destroy = false
    versioning    = true
    file_key      = "file_local_name.env"
    file_path     = "file_in_bucket_name.env"
  }

  vpc = {
    id               = "my_vpc_id"
    subnet_tier_ids  = ["id_subnet_tier_1", "id_subnet_tier_2"]
    subnet_intra_ids = ["id_subnet_intra_1", "id_subnet_intra_2"]
  }

  orchestrator = {
    group = {
      name = "first"
      deployment = {
        min_size     = 1
        max_size     = 2
        desired_size = 1

        containers = [
          {
            name = "first"
            docker = {
              repository = {
                name = "ubuntu"
              }
              image = {
                tag = "latest"
              }
            }

            traffics = [
              {
                # this will redirect http:80 to http:80
                listener = {
                  # port is by default 80 with http
                  protocol = "http"
                }
                target = {
                  port              = 80
                  protocol          = "http" # if not specified, the protocol will be the same as the listener
                  health_check_path = "/"    # if not specified, the health_check_path will be "/"
                }
              }
            ]

            entrypoint = [
              "/bin/bash",
              "-c",
            ]
            command = [
              <<EOT
              # ...
              EOT
            ]
            readonly_root_filesystem = false
          }
        ]
      }

      ec2 = {
        key_name       = "name_of_key_to_ssh_with"
        instance_types = ["t2.micro"]
        os             = "linux"
        os_version     = "2023"
        capacities = [{
            type   = "ON_DEMAND"
          }]
      }
    }
    ecs = {
      # override default ecs behaviour
    }
  }

  tags = {}
}

Data platforms or frameworks

🙆‍♀️ If you want to unify your infrastructure with terraform, use this module. Terraform covers a wide range of cloud providers, hence reducing dependability over one provider/platform.

🙆‍♂️If you want to use other serving systems such as torchserve or TensorFlow Serving, then use this module.

Data platforms are a great way to simply and efficiently manage your AI lifecycle from training to deployment. However they are quite pricy and only work for data application. Some frameworks like ray.io cluster or mlflow.org will offer easily lifecycle management from local machine to complex cloud deployment for ML projects.

🙅 If your want to do training, then you are better off using a data platform or a framework. This module is not designed for training, but for inference.

Specificities

  • heterogeneous clusters, consisting of different instance types
  • Parallelizing data processing with autoscaling

The microservice has the following specifications:

  • Load balancer
    • HTTP(S)
    • Rest/gRPC
  • Auto scaling
  • DNS with Route53
  • Environement file
  • Cloudwatch logs
  • Container orchestrators
    • ECS
      • Fargate
      • EC2
        • General Purpose
        • Compute Optimized
        • Memory Optimized
        • Accelerated Computing (GPU, Inferentia, Trainium)
        • Accelerated Computing (Gaudi) not supported
        • Storage Optimized: supported/not tested
        • HPC Optimized: supported/not tested
    • EKS
      • Fargate
      • EC2

Architecture

Architecture

Examples

Go check the examples Go check the tests

ECS vs EKS equivalent

ECS EKS
cluster cluster
service node-group
task node
task-definition deployment
container-definition pod

Errors

insufficient memory
The closest matching container-instance `<id>` has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide

It means that the memory given to the container or the service or both is superior to what is allowed. ECS requires a certain amount of memory to run and is different for each instance. Currently there is only 90% of the memory used for the containers, leaving enough overhead space to not encounter that problem. To override that from happening you can override the memory and cpu allocation by specifying it in the containers.

insufficient instances
Scaling activity `<id>`: Failed: We currently do not have sufficient `<instance_type>` capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get `<instance_type>` capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f. Launching EC2 instance failed

It means that not available instances in the available zones. Unfortunately AWS does not have enough capacity in some regions. A possible solution would be to retry deploying the microservice until it is successful.

timeout
Error: waiting for ECS Capacity Provider `<capacity_provider_arn>` to delete: timeout while waiting for resource to be gone (last state: `<state>`, timeout: 20m0s)

or

Error: deleting ECS Cluster `<cluster_arn>`: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

or

Error: waiting for Auto Scaling Group `<capacity_provider_name>` delete: found resource

Since January 2024, ECS adds automatically a lifecycle hook name = "ecs-managed-draining-termination-hook", lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING", default_result = "CONTINUE", heartbeat_timeout=3600. This means that by default an instance will wait 1h, 3600 seconds, before being terminated. Its state will change from InServive, to Terminating:Wait to Terminating:Proceed. This allows for graceful shutdown of the instance. If the instance is not terminated after 1h, it will be terminated forcefully. This lifecycle hook cannot be removed unless done manually with the console or with the CLI. The Lifecycle hook that generates this behaviour can be seen in the console EC2 > Auto Scaling groups > asg_name > Instance management > Lifecycle hooks. If you remove the lifecycle hook, the instances will be terminated immediately, skipping the state Terminating:Wait.

However, this leads to some problems in the deployment. Rest assured, the deployment worked like you wanted, it just means that the instances take an hour to be terminated. You can check the state of the instances in the console EC2 > Auto Scaling groups > asg_name > Instance management > Instances. There you can check the instances still attached to the auto scaling group. If you want to terminate them, you can also do it manually.

Another solution could be to allow a deployment to last longer than an hour, like 90 minutes. This way the instances will be terminated after the deployment is finished.

instance refresh
updating Auto Scaling Group `<name>`: operation error Auto Scaling: UpdateAutoScalingGroup, https response error StatusCode: 400, RequestID: `<requestId>`, api error ValidationError: An active instance refresh with a desired configuration exists. All configuration options derived from the desired configuration are not available for update while the instance refresh is active.

It means that there is already a refresh in progress. The refresh is a process that allows to update the instances in the auto scaling group. It is a process that can take a while. The refresh can be seen in the console EC2 > Auto Scaling groups > asg_name > Instance management > Refresh. This is also impacted by the lifecycle hook ecs-managed-draining-termination-hook.

License

See LICENSE for full details.