Setting up Amazon ECS using CloudFormation

Please Note:
These instructions are now obsolete, replaced by aws-explorer provisioning.

In the following instructions we set up an ECS environment using Amazon Cloud Formation. This tool uses scripts to create servers, security groups, databases, load balancers, etc in a fast and repeatable way.

We'll create a cluster named myCluster and a task named myTask. As you follow these instructions, replace these names with a real cluster name and the project name used in the configs repo.

There are three main phases in these instructions:

In steps 1 to 4 we create a cluster and provision it with one or more EC2 instances.
In steps 5 to 9 we add a task (application) to the cluster. These steps can be repeated to ad multiple tasks to the cluster.
Step 10 is very important, as this is where we close of security risks to the cluster.

Step 1 - Create the ECS Cluster and it's S3 Bucket

Important: Keep the cluster name short - three or four characters is good.

Go to the Cloudformation page and create a new bucket using template

https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/ttcf-1-create-cluster

For the stack name, use the name of your cluster. Skip over the 2nd page (Options).

Once the stack has been created, you can use the links on the Outputs section to check that the cluster and S3 bucket have been created.
Copy the text from the Outputs tab, which you will save in the next step.

Step 2 - Add a config for the cluster to the S3 Bucket

Create /Development/Cluster/clustername using a skeleton config.

 $ mkdir -p /Development/Cluster/<clustername>
 $ cd /Development/Cluster/<clustername>
 $ curl -# https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/skeleton-cluster.tar.gz | tar xvz

Edit SETENV and set the parameters carefully. Make sure you update the S3 bucket's name. You can get the DOCKER_AUTH and DOCKER_EMAIL values by using docker login to set your local credentials, then run cat ~/.docker/config.json. More details can be found here, but don't use the format with username and password in the file.
Load the ecs.config file into the S3 bucket:
```
 ./sync-ecs.config
```

Sometimes this gives a long error message "An unexpected error has occurred.... etc". This is usually a network related error and resolves if you try a few times.

Save the Outputs copied in the previous section into a file named 'stack-output.tab'.

 $ cat >> stack-output.tab
 PAGES		1.0.
 S3Bucket	ttcf-xytz-configs	2.3 - S3 bucket
 ...
 (press <Ctrl-D> to finish)

Go to https://drive.google.com/drive/u/0/folders/0ByzEB7u5S7PbNVBjODh6MkFMdDg?ths=true and create a new Google Sheet with the name Cluster <your-cluster-name>.

Select File->Import and upload stack-outputs.tab with the "Replace current sheet" option, and "Convert text to numbers and dates" set to No.

Sort the spreadsheet by column C.

Share this spreadsheet as required so developers can use the ECS Configuration for their task.

Step 3 - Add an EC2 instance to the Cluster

Run CloudFormation script

https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/ttcf-3-add-instance-to-cluster

    Stack name: <clustername>-instance-1.
    EcsSecurityGroup: <clustername>-ecsSecurityGroup
    InstanceType: usually t2.small (unless you have plans to run multiple projects on the cluster
    KeyName: choose an SSH key pair you have on your local machine (e.g. phil-singapore).

Save the outputs for this stack to stack-outputs-instance-1.tab.

 $ cat >> stack-outputs-instance-1.tab 
 
 INSTANCE
 healthCheck2	http://52.221.251.14:PORT/api/healthcheck	Healthcheck example 2
 temporaryLogin	ssh -i ~/.ssh/phil-singapore.pem ec2-user@...
 ...
 (press <Ctrl-D> to finish)

Load this file into the previous Google sheet, on a new tab named Instance 1.

Step 4 - Temporary login to the instance

IMPORTANT
The ability to log directly into the server is only a temporary measure, while you confirm the set up. Despite the security provided by RSA encryption and authentication, we must consider SSH to be a weak point for hackers. Please follow the following rules, under punishment of death!

Only ever have one IP address whitelisted.
If an IP address is already whitelisted, overwrite it.
Remove this entry when you are finished, within the same day.

To add temporary access:

Click on the ecsSecurityGroupPage link in your spreadsheet. On the Inbound tab, press edit. Add a temporary rule for SSH with the source as "My IP".
You should now be able to log into the EC2 instance from your command line, using the temporaryLogin command.

Check that Docker and the ECS agent are running using the dockerPs command in the spreadsheet:

 $ ssh -i ~/.ssh/phil-singapore.pem ec2-user@1.2.3.4 docker ps
 CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS              PORTS               NAMES
 b733865597db        amazon/amazon-ecs-agent:latest   "/agent"            22 minutes ago      Up 22 minutes                           ecs-agent

Verify that the S3 bucket has been mounted correctly by using the viewConfigs command:

 $ ssh -i ~/.ssh/phil-singapore.pem ec2-user@1.2.3.4 ls -l /CONFIGS_S3_BUCKET /Scripts /Volumes
 /CONFIGS_S3_BUCKET:
 total 1
 ---------- 1 root root 182 Nov  4 03:33 ecs.config
 
 /Scripts/:
 total 0
 
 /Volumes/:
 total 0

Step 5 - Add a task to the ECS cluster

Important: the length of the cluster name and the task name should not be more than 5 charaters.

Run CloudFormation script

https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/ttcf-5-add-task-to-cluster

    Stack name: clustername-<taskname>  
    dbAccessLocation: ignore this  
    ecsSecurityGroup: <clustername>-ecsSecurityGroup-xxxx  
    NeedCache, NeedDB: up to you...  
    TaskName: the name of the project, as stored in configs.tooltwist.com (eg. drinkcircle)  
    Subnets and Vpc: I use the entries with an IP address 172.*

This will create the database and REDIS cache as required. It will also create the Application Load Balancer (ALB) and log files from the tasks's Docker containers will be forwarded to CloudWatch.

Append the outputs section for this stack to stack-outputs.tab, but type in a newline and "TASK" by hand to separate the sections.

 $ cat >> stack-outputs-<taskname>.tab 
 
 TASK
 dbPort	3306	Database host
 appDomain	ttcf-xxx-alb-crowdhound-1063185692.ap-southeast-1.elb.amazonaws.com	Application endpoint
 cacheHost	ttcf-xxx-crowdhound.i07nfr.0001.apse1.cache.amazonaws.com	Cache host
 logGroup	ttcf-xxx-ecs-crowdhound	Log group
 dbHost	ttcf-xxx-crowdhound.clcuugpfmr3s.ap-southeast-1.rds.amazonaws.com	Database host
 cachePort	6379	Cache host
 (press <Ctrl-D> to finish)

Add a new tab to your Google sheet and load these details.

Step 6 - CloudSearch

If your application uses SOLR for searching, you will need to use CloudSearch in the ECS environment. Using a Docker container for SOLR won't work, because when the application starts the SOLR core will not be initialised, the healthcheck will fail, and the container will be killed and restarted, fail again, restart again, etc.

Cloudsearch provides a fully managed and backed-up service. CloudSearch can be set up manually from the Cloudsearch Dashboard. Use the domain name ecs-clustername-taskname. Creating the domain takes a while.

The best way to define the fields to be indexed is to create a small JSON file containing an example document record. Copy example-search-document.json to a new file and modify it to represent the data of your application.

Note that the source required to use Cloudsearch is similar to, but different, to the SOLR API. The crowdhound project can provide an example of how documents can be loaded, updated and searched. For details of the API see here.

After creating the Cloudsearch cluster, manually update the task tab on your spreadsheet with the search and doc endpoints.

Step 7 - Load the task definitions and configs

Create a directory to contain the task's definition, config files and scripts. The config files will be loaded up to the S3 bucket, from where they will be installed onto the EC2 instance(s) into /Volumes so they can be accessed by your application's Docker containers. The scripts will similarly get installed to /Scripts on the EC2 instance(s). The task definition will be uploaded to ECS, allowing your application to be run as an ECS task or an ECS service.

We create the initial directory from a skeleton:

    $ mkdir -p /Development/Cluster/<custername>/<taskname>
    $ cd /Development/Cluster/<custername>/<taskname>
    $ curl -# https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/skeleton-task-crowdhound.tar.gz | tar xvz

(I'll add a skeleton for ToolTwist applications soon)

Follow the numbered scripts to set the configuration and upload the files to the S3 bucket and ECS. Check the values in SETENV carefully, setting them from your spreadsheet.
The config and script changes take a few minutes to propagate through to the EC2 instance(s). The viewConfigs command from your spreadsheet can be used to check.

Step 8 - Initialize the database, etc

Database initialisation is performed from the EC2 instance, using scripts you edit in the cluster/task config directory (ie. in /Development/Cluster/clustername/taskname).

This directory can contain various useful scripts, for example for

to run a healthcheck from the command line
to initialise the database schema
to load the database
to access the REDIS cache
to load the search engine

The exact scripts required will depend upon your application - you will need to create and update them to suit your needs. Each time you make changes, sync them to the S3 bucket and wait a few minutes for them to propagate through to the EC2 instance(s).

Bear in mind that the EC2 instance is unlikely to have client software installed to communicate with the database or REDIS. The easiest approach is to use a temporary docker container to configure and load these back ends - the official mysql and redis images from Docker hub work just fine. See the default scripts for an example.

The following are example commands used to initialise the database and search engine.

$ ssh -i ~/.ssh/phil-singapore.pem ec2-user@1.2.3.4  
Last login: Mon Nov  7 23:32:26 2016 from ppp121-44-67-215.xxxxx.xxxx.internode.on.net  
...  

[ec2-user@ip-172-31-1-118 ~]$ sudo su  
[root@ip-172-31-1-118 ec2-user]# cd /Scripts/crowdhound/  
[root@ip-172-31-1-118 crowdhound]# bash db-init
CMD=docker run -i --rm mysql mysql -h ttcf-xxx-crowdhound.clcuugpfmr3s.ap-southeast-1.rds.amazonaws.com -u root -pM0use123  
...  
  
[root@ip-172-31-1-118 crowdhound]# bash db-load  
CMD=docker run -i --rm mysql mysql -h ttcf-xxx-crowdhound.clcuugpfmr3s.ap-southeast-1.rds.amazonaws.com -u < Dump20161102.sql
...

Step 9 - Starting the application

The first time, it is probably best to start the application progressively, as there are a lot of moving parts. If you try to start everything straight away you will probably get a load balancer health check that fails, with little information about what is actually wrong. Using a series of health checks, you can check that each piece is working correctly before you add more complexity to the picture.

ecs-healthchecks

Initially start the application as an ECS Task using the ECS Dashboard, and run health checks hc1 and hc2. Once these pass, shut down the task and start the application as an ECS service.

hc1 - health check from within the EC2 instance

This checks that the application is working correctly within it's Docker container. Here's the typical steps:

$ ssh -i ~/.ssh/phil-singapore.pem ec2-user@1.2.3.4
$ sudo su
# cd /Scripts/crowdhound
# bash hc1

If the output contains an error, go to the Cloudwatch application logs and solve the problem before proceeding.

hc2 - healthcheck from outside (optional)

This check can be run from your local machine's command line, or from your browser. First however, you will need to modify the ecsSecurityGroup for your cluster to allow access from the outside, to your docker container's port.

2016-12-02_10-37-18

You should then be able to access the healthcheck from your browser using the internal Docker port number and the EC2 instance's IP address, with a URL similar to http://1.2.3.4:12345/api/healthcheck. You can similarly call it from the command line:

$ curl -v http://1.2.3.4:12345/api/healthcheck

Important: remove the open port from the security group as soon as you are finished.

hc3 - the load balancer healthcheck

This is the healthcheck performed by the Application Load Balancer. For this health check and hc4 the application must be running as a service.

Press Add Service on the Services tab on the Cluster's dashboard page.
As you start the service press the Configure ELB button.
Set ELB Name to ttcf-cluster-alb-taskname.
Select the container for your application (i.e. not REDIS), then press Save to start the service.
Go to your ECS security group, and add an input rule with Port Range as 0 - 65000 and Source as the alb for your task. It's not obvious, but if you type the name of your task into the source it will provide a list of security groups.

Go to the Target Group page for your cluster/task and check the status.

2016-12-02_10-47-35

If the hc1 health check passed but the target group cannot call the same health check, then either it is checking the wrong endpoint (check the Healthcheck tab), or else the security groups configuration is preventing the ALB from communicating with the EC2 instance. Check the configuration of the ecsSecurityGroup for your cluster allows access from the albSecurityGroup for the cluster/task. For example:

2016-12-02_10-37-37

hc4 - external application healthcheck

This involves running the healthcheck through the standard application endpoint, through the load balancer, on it's normal port (i.e. 80 or 443). For example, ttcf-xxx-alb-crowdhound-1063185692.ap-southeast-1.elb.amazonaws.com/api/healthcheck. If this healthcheck passes, then you application is available to the outside world.

Step 10 - Clean Up

*** This is Super Important ***

Remove all inbound rules from the ECS Security Group that allow access from the outside.
Check that the database does not have unrestricted access.

Gotchas

If the healthcheck requires a container to be initialized in order to return 200, it can run into a restart loop. When the healthcheck fails a few times, the service is shut down then restarted. If you can't get the container initialized fast enough the container will be repeatedly shut down.

For example, when SOLR is run as a Docker container a "core" usually needs to be defined or the healthcheck will fail.

If this is happening, you will see the status of the instances on the Target Groups toggling between initialize and draining. The short-term solution is to increase the interval and number of retries for the healthcheck.

In general, the solution must be to use containers that do not need to be loaded after restart, as Amazon may restart a container or service at any time.

When debugging, work from the inside out, using the healthcheck step described above.

Deleting a cluster

A cluster and it's tasks are deleted by deleting the Cloudformation stacks in the reverse order they were created. Before doing this however, several manual steps are required.

Stop any services running on the cluster.
Go to the ecsSecurityGroup and remove any Inbound rule from an albSecurityGroup.
Empty the S3 bucket for the cluster.

Other Notes

Amazon CLI for ECS Notes
Centralising logs with Cloudwatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly