How RT works under the hood and what was the original motivation behind it?
RT consists of the following parts:
each of which is described below.
To be able to bring up two or more different versions of an app into production we need to be able to identify these versions somehow.
Decision was made to leverage git for this purpose.
In the very early stage we experimented with libgit2
but it turns out it has some bugs
that would significantly affect us and also libgit offers a very low-level API
which isn't very easy to work with for basic git commands.
We wanted this part of RT to be rock-solid which is why we instead decided to use
the git
binary installed in the system.
- everyone having write access to the
ape-dev-rt-apps
repo could potentially mess things up for all users - releasing new version implies new commit which just bumps revision/version ID - that is bothersome
- it is hard to build automation around it
Deprecated for applications migrated out of the central repository as part of rt functionality. All Git functionality will be eventually removed.
RT will process any files named *.tf.tpl
and deployment-state.hcl.tpl
as Go templates prior to handing such files over to Terraform.
A helper mkSlice function was added to allow slices to be created and used in *.tf.tpl
templates.
- Example usage
{{ $slice := mkSlice "bucket1" "bucket2" "bucket3" }}
{{ range $index, $path := $slice }}
resource "aws_s3_bucket" "{{$index}}_pixel_bucket" {
bucket = "{{$path}}-pixel-bucket"
acl = "private"
}
{{ end }}
The main reason we use this extra templating layer is because deploymentstate
exists outside of Terraform and as a result cannot use any variable logic within Terraform.
- This makes running Terraform commands against raw directories trickier because
*.tf.tpl
files are ignored by Terraform and otherwise do not contain HCL-valid syntax prior to template processing.- Both
terraform validate
&terraform plan
would be helpful to run ahead ofgit push
- Both
*.tf.tpl
will eventually be deprecated.
We leverage Terraform a lot as it solves most of the implementation details we'd otherwise have to deal with, namely
- AWS API/SDK handling and how different resources work together generally
- DSL and/or config parsing and how to effectively translate that into API calls
Even though our stack is and in the nearest future will be mainly AWS, we did not want to lock the solution unnecessarily into AWS.
More importantly there will always be services that AWS doesn't provide and we will need to automate/orchestrate those too. e.g. PagerDuty, DynDNS, Pingdom.
Even if Terraform doesn't support a specific provider, it is OSS and it has a powerful plugin architecture.
- Each application and its slot needs to be described in a form of HCL template
- Resources can be also described via CloudFormation since Terraform supports it
- One of the main differences from other tools that do infra provisioning (Puppet, CloudFormation, Ansible, ...)
is that Terraform does not do resource discovery based on tags and does not expect provider to persist the state.
This however means that it does need to manage state of each resource in a form of
tfstate
file (JSON). This in turn means that a team needs to share thetfstate
file somehow. See the "Remote State Storage" section below.
As opposed to many other solutions, Terraform 0.7+ will allow building applications/infrastructure on top of existing one (e.g. VPCs that have been created completely out of Terraform and/or even before Terraform existed).
Terraform as of today (April 2016) supports 7 backends (Artifactory, Atlas, Consul, etcd, HTTP, S3, Swift).
At the time of releasing the initial version of RT (May 2015) Terraform only supported 4 (Atlas, Consul, HTTP and S3).
A decision was made back in May 2015 to use S3 mainly for these reasons:
- finegrained permissions management via S3 bucket policies which allows sharing tfstate between teams
- availability of S3 (
99.99%
backed by SLA) - backup strategy
- region replication
- versioning allowing us to revert files back to past versions if an accident happens
- number of working and supported AWS SDKs allowing creation of custom backup scripts easily
- it does not involve extra operational overhead (i.e. the service availability is achieved by AWS engineers)
- logging which allows auditing (i.e. who changed what and when)
Both solutions would introduce extra operational overhead and both would require effort to be invested into making a good backup strategy and HA deployments. This would significantly increase the scope of this project.
Atlas is a commercial offering which would introduce some periodic fees and wouldn't bring much value in this specific case (app deployments) since Atlas (as of April 2016) does not have a notion of application/slots/deployments. It is mostly aimed at ops doing purely infrastructure deployments.
When revisiting this option we should compare:
- long-term costs
- backup strategy
- permissions management
- availability & SLA
There was no Artifactory backend supported in May 2015.
When revisiting this option we should compare:
- long-term costs
- backup strategy
- permissions management
- availability & SLA
RT provides an extra functionality beyond Terraform. This currently includes commands for attaching/detaching ELBs to/from Autoscaling Groups.
We distinguish between two contexts/layers (application and slot). These are typically enough to build the whole infrastructure for each application and its slots.
Sometimes changes need to be done out of these contexts though - i.e. the state of a slot may change since slot may or may not be accepting traffic (i.e. ELB may or may not be attached to the ASG).
- RT allows users to perform certain actions on top of the infrastructure that was built via Terraform
- Such actions typically don't conflict with Terraform's state thanks to
lifecycle
'signore_updates
flag. - If a user wants to be able to perform changes beyond what Terraform has built, the HCL templates need to follow
certain conventions and may need to define
lifecycle.ignore_updates
for certain resources.
- Since most of the AWS functionality is handled by Terraform and treated as implementation details
these custom functions re-introduce some problems:
- AWS region handling
- AWS API credentials handling
- identification/discovery of resources built by Terraform
- AWS API throttling
to be moved under "RT expectations" per https://trello.com/c/KCM4pgza/85-allow-infra-code-to-be-in-any-repository
Applications in the Release Tool are bicameral as we distinguish between the "infra" and "slot" configuration details.
Infra configuration describes resources whose lifecycle match that of the whole application. There is a single "infra configuration" for each app and changes are incremental - they can be modified and updated like a normal Terraform configuration.
Slot configuration describes resources whose lifecycle match that of a single application slot. There will be multiple slot configurations for a single app. Each slot represent a repository snapshot, so slot changes are immutable - slots may only be created or destroyed.