Skip to content

Commit

Permalink
data/aws: 20-minute create timeouts for routes and security groups
Browse files Browse the repository at this point in the history
Using [1,2,3,4,5], both of which were added in v1.11, so we have them
in our v2.2 AWS provider.  This should mitigate some of the issues
we've been having in our busy CI account, where out of ~1150 jobs in
the last 24 hours, we've had the following failures [6]:

  $ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
       2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
      10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
      38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
      58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
      76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
      90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
     164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state

The 20 minute timeout is much higher than the two-minute route default
[2], so that should help a lot with our leading error.  The security
group default is 10 minutes [4], so this is less of change there, and
we only see that error rarely anyway.  I went with 20 minutes (instead
of a higher number), because a single resource (or parallel resources)
coming in just under that range will keep the full Terraform step
under the 30 minutes that we've chosen as a timeout for our other
steps (waiting for the Kubernetes API, bootstrap completion, and
install completion.  But obviously we can tune more later if
necessary.

[1]: https://www.terraform.io/docs/configuration/resources.html#operation-timeouts
[2]: https://www.terraform.io/docs/providers/aws/r/route.html#timeouts
[3]: hashicorp/terraform-provider-aws#3639 (v1.11.0)
[4]: https://www.terraform.io/docs/providers/aws/r/security_group.html#timeouts
[5]: hashicorp/terraform-provider-aws#3599 (v1.11.0)
[6]: https://github.com/wking/openshift-release/tree/debug-scripts/d3
  • Loading branch information
wking committed Apr 26, 2019
1 parent c87b389 commit 246f4a1
Show file tree
Hide file tree
Showing 5 changed files with 20 additions and 0 deletions.
4 changes: 4 additions & 0 deletions data/data/aws/bootstrap/main.tf
Expand Up @@ -140,6 +140,10 @@ resource "aws_lb_target_group_attachment" "bootstrap" {
resource "aws_security_group" "bootstrap" {
vpc_id = "${var.vpc_id}"

timeouts {
create = "20m"
}

tags = "${merge(map(
"Name", "${var.cluster_id}-bootstrap-sg",
), var.tags)}"
Expand Down
4 changes: 4 additions & 0 deletions data/data/aws/vpc/sg-master.tf
@@ -1,6 +1,10 @@
resource "aws_security_group" "master" {
vpc_id = "${data.aws_vpc.cluster_vpc.id}"

timeouts {
create = "20m"
}

tags = "${merge(map(
"Name", "${var.cluster_id}-master-sg",
), var.tags)}"
Expand Down
4 changes: 4 additions & 0 deletions data/data/aws/vpc/sg-worker.tf
@@ -1,6 +1,10 @@
resource "aws_security_group" "worker" {
vpc_id = "${data.aws_vpc.cluster_vpc.id}"

timeouts {
create = "20m"
}

tags = "${merge(map(
"Name", "${var.cluster_id}-worker-sg",
), var.tags)}"
Expand Down
4 changes: 4 additions & 0 deletions data/data/aws/vpc/vpc-private.tf
Expand Up @@ -13,6 +13,10 @@ resource "aws_route" "to_nat_gw" {
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
depends_on = ["aws_route_table.private_routes"]

timeouts {
create = "20m"
}
}

resource "aws_subnet" "private_subnet" {
Expand Down
4 changes: 4 additions & 0 deletions data/data/aws/vpc/vpc-public.tf
Expand Up @@ -23,6 +23,10 @@ resource "aws_route" "igw_route" {
destination_cidr_block = "0.0.0.0/0"
route_table_id = "${aws_route_table.default.id}"
gateway_id = "${aws_internet_gateway.igw.id}"

timeouts {
create = "20m"
}
}

resource "aws_subnet" "public_subnet" {
Expand Down

0 comments on commit 246f4a1

Please sign in to comment.