Skip to content

Commit

Permalink
ci-operator/templates/openshift: Explicitly set AWS availability zones
Browse files Browse the repository at this point in the history
This is very similar to the earlier e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204).  This time, however, I'm not changing the zones
where the machines will run.  By default, the installer will
provisioning zone infrastructure in all available zones, but since
openshift/installer@644f705286 (data/aws/vpc: Only create subnet
infrastucture for zones with Machine(Set)s, 2019-03-27,
openshift/installer#1481) users who explicitly set zones in their
install-config will no longer have unused zones provisioned with
subnets, NAT gateways, EIPs, and other related infrastructure.  This
infrastructure reduction has two benefits in CI:

1. We don't have to pay for resources that we won't use, and we will
   have more room under our EIP limits (although we haven't bumped
   into that one in a while, because we're VPC-constained).

2. We should see reduced rates in clusters failing install because of
   AWS rate limiting, with results like [1]:

     aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)

   The reduction is because:

   i. We'll be making fewer requests for these resources, because we
      won't need to create (and subsequently tear down) as many of
      them.  This will reduce our overall AWS-API load somewhat,
      although the reduction will be incremental because we have so
      many other resources which are not associated with zones.

   ii. Throttling for these per-zone resources are the ones that tend
       to break Terraform [2].  So even if the rate of timeouts
       per-API request remains unchanged, a given cluster will only
       have half as many (three vs. the old six) per-zone chances of
       hitting one of the timeouts.  This should give us something
       close to a 50% reduction in clusters hitting throttling
       timeouts.

The drawback is that we're diverging further from the stock "I just
called 'openshift-install create cluster' without providing an
install-config.yaml" experience.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log
[2]: With a cache of build-log.txt from the past ~48 hours:

     $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts
     $ wc -l timeouts
     362 timeouts
     $ grep aws_route_table_association timeouts | wc -l
     214
     $ grep 'aws_route\.to_nat_gw' timeouts | wc -l
     102

     So (102+214)/362 is 87% of our timeouts, with the remainder being
     almost entirely related to the internet gateway (which is not
     per-zone).
  • Loading branch information
wking committed Mar 28, 2019
1 parent 3e1b090 commit 51c4a37
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,24 @@ objects:
clusterID: ${CLUSTER_ID}
metadata:
name: ${CLUSTER_NAME}
controlPlane:
name: master
replicas: 3
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
compute:
- name: worker
replicas: 3
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,24 @@ objects:
clusterID: ${CLUSTER_ID}
metadata:
name: ${CLUSTER_NAME}
controlPlane:
name: master
replicas: 3
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
compute:
- name: worker
replicas: 3
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -257,11 +257,24 @@ objects:
apiVersion: v1beta4
baseDomain: test.ose
clusterID: ${CLUSTER_ID}
controlPlane:
name: master
replicas: ${MASTERS}
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
compute:
- name: worker
replicas: ${WORKERS}
controlPlane:
- replicas: ${MASTERS}
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
metadata:
name: ${CLUSTER_NAME}
networking:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,24 @@ objects:
clusterID: ${CLUSTER_ID}
metadata:
name: ${CLUSTER_NAME}
controlPlane:
name: master
replicas: 3
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
compute:
- name: worker
replicas: 3
platform:
aws:
zones:
- us-east-1a
- us-east-1b
- us-east-1c
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
Expand Down

0 comments on commit 51c4a37

Please sign in to comment.