CI improvement: Reduce number of duplicated projects #1238

eeaton · 2024-05-15T16:11:22Z

TL;DR

Currently it is difficult to merge even small PR because of flaky CI tests. The CI test pipelines build and destroy the entire environment. This CI test occurs on every pull or merge, runs twice in parallel (to accommodate both network patterns), takes up to 3 hours, and has a high rate of flaky failures.

There are various contributing factors, but it appears there are race conditions that randomly cause the creation of some projects and some service accounts to fail. Some of these errors have been improved with retry logic, but some errors, particulary 409 errors, need more clever logic to unpick the state and retry. While I'm investigating the root cause of those errors, I'm also exploring tactical fixes to improve the success rate of the CI pipeline.

One potential improvement is reducing the number of projects. Every time we create a project there is some small chance of a flaky failure, so for the CI pipeline to pass we currently must get lucky and hope that every project passes. The blueprint creates a large number of projects to illustrate what a real/complex customer environment might look like. However, many of the projects are placeholders or duplicates. So by reducing the number of projects that don't add substantive value for the architecture or illustrate how to use terraform modules, we can reduce the chance of flaky errors preventing the CI test from passing.

Terraform Resources

n/a

Detailed design

A few approaches that I'm considering...

### Approach 1: remove "bu2".
Each environment (d-n-p) has two subfolders (bu1, bu2) that duplicate all the same projects. By removing bu2, we could reduce 15 duplicate projects. For users who do want multiple bu subfolders, we can give guidance in the readme on how to recreate this. (ex: "copy these folders, then run `sed 's/bu1/bu2/' filepath`, etc"

Currently, each combination of bu x environment creates 5 projects. (30 total, so we could save 15 projects per run by removing bu2)
prj-p-bu2-kms
prj-p-bu2-sample-base
prj-p-bu2-sample-floating
prj-p-bu2-sample-peering
prj-p-bu2-sample-restrict

### Approach 2: run CI against fewer environments. 
In the unmodified blueprint, each of the three environments (d-n-p) is identical, and the CI test creates each of them. We will consider running CI tests against only prod environments. 

However, we want to avoid the risk that code we aren't testing doesn't have new bugs introduced.
This might be done by adding logic to explicitly test the other environments only if any changes are made in those environments, or to test other environments on a random subset of CI runs, etc.

Additional information

No response

fmichaelobrien · 2024-05-15T17:40:46Z

Very good idea and walkthrough.

I like either option. Currently we stand up just over 50 projects. For clean orgs not already under quota management - there is a minor issue with not project quota but billing quota - the default is 5. At least 2 requests are required to keep the additions under 50 to kick in the 3 min response per request. I think one BU with the 3 example floating/peering/shared options is ok as an example. I vote for option 1 to reduce project count.

However for shared east-west traffic where one BU is an API provider workload for the other BU - 2 of them work well and exercise canary flows. Option 2 is good as well - however the client can dup BU1 at design time.

/Appreciated

eeaton · 2024-05-23T09:50:58Z

Partially addressed by #1241, which had led to a noticeable increase in CI pass rate (though still low). Will continue evaluating if there's a good mechanism to enforce approach 2.

eeaton added the enhancement New feature or request label May 15, 2024

mromascanu123 mentioned this issue May 16, 2024

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

Open

eeaton self-assigned this May 23, 2024

eeaton added backlog and removed enhancement New feature or request labels May 23, 2024

fmichaelobrien mentioned this issue May 24, 2024

TEF upstream git resync and terraform provider updates cadence GoogleCloudPlatform/pbmm-on-gcp-onboarding#387

Open

eeaton mentioned this issue May 28, 2024

Try testing only one env in e2e build #801

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI improvement: Reduce number of duplicated projects #1238

CI improvement: Reduce number of duplicated projects #1238

eeaton commented May 15, 2024 •

edited

Loading

fmichaelobrien commented May 15, 2024

eeaton commented May 23, 2024

CI improvement: Reduce number of duplicated projects #1238

CI improvement: Reduce number of duplicated projects #1238

Comments

eeaton commented May 15, 2024 • edited Loading

TL;DR

Terraform Resources

Detailed design

Additional information

fmichaelobrien commented May 15, 2024

eeaton commented May 23, 2024

eeaton commented May 15, 2024 •

edited

Loading