Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI improvement: Remove explicit steps that are already addressed by org policies #1243

Open
eeaton opened this issue May 17, 2024 · 3 comments
Assignees
Labels

Comments

@eeaton
Copy link
Collaborator

eeaton commented May 17, 2024

TL;DR

Investigating the cause of flaky CI errors, I'm seeing a high rate of the following issues that are set in the project factory module but can be better addressed through organization policies:

  1. 403 error when attempting to delete the default VPC, with an error that the Compute Engine API has not yet been enabled. At that point the enablement request has started, but may a long time to complete. This then triggers retry logic that leads to a different error, 409 project already exists when attempting to create the project again. (There's some inconsistent state between GCP and terraform).

However, it is not necessary to delete a default VPC if it is blocked by org policy. Provider docs state it is recommended to use the organisational policy constraint instead of setting auto_create_network to false, as is done in the project factory.

The default behavior of the project factory is a bit nonintuitive. Because the GCP platform creates a default network by default, the project factory module overrides this with auto_create_network = false. This behavior enables the Compute API, queries it for the auto-created network, and then attempts to delete the default VPC. However, it can introduce issues with eventual consistency. Conversely, when auto_create_network = true, the project factory does not attempt to query the Compute API. If the org policy to prevent the default network is enforced, and auto_created_network = true, we get the desired (if non-intuitive) behavior to not create a default VPC and not try to immediately query Compute API at project creation.

  1. 409 error about default service account does not exist occasionally appears on brand new projects. I suspect this is a similar issue, where the ability to reference the service account is eventually consistent, and there is a gap between GCP state and terraform state.

Note that the provider docs also state this tf resource is a best-effort basis, as no API formally describes the default service account resource and it is only intended for use cases that can't use the org policy.

The foundation blueprint already sets these org policies, so I expect we can remove some of these flaky errors about eventual consistency by setting the org policies first and avoiding these steps.

Terraform Resources

Projects that explicitly try to deprivilege the service account. After the org policy is enforced, this is no longer necessary. However, the org policy is created in stage 1-org and is eventually consistent, and some projects are also created in 1-org, so it's tricky to guarantee that the policy is actually enforced before projects are created.

terraform-google-project-factory module by default has auto_create_network set to false. In comparison, the google_project resource from Google provider defaults this to true. This means the project factory always attempts to enable the Compute Engine API, create the default network, then immediately delete it. This step is not necessary if the org policy is already in place.

Detailed design

The goal of removing the default VPC and deprivileging the default service account is already addressed by Org policies compute.skipDefaultNetworkCreation" and iam.automaticIamGrantsForDefaultServiceAccounts in 1-org step. After these policies are enforced, there is no need to explicitly delete the default VPC or disable the default service account; conversely, attempting to do these actions contributes to flaky failures when trying to reference APIs or resources whose state is eventually consistent.

Fixes:

Additional information

Sample error logs for #1

Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.interconnect.module.project-factory.google_project_default_service_accounts.default_service_accounts[0]: Creating...
Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.base_network_hub[0].module.project-factory.google_project_default_service_accounts.default_service_accounts[0]: Creation complete after 0s [id=projects/tyj-net-hub-base-0pux]
Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.org_billing_logs.module.budget.data.google_project.project[0]: Reading...
Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.org_billing_logs.module.budget.data.google_project.project[0]: Read complete after 1s [id=projects/tyj-c-billing-logs-uglg]
Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: google_project_iam_member.billing_bq_viewer: Creating...
Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.interconnect.module.project-factory.google_project_default_service_accounts.default_service_accounts[0]: Creation complete after 1s [id=projects/tyj-net-interconnect-xy8h]
[...]

[...]
Step #7 - "converge-org": Error: Received unexpected error:
Step #7 - "converge-org": FatalError{Underlying: error while running command: exit status 1;
Step #7 - "converge-org": Error: error creating project tyj-net-dns-oo3v (tyj-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the roles/resourcemanager.projectCreator permission
Step #7 - "converge-org":
Step #7 - "converge-org": with module.dns_hub.module.project-factory.google_project.main,
Step #7 - "converge-org": on .terraform/modules/dns_hub/modules/core_project_factory/main.tf line 73, in resource "google_project" "main":
Step #7 - "converge-org": 73: resource "google_project" "main" {
Step #7 - "converge-org":
Step #7 - "converge-org":
Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/tyj-

Step #7 - "converge-org": Error: Error deleting default network in project eyx-net-dns-rtqu: googleapi: Error 403: Compute Engine API has not been used in project eyx-net-dns-rtqu before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=eyx-net-dns-rtqu then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
[...]
Step #7 - "converge-org": Error: error creating project eyx-net-dns-rtqu (eyx-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the roles/resourcemanager.projectCreator permission

Sample error logs for 2:

Step #7 - "converge-org": TestOrg 2024-05-20T06:08:01Z command.go:185: on .terraform/modules/base_restricted_environment_network.base_shared_vpc_host_project/modules/core_project_factory/main.tf line 145, in resource "google_service_account" "default_service_account":
Step #7 - "converge-org": TestOrg 2024-05-20T06:08:01Z command.go:185: 145: resource "google_service_account" "default_service_account" {
Step #7 - "converge-org": TestOrg 2024-05-20T06:08:01Z command.go:185:
Step #7 - "converge-org": TestOrg 2024-05-20T06:08:02Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1;
Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/mkz-n-shared-base-xx88.
Step #7 - "converge-org": Details:
Step #7 - "converge-org": [
Step #7 - "converge-org": {
Step #7 - "converge-org": "@type": "type.googleapis.com/google.rpc.ResourceInfo",
Step #7 - "converge-org": "resourceName": "projects/mkz-n-shared-base-xx88/serviceAccounts/project-service-account@mkz-n-shared-base-xx88.iam.gserviceaccount.com"
Step #7 - "converge-org": }
Step #7 - "converge-org": ]
Step #7 - "converge-org": , alreadyExists
Step #7 - "converge-org":
Step #7 - "converge-org": with module.base_restricted_environment_network["nonproduction"].module.base_shared_vpc_host_project.module.project-factory.google_service_account.default_service_account[0],
Step #7 - "converge-org": on .terraform/modules/base_restricted_environment_network.base_shared_vpc_host_project/modules/core_project_factory/main.tf line 145, in resource "google_service_account" "default_service_account":
Step #7 - "converge-org": 145: resource "google_service_account" "default_service_account" {
Step #7 - "converge-org": }
Step #7 - "converge-org": Test: TestOrg
Step #7 - "converge-org": 2024/05/20 06:08:02 RUN_STAGE env var set to apply
Step #7 - "converge-org": 2024/05/20 06:08:02 Skipping stage teardown
Step #7 - "converge-org": --- FAIL: TestOrg (1015.64s)
Step #7 - "converge-org": FAIL
Step #7 - "converge-org": FAIL github.com/terraform-google-modules/terraform-example-foundation/test/integration/org 1015.695s
Step #7 - "converge-org": FAIL

@eeaton eeaton added the enhancement New feature or request label May 17, 2024
@eeaton
Copy link
Collaborator Author

eeaton commented May 20, 2024

Hi @apeabody , looks like you made a recent change on the core-project-factory module terraform-google-modules/terraform-google-project-factory@cfd7f3f that might obviate the fix I'm working on.

I started working on a new PR to override the default behavior of the core-project-factory (replace default_service_account = "disable" with default_service_account = "keep"), so that it would not attempt to create the unsupported resource, but is the upstream change on the provider another way to fix this? Or is it an unrelated issue?

@apeabody
Copy link
Contributor

apeabody commented May 20, 2024

Hi @eeaton! - Likely. We often see 409 error about default service account does exist as there was an earlier Error: Provider produced inconsistent result after apply during the creation of the project factory service account, which then fails during the subsequent terraform retry as it does exist. If your example is this situation (and likely it is), the upstream change should resolve.

Note: Here is the PR for the updated version: #1221

@eeaton
Copy link
Collaborator Author

eeaton commented May 21, 2024

Good news, thanks. I'm seeing quite a few of those 409 errors on terraform retry, so I'll prioritize getting 1221 merged and see if that helps reduce the errors.

@eeaton eeaton self-assigned this May 23, 2024
@eeaton eeaton added backlog and removed enhancement New feature or request labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants