You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigating the cause of flaky CI errors, I'm seeing a high rate of the following issues that are set in the project factory module but can be better addressed through organization policies:
403 error when attempting to delete the default VPC, with an error that the Compute Engine API has not yet been enabled. At that point the enablement request has started, but may a long time to complete. This then triggers retry logic that leads to a different error, 409 project already exists when attempting to create the project again. (There's some inconsistent state between GCP and terraform).
However, it is not necessary to delete a default VPC if it is blocked by org policy. Provider docs state it is recommended to use the organisational policy constraint instead of setting auto_create_network to false, as is done in the project factory.
The default behavior of the project factory is a bit nonintuitive. Because the GCP platform creates a default network by default, the project factory module overrides this with auto_create_network = false. This behavior enables the Compute API, queries it for the auto-created network, and then attempts to delete the default VPC. However, it can introduce issues with eventual consistency. Conversely, when auto_create_network = true, the project factory does not attempt to query the Compute API. If the org policy to prevent the default network is enforced, and auto_created_network = true, we get the desired (if non-intuitive) behavior to not create a default VPC and not try to immediately query Compute API at project creation.
409 error about default service account does not exist occasionally appears on brand new projects. I suspect this is a similar issue, where the ability to reference the service account is eventually consistent, and there is a gap between GCP state and terraform state.
Note that the provider docs also state this tf resource is a best-effort basis, as no API formally describes the default service account resource and it is only intended for use cases that can't use the org policy.
The foundation blueprint already sets these org policies, so I expect we can remove some of these flaky errors about eventual consistency by setting the org policies first and avoiding these steps.
Terraform Resources
Projects that explicitly try to deprivilege the service account. After the org policy is enforced, this is no longer necessary. However, the org policy is created in stage 1-org and is eventually consistent, and some projects are also created in 1-org, so it's tricky to guarantee that the policy is actually enforced before projects are created.
terraform-google-project-factory module by default has auto_create_network set to false. In comparison, the google_project resource from Google provider defaults this to true. This means the project factory always attempts to enable the Compute Engine API, create the default network, then immediately delete it. This step is not necessary if the org policy is already in place.
Detailed design
The goal of removing the default VPC and deprivileging the default service account is already addressed by Org policies compute.skipDefaultNetworkCreation" and iam.automaticIamGrantsForDefaultServiceAccounts in 1-org step. After these policies are enforced, there is no need to explicitly delete the default VPC or disable the default service account; conversely, attempting to do these actions contributes to flaky failures when trying to reference APIs or resources whose state is eventually consistent.
Fixes:
Ensure that these org policies are set as early as possible, with sufficient time for propagation, before creating any additional projects. Potentially these could even be manually set in 0-bootstrap, which could give sufficient time to propagate before any additional projects are created.
After org policies are set, remove the step to delete the default VPC from project creation
[...]
Step #7 - "converge-org": Error: Received unexpected error:
Step #7 - "converge-org": FatalError{Underlying: error while running command: exit status 1;
Step #7 - "converge-org": Error: error creating project tyj-net-dns-oo3v (tyj-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the roles/resourcemanager.projectCreator permission
Step #7 - "converge-org":
Step #7 - "converge-org": with module.dns_hub.module.project-factory.google_project.main,
Step #7 - "converge-org": on .terraform/modules/dns_hub/modules/core_project_factory/main.tf line 73, in resource "google_project" "main":
Step #7 - "converge-org": 73: resource "google_project" "main" {
Step #7 - "converge-org":
Step #7 - "converge-org":
Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/tyj-
Step #7 - "converge-org": Error: Error deleting default network in project eyx-net-dns-rtqu: googleapi: Error 403: Compute Engine API has not been used in project eyx-net-dns-rtqu before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=eyx-net-dns-rtqu then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
[...]
Step #7 - "converge-org": Error: error creating project eyx-net-dns-rtqu (eyx-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the roles/resourcemanager.projectCreator permission
I started working on a new PR to override the default behavior of the core-project-factory (replace default_service_account = "disable" with default_service_account = "keep"), so that it would not attempt to create the unsupported resource, but is the upstream change on the provider another way to fix this? Or is it an unrelated issue?
Hi @eeaton! - Likely. We often see 409 error about default service account does exist as there was an earlier Error: Provider produced inconsistent result after apply during the creation of the project factory service account, which then fails during the subsequent terraform retry as it does exist. If your example is this situation (and likely it is), the upstream change should resolve.
Note: Here is the PR for the updated version: #1221
Good news, thanks. I'm seeing quite a few of those 409 errors on terraform retry, so I'll prioritize getting 1221 merged and see if that helps reduce the errors.
TL;DR
Investigating the cause of flaky CI errors, I'm seeing a high rate of the following issues that are set in the project factory module but can be better addressed through organization policies:
However, it is not necessary to delete a default VPC if it is blocked by org policy. Provider docs state it is recommended to use the organisational policy constraint instead of setting auto_create_network to false, as is done in the project factory.
The default behavior of the project factory is a bit nonintuitive. Because the GCP platform creates a default network by default, the project factory module overrides this with
auto_create_network = false
. This behavior enables the Compute API, queries it for the auto-created network, and then attempts to delete the default VPC. However, it can introduce issues with eventual consistency. Conversely, whenauto_create_network = true
, the project factory does not attempt to query the Compute API. If the org policy to prevent the default network is enforced, and auto_created_network = true, we get the desired (if non-intuitive) behavior to not create a default VPC and not try to immediately query Compute API at project creation.Note that the provider docs also state this tf resource is a best-effort basis, as no API formally describes the default service account resource and it is only intended for use cases that can't use the org policy.
The foundation blueprint already sets these org policies, so I expect we can remove some of these flaky errors about eventual consistency by setting the org policies first and avoiding these steps.
Terraform Resources
Projects that explicitly try to deprivilege the service account. After the org policy is enforced, this is no longer necessary. However, the org policy is created in stage 1-org and is eventually consistent, and some projects are also created in 1-org, so it's tricky to guarantee that the policy is actually enforced before projects are created.
terraform-google-project-factory module by default has auto_create_network set to false. In comparison, the google_project resource from Google provider defaults this to true. This means the project factory always attempts to enable the Compute Engine API, create the default network, then immediately delete it. This step is not necessary if the org policy is already in place.
Detailed design
The goal of removing the default VPC and deprivileging the default service account is already addressed by Org policies
compute.skipDefaultNetworkCreation"
andiam.automaticIamGrantsForDefaultServiceAccounts
in 1-org step. After these policies are enforced, there is no need to explicitly delete the default VPC or disable the default service account; conversely, attempting to do these actions contributes to flaky failures when trying to reference APIs or resources whose state is eventually consistent.Fixes:
Additional information
Sample error logs for #1
[...]
Step #7 - "converge-org": Error: Received unexpected error:
Step #7 - "converge-org": FatalError{Underlying: error while running command: exit status 1;
Step #7 - "converge-org": Error: error creating project tyj-net-dns-oo3v (tyj-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the
roles/resourcemanager.projectCreator
permissionStep #7 - "converge-org":
Step #7 - "converge-org": with module.dns_hub.module.project-factory.google_project.main,
Step #7 - "converge-org": on .terraform/modules/dns_hub/modules/core_project_factory/main.tf line 73, in resource "google_project" "main":
Step #7 - "converge-org": 73: resource "google_project" "main" {
Step #7 - "converge-org":
Step #7 - "converge-org":
Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/tyj-
Sample error logs for 2:
The text was updated successfully, but these errors were encountered: