Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring error "400 route operation in progress" when applying 3-network-hub-and-spoke #1228

Open
mromascanu123 opened this issue May 11, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@mromascanu123
Copy link

TL;DR

This happens almost every time when deploying dev, nprod or prod. Have to plan and apply again and everything is fine . But this kind of error will ruin any pipeline deploying automatically the spokes

. . .
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter.regular_service_perimeter: Creating...
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter.regular_service_perimeter: Creation complete after 3s [id=accessPolicies/6329355927/servicePerimeters/sp_n_shared_restricted_default_perimeter_e480]
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter_resource.service_perimeter_resource["115822756025"]: Creating...
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter_resource.service_perimeter_resource["115822756025"]: Creation complete after 2s [id=accessPolicies/6329355927/servicePerimeters/sp_n_shared_restricted_default_perimeter_e480/projects/115822756025]
module.base_env.module.restricted_shared_vpc[0].google_access_context_manager_service_perimeter.bridge_to_network_hub_perimeter[0]: Creating...
module.base_env.module.restricted_shared_vpc[0].google_access_context_manager_service_perimeter.bridge_to_network_hub_perimeter[0]: Creation complete after 0s [id=accessPolicies/6329355927/servicePerimeters/spb_c_to_n_shared_restricted_bridge_e480]

Error: Error adding network peering: googleapi: Error 400: There is a route operation in progress on the local or peer network. Try again later., badRequest

with module.base_env.module.base_shared_vpc[0].module.peering[0].google_compute_network_peering.peer_network_peering,
on .terraform/modules/base_env.base_shared_vpc.peering/modules/network-peering/main.tf line 50, in resource "google_compute_network_peering" "peer_network_peering":
50: resource "google_compute_network_peering" "peer_network_peering" {

Did not investigate in detail what's going on, might be a race condition / unaccounted for dependency

Expected behavior

Should smoothly deploy - why the 2'nd time succeeds?

Observed behavior

Look at TL;DR*
Error: Error adding network peering: googleapi: Error 400: There is a route operation in progress on the local or peer network. Try again later., badRequest

Terraform Configuration

N/A - default, nothing special

Terraform Version

terraform version
Terraform v1.6.0
on linux_amd64

Additional information

No response

@mromascanu123 mromascanu123 added the bug Something isn't working label May 11, 2024
@obriensystems
Copy link
Contributor

Reference: last full 3-networks-hub-and-spoke apply - up to 5-app-infra - TF 1.3.10 to avoid the issue running cloudbuild with 1.3 - if we use the default 1.7.5 (since downgraded in 1.5.7 in cloud shell)
Env: cloud shell and CB/CSR

I will also retest 3-nhas as soon as I finish the TEF upstream sync for 20240511 main in GoogleCloudPlatform/pbmm-on-gcp-onboarding#387 to reverify 3-nhas. There are 2 symlinks in nonproduction that need to be reverted un #1107 but they function with a double symlink ok for now.

@sleighton2022
Copy link
Collaborator

If there are too many simultaneous operations on peering, this will occur. It is not occurring in our integration tests. Are environments being deployed in parallel. One workaround is to set in Terraform parallel=1, but it will make the build take a long time, as you are not running in parallel.

@mromascanu123
Copy link
Author

@sleighton2022 : there is no parallelism here and deployment is done manually. It does not happen every time, not even often. I've got one of these on 05/30 and another one today. However the problem is deeper and nastier. In both cases when one of these occurred it was associated with tfstate corruption. On 05/30 it occurred during the "apply" for "3-nhas" production and today during tf "apply" for 3-nhas development. Apparently and superficially it seemed that a retry (tf plan then apply) fixed the issue both on 05/30 and today and 3-nhas was apparently deployed without error. In reality the tfstate for the stage where the error occurred (prod on 05/30 and dev today) was corrupted and was missing variables supposed to have been generated by outputs.tf. As a result when deploying 4-projects these variables won't be found and the deployment fails for good.

Example : after today's failed deployment compared the tfstate files under key "networks" and while prod and nprod were containing same output variables (different values) quite a few were missing for dev
image

more precisely the below were missing, possibly other vars
"base_network_self_link": {
value = module.base_env.base_network_self_link
description = "The URI of the VPC being created"
}

"base_subnets_secondary_ranges": {
value = module.base_env.base_subnets_secondary_ranges
description = "The secondary ranges associated with these subnets"
}

"base_subnets_self_links": {
value = module.base_env.base_subnets_self_links
description = "The self-links of subnets being created"
}

@mromascanu123
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants