Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

mromascanu123 · 2024-05-16T13:46:36Z

TL;DR

Need to selectively remove from the environment and from tfstate the already created resources after a failed deployment .This has two sides:

cleanup the created resources by disabling the billing on already created projects then deleting the projects
removing from tfstate the projects and the corresponding random_string resources used for the project name suffixes to avoid name collisions ("resource already exists") on redeployment - because projects (and other resources) continue to exist in a zombie state even after deletion and the names are unique at org level

For the first item this would be a script replicating the manual steps below

In asset manager position on the folder to cleanup and list the cloudresourcemanager.Project resources
Extract the project_id for each of the projects to clean-up
For each project_id run gcloud billing projects unlink <project_id>
For each project_id identify and extract the "liens" if any: gcloud alpha resource-manager liens list --project <project_id>
Delete the liens : gcloud alpha resource-manager liens delete <lien_id> --project <project_id>
Delete the projects gcloud projects delete --quiet <project_id>

For the second item add in tf-wrapper.sh 2 options :

list : list the resources in tfstate e.g. ./tf-wrapper.sh list development
(will have to extract from the resulting list the resources ID to clean-up)
remove : remove from tfstate the resources whose resourceIDs are provided in a file e.g. ./tf-wrapper.sh list development ./resources_to_cleanup_from_tfstate.list (or if no file provided simply remove from tfstate all resources under specified folder)

Terraform Resources

N/A

Detailed design

See TL;DR* above

Additional information

Related #1238

eeaton · 2024-05-20T17:08:11Z

Hi @mromascanu123 , can you help me understand more about your desired outcome, and in what scenario you want to use this script? Is it something that isn't already addressed by using the helper script to automate the manual steps of deploying with Cloud Build, then destroying?

While there are some flaky errors that require unpicking state like #1187, they are specific enough that I don't recommend creating a script to directly modify terraform state. (Usually modifying terraform state files by any method other than apply should be done only as a last resort). Many other errors that might occur when a deployment fails require some other fix outside of the terraform state (modify IAM policy of the principal doing the deployment, remove a pre-existing org policy that blocks the deployment, modify the tf files) then triggering the terraform apply again.

mromascanu123 · 2024-06-12T18:15:33Z

Hi @eeaton. One of the issues I've seen is the persistence of the ** resource "random_string" ** in the tfstate when the "plan" decides to delete and recreate a project following a failed deployment. The project-factory module will attempt to recreate the roject but the resulting id will be the same as the one of the just-deleted project and obviously will fail.

I was able to reproduce the issue at least once by aborting a tf-wrapper apply with a Ctrl-C then re-planning and re-applying. But a failed deployment may occur for many other reasons.

The resource "random_string" is being used in many places and it persists after deletion of the actual resource using it tgo generate a suffix but afaik the projects and KMS keystores persist after being deleted and their name / id can't be reused

in tf-wrapper.sh "list" and "remove" operations could be added to list the resource IDs in tfstate and e.g. selectively delete as necessary the random_string resources which served to generate IDs for resources deleted but still zombified

mromascanu123 · 2024-06-14T12:35:03Z

Another issue seen repeatedly is described in #1228 - even when retrying with tf-wrapper plan and then apply and apparently succeeding, in reality this ends up in tfstate corruption for that particular job (e.g.development under 3-nhas). In this case, for safer recovery must delete all created resources by that job and also the corresponding resource IDs in tfstate and redo the plan and apply, obviously crossing fingers not to hit the snag again.

mromascanu123 added the enhancement New feature or request label May 16, 2024

eeaton added waiting-response Waiting for issue author to respond. and removed enhancement New feature or request labels May 23, 2024

mromascanu123 mentioned this issue Jun 25, 2024

doc improvement: clarify intended usage and level of support #1239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

mromascanu123 commented May 16, 2024 •

edited

Loading

eeaton commented May 20, 2024 •

edited

Loading

mromascanu123 commented Jun 12, 2024

mromascanu123 commented Jun 14, 2024

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

Comments

mromascanu123 commented May 16, 2024 • edited Loading

TL;DR

Terraform Resources

Detailed design

Additional information

eeaton commented May 20, 2024 • edited Loading

mromascanu123 commented Jun 12, 2024

mromascanu123 commented Jun 14, 2024

mromascanu123 commented May 16, 2024 •

edited

Loading

eeaton commented May 20, 2024 •

edited

Loading