Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

Open
mromascanu123 opened this issue May 16, 2024 · 3 comments
Labels
waiting-response Waiting for issue author to respond.

Comments

@mromascanu123
Copy link

mromascanu123 commented May 16, 2024

TL;DR

Need to selectively remove from the environment and from tfstate the already created resources after a failed deployment .This has two sides:

  • cleanup the created resources by disabling the billing on already created projects then deleting the projects
  • removing from tfstate the projects and the corresponding random_string resources used for the project name suffixes to avoid name collisions ("resource already exists") on redeployment - because projects (and other resources) continue to exist in a zombie state even after deletion and the names are unique at org level

For the first item this would be a script replicating the manual steps below

  1. In asset manager position on the folder to cleanup and list the cloudresourcemanager.Project resources
  2. Extract the project_id for each of the projects to clean-up
  3. For each project_id run gcloud billing projects unlink <project_id>
  4. For each project_id identify and extract the "liens" if any: gcloud alpha resource-manager liens list --project <project_id>
  5. Delete the liens : gcloud alpha resource-manager liens delete <lien_id> --project <project_id>
  6. Delete the projects gcloud projects delete --quiet <project_id>

For the second item add in tf-wrapper.sh 2 options :

  • list : list the resources in tfstate e.g. ./tf-wrapper.sh list development
    (will have to extract from the resulting list the resources ID to clean-up)
  • remove : remove from tfstate the resources whose resourceIDs are provided in a file e.g. ./tf-wrapper.sh list development ./resources_to_cleanup_from_tfstate.list (or if no file provided simply remove from tfstate all resources under specified folder)

Terraform Resources

N/A

Detailed design

See TL;DR* above

Additional information

Related #1238

@mromascanu123 mromascanu123 added the enhancement New feature or request label May 16, 2024
@eeaton
Copy link
Collaborator

eeaton commented May 20, 2024

Hi @mromascanu123 , can you help me understand more about your desired outcome, and in what scenario you want to use this script? Is it something that isn't already addressed by using the helper script to automate the manual steps of deploying with Cloud Build, then destroying?

While there are some flaky errors that require unpicking state like #1187, they are specific enough that I don't recommend creating a script to directly modify terraform state. (Usually modifying terraform state files by any method other than apply should be done only as a last resort). Many other errors that might occur when a deployment fails require some other fix outside of the terraform state (modify IAM policy of the principal doing the deployment, remove a pre-existing org policy that blocks the deployment, modify the tf files) then triggering the terraform apply again.

@eeaton eeaton added waiting-response Waiting for issue author to respond. and removed enhancement New feature or request labels May 23, 2024
@mromascanu123
Copy link
Author

Hi @eeaton. One of the issues I've seen is the persistence of the ** resource "random_string" ** in the tfstate when the "plan" decides to delete and recreate a project following a failed deployment. The project-factory module will attempt to recreate the roject but the resulting id will be the same as the one of the just-deleted project and obviously will fail.

I was able to reproduce the issue at least once by aborting a tf-wrapper apply with a Ctrl-C then re-planning and re-applying. But a failed deployment may occur for many other reasons.

The resource "random_string" is being used in many places and it persists after deletion of the actual resource using it tgo generate a suffix but afaik the projects and KMS keystores persist after being deleted and their name / id can't be reused

in tf-wrapper.sh "list" and "remove" operations could be added to list the resource IDs in tfstate and e.g. selectively delete as necessary the random_string resources which served to generate IDs for resources deleted but still zombified

@mromascanu123
Copy link
Author

Another issue seen repeatedly is described in #1228 - even when retrying with tf-wrapper plan and then apply and apparently succeeding, in reality this ends up in tfstate corruption for that particular job (e.g.development under 3-nhas). In this case, for safer recovery must delete all created resources by that job and also the corresponding resource IDs in tfstate and redo the plan and apply, obviously crossing fingers not to hit the snag again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-response Waiting for issue author to respond.
Projects
None yet
Development

No branches or pull requests

2 participants