Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails #126

Closed
1 of 4 tasks
simeon-aladjem opened this issue Feb 16, 2024 · 5 comments
Assignees
Labels
bug Bug stale Stale

Comments

@simeon-aladjem
Copy link

Code of Conduct

  • I have read and agree to the Code of Conduct.
  • Vote on this issue by adding a 👍 reaction to the original issue initial description to help the maintainers prioritize.
  • Do not leave "+1" or other comments that do not add relevant information or questions.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Terraform

v1.7.3

Terraform Provider

v0.8.1

VMware Cloud Foundation

4.5.2

Description

I have TF file 1 domain, 2 clusters, 4 hosts in cluster #1 and 3 hosts in cluster #2.
I remove a host from cluster #1 and also remove the host resource itself.

The plan looks like that:

  # vcf_domain.wld1 will be updated in-place
  ~ resource "vcf_domain" "wld1" {
        id                       = "dc34fa4e-0fc3-49a7-83bf-b95876528936"
        name                     = "sfo-w01-vc01"
        # (5 unchanged attributes hidden)

      ~ cluster {
            id                        = "a9263830-fd6d-41e0-b2a3-add395f39c68"
            name                      = "sfo-w01-cl01"
            # (6 unchanged attributes hidden)

          ~ host {
              ~ id          = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> "f926f406-2d1d-4ddd-baef-d5e39866376f"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "f926f406-2d1d-4ddd-baef-d5e39866376f" -> "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87" -> "2da8538e-9434-438d-94bd-aabe4ef1fbdb"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          - host {
              - id          = "2da8538e-9434-438d-94bd-aabe4ef1fbdb" -> null
              - license_key = (sensitive value) -> null

              - vmnic {
                  - id       = "vmnic0" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
              - vmnic {
                  - id       = "vmnic1" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
            }

            # (2 unchanged blocks hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # vcf_host.host5 will be destroyed
  # (because vcf_host.host5 is not in configuration)
  - resource "vcf_host" "host5" {
      - fqdn            = "esxi-5.vrack.vsphere.local" -> null
      - id              = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> null
      - network_pool_id = "b9dc86fd-7074-4adf-b23a-02be8a7c8962" -> null
      - password        = (sensitive value) -> null
      - status          = "ASSIGNED" -> null
      - storage_type    = "VSAN" -> null
      - username        = "root" -> null
    }

The host resource removal (decommissioning) is attempted first, and fails because it is not removed from the cluster yet.

vcf_host.host5: Destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd]
vcf_host.host5: Still destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd, 10s elapsed]
vcf_host.host5: Still destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd, 20s elapsed]
│
│ Error: Task with ID = dfc83a78-69c5-4788-b941-ca11b8b42c5e , Name: "Decommissioning host(s) esxi-5.vrack.vsphere.local from VMware Cloud Foundation" Type: "HOST_DECOMMISSION" is in state Failed
│
│
│

Affected Resources or Data Sources

resources/vcf_domain
resources/vcf_host
resources/vcf_cluster

Terraform Configuration

N/A

Debug Output


│ Error: Task with ID = dfc83a78-69c5-4788-b941-ca11b8b42c5e , Name: "Decommissioning host(s) esxi-5.vrack.vsphere.local from VMware Cloud Foundation" Type: "HOST_DECOMMISSION" is in state Failed


Panic Output

No response

Expected Behavior

The host removal from the cluster should happen first, and then the resource should be destroyed

Actual Behavior

The host resource removal (decommissioning) is attempted first, and fails because it is not removed from the cluster yet.

Steps to Reproduce

  1. Create a TF plan with 4 vcf_host resource and 1 vcf_domain, which cluster includes those 4 hosts
  2. Apply the plan and wait for the VCF WLD to be created
  3. Remove from the plan one of the vcf_host resources and also remove the reference to it from the domain's cluster
  4. Apply the plan. It will attempt the following change:
  # vcf_domain.wld1 will be updated in-place
  ~ resource "vcf_domain" "wld1" {
        id                       = "dc34fa4e-0fc3-49a7-83bf-b95876528936"
        name                     = "sfo-w01-vc01"
        # (5 unchanged attributes hidden)

      ~ cluster {
            id                        = "a9263830-fd6d-41e0-b2a3-add395f39c68"
            name                      = "sfo-w01-cl01"
            # (6 unchanged attributes hidden)

          ~ host {
              ~ id          = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> "f926f406-2d1d-4ddd-baef-d5e39866376f"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "f926f406-2d1d-4ddd-baef-d5e39866376f" -> "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87" -> "2da8538e-9434-438d-94bd-aabe4ef1fbdb"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          - host {
              - id          = "2da8538e-9434-438d-94bd-aabe4ef1fbdb" -> null
              - license_key = (sensitive value) -> null

              - vmnic {
                  - id       = "vmnic0" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
              - vmnic {
                  - id       = "vmnic1" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
            }

            # (2 unchanged blocks hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # vcf_host.host5 will be destroyed
  # (because vcf_host.host5 is not in configuration)
  - resource "vcf_host" "host5" {
      - fqdn            = "esxi-5.vrack.vsphere.local" -> null
      - id              = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> null
      - network_pool_id = "b9dc86fd-7074-4adf-b23a-02be8a7c8962" -> null
      - password        = (sensitive value) -> null
      - status          = "ASSIGNED" -> null
      - storage_type    = "VSAN" -> null
      - username        = "root" -> null
    }
  1. The first operation to be attempted will be destruction of the vcf_host, and it will fail

Environment Details

No response

Screenshots

No response

References

No response

@simeon-aladjem simeon-aladjem added bug Bug needs-triage Needs Triage labels Feb 16, 2024
@github-actions github-actions bot added the pending-review Pending Review label Feb 16, 2024
@spacegospod spacegospod self-assigned this Feb 23, 2024
@spacegospod spacegospod removed needs-triage Needs Triage pending-review Pending Review labels Feb 23, 2024
@spacegospod
Copy link
Contributor

@simeon-aladjem

I'm not sure if it is possible to force terraform to first update the cluster resource and only afterwards attempt to destroy the host resource.
While we investigate please apply the operations separately as a workaround

@simeon-aladjem
Copy link
Author

I'm not sure if it is possible to force terraform to first update the cluster resource and only afterwards attempt to destroy the host resource. While we investigate please apply the operations separately as a workaround

Hi @stoyanzhelyazkov ,
When we commission hosts and add them to a cluster in the same plan, it is done in the correct order, i.e. (1) host commissioning and (2) adding commissioned hosts to a cluster.
Also, when we create a workload domain with additional cluster with the same plan, it is also done in the correct order - (1) create the domain and then (2) create additional cluster.
How do we enforce the order in those cases?

Moreover, cluster appears as "depending on" the hosts in the .tfstate file:

{
      "mode": "managed",
      "type": "vcf_domain",
      "name": "wld1",
      "provider": "provider[\"registry.terraform.io/vmware/vcf\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "cluster": [
                . . .
            ],
          }        
          "sensitive_attributes": [],
          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjoxNDQwMDAwMDAwMDAwMCwiZGVsZXRlIjozNjAwMDAwMDAwMDAwLCJyZWFkIjoxMjAwMDAwMDAwMDAwLCJ1cGRhdGUiOjE0NDAwMDAwMDAwMDAwfX0=",
          "dependencies": [
            "vcf_host.host11",
            "vcf_host.host6",
            "vcf_host.host7"
          ]
        }
      ]
    },

in other words, even terraform should be smart enough to perform first host removal from the cluster, and then the deletion of the host resource.
But even if terraform isn't that smart, shouldn't the provider be able to perform the operation in the right order?

@tenthirtyam
Copy link
Contributor

If you use a depends_on in the configuration, it should remove the host from the cluster prior to removing the host in the same run.

@simeon-aladjem
Copy link
Author

simeon-aladjem commented Feb 27, 2024

According to what I read in the documentation and in several blogs, depends_on is not recommended because of side effects:
https://itnext.io/beware-of-depends-on-for-modules-it-might-bite-you-da4741caac70
https://developer.hashicorp.com/terraform/language/meta-arguments/depends_on#processing-and-planning-consequences

To summarise what I have learned:
When creating resources, Terraform manages to do it in the right order because of the implicit dependencies, like:

resource "vcf_cluster" "wld1-cluster2" {
  name      = "sfo-w01-cl02"
  domain_id = vcf_domain.wld1.id # (1)
  host {
    id = vcf_host.host8.id # (2)
    . . .
  }
  . . .
}

Because of (1) above, TF knows to create first the domain before creating the cluster.
Because of (2) above, TF knows to create (commission) the host first and only then create/update the cluster.

When destroying resources, though, it seems like TF doesn't consider the dependency.
Is it a Terraform issue, then?

Copy link

'Marking this issue as stale due to inactivity. This helps us focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context.
Thank you!'

@github-actions github-actions bot added the stale Stale label Apr 28, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug stale Stale
Projects
None yet
Development

No branches or pull requests

3 participants