When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails #126

simeon-aladjem · 2024-02-16T15:10:56Z

Code of Conduct

I have read and agree to the Code of Conduct.
Vote on this issue by adding a 👍 reaction to the original issue initial description to help the maintainers prioritize.
Do not leave "+1" or other comments that do not add relevant information or questions.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Terraform

v1.7.3

Terraform Provider

v0.8.1

VMware Cloud Foundation

4.5.2

Description

I have TF file 1 domain, 2 clusters, 4 hosts in cluster #1 and 3 hosts in cluster #2.
I remove a host from cluster #1 and also remove the host resource itself.

The plan looks like that:

  # vcf_domain.wld1 will be updated in-place
  ~ resource "vcf_domain" "wld1" {
        id                       = "dc34fa4e-0fc3-49a7-83bf-b95876528936"
        name                     = "sfo-w01-vc01"
        # (5 unchanged attributes hidden)

      ~ cluster {
            id                        = "a9263830-fd6d-41e0-b2a3-add395f39c68"
            name                      = "sfo-w01-cl01"
            # (6 unchanged attributes hidden)

          ~ host {
              ~ id          = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> "f926f406-2d1d-4ddd-baef-d5e39866376f"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "f926f406-2d1d-4ddd-baef-d5e39866376f" -> "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87" -> "2da8538e-9434-438d-94bd-aabe4ef1fbdb"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          - host {
              - id          = "2da8538e-9434-438d-94bd-aabe4ef1fbdb" -> null
              - license_key = (sensitive value) -> null

              - vmnic {
                  - id       = "vmnic0" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
              - vmnic {
                  - id       = "vmnic1" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
            }

            # (2 unchanged blocks hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # vcf_host.host5 will be destroyed
  # (because vcf_host.host5 is not in configuration)
  - resource "vcf_host" "host5" {
      - fqdn            = "esxi-5.vrack.vsphere.local" -> null
      - id              = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> null
      - network_pool_id = "b9dc86fd-7074-4adf-b23a-02be8a7c8962" -> null
      - password        = (sensitive value) -> null
      - status          = "ASSIGNED" -> null
      - storage_type    = "VSAN" -> null
      - username        = "root" -> null
    }

The host resource removal (decommissioning) is attempted first, and fails because it is not removed from the cluster yet.

vcf_host.host5: Destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd]
vcf_host.host5: Still destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd, 10s elapsed]
vcf_host.host5: Still destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd, 20s elapsed]
│
│ Error: Task with ID = dfc83a78-69c5-4788-b941-ca11b8b42c5e , Name: "Decommissioning host(s) esxi-5.vrack.vsphere.local from VMware Cloud Foundation" Type: "HOST_DECOMMISSION" is in state Failed
│
│
│

Affected Resources or Data Sources

resources/vcf_domain
resources/vcf_host
resources/vcf_cluster

Terraform Configuration

N/A

Debug Output

│
│ Error: Task with ID = dfc83a78-69c5-4788-b941-ca11b8b42c5e , Name: "Decommissioning host(s) esxi-5.vrack.vsphere.local from VMware Cloud Foundation" Type: "HOST_DECOMMISSION" is in state Failed
│
│
│

Panic Output

No response

Expected Behavior

The host removal from the cluster should happen first, and then the resource should be destroyed

Actual Behavior

The host resource removal (decommissioning) is attempted first, and fails because it is not removed from the cluster yet.

Steps to Reproduce

Create a TF plan with 4 vcf_host resource and 1 vcf_domain, which cluster includes those 4 hosts
Apply the plan and wait for the VCF WLD to be created
Remove from the plan one of the vcf_host resources and also remove the reference to it from the domain's cluster
Apply the plan. It will attempt the following change:

  # vcf_domain.wld1 will be updated in-place
  ~ resource "vcf_domain" "wld1" {
        id                       = "dc34fa4e-0fc3-49a7-83bf-b95876528936"
        name                     = "sfo-w01-vc01"
        # (5 unchanged attributes hidden)

      ~ cluster {
            id                        = "a9263830-fd6d-41e0-b2a3-add395f39c68"
            name                      = "sfo-w01-cl01"
            # (6 unchanged attributes hidden)

          ~ host {
              ~ id          = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> "f926f406-2d1d-4ddd-baef-d5e39866376f"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "f926f406-2d1d-4ddd-baef-d5e39866376f" -> "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87" -> "2da8538e-9434-438d-94bd-aabe4ef1fbdb"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          - host {
              - id          = "2da8538e-9434-438d-94bd-aabe4ef1fbdb" -> null
              - license_key = (sensitive value) -> null

              - vmnic {
                  - id       = "vmnic0" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
              - vmnic {
                  - id       = "vmnic1" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
            }

            # (2 unchanged blocks hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # vcf_host.host5 will be destroyed
  # (because vcf_host.host5 is not in configuration)
  - resource "vcf_host" "host5" {
      - fqdn            = "esxi-5.vrack.vsphere.local" -> null
      - id              = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> null
      - network_pool_id = "b9dc86fd-7074-4adf-b23a-02be8a7c8962" -> null
      - password        = (sensitive value) -> null
      - status          = "ASSIGNED" -> null
      - storage_type    = "VSAN" -> null
      - username        = "root" -> null
    }

The first operation to be attempted will be destruction of the vcf_host, and it will fail

Environment Details

No response

Screenshots

No response

References

No response

The text was updated successfully, but these errors were encountered:

spacegospod · 2024-02-23T15:19:43Z

@simeon-aladjem

I'm not sure if it is possible to force terraform to first update the cluster resource and only afterwards attempt to destroy the host resource.
While we investigate please apply the operations separately as a workaround

simeon-aladjem · 2024-02-26T14:55:21Z

I'm not sure if it is possible to force terraform to first update the cluster resource and only afterwards attempt to destroy the host resource. While we investigate please apply the operations separately as a workaround

Hi @stoyanzhelyazkov ,
When we commission hosts and add them to a cluster in the same plan, it is done in the correct order, i.e. (1) host commissioning and (2) adding commissioned hosts to a cluster.
Also, when we create a workload domain with additional cluster with the same plan, it is also done in the correct order - (1) create the domain and then (2) create additional cluster.
How do we enforce the order in those cases?

Moreover, cluster appears as "depending on" the hosts in the .tfstate file:

{
      "mode": "managed",
      "type": "vcf_domain",
      "name": "wld1",
      "provider": "provider[\"registry.terraform.io/vmware/vcf\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "cluster": [
                . . .
            ],
          }        
          "sensitive_attributes": [],
          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjoxNDQwMDAwMDAwMDAwMCwiZGVsZXRlIjozNjAwMDAwMDAwMDAwLCJyZWFkIjoxMjAwMDAwMDAwMDAwLCJ1cGRhdGUiOjE0NDAwMDAwMDAwMDAwfX0=",
          "dependencies": [
            "vcf_host.host11",
            "vcf_host.host6",
            "vcf_host.host7"
          ]
        }
      ]
    },

in other words, even terraform should be smart enough to perform first host removal from the cluster, and then the deletion of the host resource.
But even if terraform isn't that smart, shouldn't the provider be able to perform the operation in the right order?

tenthirtyam · 2024-02-26T15:01:24Z

If you use a depends_on in the configuration, it should remove the host from the cluster prior to removing the host in the same run.

simeon-aladjem · 2024-02-27T11:00:00Z

According to what I read in the documentation and in several blogs, depends_on is not recommended because of side effects:
https://itnext.io/beware-of-depends-on-for-modules-it-might-bite-you-da4741caac70
https://developer.hashicorp.com/terraform/language/meta-arguments/depends_on#processing-and-planning-consequences

To summarise what I have learned:
When creating resources, Terraform manages to do it in the right order because of the implicit dependencies, like:

resource "vcf_cluster" "wld1-cluster2" {
  name      = "sfo-w01-cl02"
  domain_id = vcf_domain.wld1.id # (1)
  host {
    id = vcf_host.host8.id # (2)
    . . .
  }
  . . .
}

Because of (1) above, TF knows to create first the domain before creating the cluster.
Because of (2) above, TF knows to create (commission) the host first and only then create/update the cluster.

When destroying resources, though, it seems like TF doesn't consider the dependency.
Is it a Terraform issue, then?

github-actions · 2024-04-28T01:14:07Z

'Marking this issue as stale due to inactivity. This helps us focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context.
Thank you!'

simeon-aladjem added bug Bug needs-triage Needs Triage labels Feb 16, 2024

github-actions bot added the pending-review Pending Review label Feb 16, 2024

spacegospod self-assigned this Feb 23, 2024

spacegospod removed needs-triage Needs Triage pending-review Pending Review labels Feb 23, 2024

github-actions bot added the stale Stale label Apr 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails #126

When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails #126

simeon-aladjem commented Feb 16, 2024

spacegospod commented Feb 23, 2024

simeon-aladjem commented Feb 26, 2024

tenthirtyam commented Feb 26, 2024

simeon-aladjem commented Feb 27, 2024 •

edited

github-actions bot commented Apr 28, 2024

When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails #126

When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails #126

Comments

simeon-aladjem commented Feb 16, 2024

Code of Conduct

Terraform

Terraform Provider

VMware Cloud Foundation

Description

Affected Resources or Data Sources

Terraform Configuration

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Environment Details

Screenshots

References

spacegospod commented Feb 23, 2024

simeon-aladjem commented Feb 26, 2024

tenthirtyam commented Feb 26, 2024

simeon-aladjem commented Feb 27, 2024 • edited

github-actions bot commented Apr 28, 2024

simeon-aladjem commented Feb 27, 2024 •

edited