TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

returntrip · 2024-04-08T08:17:02Z

Terraform Version

Terraform v1.7.5
on linux_amd64
provider registry.terraform.io/vmware/vcd v3.12.0

Affected Resource(s)

vcd_cse_kubernetes_cluster

Expected Behavior

Upgrade from 'Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1' to 'Ubuntu 20.04 and Kubernetes v1.27.5+vmware.1' succeeds

Actual Behavior

Upgrade fails with error:

Error: Kubernetes cluster update failed: cannot perform an OVA change as the new one 'Ubuntu 20.04 and Kubernetes v1.27.5+vmware.1' has an older TKG/Kubernetes version (v2.4.0/v1.27.5+vmware.1)
│
│ with vcd_cse_kubernetes_cluster.my_cluster,
│ on main.tf line 73, in resource "vcd_cse_kubernetes_cluster" "my_cluster":
│ 73: resource "vcd_cse_kubernetes_cluster" "my_cluster" {

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

terraform apply

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

Issue Demonstrate how a TKG cluster can be upgrade using TF 3.12.0 #1237 (comment)

The text was updated successfully, but these errors were encountered:

adambarreiro · 2024-04-09T13:19:11Z

Hi @returntrip,

I implemented a fix in this PR: #1247
Not sure if you would be able to test it? It worked for me, but your feedback would also be valuable.

You can clone my fork: https://github.com/adambarreiro/terraform-provider-vcd.git, then checkout fix-cse-upgrade branch and run make install to install this un-released provider with the patch.

Then you could try creating a cluster with a specific TKG version, it should now display the OVAs that weren't displayed before.

Let me know, thanks in advance!

returntrip · 2024-04-10T07:44:56Z

Hi @adambarreiro!

Sure, I have tried this:

Created cluster with TKG 2.2.0 v1.25.7+vmware.2 and got the following supported upgrade templates versions:

$ terraform output supported_upgrades
toset([
  "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
])

However, I would expect to have same as in GUI:

So it seems missing TKG 2.4.0 v1.25.13.

I also guess that the (GUI) upgrades logic seems as follow:

Supported Upgrade:

TKG current v+1 and current K8s minor v+1 or any k8s patch version for current minor version (1.25.x)
- TKG 2.4.0 v1.25.13
- TKG 2.4.0 v1.26.8
- TKG 2.5.0 v1.26.11

Unsupported Upgrade:

K8s minor => +2
- TKG 2.4.0 v1.27.5
- TKG 2.5.0 v1.27.8
- TKG 2.5.0 v1.28.4

Above is considering that I have these templates available in my CSE catalog:

TKG 2.2.0 Ubuntu 20.04 and Kubernetes v1.25.7+vmware.2
TKG 2.4.0 Ubuntu 20.04 and Kubernetes v1.25.13+vmware.1
TKG 2.4.0 Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1
TKG 2.5.0 Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1
TKG 2.4.0 Ubuntu 20.04 and Kubernetes v1.27.5+vmware.1
TKG 2.5.0 Ubuntu 20.04 and Kubernetes v1.27.8+vmware.1
TKG 2.5.0 Ubuntu 22.04 and Kubernetes v1.28.4+vmware.1

Upgraded the (same) k8s cluste successfully to TKG 2.5.0 v1.26.11

However, supported_upgrades says only previous terraform apply versions are available

$ terraform output supported_upgrades
toset([
  "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
])

While both GUI and terraform plan say version v12.7.8 is available. This is a bit confusing, but I guess it is the way TF works....

GUI:

terraform plan snippet:

  ~ supported_upgrades = [
      - "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
      - "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
      + "Ubuntu 20.04 and Kubernetes v1.27.8+vmware.1",

Some notes from above: TKG 2.4.0 v1.27.5 is not available perhaps because TKG version is 2.4.0 which is below the current TKG version (TKG 2.5.0 v1.26.11)

adambarreiro · 2024-04-11T07:11:07Z

Thanks @returntrip for the feedback, really appreciated.
I made some adjustments and it should work now

returntrip · 2024-04-11T15:29:17Z

It started well:

Cluster created as TKG 2.2.0 v1.25.7 (looks fine):

$ terraform output supported_upgrades
toset([
  "Ubuntu 20.04 and Kubernetes v1.25.13+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
])

TKG Upgrade to TKG 2.4.0 v1.25.13 (looks fine):

"before": [
        "Ubuntu 20.04 and Kubernetes v1.25.13+vmware.1",
        "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
        "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1"
      ],
      "after": [
        "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
        "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1"
      ],

TKG Upgrade to TKG 2.5.0 v1.26.11:

 kubectl get nodes
NAME                                                STATUS   ROLES           AGE     VERSION
testtf4-control-plane-node-pool-kth7l               Ready    control-plane   5h32m   v1.25.13+vmware.1
testtf4-worker-node-pool-1-6f7466bbd9xb9f4f-bwdbf   Ready    <none>          3h52m   v1.26.11+vmware.1

a) WN node was upgraded OK
b) CPN was not upgraded. Basically for some reason the coredns version in the capi yaml is 1.9.3 while the CPN is running 1.10.1, which is newer than what "TF" wants to install. This is the error on events:

| [admission  webhook "validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io"  denied the request: KubeadmControlPlane.controlplane.cluster.x-k8s.io  "testtf4-control-plane-node-pool" is invalid:  spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTag: Forbidden:  cannot migrate CoreDNS up to '1.9.3' from '1.10.1': cannot migrate up to  '1.9.3' from '1.10.1'] during patching objects with name  [KubeadmControlPlane/testtf4-control-plane-node-pool]

This is a snippet from the CAPIYAML. check dns (and etcd) section below:

     ---
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlane
      metadata:
        name: testtf4-control-plane-node-pool
        namespace: testtf4-ns
      spec:
        kubeadmConfigSpec:
          clusterConfiguration:
            apiServer:
              certSANs:
                - localhost
                - 127.0.0.1
            controllerManager:
              extraArgs:
                enable-hostpath-provisioner: "true"
            dns:
              imageRepository: projects.registry.vmware.com/tkg
              imageTag: v1.9.3_vmware.19
            etcd:
              local:
                imageRepository: projects.registry.vmware.com/tkg
                imageTag: v3.5.10_vmware.1
            imageRepository: projects.registry.vmware.com/tkg
          initConfiguration:
            nodeRegistration:
              criSocket: /run/containerd/containerd.sock
              kubeletExtraArgs:
                cloud-provider: external
                eviction-hard: nodefs.available<0%%,nodefs.inodesFree<0%%,imagefs.available<0%%
          joinConfiguration:
            nodeRegistration:
              criSocket: /run/containerd/containerd.sock
              kubeletExtraArgs:
                cloud-provider: external
                eviction-hard: nodefs.available<0%%,nodefs.inodesFree<0%%,imagefs.available<0%%
          preKubeadmCommands:
            - mv /etc/ssl/certs/custom_certificate_*.crt
              /usr/local/share/ca-certificates && update-ca-certificates
          users:
            - name: root
              sshAuthorizedKeys:
                - ""
        machineTemplate:
          infrastructureRef:
            apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
            kind: VCDMachineTemplate
            name: testtf4-control-plane-node-pool
            namespace: testtf4-ns
        replicas: 3
        version: v1.26.11+vmware.1
      ---

This is the current version running on the CPN. And here is the documentation about this being the correct version for TKG 2.4.0

While here is the documentation for TKG 2.5.0

$ kubectl describe deployment coredns -n kube-system | grep -i image
    Image:       projects.registry.vmware.com/tkg/coredns:v1.10.1_vmware.7

See that also the etcd version seems incorrect.

BTW, how are the k8s components version (DNS, etcd) fetched and pupulated? I am just curious.

adambarreiro · 2024-04-11T18:18:53Z

Hi @returntrip,
Thanks again for your feedback.

It seems to be another bug, the CoreDNS version for TKG 2.4.0 with v1.25.13 K8s is wrong, should be v1.9.3_vmware.16 instead of v1.10.1_vmware.7 (answering your question, it is obtained from here, the fix PR is this one).

I have reported the same to the Container Service Extension team as I believe it should happen in UI as well, so thanks a lot for that.

returntrip · 2024-04-11T18:30:27Z

I am a bit confused tho cause the doc for TKG 2.4.0 https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.4/tkg-deploy-mc/mgmt-release-notes.html states:

coredns v1.10.1_vmware.7

etcd v3.5.7_vmware.6

So on paper the version is correct.

adambarreiro · 2024-04-12T07:09:48Z

Hi @returntrip, you're correct indeed.

I have now changed the v2.5.0 to use CoreDNS v1.10.1_vmware.12 instead. I checked the public Docker registry and it doesn't have v1.10.1_vmware.11, so that's something that I'd need to clarify with CSE team so we're aligned in providing the right version.

returntrip · 2024-04-12T07:58:57Z

I think the problem is here. In the sense that v1.26.11 has incorrect coredns and etc version there

They should (according to this) be:
dns (coredns) v1.10.1_vmware.11* (which you cannot find anyway)
etcd v3.5.9_vmware.6*

Instead of:
"coreDns": "v1.9.3_vmware.19"
"etcd": "v3.5.10_vmware.1",

There is somethign odd also here

So either the documentation is incorrect or both https://github.com/vmware/go-vcloud-director/blob/main/govcd/cse/tkg_versions.json and https://github.com/vmware/cluster-api-provider-cloud-director/blob/main/templates/cluster-template-v1.26.11-tkgv2.5.0-crs.yaml are

P.S.: if the documentation is incorrect, then the upgraded path is wrong in some way...

…cluster (#1247) * Fix Issue #1248 that prevents CSE Kubernetes clusters from being upgraded to an OVA with higher Kubernetes version but same TKG version, and to an OVA with a higher patch version of Kubernetes. * Fix Issue #1248 that prevents CSE Kubernetes clusters from being upgraded to TKG v2.5.0 with Kubernetes v1.26.11 as it performed an invalid upgrade of CoreDNS. * Fix Issue #1252 that prevents reading the SSH Public Key from provisioned CSE Kubernetes clusters. Signed-off-by: abarreiro <abarreiro@vmware.com>

adambarreiro · 2024-04-18T14:02:25Z

Merged #1248, will release a patch soon.

PS: A bit more of detail about the fix, it seems this issue uncovered more wrong things than expected. It seems both the documentation and CSE UI extension have some mistakes. As a workaround, TKG v2.4.0 versions will use a lower CoreDNS version, until things get clarified.

CSE team will update the known issues on their docs at some point, with some recommended upgrade paths.

Thanks for reporting!

returntrip changed the title ~~TKG Cluster upgrade from v1.26.8 to v1.27.5 failes~~ TKG Cluster upgrade from v1.26.8 to v1.27.5 fails Apr 8, 2024

adambarreiro self-assigned this Apr 8, 2024

adambarreiro mentioned this issue Apr 8, 2024

Fix TKG OVA template upgrade in vcd_cse_kubernetes_cluster #1247

Merged

returntrip mentioned this issue Apr 16, 2024

Upgrading a TKG Cluster with a declared ssh key will cause cluster to be destroyed and re created #1252

Closed

adambarreiro closed this as completed in #1247 Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

returntrip commented Apr 8, 2024

adambarreiro commented Apr 9, 2024

returntrip commented Apr 10, 2024

adambarreiro commented Apr 11, 2024

returntrip commented Apr 11, 2024

adambarreiro commented Apr 11, 2024

returntrip commented Apr 11, 2024 •

edited

adambarreiro commented Apr 12, 2024

returntrip commented Apr 12, 2024 •

edited

adambarreiro commented Apr 18, 2024 •

edited

TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

Comments

returntrip commented Apr 8, 2024

Terraform Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Steps to Reproduce

References

adambarreiro commented Apr 9, 2024

returntrip commented Apr 10, 2024

adambarreiro commented Apr 11, 2024

returntrip commented Apr 11, 2024

adambarreiro commented Apr 11, 2024

returntrip commented Apr 11, 2024 • edited

adambarreiro commented Apr 12, 2024

returntrip commented Apr 12, 2024 • edited

adambarreiro commented Apr 18, 2024 • edited

returntrip commented Apr 11, 2024 •

edited

returntrip commented Apr 12, 2024 •

edited

adambarreiro commented Apr 18, 2024 •

edited