Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

Closed
returntrip opened this issue Apr 8, 2024 · 9 comments · Fixed by #1247
Closed

TKG Cluster upgrade from v1.26.8 to v1.27.5 fails #1248

returntrip opened this issue Apr 8, 2024 · 9 comments · Fixed by #1247
Assignees

Comments

@returntrip
Copy link

Terraform Version

Terraform v1.7.5
on linux_amd64
provider registry.terraform.io/vmware/vcd v3.12.0

Affected Resource(s)

  • vcd_cse_kubernetes_cluster

Expected Behavior

Upgrade from 'Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1' to 'Ubuntu 20.04 and Kubernetes v1.27.5+vmware.1' succeeds

Actual Behavior

Upgrade fails with error:

Error: Kubernetes cluster update failed: cannot perform an OVA change as the new one 'Ubuntu 20.04 and Kubernetes v1.27.5+vmware.1' has an older TKG/Kubernetes version (v2.4.0/v1.27.5+vmware.1)

│ with vcd_cse_kubernetes_cluster.my_cluster,
│ on main.tf line 73, in resource "vcd_cse_kubernetes_cluster" "my_cluster":
│ 73: resource "vcd_cse_kubernetes_cluster" "my_cluster" {

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

@returntrip returntrip changed the title TKG Cluster upgrade from v1.26.8 to v1.27.5 failes TKG Cluster upgrade from v1.26.8 to v1.27.5 fails Apr 8, 2024
@adambarreiro adambarreiro self-assigned this Apr 8, 2024
@adambarreiro
Copy link
Collaborator

Hi @returntrip,

I implemented a fix in this PR: #1247
Not sure if you would be able to test it? It worked for me, but your feedback would also be valuable.

You can clone my fork: https://github.com/adambarreiro/terraform-provider-vcd.git, then checkout fix-cse-upgrade branch and run make install to install this un-released provider with the patch.

Then you could try creating a cluster with a specific TKG version, it should now display the OVAs that weren't displayed before.

Let me know, thanks in advance!

@returntrip
Copy link
Author

Hi @adambarreiro!

Sure, I have tried this:

  1. Created cluster with TKG 2.2.0 v1.25.7+vmware.2 and got the following supported upgrade templates versions:
$ terraform output supported_upgrades
toset([
  "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
])

However, I would expect to have same as in GUI:
image

So it seems missing TKG 2.4.0 v1.25.13.

I also guess that the (GUI) upgrades logic seems as follow:

Supported Upgrade:

  • TKG current v+1 and current K8s minor v+1 or any k8s patch version for current minor version (1.25.x)
    • TKG 2.4.0 v1.25.13
    • TKG 2.4.0 v1.26.8
    • TKG 2.5.0 v1.26.11

Unsupported Upgrade:

  • K8s minor => +2
    • TKG 2.4.0 v1.27.5
    • TKG 2.5.0 v1.27.8
    • TKG 2.5.0 v1.28.4

Above is considering that I have these templates available in my CSE catalog:

TKG 2.2.0 Ubuntu 20.04 and Kubernetes v1.25.7+vmware.2
TKG 2.4.0 Ubuntu 20.04 and Kubernetes v1.25.13+vmware.1
TKG 2.4.0 Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1
TKG 2.5.0 Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1
TKG 2.4.0 Ubuntu 20.04 and Kubernetes v1.27.5+vmware.1
TKG 2.5.0 Ubuntu 20.04 and Kubernetes v1.27.8+vmware.1
TKG 2.5.0 Ubuntu 22.04 and Kubernetes v1.28.4+vmware.1
  1. Upgraded the (same) k8s cluste successfully to TKG 2.5.0 v1.26.11

However, supported_upgrades says only previous terraform apply versions are available

$ terraform output supported_upgrades
toset([
  "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
])

While both GUI and terraform plan say version v12.7.8 is available. This is a bit confusing, but I guess it is the way TF works....

GUI:
image
terraform plan snippet:

  ~ supported_upgrades = [
      - "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
      - "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
      + "Ubuntu 20.04 and Kubernetes v1.27.8+vmware.1",

Some notes from above: TKG 2.4.0 v1.27.5 is not available perhaps because TKG version is 2.4.0 which is below the current TKG version (TKG 2.5.0 v1.26.11)

@adambarreiro
Copy link
Collaborator

Thanks @returntrip for the feedback, really appreciated.
I made some adjustments and it should work now

@returntrip
Copy link
Author

It started well:

  1. Cluster created as TKG 2.2.0 v1.25.7 (looks fine):
$ terraform output supported_upgrades
toset([
  "Ubuntu 20.04 and Kubernetes v1.25.13+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
  "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1",
])
  1. TKG Upgrade to TKG 2.4.0 v1.25.13 (looks fine):
"before": [
        "Ubuntu 20.04 and Kubernetes v1.25.13+vmware.1",
        "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
        "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1"
      ],
      "after": [
        "Ubuntu 20.04 and Kubernetes v1.26.11+vmware.1",
        "Ubuntu 20.04 and Kubernetes v1.26.8+vmware.1"
      ],
  1. TKG Upgrade to TKG 2.5.0 v1.26.11:
 kubectl get nodes
NAME                                                STATUS   ROLES           AGE     VERSION
testtf4-control-plane-node-pool-kth7l               Ready    control-plane   5h32m   v1.25.13+vmware.1
testtf4-worker-node-pool-1-6f7466bbd9xb9f4f-bwdbf   Ready    <none>          3h52m   v1.26.11+vmware.1

a) WN node was upgraded OK
b) CPN was not upgraded. Basically for some reason the coredns version in the capi yaml is 1.9.3 while the CPN is running 1.10.1, which is newer than what "TF" wants to install. This is the error on events:

| [admission  webhook "validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io"  denied the request: KubeadmControlPlane.controlplane.cluster.x-k8s.io  "testtf4-control-plane-node-pool" is invalid:  spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTag: Forbidden:  cannot migrate CoreDNS up to '1.9.3' from '1.10.1': cannot migrate up to  '1.9.3' from '1.10.1'] during patching objects with name  [KubeadmControlPlane/testtf4-control-plane-node-pool]

This is a snippet from the CAPIYAML. check dns (and etcd) section below:

     ---
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlane
      metadata:
        name: testtf4-control-plane-node-pool
        namespace: testtf4-ns
      spec:
        kubeadmConfigSpec:
          clusterConfiguration:
            apiServer:
              certSANs:
                - localhost
                - 127.0.0.1
            controllerManager:
              extraArgs:
                enable-hostpath-provisioner: "true"
            dns:
              imageRepository: projects.registry.vmware.com/tkg
              imageTag: v1.9.3_vmware.19
            etcd:
              local:
                imageRepository: projects.registry.vmware.com/tkg
                imageTag: v3.5.10_vmware.1
            imageRepository: projects.registry.vmware.com/tkg
          initConfiguration:
            nodeRegistration:
              criSocket: /run/containerd/containerd.sock
              kubeletExtraArgs:
                cloud-provider: external
                eviction-hard: nodefs.available<0%%,nodefs.inodesFree<0%%,imagefs.available<0%%
          joinConfiguration:
            nodeRegistration:
              criSocket: /run/containerd/containerd.sock
              kubeletExtraArgs:
                cloud-provider: external
                eviction-hard: nodefs.available<0%%,nodefs.inodesFree<0%%,imagefs.available<0%%
          preKubeadmCommands:
            - mv /etc/ssl/certs/custom_certificate_*.crt
              /usr/local/share/ca-certificates && update-ca-certificates
          users:
            - name: root
              sshAuthorizedKeys:
                - ""
        machineTemplate:
          infrastructureRef:
            apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
            kind: VCDMachineTemplate
            name: testtf4-control-plane-node-pool
            namespace: testtf4-ns
        replicas: 3
        version: v1.26.11+vmware.1
      ---

This is the current version running on the CPN. And here is the documentation about this being the correct version for TKG 2.4.0

While here is the documentation for TKG 2.5.0

$ kubectl describe deployment coredns -n kube-system | grep -i image
    Image:       projects.registry.vmware.com/tkg/coredns:v1.10.1_vmware.7

See that also the etcd version seems incorrect.

BTW, how are the k8s components version (DNS, etcd) fetched and pupulated? I am just curious.

@adambarreiro
Copy link
Collaborator

Hi @returntrip,
Thanks again for your feedback.

It seems to be another bug, the CoreDNS version for TKG 2.4.0 with v1.25.13 K8s is wrong, should be v1.9.3_vmware.16 instead of v1.10.1_vmware.7 (answering your question, it is obtained from here, the fix PR is this one).

I have reported the same to the Container Service Extension team as I believe it should happen in UI as well, so thanks a lot for that.

@returntrip
Copy link
Author

returntrip commented Apr 11, 2024

I am a bit confused tho cause the doc for TKG 2.4.0 https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.4/tkg-deploy-mc/mgmt-release-notes.html states:

coredns v1.10.1_vmware.7

etcd v3.5.7_vmware.6

So on paper the version is correct.

@adambarreiro
Copy link
Collaborator

Hi @returntrip, you're correct indeed.

I have now changed the v2.5.0 to use CoreDNS v1.10.1_vmware.12 instead. I checked the public Docker registry and it doesn't have v1.10.1_vmware.11, so that's something that I'd need to clarify with CSE team so we're aligned in providing the right version.

@returntrip
Copy link
Author

returntrip commented Apr 12, 2024

I think the problem is here. In the sense that v1.26.11 has incorrect coredns and etc version there

They should (according to this) be:
dns (coredns) v1.10.1_vmware.11* (which you cannot find anyway)
etcd v3.5.9_vmware.6*

Instead of:
"coreDns": "v1.9.3_vmware.19"
"etcd": "v3.5.10_vmware.1",

There is somethign odd also here

So either the documentation is incorrect or both https://github.com/vmware/go-vcloud-director/blob/main/govcd/cse/tkg_versions.json and https://github.com/vmware/cluster-api-provider-cloud-director/blob/main/templates/cluster-template-v1.26.11-tkgv2.5.0-crs.yaml are

P.S.: if the documentation is incorrect, then the upgraded path is wrong in some way...

adambarreiro added a commit that referenced this issue Apr 18, 2024
…cluster (#1247)

* Fix Issue #1248 that prevents CSE Kubernetes clusters from being upgraded to an OVA with higher Kubernetes version but same TKG version, and to an OVA with a higher patch version of Kubernetes.
* Fix Issue #1248 that prevents CSE Kubernetes clusters from being upgraded to TKG v2.5.0 with Kubernetes v1.26.11 as it performed an invalid upgrade of CoreDNS.
* Fix Issue #1252 that prevents reading the SSH Public Key from provisioned CSE Kubernetes clusters.

Signed-off-by: abarreiro <abarreiro@vmware.com>
@adambarreiro
Copy link
Collaborator

adambarreiro commented Apr 18, 2024

Merged #1248, will release a patch soon.

PS: A bit more of detail about the fix, it seems this issue uncovered more wrong things than expected. It seems both the documentation and CSE UI extension have some mistakes. As a workaround, TKG v2.4.0 versions will use a lower CoreDNS version, until things get clarified.

CSE team will update the known issues on their docs at some point, with some recommended upgrade paths.

Thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants