Shared cluster deployment fails while trying to find the vCenter address #53

syangsao · 2022-07-28T21:38:41Z

Bug description

Shared cluster deployment fails while trying to find the vCenter address with the following message:

ERROR :Failed to deploy cluster Failed Error: unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): cluster creation failed, reason:'VCenterUnreachable', message:'Post "https://vcenter01.syangsao.lab/sdk": dial tcp: lookup vcenter01.syangsao.lab on 100.64.0.10:53: no such host'
Error: exit status 1

Affected product modules (please put an X in all that apply)

SIVT APIs
SIVT UI
SIVT CLI
Docs

Expected behavior

Shared cluster installation should follow through, using the DNS server that was configured via the SIVT UI. Confirmed the DNS server on the SIVT host does find the vCenter address correctly. The same DNS server is used to confirm the AVI hostname during the initial cluster setup.

Steps to reproduce the bug

This seems to occur constantly during the shared cluster installation. Sometimes the shared cluster installation makes it through but then the same error occurs on the workload cluster installation next. Unsure on how to debug this and check why it is stating that the vCenter address is unreachable.

Version (include the SHA if the version is not obvious)

Environment where the bug was observed (vSphere+VMC, vSphere+DVS, vSphere+NSXt, etc)

vSphere+DVS+AVI

SIVT version: 1.3
vSphere version: 7.0.3 Update 3g
vCenter version: 7.0.3 Update 3g
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9+vmware.1", GitCommit:"21eeb4527eefb360eb251addc358cea6997e8335", GitTreeState:"clean", BuildDate:"2022-05-04T00:18:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version:
Cloud provider or hardware configuration: Dell
OS (e.g. from /etc/os-release):

NAME="VMware Photon OS"
VERSION="3.0"
ID=photon
VERSION_ID=3.0
PRETTY_NAME="VMware Photon OS/Linux"
ANSI_COLOR="1;34"
HOME_URL="https://vmware.github.io/photon/"
BUG_REPORT_URL="https://github.com/vmware/photon/issues"

Sonobuoy tarball (which contains * below)

Relevant Debug Output (Logs, manifests, etc)

SIVT host confirms DNS entry for the vCenter address is valid. Not sure why the installation fails and where to troubleshoot how it is doing the lookups

root@service13 [ ~ ]# ping vcenter01.syangsao.lab
PING vcenter01.syangsao.lab (192.168.40.14) 56(84) bytes of data.
64 bytes from vcenter01.syangsao.lab (192.168.40.14): icmp_seq=1 ttl=64 time=0.139 ms
64 bytes from vcenter01.syangsao.lab (192.168.40.14): icmp_seq=2 ttl=64 time=0.110 ms
^C
--- vcenter01.syangsao.lab ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 0.110/0.124/0.139/0.018 ms

DNS lookup seems to be valid from the SIVT host.

root@service13 [ ~ ]# dig vcenter01.syangsao.lab

; <<>> DiG 9.16.27 <<>> vcenter01.syangsao.lab
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37332
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: a30569a4c5712f680100000062e300ee1f992de4a8ee1b4b (good)
;; QUESTION SECTION:
;vcenter01.syangsao.lab.		IN	A

;; ANSWER SECTION:
vcenter01.syangsao.lab.	604800	IN	A	192.168.40.14

;; Query time: 4 msec
;; SERVER: 192.168.40.2#53(192.168.40.2)
;; WHEN: Thu Jul 28 16:34:38 CDT 2022
;; MSG SIZE  rcvd: 95

Reverse lookup is valid.

root@service13 [ ~ ]# dig -x 192.168.40.14

; <<>> DiG 9.16.27 <<>> -x 192.168.40.14
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30348
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: a143c14aac8c21d50100000062e300fa89b24f03d95a48f9 (good)
;; QUESTION SECTION:
;14.40.168.192.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
14.40.168.192.in-addr.arpa. 604800 IN	PTR	vcenter01.syangsao.lab.

;; Query time: 4 msec
;; SERVER: 192.168.40.2#53(192.168.40.2)
;; WHEN: Thu Jul 28 16:34:50 CDT 2022
;; MSG SIZE  rcvd: 119

Local resolver is configured properly on SIVT host

root@service13 [ ~ ]# resolvectl |more
Global
       LLMNR setting: no
MulticastDNS setting: yes
  DNSOverTLS setting: opportunistic
      DNSSEC setting: no
    DNSSEC supported: no
  Current DNS Server: 192.168.40.2
         DNS Servers: 192.168.40.2
Fallback DNS Servers: 8.8.8.8
                      8.8.4.4
[...]

The text was updated successfully, but these errors were encountered:

syangsao · 2022-07-28T21:41:17Z

Installation output

install.log

syangsao · 2022-07-28T21:45:59Z

JSON configuration is located here

syangsao · 2022-07-29T19:37:57Z

OK, so long story short, DHCP was doling out 2 DNS addresses:

192.168.40.2 *Can resolve vcenter01.syangsao.lab
192.168.1.1 **Can not resolve vcenter01.syangsao.lab, but used as a backup in case ^ fails

When the Linux host boots up on the TKG mgmt and workload subnets, it would sometimes flip the DNS addresses above. Even though the DNS server addresses are in order of having 1 then 2 for the preferred order. And so, that is why sometimes the shared cluster nodes would come up, because they had the right order and the workload clusters would fail because the order was reversed and using the DNS server that could not resolve the vCenter address.

I was able to confirm this behaviour by booting up a separate Linux host on both subnets and that is how I found this behaviour of the DNS address flipping to the wrong one. Fix is just to remove the 2nd DNS entry completely since it seems as though the Linux hosts flip these addresses from time to time and not follow the order given/setup from the DHCP end.

syangsao closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared cluster deployment fails while trying to find the vCenter address #53

Shared cluster deployment fails while trying to find the vCenter address #53

syangsao commented Jul 28, 2022

syangsao commented Jul 28, 2022

syangsao commented Jul 28, 2022

syangsao commented Jul 29, 2022

Shared cluster deployment fails while trying to find the vCenter address #53

Shared cluster deployment fails while trying to find the vCenter address #53

Comments

syangsao commented Jul 28, 2022

syangsao commented Jul 28, 2022

syangsao commented Jul 28, 2022

syangsao commented Jul 29, 2022