Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared cluster deployment fails while trying to find the vCenter address #53

Closed
1 of 4 tasks
syangsao opened this issue Jul 28, 2022 · 3 comments
Closed
1 of 4 tasks

Comments

@syangsao
Copy link

Bug description

Shared cluster deployment fails while trying to find the vCenter address with the following message:

ERROR :Failed to deploy cluster Failed Error: unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): cluster creation failed, reason:'VCenterUnreachable', message:'Post "https://vcenter01.syangsao.lab/sdk": dial tcp: lookup vcenter01.syangsao.lab on 100.64.0.10:53: no such host'
Error: exit status 1

Affected product modules (please put an X in all that apply)

  • SIVT APIs
  • SIVT UI
  • SIVT CLI
  • Docs

Expected behavior

Shared cluster installation should follow through, using the DNS server that was configured via the SIVT UI. Confirmed the DNS server on the SIVT host does find the vCenter address correctly. The same DNS server is used to confirm the AVI hostname during the initial cluster setup.

Steps to reproduce the bug

This seems to occur constantly during the shared cluster installation. Sometimes the shared cluster installation makes it through but then the same error occurs on the workload cluster installation next. Unsure on how to debug this and check why it is stating that the vCenter address is unreachable.

Version (include the SHA if the version is not obvious)

Environment where the bug was observed (vSphere+VMC, vSphere+DVS, vSphere+NSXt, etc)

vSphere+DVS+AVI

  • SIVT version: 1.3
  • vSphere version: 7.0.3 Update 3g
  • vCenter version: 7.0.3 Update 3g
  • Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9+vmware.1", GitCommit:"21eeb4527eefb360eb251addc358cea6997e8335", GitTreeState:"clean", BuildDate:"2022-05-04T00:18:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes installer & version:

  • Cloud provider or hardware configuration: Dell

  • OS (e.g. from /etc/os-release):

NAME="VMware Photon OS"
VERSION="3.0"
ID=photon
VERSION_ID=3.0
PRETTY_NAME="VMware Photon OS/Linux"
ANSI_COLOR="1;34"
HOME_URL="https://vmware.github.io/photon/"
BUG_REPORT_URL="https://github.com/vmware/photon/issues"

  • Sonobuoy tarball (which contains * below)

Relevant Debug Output (Logs, manifests, etc)

SIVT host confirms DNS entry for the vCenter address is valid. Not sure why the installation fails and where to troubleshoot how it is doing the lookups

root@service13 [ ~ ]# ping vcenter01.syangsao.lab
PING vcenter01.syangsao.lab (192.168.40.14) 56(84) bytes of data.
64 bytes from vcenter01.syangsao.lab (192.168.40.14): icmp_seq=1 ttl=64 time=0.139 ms
64 bytes from vcenter01.syangsao.lab (192.168.40.14): icmp_seq=2 ttl=64 time=0.110 ms
^C
--- vcenter01.syangsao.lab ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 0.110/0.124/0.139/0.018 ms

DNS lookup seems to be valid from the SIVT host.

root@service13 [ ~ ]# dig vcenter01.syangsao.lab

; <<>> DiG 9.16.27 <<>> vcenter01.syangsao.lab
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37332
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: a30569a4c5712f680100000062e300ee1f992de4a8ee1b4b (good)
;; QUESTION SECTION:
;vcenter01.syangsao.lab.		IN	A

;; ANSWER SECTION:
vcenter01.syangsao.lab.	604800	IN	A	192.168.40.14

;; Query time: 4 msec
;; SERVER: 192.168.40.2#53(192.168.40.2)
;; WHEN: Thu Jul 28 16:34:38 CDT 2022
;; MSG SIZE  rcvd: 95

Reverse lookup is valid.

root@service13 [ ~ ]# dig -x 192.168.40.14

; <<>> DiG 9.16.27 <<>> -x 192.168.40.14
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30348
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: a143c14aac8c21d50100000062e300fa89b24f03d95a48f9 (good)
;; QUESTION SECTION:
;14.40.168.192.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
14.40.168.192.in-addr.arpa. 604800 IN	PTR	vcenter01.syangsao.lab.

;; Query time: 4 msec
;; SERVER: 192.168.40.2#53(192.168.40.2)
;; WHEN: Thu Jul 28 16:34:50 CDT 2022
;; MSG SIZE  rcvd: 119

Local resolver is configured properly on SIVT host

root@service13 [ ~ ]# resolvectl |more
Global
       LLMNR setting: no
MulticastDNS setting: yes
  DNSOverTLS setting: opportunistic
      DNSSEC setting: no
    DNSSEC supported: no
  Current DNS Server: 192.168.40.2
         DNS Servers: 192.168.40.2
Fallback DNS Servers: 8.8.8.8
                      8.8.4.4
[...]
@syangsao
Copy link
Author

Installation output

install.log

@syangsao
Copy link
Author

JSON configuration is located here

@syangsao
Copy link
Author

OK, so long story short, DHCP was doling out 2 DNS addresses:

  1. 192.168.40.2 *Can resolve vcenter01.syangsao.lab
  2. 192.168.1.1 **Can not resolve vcenter01.syangsao.lab, but used as a backup in case ^ fails

When the Linux host boots up on the TKG mgmt and workload subnets, it would sometimes flip the DNS addresses above. Even though the DNS server addresses are in order of having 1 then 2 for the preferred order. And so, that is why sometimes the shared cluster nodes would come up, because they had the right order and the workload clusters would fail because the order was reversed and using the DNS server that could not resolve the vCenter address.

I was able to confirm this behaviour by booting up a separate Linux host on both subnets and that is how I found this behaviour of the DNS address flipping to the wrong one. Fix is just to remove the 2nd DNS entry completely since it seems as though the Linux hosts flip these addresses from time to time and not follow the order given/setup from the DHCP end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant