node-setup-daemon fails on karpenter nodes #1839

marcustut · 2024-03-16T14:11:18Z

What happened?

I followed the guide to setup EKS but since I already have an existing clusters with karpenter, I didn't create a new cluster with the eks-cluster.yaml provided. After I deploy the operator, local-csi-driver I then deploy a ScyllaCluster manifest but the cluster-node-setup pod kept failing with this error

++ mktemp -d
+ cd /tmp/tmp.eE7xsS9yC8
++ find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n'
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./snap
+ mount --rbind /host/snap ./snap
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./home
+ mount --rbind /host/home ./home
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./sys
+ mount --rbind /host/sys ./sys
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./boot
+ mount --rbind /host/boot ./boot
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./root
+ mount --rbind /host/root ./root
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./lost+found
+ mount --rbind /host/lost+found ./lost+found
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./opt
+ mount --rbind /host/opt ./opt
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./media
+ mount --rbind /host/media ./media
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./mnt
+ mount --rbind /host/mnt ./mnt
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./usr
+ mount --rbind /host/usr ./usr
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./etc
+ mount --rbind /host/etc ./etc
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./dev
+ mount --rbind /host/dev ./dev
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./host
+ mount --rbind /host/host ./host
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./proc
+ mount --rbind /host/proc ./proc
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./srv
+ mount --rbind /host/srv ./srv
++ find /host -mindepth 1 -maxdepth 1 -type f -printf '%f\n'
+ find /host -mindepth 1 -maxdepth 1 -type l -exec cp -P '{}' ./ ';'
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/udev ']'
+ mkdir -p ./run/udev
+ mount --rbind /host/run/udev ./run/udev
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/mdadm ']'
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/dbus ']'
+ mkdir -p ./run/dbus
+ mount --rbind /host/run/dbus ./run/dbus
+ for dir in "run/crio" "run/containerd"
+ '[' -d /host/run/crio ']'
+ for dir in "run/crio" "run/containerd"
+ '[' -d /host/run/containerd ']'
+ mkdir -p ./run/containerd
+ mount --rbind /host/run/containerd ./run/containerd
+ '[' -f /host/run/dockershim.sock ']'
+ '[' -d /host/var/lib/kubelet ']'
+ mkdir -p ./var/lib/kubelet
+ mount --rbind /host/var/lib/kubelet ./var/lib/kubelet
+ mkdir -p ./scylla-operator
+ touch ./scylla-operator/scylla-operator
+ mount --bind /usr/bin/scylla-operator ./scylla-operator/scylla-operator
+ mkdir -p ./run/secrets/kubernetes.io/serviceaccount
+ for f in ca.crt token
+ touch ./run/secrets/kubernetes.io/serviceaccount/ca.crt
+ mount --bind /run/secrets/kubernetes.io/serviceaccount/ca.crt ./run/secrets/kubernetes.io/serviceaccount/ca.crt
+ for f in ca.crt token
+ touch ./run/secrets/kubernetes.io/serviceaccount/token
+ mount --bind /run/secrets/kubernetes.io/serviceaccount/token ./run/secrets/kubernetes.io/serviceaccount/token
+ '[' -L /host/var/run ']'
+ mkdir -p ./var
+ ln -s ../run ./var/run
+ exec chroot ./ /scylla-operator/scylla-operator node-setup-daemon --namespace=scylla-operator-node-tuning --pod-name=cluster-node-setup-jhjpj --node-name=ip-192-168-14-249.ap-south-1.compute.internal --node-config-name=cluster --node-config-uid=a7e98cb5-acc5-41a2-af1f-c44b48ca9f03 --scylla-image=docker.io/scylladb/scylla:5.4.0 --disable-optimizations=false --loglevel=4
2024/03/16 14:02:13 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
I0316 14:02:13.164326       1 operator/nodesetupdaemon.go:172] node-setup-daemon version "v1.12.0-beta.0-26-gc181cf2"
I0316 14:02:13.164345       1 flag/flags.go:64] FLAG: --burst="75"
I0316 14:02:13.164350       1 flag/flags.go:64] FLAG: --cri-endpoint="[unix:///var/run/dockershim.sock,unix:///run/containerd/containerd.sock,unix:///run/crio/crio.sock]"
I0316 14:02:13.164359       1 flag/flags.go:64] FLAG: --disable-optimizations="false"
I0316 14:02:13.164363       1 flag/flags.go:64] FLAG: --feature-gates=""
I0316 14:02:13.164368       1 flag/flags.go:64] FLAG: --help="false"
I0316 14:02:13.164371       1 flag/flags.go:64] FLAG: --kubeconfig=""
I0316 14:02:13.164375       1 flag/flags.go:64] FLAG: --kubelet-pod-resources-endpoint="unix:///var/lib/kubelet/pod-resources/kubelet.sock"
I0316 14:02:13.164380       1 flag/flags.go:64] FLAG: --loglevel="4"
I0316 14:02:13.164387       1 flag/flags.go:64] FLAG: --namespace="scylla-operator-node-tuning"
I0316 14:02:13.164391       1 flag/flags.go:64] FLAG: --node-config-name="cluster"
I0316 14:02:13.164393       1 flag/flags.go:64] FLAG: --node-config-uid="a7e98cb5-acc5-41a2-af1f-c44b48ca9f03"
I0316 14:02:13.164397       1 flag/flags.go:64] FLAG: --node-name="ip-192-168-14-249.ap-south-1.compute.internal"
I0316 14:02:13.164415       1 flag/flags.go:64] FLAG: --pod-name="cluster-node-setup-jhjpj"
I0316 14:02:13.164418       1 flag/flags.go:64] FLAG: --qps="50"
I0316 14:02:13.164424       1 flag/flags.go:64] FLAG: --scylla-image="docker.io/scylladb/scylla:5.4.0"
I0316 14:02:13.164428       1 flag/flags.go:64] FLAG: --v="4"
I0316 14:02:13.164590       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/run/crio/crio.sock"
I0316 14:02:13.164628       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/var/run/dockershim.sock"
I0316 14:02:13.164738       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/run/containerd/containerd.sock"
I0316 14:02:15.165396       1 cri/client.go:114] "Connected to CRI endpoint" Successful=["unix:///run/containerd/containerd.sock"] Other attempts="[unix:///var/run/dockershim.sock: context deadline exceeded, unix:///run/crio/crio.sock: context deadline exceeded]"
I0316 14:02:15.183283       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:54291->[::1]:53: read: connection refused"
I0316 14:02:15.194851       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:58179->[::1]:53: read: connection refused"
I0316 14:02:15.246331       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:51943->[::1]:53: read: connection refused"
I0316 14:02:15.500059       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:45116->[::1]:53: read: connection refused"
Error: can't get node "ip-192-168-14-249.ap-south-1.compute.internal": timed out waiting for the condition

So apparently somehow it was failing at not able to call get node through the kubernetes api but the error is not regarding authentication but some tcp / udp connection refused.

What did you expect to happen?

The cluster-node-setup to succeed and XFS filesystem is created and scyllacluster continue to be created

How can we reproduce it (as minimally and precisely as possible)?

use Karpenter cluster with a nodepool using i4i instances and deploy the operator

Scylla Operator version

v1.12

Kubernetes platform name and version

```console $ kubectl version Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.25.16-eks-77b1e4e WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1 ```

Kubernetes platform info:

Please attach the must-gather archive.

scylla-operator-must-gather-tc7c7mp7jnt6.zip

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

tnozicka · 2024-03-18T08:37:55Z

Please attach the must-gather archive.
n/a

we can't invest time into investigating your issue if you don't invest time to provide the required information

The must-gather archive is a **mandatory** part of every bug report.

https://github.com/scylladb/scylla-operator/blob/f356138/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57

zimnx · 2024-03-18T08:42:07Z

dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:45116->[::1]:53: read: connection refused"

Looks like firewall blocks DNS traffic from that node or your DNS service is down.

marcustut · 2024-03-18T16:18:13Z

Please attach the must-gather archive.
n/a

we can't invest time into investigating your issue if you don't invest time to provide the required information
The must-gather archive is a **mandatory** part of every bug report.
https://github.com/scylladb/scylla-operator/blob/f356138/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57

Sorry, I'll get back to this later with the must-gather archive.

As for @zimnx, I did manually SSH-ed into the node and ran the API Call through curl, it worked fine. I also tried launching a debug pod with the same image docker.io/scylladb/scylla-operator:1.12 and ran the API call through curl and it worked fine too.

marcustut · 2024-03-19T12:36:49Z

@tnozicka @zimnx I have updated the issue with the must-gather logs

tnozicka · 2024-03-26T12:47:25Z

This is the log file I got from the must-gather program

you need to upload the folder it created, not the log from its creation

"Gathering artifacts" DestDir="scylla-operator-must-gather-tc7c7mp7jnt6"

marcustut · 2024-03-28T14:40:19Z

This is the log file I got from the must-gather program

you need to upload the folder it created, not the log from its creation

"Gathering artifacts" DestDir="scylla-operator-must-gather-tc7c7mp7jnt6"

Sorry, I updated the issue with the entire archive

zimnx · 2024-03-29T08:13:58Z

Archive is empty

marcustut · 2024-03-29T09:34:17Z

Archive is empty

Sorry, reuploaded it

zr-mah · 2024-04-05T02:02:37Z

I'm facing the exact same issue too.
@zimnx May I know if you got a chance to look at it?

scylla-operator-bot · 2024-07-08T10:43:30Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

marcustut added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2024

scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 16, 2024

tnozicka assigned zimnx Mar 18, 2024

tnozicka added kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2024

scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024

tnozicka removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024

scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-setup-daemon fails on karpenter nodes #1839

node-setup-daemon fails on karpenter nodes #1839

marcustut commented Mar 16, 2024 •

edited

Loading

tnozicka commented Mar 18, 2024

zimnx commented Mar 18, 2024

marcustut commented Mar 18, 2024

marcustut commented Mar 19, 2024

tnozicka commented Mar 26, 2024 •

edited

Loading

marcustut commented Mar 28, 2024

zimnx commented Mar 29, 2024

marcustut commented Mar 29, 2024

zr-mah commented Apr 5, 2024

scylla-operator-bot bot commented Jul 8, 2024

node-setup-daemon fails on karpenter nodes #1839

node-setup-daemon fails on karpenter nodes #1839

Comments

marcustut commented Mar 16, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Scylla Operator version

Kubernetes platform name and version

Please attach the must-gather archive.

Anything else we need to know?

tnozicka commented Mar 18, 2024

zimnx commented Mar 18, 2024

marcustut commented Mar 18, 2024

marcustut commented Mar 19, 2024

tnozicka commented Mar 26, 2024 • edited Loading

marcustut commented Mar 28, 2024

zimnx commented Mar 29, 2024

marcustut commented Mar 29, 2024

zr-mah commented Apr 5, 2024

scylla-operator-bot bot commented Jul 8, 2024

marcustut commented Mar 16, 2024 •

edited

Loading

tnozicka commented Mar 26, 2024 •

edited

Loading