Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-setup-daemon fails on karpenter nodes #1839

Open
marcustut opened this issue Mar 16, 2024 · 9 comments
Open

node-setup-daemon fails on karpenter nodes #1839

marcustut opened this issue Mar 16, 2024 · 9 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@marcustut
Copy link

marcustut commented Mar 16, 2024

What happened?

I followed the guide to setup EKS but since I already have an existing clusters with karpenter, I didn't create a new cluster with the eks-cluster.yaml provided. After I deploy the operator, local-csi-driver I then deploy a ScyllaCluster manifest but the cluster-node-setup pod kept failing with this error

++ mktemp -d
+ cd /tmp/tmp.eE7xsS9yC8
++ find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n'
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./snap
+ mount --rbind /host/snap ./snap
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./home
+ mount --rbind /host/home ./home
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./sys
+ mount --rbind /host/sys ./sys
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./boot
+ mount --rbind /host/boot ./boot
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./root
+ mount --rbind /host/root ./root
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./lost+found
+ mount --rbind /host/lost+found ./lost+found
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./opt
+ mount --rbind /host/opt ./opt
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./media
+ mount --rbind /host/media ./media
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./mnt
+ mount --rbind /host/mnt ./mnt
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./usr
+ mount --rbind /host/usr ./usr
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./etc
+ mount --rbind /host/etc ./etc
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./dev
+ mount --rbind /host/dev ./dev
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./host
+ mount --rbind /host/host ./host
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./proc
+ mount --rbind /host/proc ./proc
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./srv
+ mount --rbind /host/srv ./srv
++ find /host -mindepth 1 -maxdepth 1 -type f -printf '%f\n'
+ find /host -mindepth 1 -maxdepth 1 -type l -exec cp -P '{}' ./ ';'
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/udev ']'
+ mkdir -p ./run/udev
+ mount --rbind /host/run/udev ./run/udev
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/mdadm ']'
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/dbus ']'
+ mkdir -p ./run/dbus
+ mount --rbind /host/run/dbus ./run/dbus
+ for dir in "run/crio" "run/containerd"
+ '[' -d /host/run/crio ']'
+ for dir in "run/crio" "run/containerd"
+ '[' -d /host/run/containerd ']'
+ mkdir -p ./run/containerd
+ mount --rbind /host/run/containerd ./run/containerd
+ '[' -f /host/run/dockershim.sock ']'
+ '[' -d /host/var/lib/kubelet ']'
+ mkdir -p ./var/lib/kubelet
+ mount --rbind /host/var/lib/kubelet ./var/lib/kubelet
+ mkdir -p ./scylla-operator
+ touch ./scylla-operator/scylla-operator
+ mount --bind /usr/bin/scylla-operator ./scylla-operator/scylla-operator
+ mkdir -p ./run/secrets/kubernetes.io/serviceaccount
+ for f in ca.crt token
+ touch ./run/secrets/kubernetes.io/serviceaccount/ca.crt
+ mount --bind /run/secrets/kubernetes.io/serviceaccount/ca.crt ./run/secrets/kubernetes.io/serviceaccount/ca.crt
+ for f in ca.crt token
+ touch ./run/secrets/kubernetes.io/serviceaccount/token
+ mount --bind /run/secrets/kubernetes.io/serviceaccount/token ./run/secrets/kubernetes.io/serviceaccount/token
+ '[' -L /host/var/run ']'
+ mkdir -p ./var
+ ln -s ../run ./var/run
+ exec chroot ./ /scylla-operator/scylla-operator node-setup-daemon --namespace=scylla-operator-node-tuning --pod-name=cluster-node-setup-jhjpj --node-name=ip-192-168-14-249.ap-south-1.compute.internal --node-config-name=cluster --node-config-uid=a7e98cb5-acc5-41a2-af1f-c44b48ca9f03 --scylla-image=docker.io/scylladb/scylla:5.4.0 --disable-optimizations=false --loglevel=4
2024/03/16 14:02:13 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
I0316 14:02:13.164326       1 operator/nodesetupdaemon.go:172] node-setup-daemon version "v1.12.0-beta.0-26-gc181cf2"
I0316 14:02:13.164345       1 flag/flags.go:64] FLAG: --burst="75"
I0316 14:02:13.164350       1 flag/flags.go:64] FLAG: --cri-endpoint="[unix:///var/run/dockershim.sock,unix:///run/containerd/containerd.sock,unix:///run/crio/crio.sock]"
I0316 14:02:13.164359       1 flag/flags.go:64] FLAG: --disable-optimizations="false"
I0316 14:02:13.164363       1 flag/flags.go:64] FLAG: --feature-gates=""
I0316 14:02:13.164368       1 flag/flags.go:64] FLAG: --help="false"
I0316 14:02:13.164371       1 flag/flags.go:64] FLAG: --kubeconfig=""
I0316 14:02:13.164375       1 flag/flags.go:64] FLAG: --kubelet-pod-resources-endpoint="unix:///var/lib/kubelet/pod-resources/kubelet.sock"
I0316 14:02:13.164380       1 flag/flags.go:64] FLAG: --loglevel="4"
I0316 14:02:13.164387       1 flag/flags.go:64] FLAG: --namespace="scylla-operator-node-tuning"
I0316 14:02:13.164391       1 flag/flags.go:64] FLAG: --node-config-name="cluster"
I0316 14:02:13.164393       1 flag/flags.go:64] FLAG: --node-config-uid="a7e98cb5-acc5-41a2-af1f-c44b48ca9f03"
I0316 14:02:13.164397       1 flag/flags.go:64] FLAG: --node-name="ip-192-168-14-249.ap-south-1.compute.internal"
I0316 14:02:13.164415       1 flag/flags.go:64] FLAG: --pod-name="cluster-node-setup-jhjpj"
I0316 14:02:13.164418       1 flag/flags.go:64] FLAG: --qps="50"
I0316 14:02:13.164424       1 flag/flags.go:64] FLAG: --scylla-image="docker.io/scylladb/scylla:5.4.0"
I0316 14:02:13.164428       1 flag/flags.go:64] FLAG: --v="4"
I0316 14:02:13.164590       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/run/crio/crio.sock"
I0316 14:02:13.164628       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/var/run/dockershim.sock"
I0316 14:02:13.164738       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/run/containerd/containerd.sock"
I0316 14:02:15.165396       1 cri/client.go:114] "Connected to CRI endpoint" Successful=["unix:///run/containerd/containerd.sock"] Other attempts="[unix:///var/run/dockershim.sock: context deadline exceeded, unix:///run/crio/crio.sock: context deadline exceeded]"
I0316 14:02:15.183283       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:54291->[::1]:53: read: connection refused"
I0316 14:02:15.194851       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:58179->[::1]:53: read: connection refused"
I0316 14:02:15.246331       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:51943->[::1]:53: read: connection refused"
I0316 14:02:15.500059       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:45116->[::1]:53: read: connection refused"
Error: can't get node "ip-192-168-14-249.ap-south-1.compute.internal": timed out waiting for the condition

So apparently somehow it was failing at not able to call get node through the kubernetes api but the error is not regarding authentication but some tcp / udp connection refused.

What did you expect to happen?

The cluster-node-setup to succeed and XFS filesystem is created and scyllacluster continue to be created

How can we reproduce it (as minimally and precisely as possible)?

use Karpenter cluster with a nodepool using i4i instances and deploy the operator

Scylla Operator version

v1.12

Kubernetes platform name and version

```console $ kubectl version Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.25.16-eks-77b1e4e WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1 ```

Kubernetes platform info:

Please attach the must-gather archive.

scylla-operator-must-gather-tc7c7mp7jnt6.zip

Anything else we need to know?

No response

@marcustut marcustut added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 16, 2024
@tnozicka tnozicka added kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024
@tnozicka
Copy link
Member

Please attach the must-gather archive.
n/a

we can't invest time into investigating your issue if you don't invest time to provide the required information

The must-gather archive is a **mandatory** part of every bug report.

https://github.com/scylladb/scylla-operator/blob/f356138/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57

@zimnx
Copy link
Collaborator

zimnx commented Mar 18, 2024

dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:45116->[::1]:53: read: connection refused"

Looks like firewall blocks DNS traffic from that node or your DNS service is down.

@tnozicka tnozicka removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024
@marcustut
Copy link
Author

Please attach the must-gather archive.
n/a

we can't invest time into investigating your issue if you don't invest time to provide the required information

The must-gather archive is a **mandatory** part of every bug report.

https://github.com/scylladb/scylla-operator/blob/f356138/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57

Sorry, I'll get back to this later with the must-gather archive.

As for @zimnx, I did manually SSH-ed into the node and ran the API Call through curl, it worked fine. I also tried launching a debug pod with the same image docker.io/scylladb/scylla-operator:1.12 and ran the API call through curl and it worked fine too.

@marcustut
Copy link
Author

@tnozicka @zimnx I have updated the issue with the must-gather logs

@tnozicka
Copy link
Member

tnozicka commented Mar 26, 2024

This is the log file I got from the must-gather program

you need to upload the folder it created, not the log from its creation

"Gathering artifacts" DestDir="scylla-operator-must-gather-tc7c7mp7jnt6"

@marcustut
Copy link
Author

This is the log file I got from the must-gather program

you need to upload the folder it created, not the log from its creation

"Gathering artifacts" DestDir="scylla-operator-must-gather-tc7c7mp7jnt6"

Sorry, I updated the issue with the entire archive

@zimnx
Copy link
Collaborator

zimnx commented Mar 29, 2024

Archive is empty

@marcustut
Copy link
Author

Archive is empty

Sorry, reuploaded it

@zr-mah
Copy link

zr-mah commented Apr 5, 2024

I'm facing the exact same issue too.
@zimnx May I know if you got a chance to look at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

4 participants