Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Verify kubectl in kind-in-docker step #1305

Merged
merged 2 commits into from
Aug 9, 2023

Conversation

architkulkarni
Copy link
Contributor

Why are these changes needed?

This PR configures kind to allow kubectl to connect to it. This is a prerequisite for testing sample YAML files end-to-end in Buildkite. The configuration is copied from @simon-mo's work in ray-project/ray#22035.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni
Copy link
Contributor Author

Next steps for followup PRs:

  • Start the Kuberay operator and run a minimal e2e test
  • Run all existing e2e tests test_sample_rayjobs.py, etc.

The second step is currently blocked by the following issue. Every time we delete the kind cluster and start a new kind cluster on the same machine, it fails with

root@7bdb47dd2a29:/go/kuberay# kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✗ Starting control-plane 🕹️
Deleted nodes: ["kind-control-plane"]
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0808 21:06:14.009749     197 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
W0808 21:06:14.010279     197 initconfiguration.go:332] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.27.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0808 21:06:14.014713     197 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0808 21:06:14.415319     197 certs.go:519] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.19.0.2 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0808 21:06:14.792602     197 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0808 21:06:14.940827     197 certs.go:519] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0808 21:06:15.045685     197 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0808 21:06:15.110701     197 certs.go:519] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0808 21:06:15.698453     197 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0808 21:06:15.831358     197 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0808 21:06:15.879228     197 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0808 21:06:16.035416     197 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0808 21:06:16.145571     197 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0808 21:06:16.248809     197 kubelet.go:67] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0808 21:06:16.314799     197 manifests.go:99] [control-plane] getting StaticPodSpecs
I0808 21:06:16.314990     197 certs.go:519] validating certificate period for CA certificate
I0808 21:06:16.315058     197 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0808 21:06:16.315069     197 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0808 21:06:16.315074     197 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0808 21:06:16.315079     197 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0808 21:06:16.315082     197 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0808 21:06:16.316741     197 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0808 21:06:16.316757     197 manifests.go:99] [control-plane] getting StaticPodSpecs
I0808 21:06:16.316901     197 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0808 21:06:16.316910     197 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0808 21:06:16.316913     197 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0808 21:06:16.316916     197 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0808 21:06:16.316922     197 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0808 21:06:16.316929     197 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0808 21:06:16.316937     197 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0808 21:06:16.317429     197 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0808 21:06:16.317441     197 manifests.go:99] [control-plane] getting StaticPodSpecs
I0808 21:06:16.317578     197 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0808 21:06:16.317887     197 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0808 21:06:16.319115     197 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0808 21:06:16.319131     197 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0808 21:06:16.319568     197 loader.go:373] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0808 21:06:16.320631     197 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
[...]
I0808 21:08:11.321190     197 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.

Unfortunately, an error has occurred:
	timed out waiting for the condition

This error is likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
	Once you have found the failing container, you can inspect its logs with:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
	cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	cmd/kubeadm/app/cmd/phases/workflow/runner.go:259
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
	cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
	cmd/kubeadm/app/cmd/init.go:111
github.com/spf13/cobra.(*Command).execute
	vendor/github.com/spf13/cobra/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
	vendor/github.com/spf13/cobra/command.go:1040
github.com/spf13/cobra.(*Command).Execute
	vendor/github.com/spf13/cobra/command.go:968
k8s.io/kubernetes/cmd/kubeadm/app.Run
	cmd/kubeadm/app/kubeadm.go:50
main.main
	cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:250
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	cmd/kubeadm/app/cmd/phases/workflow/runner.go:260
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
	cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
	cmd/kubeadm/app/cmd/init.go:111
github.com/spf13/cobra.(*Command).execute
	vendor/github.com/spf13/cobra/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
	vendor/github.com/spf13/cobra/command.go:1040
github.com/spf13/cobra.(*Command).Execute
	vendor/github.com/spf13/cobra/command.go:968
k8s.io/kubernetes/cmd/kubeadm/app.Run
	cmd/kubeadm/app/kubeadm.go:50
main.main
	cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:250
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would you mind opening an issue to track #1305 (comment)? Thanks!

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni
Copy link
Contributor Author

The newly added test passed: https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/1205#0189db69-4db3-4bbe-b56d-77cae3af83a3

The other tests are unrelated. Merging

@architkulkarni architkulkarni merged commit 9e37e19 into ray-project:master Aug 9, 2023
12 of 13 checks passed
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 15, 2023
This PR configures kind to allow kubectl to connect to it. This is a prerequisite for testing sample YAML files end-to-end in Buildkite. The configuration is copied from @simon-mo's work in ray-project/ray#22035.

---------

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
This PR configures kind to allow kubectl to connect to it. This is a prerequisite for testing sample YAML files end-to-end in Buildkite. The configuration is copied from @simon-mo's work in ray-project/ray#22035.

---------

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants