Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cluster creation fails when specifying k3s image v1.19.16-k3s1 #890

Closed
jimthompson5802 opened this issue Dec 13, 2021 · 10 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jimthompson5802
Copy link

What did you do

When I attempt to start a cluster with kubernetes 1.19.16, the cluster creation fails with the following series of messages.

$ k3d cluster create --config k3d/kubeflow-cluster.yaml

INFO[0000] Using config file k3d/kubeflow-cluster.yaml (k3d.io/v1alpha3#simple)
INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-kubeflow'
INFO[0000] Created volume 'k3d-kubeflow-images'
INFO[0000] Starting new tools node...
INFO[0000] Starting Node 'k3d-kubeflow-tools'
INFO[0001] Creating node 'k3d-kubeflow-server-0'
INFO[0002] Pulling image 'rancher/k3s:v1.19.16-k3s1'
INFO[0016] Creating node 'k3d-kubeflow-agent-0'
INFO[0016] Creating LoadBalancer 'k3d-kubeflow-serverlb'
INFO[0016] Using the k3d-tools node to gather environment information
INFO[0018] Starting cluster 'kubeflow'
INFO[0018] Starting servers...
INFO[0018] Starting Node 'k3d-kubeflow-server-0'
WARN[0020] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 0/10): �time="2021-12-13T12:22:10.841643887Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0021] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 1/10): �time="2021-12-13T12:22:12.127758098Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0023] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 2/10): �time="2021-12-13T12:22:13.790232406Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0024] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 3/10): �time="2021-12-13T12:22:15.347073402Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0025] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 4/10): �time="2021-12-13T12:22:15.347073402Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0026] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 5/10): �time="2021-12-13T12:22:17.199141348Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0026] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 6/10): �time="2021-12-13T12:22:17.199141348Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0027] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 7/10): �time="2021-12-13T12:22:17.199141348Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0027] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 8/10): �time="2021-12-13T12:22:17.199141348Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0029] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 9/10): �time="2021-12-13T12:22:19.891819095Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
ERRO[0029] Failed Cluster Start: Failed to start server k3d-kubeflow-server-0: Node k3d-kubeflow-server-0 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-kubeflow-server-0': stopped returning log lines
ERRO[0029] Failed to create cluster >>> Rolling Back
INFO[0029] Deleting cluster 'kubeflow'
INFO[0029] Deleting cluster network 'k3d-kubeflow'
INFO[0029] Deleting image volume 'k3d-kubeflow-images'
FATA[0029] Cluster creation FAILED, all changes have been rolled back!

The purpose for using kubernetes 1.19.16 is to test kubeflow setup.

  • How was the cluster created?

    • k3d cluster create --config k3d/kubeflow-cluster.yaml
  • What did you do afterwards?

    • Unable to do anything

What did you expect to happen

Creation of a cluster with kubenetes version 1.19.16 distribution.

Screenshots or terminal output

Configuration file kubeflow-cluster.yaml

Notes:

  • Same error occurs using image: rancher/k3s:v1.19.16-k3s1-amd64
  • No error occurs if the image: parameter is removed and the default k3s image is used. The cluster starts successfully.
apiVersion: k3d.io/v1alpha3
kind: Simple
name: kubeflow
image: rancher/k3s:v1.19.16-k3s1
servers: 1
agents: 1
options:
  k3s:
    extraArgs:
      - arg: --disable=traefik
        nodeFilters:
          - server:*

k3d cluster list after failure

$ k3d cluster list

NAME   SERVERS   AGENTS   LOADBALANCER

output from docker images | grep rancher

$ docker images | grep rancher

rancher/k3d-tools                                      5.2.1                df011b762013   5 days ago      18.7MB
rancher/k3d-proxy                                      5.2.1                52ec7dd5ec41   5 days ago      42.4MB
rancher/k3s                                            v1.22.4-k3s1         a920d1b20ab3   13 days ago     170MB
rancher/k3s                                            v1.22.4-k3s1-amd64   a920d1b20ab3   13 days ago     170MB
rancher/k3s                                            v1.21.7-k3s1         4cbf38ec7da6   13 days ago     174MB
rancher/k3s                                            v1.20.13-k3s1        b6a662486d29   13 days ago     162MB
rancher/k3d-tools                                      5.1.0                3f7735ad4206   4 weeks ago     18.1MB
rancher/k3d-proxy                                      5.1.0                9b1de62b76f0   4 weeks ago     42.4MB
rancher/k3s                                            v1.19.16-k3s1        7d5a371ee4c7   5 weeks ago     160MB
rancher/k3s                                            v1.20.12-k3s1        5cc459ac9af0   5 weeks ago     162MB
rancher/k3s                                            v1.21.2-k3s1         b41b52c9bb59   5 months ago    172MB
rancher/k3s                                            v1.20.4-k3s1         2f3ffb0c5e3e   9 months ago    157MB
rancher/k3s                                            v1.20.2-k3s1         1b02adf07426   11 months ago   154MB
rancher/k3s                                            v1.20.0-k3s2         e5cf68bf74a7   11 months ago   154MB

Which OS & Architecture

MacBook Pro 2019 (Intel), MacOS 11.6.1

$ uname -mprsv

Darwin 20.6.0 Darwin Kernel Version 20.6.0: Tue Oct 12 18:33:42 PDT 2021; root:xnu-7195.141.8~1/RELEASE_X86_64 x86_64 i38

Which version of k3d

$ k3d version

k3d version v5.2.1
k3s version v1.21.7-k3s1 (default)

Which version of docker

$ docker version

Client:
Cloud integration: v1.0.22
Version:           20.10.11
API version:       1.41
Go version:        go1.16.10
Git commit:        dea9396
Built:             Thu Nov 18 00:36:09 2021
OS/Arch:           darwin/amd64
Context:           default
Experimental:      true

Server: Docker Engine - Community
Engine:
 Version:          20.10.11
 API version:      1.41 (minimum version 1.12)
 Go version:       go1.16.9
 Git commit:       847da18
 Built:            Thu Nov 18 00:35:39 2021
 OS/Arch:          linux/amd64
 Experimental:     false
containerd:
 Version:          1.4.12
 GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
 Version:          1.0.2
 GitCommit:        v1.0.2-0-g52b36a2
docker-init:
 Version:          0.19.0
 GitCommit:        de40ad0


$ docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.7.1)
  compose: Docker Compose (Docker Inc., v2.2.1)
  scan: Docker Scan (Docker Inc., 0.9.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 25
 Server Version: 20.10.11
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc version: v1.0.2-0-g52b36a2
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.76-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 12.68GiB
 Name: docker-desktop
 ID: KTBO:NV7V:XXVG:CO4R:4FNU:QOBO:ZYT2:CHEA:OZ6V:RQFJ:B7N7:2UI5
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 44
  Goroutines: 46
  System Time: 2021-12-13T12:04:51.630102766Z
  EventsListeners: 3
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Other Software version

If needed here is kubectl version

$ kubectl version --client

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.13", GitCommit:"2444b3347a2c45eb965b182fb836e1f51dc61b70", GitTreeState:"archive", BuildDate:"2021-11-18T06:58:15Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/amd64"}
@jimthompson5802 jimthompson5802 added the bug Something isn't working label Dec 13, 2021
@jimthompson5802
Copy link
Author

If it helps, re-ran k3d cluster create with --verbose flag:

$ k3d cluster create --verbose --config k3d/kubeflow-cluster.yaml

DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] Runtime Info:
&{Name:docker Endpoint:/var/run/docker.sock Version:20.10.11 OSType:linux OS:Docker Desktop Arch:x86_64 CgroupVersion:2 CgroupDriver:cgroupfs Filesystem:extfs}
DEBU[0000] Additional CLI Configuration:
cli:
  api-port: ""
  env: []
  k3s-node-labels: []
  k3sargs: []
  ports: []
  registries:
    create: ""
  runtime-labels: []
  volumes: []
DEBU[0000] Validating file /var/folders/8_/h2tjc94d2y5fwx8dhbcrnbfc0000gn/T/k3d-config-tmp-kubeflow-cluster.yaml391576977 against default JSONSchema...
DEBU[0000] JSON Schema Validation Result: &{errors:[] score:58}
INFO[0000] Using config file k3d/kubeflow-cluster.yaml (k3d.io/v1alpha3#simple)
DEBU[0000] Configuration:
agents: 1
apiversion: k3d.io/v1alpha3
image: rancher/k3s:v1.19.16-k3s1
kind: Simple
name: kubeflow
network: ""
options:
  k3d:
    disableimagevolume: false
    disableloadbalancer: false
    disablerollback: false
    loadbalancer:
      configoverrides: []
    timeout: 0s
    wait: true
  k3s:
    extraargs:
    - arg: --disable=traefik
      nodeFilters:
      - server:*
  kubeconfig:
    switchcurrentcontext: true
    updatedefaultkubeconfig: true
  runtime:
    agentsmemory: ""
    gpurequest: ""
    serversmemory: ""
registries:
  config: ""
  use: []
servers: 1
subnet: ""
token: ""
DEBU[0000] ========== Simple Config ==========
{TypeMeta:{Kind:Simple APIVersion:k3d.io/v1alpha3} Name:kubeflow Servers:1 Agents:1 ExposeAPI:{Host: HostIP: HostPort:} Image:rancher/k3s:v1.19.16-k3s1 Network: Subnet: ClusterToken: Volumes:[] Ports:[] Options:{K3dOptions:{Wait:true Timeout:0s DisableLoadbalancer:false DisableImageVolume:false NoRollback:false NodeHookActions:[] Loadbalancer:{ConfigOverrides:[]}} K3sOptions:{ExtraArgs:[{Arg:--disable=traefik NodeFilters:[server:*]}] NodeLabels:[]} KubeconfigOptions:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true} Runtime:{GPURequest: ServersMemory: AgentsMemory: Labels:[]}} Env:[] Registries:{Use:[] Create:<nil> Config:}}
==========================
DEBU[0000] ========== Merged Simple Config ==========
{TypeMeta:{Kind:Simple APIVersion:k3d.io/v1alpha3} Name:kubeflow Servers:1 Agents:1 ExposeAPI:{Host: HostIP: HostPort:51713} Image:rancher/k3s:v1.19.16-k3s1 Network: Subnet: ClusterToken: Volumes:[] Ports:[] Options:{K3dOptions:{Wait:true Timeout:0s DisableLoadbalancer:false DisableImageVolume:false NoRollback:false NodeHookActions:[] Loadbalancer:{ConfigOverrides:[]}} K3sOptions:{ExtraArgs:[{Arg:--disable=traefik NodeFilters:[server:*]}] NodeLabels:[]} KubeconfigOptions:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true} Runtime:{GPURequest: ServersMemory: AgentsMemory: Labels:[]}} Env:[] Registries:{Use:[] Create:<nil> Config:}}
==========================
DEBU[0000] generated loadbalancer config:
ports:
  6443.tcp:
  - k3d-kubeflow-server-0
settings:
  workerConnections: 1024
DEBU[0000] ===== Merged Cluster Config =====
&{TypeMeta:{Kind: APIVersion:} Cluster:{Name:kubeflow Network:{Name:k3d-kubeflow ID: External:false IPAM:{IPPrefix:zero IPPrefix IPsUsed:[] Managed:false} Members:[]} Token: Nodes:[0xc00018ca80 0xc00018cc00 0xc00018cd80] InitNode:<nil> ExternalDatastore:<nil> KubeAPI:0xc0004ab8c0 ServerLoadBalancer:0xc0001ddb70 ImageVolume:} ClusterCreateOpts:{DisableImageVolume:false WaitForServer:true Timeout:0s DisableLoadBalancer:false GPURequest: ServersMemory: AgentsMemory: NodeHooks:[] GlobalLabels:map[app:k3d] GlobalEnv:[] Registries:{Create:<nil> Use:[] Config:<nil>}} KubeconfigOpts:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true}}
===== ===== =====
DEBU[0000] '--kubeconfig-update-default set: enabling wait-for-server
INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-kubeflow'
INFO[0000] Created volume 'k3d-kubeflow-images'
INFO[0000] Starting new tools node...
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] [Docker] Local DfD: using 'host.docker.internal'
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] Detected CgroupV2, enabling custom entrypoint (disable by setting K3D_FIX_CGROUPV2=false)
DEBU[0000] Created container k3d-kubeflow-tools (ID: 628f5d4736aa92a0c87da81b8d3d8c2be20d223d580ad4cab72e56580b9e5e82)
DEBU[0000] Node k3d-kubeflow-tools Start Time: 2021-12-13 07:53:51.387471 -0500 EST m=+0.227691493
INFO[0000] Starting Node 'k3d-kubeflow-tools'
DEBU[0000] Truncated 2021-12-13 12:53:51.863100628 +0000 UTC to 2021-12-13 12:53:51 +0000 UTC
DEBU[0002] [Docker] wanted to use 'host.docker.internal' as docker host, but it's not resolvable locally: lookup host.docker.internal on 192.168.1.1:53: no such host
INFO[0003] Creating node 'k3d-kubeflow-server-0'
DEBU[0003] Created container k3d-kubeflow-server-0 (ID: 0d50799591e51426b76693774af70950d5e10fcf1332ad97066d52af8eca1e64)
DEBU[0003] Created node 'k3d-kubeflow-server-0'
INFO[0003] Creating node 'k3d-kubeflow-agent-0'
DEBU[0003] Created container k3d-kubeflow-agent-0 (ID: 2faf6866042468589a9a2a2049415c055822d577367d661b953f8eae510a66a5)
DEBU[0003] Created node 'k3d-kubeflow-agent-0'
INFO[0003] Creating LoadBalancer 'k3d-kubeflow-serverlb'
DEBU[0003] Created container k3d-kubeflow-serverlb (ID: c0fab471b5b926251850b393611d144d260da7862f95025e148b590283a12432)
DEBU[0003] Created loadbalancer 'k3d-kubeflow-serverlb'
DEBU[0003] DOCKER_SOCK=/var/run/docker.sock
INFO[0003] Using the k3d-tools node to gather environment information
DEBU[0003] no netlabel present on container /k3d-kubeflow-tools
DEBU[0003] failed to get IP for container /k3d-kubeflow-tools as we couldn't find the cluster network
DEBU[0003] DOCKER_SOCK=/var/run/docker.sock
DEBU[0003] no netlabel present on container /k3d-kubeflow-tools
DEBU[0003] failed to get IP for container /k3d-kubeflow-tools as we couldn't find the cluster network
DEBU[0003] Executing command '[sh -c getent ahostsv4 'host.k3d.internal']' in node 'k3d-kubeflow-tools'
DEBU[0004] Exec process in node 'k3d-kubeflow-tools' exited with '0'
DEBU[0004] Hostname 'host.k3d.internal' resolved to address '192.168.65.2' inside node k3d-kubeflow-tools
INFO[0004] Starting cluster 'kubeflow'
INFO[0004] Starting servers...
DEBU[0004] >>> enabling cgroupsv2 magic
DEBU[0004] Node k3d-kubeflow-server-0 Start Time: 2021-12-13 07:53:55.655211 -0500 EST m=+4.495245164
DEBU[0004] Deleting node k3d-kubeflow-tools ...
INFO[0004] Starting Node 'k3d-kubeflow-server-0'
DEBU[0004] Truncated 2021-12-13 12:53:56.127952552 +0000 UTC to 2021-12-13 12:53:56 +0000 UTC
DEBU[0004] Waiting for node k3d-kubeflow-server-0 to get ready (Log: 'k3s is up and running')
WARN[0007] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 0/10): �time="2021-12-13T12:53:58.567759995Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0008] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 1/10): �time="2021-12-13T12:53:59.838081664Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0010] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 2/10): �time="2021-12-13T12:54:01.350562469Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0011] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 3/10): �time="2021-12-13T12:54:02.903565577Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0012] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 4/10): �time="2021-12-13T12:54:02.903565577Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0013] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 5/10): �time="2021-12-13T12:54:04.878912304Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0014] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 6/10): �time="2021-12-13T12:54:04.878912304Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0014] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 7/10): �time="2021-12-13T12:54:04.878912304Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0015] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 8/10): �time="2021-12-13T12:54:04.878912304Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
WARN[0016] warning: encountered fatal log from node k3d-kubeflow-server-0 (retrying 9/10): �time="2021-12-13T12:54:07.608890706Z" level=fatal msg="failed to find memory cgroup, you may need to add \"cgroup_memory=1 cgroup_enable=memory\" to your linux cmdline (/boot/cmdline.txt on a Raspberry Pi)"
ERRO[0016] Failed Cluster Start: Failed to start server k3d-kubeflow-server-0: Node k3d-kubeflow-server-0 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-kubeflow-server-0': stopped returning log lines
ERRO[0016] Failed to create cluster >>> Rolling Back
INFO[0016] Deleting cluster 'kubeflow'
DEBU[0016] Cluster Details: &{Name:kubeflow Network:{Name:k3d-kubeflow ID:15fec57df039fcceed07040ba2d04193f24fb4ce9f933b49e703890e1c5c8eb0 External:false IPAM:{IPPrefix:172.24.0.0/16 IPsUsed:[] Managed:false} Members:[]} Token:szRQZMyPfZkudYRMlhKj Nodes:[0xc00018ca80 0xc00018cc00 0xc00018cd80] InitNode:<nil> ExternalDatastore:<nil> KubeAPI:0xc0004ab8c0 ServerLoadBalancer:0xc0001ddb70 ImageVolume:k3d-kubeflow-images}
DEBU[0016] Deleting node k3d-kubeflow-serverlb ...
DEBU[0017] Deleting node k3d-kubeflow-server-0 ...
DEBU[0017] Deleting node k3d-kubeflow-agent-0 ...
INFO[0017] Deleting cluster network 'k3d-kubeflow'
INFO[0017] Deleting image volume 'k3d-kubeflow-images'
FATA[0017] Cluster creation FAILED, all changes have been rolled back!

@iwilltry42 iwilltry42 self-assigned this Dec 13, 2021
@iwilltry42 iwilltry42 added this to the v5.2.2 milestone Dec 13, 2021
@iwilltry42
Copy link
Member

Hi @jimthompson5802 , thanks for opening this issue!
I'm pretty sure that your issue is related to CgroupsV2 on your system.
From your docker info:

 Cgroup Driver: cgroupfs
 Cgroup Version: 2

And from the k3d logs: DEBU[0000] Detected CgroupV2, enabling custom entrypoint (disable by setting K3D_FIX_CGROUPV2=false)

If I'm not wrong, cgroup v2 compatibility was just introduced in K3s v1.20.11 (or rather backported there) and is not available in K3s v1.19.x, which explains why the cgroup stuff fails with it 🤔

@jimthompson5802
Copy link
Author

@iwilltry42 thank you for the quick response. I appreciate your insight into the problem.

Since you pointed out the environment variable, K3D_FIX_CGROUPV2, should I try setting the environment variable K3D_FIX_CGGROUPV2=false for a work-around to this issue?

@jimthompson5802
Copy link
Author

@iwilltry42 I hope you don't mind me running this idea by you.

I've been thinking about your write-up. It appears that the cause for the problem is that Docker for Desktop now supports cgroupv2. In looking at the Docker for Desktop release notes, cgroupv2 support was introduced in Docker for Desktop 4.3.0, which is what I'm currently running.

If I fall back to Docker for Desktop 4.2.0, which I assume does not have cgroupv2, would this be a way for me to use the k3s:v1.19.x image? According to the release notes, Docker Engine 20.10.10 is in this release. I believe this meets the minimum requirement for k3d v5.x.

Does this make sense?

@jimthompson5802
Copy link
Author

@iwilltry42 I wanted to provide an update. I uninstalled Docker for Desktop (Mac) 4.3.0 and installed 4.2.0.

With this setup, I was able to use rancher/k3s:v1.19.16-k3s1 and install kubeflow w/o problems. This may also explain errors in installing kubeflow in later k3s images, e.,g., k3s:v1.21.7-k3s1. By using Docker Desktop 4.2.0, I am able to use the more recent k3s images.

Thank you very much for your assistance. Without your help, I would still be floundering.

The one impact of this particular solution is that I am frozen at Docker Desktop 4.2.0. In thinking about how to resolve, this I wanted to ask about the environment variable K3D_FIX_CGROUPV2. Is there documentation on the variable that I can review?

I'm wondering if I go back to Docker Desktop 4.3.0 and set K3D_FIX_CGROUPV2=false would this allow me to install and use kubeflow in current k3s images. if that is a possibility, I'm not sure where to set the environment variable. Do I set it when I execute k3d cluster create or do I set it in the config file to be be used in each of the nodes, something like

env:
  - envVar: K3D_FIX_CGROUPV2=false 
    nodeFilters:
      - server:*
      - agent:*

Any guidance will be appreciated.

In any event, I have a solution for the problem reported above. This issue can be closed.

Thank you again for your help.

@jimthompson5802
Copy link
Author

jimthompson5802 commented Dec 14, 2021

For completeness ... upgraded to Docker Desktop 4.3.1 and tried setting export K3D_FIX_CGROUPV2=false before running k3d cluster create. Cluster definition failed on agent node.

So falling back to Docker Desktop 4.2.0 to use k3d as a local cluster to run kubeflow.

As noted above, I have a work-around for the original issue, so I'll close this. I'm good right now.

@iwilltry42
Copy link
Member

Hi again 👋
export K3D_FIX_CGROUPV2=false won't change anything in this case I'm afraid, as the fix just sets up cgroupv2 in a way that K3s works with it (not required anymore in newer versions), but it doesn't make old versions of K3s work with cgroupv2 which are not compatible with it.

Glad that you found a workaround for your usecase!

However, I'm interested in what were your problems with deploying kubeflow in newer versions of K3s?
Seems concerning given that Kubernetes v1.19 is EOL already 🤔

@jimthompson5802
Copy link
Author

@iwilltry42 thank for your interest re: "...problems with deploying kubeflow in newer versions of k3s?".

Right now the issues seem to center around the particular version of Docker Desktop.

If I'm using Docker Desktop 4.2.0, I don't have a problem using k3d 5.2.1 and the default k3s image in testing kubeflow. This is what I'm currently running right now based on your earlier guidance.

OTOH, if I use Docker Desktop 4.3.1, I encounter problems just in defining the cluster. Let me gather some diagnostic output and I'll open a new issue for that.

@iwilltry42
Copy link
Member

Ah sorry, I misunderstood you there and thought kubeflow didn't work with K3s >1.19.
If kubeflow works with K3s > v1.19, wouldn't that work for you? Or is K8s v1.19 a hard requirement (which would probably not be good as it's EOL)?

@jimthompson5802
Copy link
Author

@iwilltry42 There are couple of nuances that I failed to make clear. Sorry for that.

Let me summarize:

  • I started with k3d 5.x and using the default k3s image, which are at 1.21.x ( or 1.20.x). I was able to successfully create the cluster but when I tried deploying kubelow, I encountered the error described here.

  • Since the README.md for kubeflow/manifests indicated k8s 1.19 was a pre-requisite, I decided to create a k3d cluster with the k3s 1.19.x image.

  • This led to the problem where I could not create the k3d cluster, which resulted me in opening this particular issue.

  • From your guidance, I changed Docker for Desktop from 4.3.0 to 4.2.0, which was the last release with support of cgroupv1.

  • Using Docker for Desktop 4.2.0, I was able to create a k3d 5.x cluster with the k3s 1.19.x image and was able to successfully install kubeflow.

  • Using Docker for Desktop 4.2.0, as another test, I created a k3d 5.x cluster with the default k3s 1.21.x image. I was able to successfully install kubeflow.

  • For completeness I went back to the latest release of Docker for Desktop 4.3.1. With this setup, I was able to create a k3d 5.x cluster with the k3s 1.21.x default image. However, when I install kubeflow, I encountered the original problem, described in the first bullet item, where kubeflow does not install correctly.

At this point I can see three possible sources for the situation described in the last bullet item:

  • There is a problem with how Docker for Desktop implemented cgroupv2 support.
  • k3d may need more refinements to how it works in a cgroupv2 environment
  • The kubelow components that fail to install need to be updated for cgroupv2 environment.

At this point, I'm not sure where to address the problem. 😕

Right now I'm thinking of opening a new issue with k3d on the problem described in the summary. Best case, k3d finds the root cause and fixes it. 😄 Or, at minimum, eliminate k3d as a possibility and I can pursue the issue with the other avenues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants