Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

Closed
harridu opened this issue Aug 31, 2022 · 12 comments
Closed

rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

harridu opened this issue Aug 31, 2022 · 12 comments

Comments

@harridu
Copy link

harridu commented Aug 31, 2022

RKE version:
1.3.14

Docker version: (docker version,docker info preferred)

root@rr01:~# docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:03:11 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:17 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

root@rr01:~# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
  compose: Docker Compose (Docker Inc., v2.2.3)

Server:
 Containers: 38
  Running: 21
  Paused: 0
  Stopped: 17
 Images: 108
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.10.0-17-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.771GiB
 Name: rr01.ac.aixigo.de
 ID: KFEF:IEUC:ODFJ:CZXC:E35F:AQA6:D3HS:BDU6:BTFH:CIXC:LNWZ:EHNL
 Docker Root Dir: /export/docker-data
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  http://artifacts.ac.aixigo.de:1081/
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.24.0.0/14, Size: 24

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Linux dpcl082.ac.aixigo.de 5.10.0-17-amd64 #1 SMP Debian 5.10.136-1 (2022-08-13) x86_64 GNU/Linux

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

cluster.yml file:

nodes:
  - address: rr01.ac.aixigo.de
    user: debian
    role:
      - controlplane
      - worker
      - etcd
  - address: rr02.ac.aixigo.de
    user: debian
    role:
      - controlplane
      - worker
      - etcd
  - address: rr03.ac.aixigo.de
    user: debian
    role:
      - controlplane
      - worker
      - etcd

ignore_docker_version: false

#
# this is a *local* directory, not ~/.ssh
#
ssh_key_path: .ssh/id_ecdsa

kubernetes_version: v1.23.10-rancher1-1

services:
  etcd:
    snapshot: true
    creation: 3h
    retention: 72h

network:
  plugin: canal
  # plugin: flannel

Steps to Reproduce:

Results:

{hdunkel@dpcl082:rke.rancher 07:00:34 (master) 540} rke etcd snapshot-save --config ${name}.yml --name snapshot-${name}
INFO[0000] Running RKE version: v1.3.14                 
INFO[0000] Starting saving snapshot on etcd hosts       
INFO[0000] [dialer] Setup tunnel for host [rr02.ac.aixigo.de] 
INFO[0000] [dialer] Setup tunnel for host [rr03.ac.aixigo.de] 
INFO[0000] [dialer] Setup tunnel for host [rr01.ac.aixigo.de] 
INFO[0000] [state] Deploying state file to [/etc/kubernetes/snapshot-rancher.rkestate] on host [rr03.ac.aixigo.de] 
INFO[0000] [state] Deploying state file to [/etc/kubernetes/snapshot-rancher.rkestate] on host [rr01.ac.aixigo.de] 
INFO[0000] [state] Deploying state file to [/etc/kubernetes/snapshot-rancher.rkestate] on host [rr02.ac.aixigo.de] 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [rr03.ac.aixigo.de], try #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [rr01.ac.aixigo.de], try #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [rr02.ac.aixigo.de], try #1 
INFO[0012] Image [rancher/rke-tools:v0.1.87] exists on host [rr02.ac.aixigo.de] 
INFO[0013] Starting container [cluster-state-deployer] on host [rr02.ac.aixigo.de], try #1 
INFO[0013] [state] Successfully started [cluster-state-deployer] container on host [rr02.ac.aixigo.de] 
INFO[0013] Waiting for [cluster-state-deployer] container to exit on host [rr02.ac.aixigo.de] 
INFO[0013] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0014] Image [rancher/rke-tools:v0.1.87] exists on host [rr03.ac.aixigo.de] 
INFO[0014] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0014] Starting container [cluster-state-deployer] on host [rr03.ac.aixigo.de], try #1 
INFO[0014] [state] Successfully started [cluster-state-deployer] container on host [rr03.ac.aixigo.de] 
INFO[0015] Waiting for [cluster-state-deployer] container to exit on host [rr03.ac.aixigo.de] 
INFO[0015] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0015] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0016] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0016] Image [rancher/rke-tools:v0.1.87] exists on host [rr01.ac.aixigo.de] 
INFO[0016] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0017] Starting container [cluster-state-deployer] on host [rr01.ac.aixigo.de], try #1 
INFO[0017] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0017] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0017] [state] Successfully started [cluster-state-deployer] container on host [rr01.ac.aixigo.de] 
INFO[0017] Waiting for [cluster-state-deployer] container to exit on host [rr01.ac.aixigo.de] 
INFO[0017] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0018] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0018] Removing container [cluster-state-deployer] on host [rr02.ac.aixigo.de], try #1 
INFO[0018] [remove/cluster-state-deployer] Successfully removed container on host [rr02.ac.aixigo.de] 
INFO[0018] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0019] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0019] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0020] Removing container [cluster-state-deployer] on host [rr03.ac.aixigo.de], try #1 
INFO[0020] [remove/cluster-state-deployer] Successfully removed container on host [rr03.ac.aixigo.de] 
INFO[0020] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0021] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0023] Removing container [cluster-state-deployer] on host [rr01.ac.aixigo.de], try #1 
INFO[0023] [remove/cluster-state-deployer] Successfully removed container on host [rr01.ac.aixigo.de] 
INFO[0023] [etcd] Running snapshot save once on host [rr01.ac.aixigo.de] 
INFO[0023] Finding container [etcd] on host [rr01.ac.aixigo.de], try #1 
INFO[0023] Image [rancher/rke-tools:v0.1.87] exists on host [rr01.ac.aixigo.de] 
INFO[0023] Starting container [etcd-snapshot-once] on host [rr01.ac.aixigo.de], try #1 
INFO[0023] [etcd] Successfully started [etcd-snapshot-once] container on host [rr01.ac.aixigo.de] 
INFO[0023] Waiting for [etcd-snapshot-once] container to exit on host [rr01.ac.aixigo.de] 
INFO[0024] Container [etcd-snapshot-once] is still running on host [rr01.ac.aixigo.de]: stderr: [time="2022-08-31T05:01:00Z" level=info msg="Initializing Onetime Backup" name=snapshot-rancher
], stdout: [] 
INFO[0026] Removing container [etcd-snapshot-once] on host [rr01.ac.aixigo.de], try #1 
INFO[0027] [etcd] Running snapshot save once on host [rr02.ac.aixigo.de] 
INFO[0027] Finding container [etcd] on host [rr02.ac.aixigo.de], try #1 
INFO[0027] Image [rancher/rke-tools:v0.1.87] exists on host [rr02.ac.aixigo.de] 
INFO[0027] Starting container [etcd-snapshot-once] on host [rr02.ac.aixigo.de], try #1 
INFO[0027] [etcd] Successfully started [etcd-snapshot-once] container on host [rr02.ac.aixigo.de] 
INFO[0027] Waiting for [etcd-snapshot-once] container to exit on host [rr02.ac.aixigo.de] 
INFO[0028] Container [etcd-snapshot-once] is still running on host [rr02.ac.aixigo.de]: stderr: [time="2022-08-31T05:01:03Z" level=info msg="Initializing Onetime Backup" name=snapshot-rancher
], stdout: [] 
INFO[0030] Removing container [etcd-snapshot-once] on host [rr02.ac.aixigo.de], try #1 
INFO[0030] [etcd] Running snapshot save once on host [rr03.ac.aixigo.de] 
INFO[0030] Finding container [etcd] on host [rr03.ac.aixigo.de], try #1 
INFO[0030] Image [rancher/rke-tools:v0.1.87] exists on host [rr03.ac.aixigo.de] 
INFO[0030] Starting container [etcd-snapshot-once] on host [rr03.ac.aixigo.de], try #1 
INFO[0030] [etcd] Successfully started [etcd-snapshot-once] container on host [rr03.ac.aixigo.de] 
INFO[0030] Waiting for [etcd-snapshot-once] container to exit on host [rr03.ac.aixigo.de] 
INFO[0031] Container [etcd-snapshot-once] is still running on host [rr03.ac.aixigo.de]: stderr: [time="2022-08-31T05:01:07Z" level=info msg="Initializing Onetime Backup" name=snapshot-rancher
], stdout: [] 
INFO[0034] Removing container [etcd-snapshot-once] on host [rr03.ac.aixigo.de], try #1 
INFO[0034] [etcd] Finished saving snapshot [snapshot-rancher] on all etcd hosts 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1984075]

goroutine 1 [running]:
github.com/rancher/rke/cluster.(*Cluster).SnapshotEtcd(0xc000872000, {0x2218cd8, 0xc0001c0000}, {0x7fffb3490631, 0x10})
        /go/src/github.com/rancher/rke/cluster/etcd.go:54 +0x415
github.com/rancher/rke/cmd.SnapshotSaveEtcdHosts({0x2218cd8, 0xc0001c0000}, 0x0, {0x0, 0x0, 0x0}, {{0x0, 0x0}, {0x7fffb349061e, 0xb}, ...}, ...)
        /go/src/github.com/rancher/rke/cmd/etcd.go:133 +0x226
github.com/rancher/rke/cmd.SnapshotSaveEtcdHostsFromCli(0xc00081db80)
        /go/src/github.com/rancher/rke/cmd/etcd.go:321 +0x333
github.com/urfave/cli.HandleAction({0x1b34dc0, 0x1f90098}, 0xd)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/app.go:523 +0xa8
github.com/urfave/cli.Command.Run({{0x1e9fd49, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x1ec1894, 0x1f}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/command.go:174 +0x63a
github.com/urfave/cli.(*App).RunAsSubcommand(0xc0006da8c0, 0xc00081d8c0)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/app.go:404 +0x9ec
github.com/urfave/cli.Command.startApp({{0x1e8ebb0, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x1efba81, 0x34}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/command.go:373 +0x6e9
github.com/urfave/cli.Command.Run({{0x1e8ebb0, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x1efba81, 0x34}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/command.go:102 +0x825
github.com/urfave/cli.(*App).Run(0xc0006da700, {0xc0001ae000, 0x7, 0x7})
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/app.go:276 +0x80c
main.mainErr()
        /go/src/github.com/rancher/rke/main.go:81 +0x1825
main.main()
        /go/src/github.com/rancher/rke/main.go:25 +0x33
@harridu harridu changed the title rke 1.4.13 dies with SIGSEGV rke 1.4.13 dies with SIGSEGV on creating an etcd snapshot Aug 31, 2022
@harridu harridu changed the title rke 1.4.13 dies with SIGSEGV on creating an etcd snapshot rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot Aug 31, 2022
@konstantin-921
Copy link

konstantin-921 commented Sep 2, 2022

RKE version: 1.3.14
Docker version: 20.10.17

+1
The same problem

@kinarashah kinarashah added this to the v1.3.15 milestone Sep 2, 2022
@kinarashah
Copy link
Member

Able to reproduce with RKE v1.3.13 and v1.3.14.
It's panicking because backupConfig here is nil https://github.com/rancher/rke/blob/release/v1.3/cluster/etcd.go#L54

As a workaround, please add empty backup_config to the cluster.yml:

services:
  etcd:
    backup_config: {}
    snapshot: true
    creation: 3h
    retention: 72h

@vincebrannon
Copy link

Has no internal ticket been opened for this yet?
Also customer wants to know if the snapshot taken was successful or not...

@harridu
Copy link
Author

harridu commented Sep 5, 2022

The workaround seems to be OK, AFAICT.

@vincebrannon, the "regular" etcd backups run every couple of hours were sufficient for me.

@vincebrannon
Copy link

@harridu
So what I was asking was is this: The etcd snapshot that triggers this error, is that successful or not? Do you first have to apply the workaround and then re-do the etcd snapshot?

@harridu
Copy link
Author

harridu commented Sep 15, 2022

It created a snapshot file as expected. I didn't rely upon it, though. I waited for the next scheduled backup.

@kinarashah
Copy link
Member

Available to test with v1.3.15-rc2 https://github.com/rancher/rke/releases/tag/v1.3.15-rc2

@vivek-shilimkar
Copy link
Member

Validated with RKE version v1.3.15-rc2.

Validation steps followed.

  1. Provisioned a k8s cluster v1.24.4 with standalone RKE v1.3.15-rc2

  2. Wait for cluster to come up active.

  3. Once the cluster is active, took the snapshot of cluster with the following command.
    rke etcd snapshot-save --name snapshot-${name}

  4. Made sure snapshot is saved and able to restore cluster with that snapshot.

Based on above validation the issue is not active on RKE v1.3.15-rc2 and hence closing the issue.

@sowmyav27
Copy link

sowmyav27 commented Sep 19, 2022

@vivek-infracloud @mitulshah-suse Reopening to validate using 1.3.14 RKE to reproduce the issue using these steps. And ensure the same steps work using latest 1.3.15-rc2.

@vivek-shilimkar
Copy link
Member

vivek-shilimkar commented Sep 20, 2022

Validated if the workaround mentioned here works.

Validation steps on RKE v1.3.14

  1. Provisioned a standalone RKE1 cluster with 1cp, 1etcd and 1wk node each, without the empty backup_config in cluster.yaml
  2. Once the cluster is active, tried to take a etcd snapshot with following command.
    rke etcd snapshot-save --name snapshot-${name}

As expected the snapshot was not created and log shows an SIGSEGV error.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1984075]
  1. Added the empty backup_config in the cluster.yaml file.
  2. Ran the rke up command and waited for cluster to come to an active state.
  3. Ran the command to take the snapshot of the cluster. However, still saw the same error.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1984075]

Workaround isn't working to create a snapshot.

@vivek-shilimkar
Copy link
Member

While validating the snapshot creation with RKE v1.3.15-rc2. The snapshot creation was successful and the Panic error did not encounter even with blank backup_config.

@kinarashah
Copy link
Member

Closing this issue, fixed in v1.3.15. Please reopen if you run into issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants