rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

harridu · 2022-08-31T05:06:41Z

RKE version:
1.3.14

Docker version: (docker version,docker info preferred)

root@rr01:~# docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:03:11 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:17 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

root@rr01:~# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
  compose: Docker Compose (Docker Inc., v2.2.3)

Server:
 Containers: 38
  Running: 21
  Paused: 0
  Stopped: 17
 Images: 108
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.10.0-17-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.771GiB
 Name: rr01.ac.aixigo.de
 ID: KFEF:IEUC:ODFJ:CZXC:E35F:AQA6:D3HS:BDU6:BTFH:CIXC:LNWZ:EHNL
 Docker Root Dir: /export/docker-data
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  http://artifacts.ac.aixigo.de:1081/
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.24.0.0/14, Size: 24

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Linux dpcl082.ac.aixigo.de 5.10.0-17-amd64 #1 SMP Debian 5.10.136-1 (2022-08-13) x86_64 GNU/Linux

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

cluster.yml file:

nodes:
  - address: rr01.ac.aixigo.de
    user: debian
    role:
      - controlplane
      - worker
      - etcd
  - address: rr02.ac.aixigo.de
    user: debian
    role:
      - controlplane
      - worker
      - etcd
  - address: rr03.ac.aixigo.de
    user: debian
    role:
      - controlplane
      - worker
      - etcd

ignore_docker_version: false

#
# this is a *local* directory, not ~/.ssh
#
ssh_key_path: .ssh/id_ecdsa

kubernetes_version: v1.23.10-rancher1-1

services:
  etcd:
    snapshot: true
    creation: 3h
    retention: 72h

network:
  plugin: canal
  # plugin: flannel

Steps to Reproduce:

Results:

{hdunkel@dpcl082:rke.rancher 07:00:34 (master) 540} rke etcd snapshot-save --config ${name}.yml --name snapshot-${name}
INFO[0000] Running RKE version: v1.3.14                 
INFO[0000] Starting saving snapshot on etcd hosts       
INFO[0000] [dialer] Setup tunnel for host [rr02.ac.aixigo.de] 
INFO[0000] [dialer] Setup tunnel for host [rr03.ac.aixigo.de] 
INFO[0000] [dialer] Setup tunnel for host [rr01.ac.aixigo.de] 
INFO[0000] [state] Deploying state file to [/etc/kubernetes/snapshot-rancher.rkestate] on host [rr03.ac.aixigo.de] 
INFO[0000] [state] Deploying state file to [/etc/kubernetes/snapshot-rancher.rkestate] on host [rr01.ac.aixigo.de] 
INFO[0000] [state] Deploying state file to [/etc/kubernetes/snapshot-rancher.rkestate] on host [rr02.ac.aixigo.de] 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [rr03.ac.aixigo.de], try #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [rr01.ac.aixigo.de], try #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [rr02.ac.aixigo.de], try #1 
INFO[0012] Image [rancher/rke-tools:v0.1.87] exists on host [rr02.ac.aixigo.de] 
INFO[0013] Starting container [cluster-state-deployer] on host [rr02.ac.aixigo.de], try #1 
INFO[0013] [state] Successfully started [cluster-state-deployer] container on host [rr02.ac.aixigo.de] 
INFO[0013] Waiting for [cluster-state-deployer] container to exit on host [rr02.ac.aixigo.de] 
INFO[0013] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0014] Image [rancher/rke-tools:v0.1.87] exists on host [rr03.ac.aixigo.de] 
INFO[0014] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0014] Starting container [cluster-state-deployer] on host [rr03.ac.aixigo.de], try #1 
INFO[0014] [state] Successfully started [cluster-state-deployer] container on host [rr03.ac.aixigo.de] 
INFO[0015] Waiting for [cluster-state-deployer] container to exit on host [rr03.ac.aixigo.de] 
INFO[0015] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0015] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0016] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0016] Image [rancher/rke-tools:v0.1.87] exists on host [rr01.ac.aixigo.de] 
INFO[0016] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0017] Starting container [cluster-state-deployer] on host [rr01.ac.aixigo.de], try #1 
INFO[0017] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0017] Container [cluster-state-deployer] is still running on host [rr02.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0017] [state] Successfully started [cluster-state-deployer] container on host [rr01.ac.aixigo.de] 
INFO[0017] Waiting for [cluster-state-deployer] container to exit on host [rr01.ac.aixigo.de] 
INFO[0017] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0018] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0018] Removing container [cluster-state-deployer] on host [rr02.ac.aixigo.de], try #1 
INFO[0018] [remove/cluster-state-deployer] Successfully removed container on host [rr02.ac.aixigo.de] 
INFO[0018] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0019] Container [cluster-state-deployer] is still running on host [rr03.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0019] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0020] Removing container [cluster-state-deployer] on host [rr03.ac.aixigo.de], try #1 
INFO[0020] [remove/cluster-state-deployer] Successfully removed container on host [rr03.ac.aixigo.de] 
INFO[0020] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0021] Container [cluster-state-deployer] is still running on host [rr01.ac.aixigo.de]: stderr: [], stdout: [Waiting for file [/etc/kubernetes/rancher.rkestate] to be successfully copied to this container, retry count 1
] 
INFO[0023] Removing container [cluster-state-deployer] on host [rr01.ac.aixigo.de], try #1 
INFO[0023] [remove/cluster-state-deployer] Successfully removed container on host [rr01.ac.aixigo.de] 
INFO[0023] [etcd] Running snapshot save once on host [rr01.ac.aixigo.de] 
INFO[0023] Finding container [etcd] on host [rr01.ac.aixigo.de], try #1 
INFO[0023] Image [rancher/rke-tools:v0.1.87] exists on host [rr01.ac.aixigo.de] 
INFO[0023] Starting container [etcd-snapshot-once] on host [rr01.ac.aixigo.de], try #1 
INFO[0023] [etcd] Successfully started [etcd-snapshot-once] container on host [rr01.ac.aixigo.de] 
INFO[0023] Waiting for [etcd-snapshot-once] container to exit on host [rr01.ac.aixigo.de] 
INFO[0024] Container [etcd-snapshot-once] is still running on host [rr01.ac.aixigo.de]: stderr: [time="2022-08-31T05:01:00Z" level=info msg="Initializing Onetime Backup" name=snapshot-rancher
], stdout: [] 
INFO[0026] Removing container [etcd-snapshot-once] on host [rr01.ac.aixigo.de], try #1 
INFO[0027] [etcd] Running snapshot save once on host [rr02.ac.aixigo.de] 
INFO[0027] Finding container [etcd] on host [rr02.ac.aixigo.de], try #1 
INFO[0027] Image [rancher/rke-tools:v0.1.87] exists on host [rr02.ac.aixigo.de] 
INFO[0027] Starting container [etcd-snapshot-once] on host [rr02.ac.aixigo.de], try #1 
INFO[0027] [etcd] Successfully started [etcd-snapshot-once] container on host [rr02.ac.aixigo.de] 
INFO[0027] Waiting for [etcd-snapshot-once] container to exit on host [rr02.ac.aixigo.de] 
INFO[0028] Container [etcd-snapshot-once] is still running on host [rr02.ac.aixigo.de]: stderr: [time="2022-08-31T05:01:03Z" level=info msg="Initializing Onetime Backup" name=snapshot-rancher
], stdout: [] 
INFO[0030] Removing container [etcd-snapshot-once] on host [rr02.ac.aixigo.de], try #1 
INFO[0030] [etcd] Running snapshot save once on host [rr03.ac.aixigo.de] 
INFO[0030] Finding container [etcd] on host [rr03.ac.aixigo.de], try #1 
INFO[0030] Image [rancher/rke-tools:v0.1.87] exists on host [rr03.ac.aixigo.de] 
INFO[0030] Starting container [etcd-snapshot-once] on host [rr03.ac.aixigo.de], try #1 
INFO[0030] [etcd] Successfully started [etcd-snapshot-once] container on host [rr03.ac.aixigo.de] 
INFO[0030] Waiting for [etcd-snapshot-once] container to exit on host [rr03.ac.aixigo.de] 
INFO[0031] Container [etcd-snapshot-once] is still running on host [rr03.ac.aixigo.de]: stderr: [time="2022-08-31T05:01:07Z" level=info msg="Initializing Onetime Backup" name=snapshot-rancher
], stdout: [] 
INFO[0034] Removing container [etcd-snapshot-once] on host [rr03.ac.aixigo.de], try #1 
INFO[0034] [etcd] Finished saving snapshot [snapshot-rancher] on all etcd hosts 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1984075]

goroutine 1 [running]:
github.com/rancher/rke/cluster.(*Cluster).SnapshotEtcd(0xc000872000, {0x2218cd8, 0xc0001c0000}, {0x7fffb3490631, 0x10})
        /go/src/github.com/rancher/rke/cluster/etcd.go:54 +0x415
github.com/rancher/rke/cmd.SnapshotSaveEtcdHosts({0x2218cd8, 0xc0001c0000}, 0x0, {0x0, 0x0, 0x0}, {{0x0, 0x0}, {0x7fffb349061e, 0xb}, ...}, ...)
        /go/src/github.com/rancher/rke/cmd/etcd.go:133 +0x226
github.com/rancher/rke/cmd.SnapshotSaveEtcdHostsFromCli(0xc00081db80)
        /go/src/github.com/rancher/rke/cmd/etcd.go:321 +0x333
github.com/urfave/cli.HandleAction({0x1b34dc0, 0x1f90098}, 0xd)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/app.go:523 +0xa8
github.com/urfave/cli.Command.Run({{0x1e9fd49, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x1ec1894, 0x1f}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/command.go:174 +0x63a
github.com/urfave/cli.(*App).RunAsSubcommand(0xc0006da8c0, 0xc00081d8c0)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/app.go:404 +0x9ec
github.com/urfave/cli.Command.startApp({{0x1e8ebb0, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x1efba81, 0x34}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/command.go:373 +0x6e9
github.com/urfave/cli.Command.Run({{0x1e8ebb0, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x1efba81, 0x34}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/command.go:102 +0x825
github.com/urfave/cli.(*App).Run(0xc0006da700, {0xc0001ae000, 0x7, 0x7})
        /go/pkg/mod/github.com/urfave/cli@v1.22.2/app.go:276 +0x80c
main.mainErr()
        /go/src/github.com/rancher/rke/main.go:81 +0x1825
main.main()
        /go/src/github.com/rancher/rke/main.go:25 +0x33

The text was updated successfully, but these errors were encountered:

konstantin-921 · 2022-09-02T15:13:10Z

RKE version: 1.3.14
Docker version: 20.10.17

+1
The same problem

kinarashah · 2022-09-02T20:48:54Z

Able to reproduce with RKE v1.3.13 and v1.3.14.
It's panicking because backupConfig here is nil https://github.com/rancher/rke/blob/release/v1.3/cluster/etcd.go#L54

As a workaround, please add empty backup_config to the cluster.yml:

services:
  etcd:
    backup_config: {}
    snapshot: true
    creation: 3h
    retention: 72h

vincebrannon · 2022-09-05T07:49:55Z

Has no internal ticket been opened for this yet?
Also customer wants to know if the snapshot taken was successful or not...

harridu · 2022-09-05T08:22:51Z

The workaround seems to be OK, AFAICT.

@vincebrannon, the "regular" etcd backups run every couple of hours were sufficient for me.

vincebrannon · 2022-09-05T09:16:25Z

@harridu
So what I was asking was is this: The etcd snapshot that triggers this error, is that successful or not? Do you first have to apply the workaround and then re-do the etcd snapshot?

harridu · 2022-09-15T05:05:13Z

It created a snapshot file as expected. I didn't rely upon it, though. I waited for the next scheduled backup.

kinarashah · 2022-09-17T02:32:53Z

Available to test with v1.3.15-rc2 https://github.com/rancher/rke/releases/tag/v1.3.15-rc2

vivek-shilimkar · 2022-09-19T06:53:57Z

Validated with RKE version v1.3.15-rc2.

Validation steps followed.

Provisioned a k8s cluster v1.24.4 with standalone RKE v1.3.15-rc2
Wait for cluster to come up active.
Once the cluster is active, took the snapshot of cluster with the following command.
rke etcd snapshot-save --name snapshot-${name}
Made sure snapshot is saved and able to restore cluster with that snapshot.

Based on above validation the issue is not active on RKE v1.3.15-rc2 and hence closing the issue.

sowmyav27 · 2022-09-19T14:57:00Z

@vivek-infracloud @mitulshah-suse Reopening to validate using 1.3.14 RKE to reproduce the issue using these steps. And ensure the same steps work using latest 1.3.15-rc2.

vivek-shilimkar · 2022-09-20T06:10:22Z

Validated if the workaround mentioned here works.

Validation steps on RKE v1.3.14

Provisioned a standalone RKE1 cluster with 1cp, 1etcd and 1wk node each, without the empty backup_config in cluster.yaml
Once the cluster is active, tried to take a etcd snapshot with following command.
rke etcd snapshot-save --name snapshot-${name}

As expected the snapshot was not created and log shows an SIGSEGV error.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1984075]

Added the empty backup_config in the cluster.yaml file.
Ran the rke up command and waited for cluster to come to an active state.
Ran the command to take the snapshot of the cluster. However, still saw the same error.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1984075]

Workaround isn't working to create a snapshot.

vivek-shilimkar · 2022-09-21T03:58:51Z

While validating the snapshot creation with RKE v1.3.15-rc2. The snapshot creation was successful and the Panic error did not encounter even with blank backup_config.

kinarashah · 2022-09-21T20:43:11Z

Closing this issue, fixed in v1.3.15. Please reopen if you run into issues.

harridu changed the title ~~rke 1.4.13 dies with SIGSEGV~~ rke 1.4.13 dies with SIGSEGV on creating an etcd snapshot Aug 31, 2022

harridu changed the title ~~rke 1.4.13 dies with SIGSEGV on creating an etcd snapshot~~ rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot Aug 31, 2022

kinarashah added this to the v1.3.15 milestone Sep 2, 2022

kinarashah added kind/bug good-first-issue labels Sep 2, 2022

kinarashah mentioned this issue Sep 16, 2022

[v1.3] fix panic in etcd snapshot logic #3041

Merged

kinarashah self-assigned this Sep 16, 2022

kinarashah added the [zube]: To Test label Sep 17, 2022

vivek-shilimkar added the [zube]: QA Working label Sep 19, 2022

vivek-shilimkar self-assigned this Sep 19, 2022

zube bot removed the [zube]: To Test label Sep 19, 2022

vivek-shilimkar closed this as completed Sep 19, 2022

zube bot added [zube]: Done and removed [zube]: QA Working labels Sep 19, 2022

sowmyav27 reopened this Sep 19, 2022

zube bot added [zube]: To Triage and removed [zube]: Done labels Sep 19, 2022

sowmyav27 added [zube]: To Test and removed [zube]: To Triage labels Sep 19, 2022

vivek-shilimkar added the [zube]: QA Working label Sep 20, 2022

zube bot removed the [zube]: To Test label Sep 20, 2022

kinarashah closed this as completed Sep 21, 2022

zube bot added [zube]: Done and removed [zube]: QA Working labels Sep 21, 2022

a-blender self-assigned this Oct 26, 2022

a-blender added [zube]: Working labels Oct 26, 2022

zube bot reopened this Oct 26, 2022

zube bot removed [zube]: Done labels Oct 26, 2022

superseb mentioned this issue Nov 22, 2022

[Backport v1.3] etcd snapshot-save panics on minimal cluster.yml #3104

Closed

snasovich closed this as completed May 25, 2023

zube bot added [zube]: Done and removed [zube]: To Test labels May 25, 2023

zube bot removed the [zube]: Done label Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

harridu commented Aug 31, 2022 •

edited by zube bot

konstantin-921 commented Sep 2, 2022 •

edited

kinarashah commented Sep 2, 2022

vincebrannon commented Sep 5, 2022

harridu commented Sep 5, 2022

vincebrannon commented Sep 5, 2022

harridu commented Sep 15, 2022

kinarashah commented Sep 17, 2022

vivek-shilimkar commented Sep 19, 2022

sowmyav27 commented Sep 19, 2022 •

edited

vivek-shilimkar commented Sep 20, 2022 •

edited

vivek-shilimkar commented Sep 21, 2022

kinarashah commented Sep 21, 2022

rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

rke 1.3.14 dies with SIGSEGV on creating an etcd snapshot #3028

Comments

harridu commented Aug 31, 2022 • edited by zube bot

konstantin-921 commented Sep 2, 2022 • edited

kinarashah commented Sep 2, 2022

vincebrannon commented Sep 5, 2022

harridu commented Sep 5, 2022

vincebrannon commented Sep 5, 2022

harridu commented Sep 15, 2022

kinarashah commented Sep 17, 2022

vivek-shilimkar commented Sep 19, 2022

sowmyav27 commented Sep 19, 2022 • edited

vivek-shilimkar commented Sep 20, 2022 • edited

vivek-shilimkar commented Sep 21, 2022

kinarashah commented Sep 21, 2022

harridu commented Aug 31, 2022 •

edited by zube bot

konstantin-921 commented Sep 2, 2022 •

edited

sowmyav27 commented Sep 19, 2022 •

edited

vivek-shilimkar commented Sep 20, 2022 •

edited