Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to restore from snapshot - ETCD permission denied #6466

Closed
riuvshyn opened this issue Aug 2, 2024 · 12 comments
Closed

Unable to restore from snapshot - ETCD permission denied #6466

riuvshyn opened this issue Aug 2, 2024 · 12 comments

Comments

@riuvshyn
Copy link

riuvshyn commented Aug 2, 2024

Environmental Info:
RKE2 Version:
1.27.15-rke2-r1

Node(s) CPU architecture, OS, and Version:

Linux ip-172-23-99-35 5.15.0-1066-aws #72~20.04.1-Ubuntu SMP Thu Jul 18 10:41:27 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

System umask is set to 0027

Cluster Configuration:
3 servers 6 workers

Describe the bug:
https://docs.rke2.io/backup_restore#restoring-a-snapshot-to-existing-nodes fails with profile: cis

I am trying to follow restore on existing node procedure and can't get it working following steps in the doc.

Steps To Reproduce:

  • create snapshot rke2 etcd-snapshot save --name test-1
  • stop rke2-server on all server nodes: service rke2-server stop
  • execute restore command: rke2 server --cluster-reset --cluster-reset-restore-path=test-1-ip-172-23-99-35.eu-central-1.compute.internal-1722630988.zip

Expected behavior:

  • restore command successfully restores state from snapshot

Actual behavior:

  • restore commands gets stuck failing to connect to etcd container.
  • logs in etcd container:
{"level":"info","ts":"2024-08-01T12:57:11.52932Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--config-file=/var/lib/rancher/rke2/server/db/etcd/config"]}
{"level":"warn","ts":"2024-08-01T12:57:11.529372Z","caller":"etcdmain/etcd.go:75","msg":"failed to verify flags","error":"open /var/lib/rancher/rke2/server/db/etcd/config: permission denied"}

Additional context / logs:
I have tried to set umask 0022 mentioned in this issue #4313
And it actually helps to execute cluster reset command but then when I am starting up service rke2-server back etcd fails again with:

{"level":"panic","ts":"2024-08-02T20:40:30.329167Z","caller":"backend/backend.go:189","msg":"failed to open database","path":"/opt/kubernetes/etcd/data/member/snap/db","error":"open /opt/kubernetes/etcd/data/member/snap/db: permission denied","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend\n\t/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:189\ngo.etcd.io/etcd/server/v3/mvcc/backend.New\n\t/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:163\ngo.etcd.io/etcd/server/v3/etcdserver.newBackend\n\t/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:55\ngo.etcd.io/etcd/server/v3/etcdserver.openBackend.func1\n\t/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:76"}
panic: failed to open database

goroutine 178 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000000180, {0xc000122f80, 0x2, 0x2})
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x49b
go.uber.org/zap.(*Logger).Panic(0xc000157fb0?, {0x1311c6e?, 0x451ab2?}, {0xc000122f80, 0x2, 0x2})
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x59
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000157fb0, 0x2d}, 0x5f5e100, 0x2710, {0x12ff7fe, 0x7}, 0x1a6666666, 0xc00012e190, 0x0, 0x0, ...})
	/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:189 +0x429
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
	/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0xc00004b680, 0x36}, {0x0, 0x0}, {0x0, 0x0}, {0xc000490a20, 0x1, 0x1}, {0xc00016dc20, ...}, ...}, ...)
	/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
	/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
	/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:75 +0x18a

Note: /opt/kubernetes/etcd/ is my custom etcd data-path

After cluster reset command with umask 0022 ownership of /opt/kubernetes/etcd/data/member/snap/db is changed to root which causing error above.

root@ip-172-23-99-35:~# ls -la /opt/kubernetes/etcd/data/member/snap/
total 115168
drwx------ 2 etcd etcd      4096 Aug  2 20:56 .
drwx------ 3 etcd etcd      4096 Aug  2 20:55 ..
-rw-r--r-- 1 etcd etcd     10082 Aug  2 20:23 0000000000000002-0000000000002711.snap
-rw-r--r-- 1 etcd etcd     10084 Aug  2 20:25 0000000000000002-0000000000004e22.snap
-rw-r--r-- 1 etcd etcd     10084 Aug  2 20:27 0000000000000002-0000000000007533.snap
-rw-r--r-- 1 etcd etcd     10084 Aug  2 20:30 0000000000000002-0000000000009c44.snap
-rw-r--r-- 1 etcd etcd     10084 Aug  2 20:35 0000000000000002-000000000000c355.snap
-rw------- 1 root root 134643712 Aug  2 20:55 db

I can fix ownership of /opt/kubernetes/etcd/data/member/snap/db manually to etcd user and with next etcd container restart it will work but if I will restart rke2-server issue will happen again...

I have also tried to set UMask=0022 on rke2-server system unit I didn't change anything.

@brandond
Copy link
Member

brandond commented Aug 2, 2024

I believe you'll need to set the umask to 0020 in both your shell, and in the rke2 systemd unit. Remember that you need to do a systemctl daemon-reload in order to effect changes to the systemd unit.

Closing this as a duplicate of #4313

@brandond brandond closed this as completed Aug 2, 2024
@riuvshyn
Copy link
Author

riuvshyn commented Aug 2, 2024

@brandond I have just tried using umask 0020 and still having issue with etcd permissions.
rke2 server --cluster-reset --cluster-reset-restore-path=on-demand-ip-172-23-99-45.eu-central-1.compute.internal-1722636597.zip cluster reset was able to complete.

...
INFO[0049] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

Then I have updated /usr/local/lib/systemd/system/rke2-server.service by adding

...
[Service]
UMask=0020
...

and started rke2-server service, and etcd still fails same way:

{"level":"panic","ts":"2024-08-02T22:14:18.11581Z","caller":"backend/backend.go:189","msg":"failed to open database","path":"/opt/kubernetes/etcd/data/member/snap/db","error":"open /opt/kubernetes/etcd/data/member/snap/db: permission denied","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend\n\t/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:189\ngo.etcd.io/etcd/server/v3/mvcc/backend.New\n\t/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:163\ngo.etcd.io/etcd/server/v3/etcdserver.newBackend\n\t/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:55\ngo.etcd.io/etcd/server/v3/etcdserver.openBackend.func1\n\t/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:76"}
panic: failed to open database

goroutine 166 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000000180, {0xc000123000, 0x2, 0x2})
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x49b
go.uber.org/zap.(*Logger).Panic(0xc000145ce0?, {0x1311c6e?, 0x451ab2?}, {0xc000123000, 0x2, 0x2})
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x59
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000145ce0, 0x2d}, 0x5f5e100, 0x2710, {0x12ff7fe, 0x7}, 0x1a6666666, 0xc00038a280, 0x0, 0x0, ...})
	/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:189 +0x429
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
	/go/src/go.etcd.io/etcd/server/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0xc00004bc40, 0x36}, {0x0, 0x0}, {0x0, 0x0}, {0xc000484990, 0x1, 0x1}, {0xc000147b90, ...}, ...}, ...)
	/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
	/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
	/go/src/go.etcd.io/etcd/server/etcdserver/backend.go:75 +0x18a

This happened because /opt/kubernetes/etcd/data/member/snap/db becomes owned by root:

root@ip-172-23-99-45:~# ls -la /opt/kubernetes/etcd/data/member/snap/
total 87512
drwx------ 2 etcd etcd      4096 Aug  2 22:17 .
drwx------ 3 etcd etcd      4096 Aug  2 22:14 ..
-rw-r--r-- 1 etcd etcd     10085 Aug  2 21:52 0000000000000002-0000000000004e22.snap
-rw-r--r-- 1 etcd etcd     10085 Aug  2 21:53 0000000000000002-0000000000007533.snap
-rw-r--r-- 1 etcd etcd     10085 Aug  2 21:56 0000000000000002-0000000000009c44.snap
-rw-r--r-- 1 etcd etcd     10085 Aug  2 22:01 0000000000000002-000000000000c355.snap
-rw-r--r-- 1 etcd etcd     10085 Aug  2 22:07 0000000000000002-000000000000ea66.snap
-rw------- 1 root root 102924288 Aug  2 22:14 db

Originally before doing cluster reset it was owned by etcd user. So I am abit lost and not sure how to successfully restore snapshot.

@brandond
Copy link
Member

brandond commented Aug 2, 2024

And just chown -r etcd:etcd /opt/kubernetes/etcd doesn't fix the issue?

@riuvshyn
Copy link
Author

riuvshyn commented Aug 2, 2024

It does, but when I restart rke2-server /opt/kubernetes/etcd/data/member/snap/db becomes owned by root:root again...

So the only way to get it somehow working is to wait while with currently running rke2-server it will attempt to restart failing etcd container

@brandond
Copy link
Member

brandond commented Aug 2, 2024

I am very confused how you ended up with /opt/kubernetes/etcd. If you'd just set the RKE2 data-dir to /opt/kubernetes the etcd folder would be under /opt/kubernetes/server/db/etcd - so I suspect you've got some symlinks or bind mounts in place here? Or did are you overriding our datastore paths via etcd-arg? Something is definitely screwy with your environment beyond just the umask.

@brandond
Copy link
Member

brandond commented Aug 2, 2024

I really suspect that you've set etcd-arg: data-dir=/opt/kubernetes/etcd. Don't do that. If you change the etcd database file paths we can't ensure that they are set up correctly for access within the pod.

@riuvshyn
Copy link
Author

riuvshyn commented Aug 2, 2024

I had perf related issues so I had to move etcd data to SSD.
/opt/kubernetes/etcd is only used for etcd data

etcd-arg:
- data-dir=/opt/kubernetes/etcd/data
- wal-dir=/opt/kubernetes/etcd/wal
etcd-extra-mount:
  - "/opt/kubernetes-wise/etcd:/opt/kubernetes/etcd:rw"

rke2 data-dir is still /var/lib/rancher/

@brandond
Copy link
Member

brandond commented Aug 2, 2024

Yeah, don't do that. We don't expect the etcd files to be anywhere except DATADIR/server/db/etcd. Move the whole thing with the rke2 data-dir option, or mount your SSD under /var/lib/rancher/rke2/server/db

@riuvshyn
Copy link
Author

riuvshyn commented Aug 2, 2024

Oh, ok... my motivation was to reduce unnecessary disk activity on etcd data store.

@brandond
Copy link
Member

brandond commented Aug 2, 2024

Yeah that's totally valid, we just don't handle users overriding the etcd paths using etcd-arg. We expect it to be where we want to put it.

@riuvshyn
Copy link
Author

riuvshyn commented Aug 2, 2024

Yeah, don't do that. We don't expect the etcd files to be anywhere except DATADIR/server/db/etcd. Move the whole thing with the rke2 data-dir option, or mount your SSD under /var/lib/rancher/rke2/server/db

Cool, I'll try that. Thank you!

@riuvshyn
Copy link
Author

riuvshyn commented Aug 5, 2024

@brandond I have tried moving back etcd data path back to /var/lib/rancher/rke2/server/db and that seem to be working now as expected. I still have to do umask 0022 before exucuting rke2 server --cluster-reset ..., but when it cluster restore / reset is done I did revert it to umask 0027 and rke2-server was able to start up normally without any issues so no need to modify systemd unit.
Thanks for clarifying with etcd data path, I did not see anything in docs saying that custom etcd data path can not be used, maybe worth adding that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants