Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmanaged datasets destroyed at boot time #196

Open
azazar opened this issue Mar 14, 2021 · 15 comments
Open

Unmanaged datasets destroyed at boot time #196

azazar opened this issue Mar 14, 2021 · 15 comments

Comments

@azazar
Copy link

azazar commented Mar 14, 2021

Describe the bug

When custom datasets are present in the system, they are not recognised by zsys and sceduled for destuction. That's probably a new incarnation of a bug #103 reported earlier.

Before actually deleting filesystem there were two failed attempts:

...
Mar 12 14:19:45 hp zed: eid=132 class=history_event pool_guid=0x974173CCDE607995
Mar 12 14:19:48 hp zed: eid=133 class=history_event pool_guid=0x974173CCDE607995
Mar 12 14:19:50 hp zed: eid=134 class=history_event pool_guid=0x974173CCDE607995
Mar 12 14:20:18 hp systemd[1]: zsysd.service: Succeeded.
Mar 12 14:21:01 hp systemd[1]: Starting Clean up old snapshots to free space...
Mar 12 14:21:01 hp systemd[1]: Starting ZSYS daemon service...
Mar 12 14:21:02 hp systemd[1]: Started ZSYS daemon service.
Mar 12 14:21:03 hp zsysd[14234]: level=warning msg="[[f0d315a4:620bdb6a]] Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy \"rpool/USERDATA/m_enc\" and its children: cannot destroy dataset \"rpool/USERDATA/m_enc\": dataset is busy"
Mar 12 14:21:03 hp zsysctl[14228]: #033[33mWARNING#033[0m Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy "rpool/USERDATA/m_enc" and its children: cannot destroy dataset "rpool/USERDATA/m_enc": dataset is busy
Mar 12 14:21:05 hp systemd[1]: zsys-gc.service: Succeeded.
Mar 12 14:21:05 hp systemd[1]: Finished Clean up old snapshots to free space.
Mar 12 14:22:05 hp systemd[1]: zsysd.service: Succeeded.
...
Mar 12 21:28:30 hp systemd[1]: Starting Clean up old snapshots to free space...
Mar 12 21:28:31 hp zed: eid=3635 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:35 hp systemd-resolved[1254]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying tr
ansaction with reduced feature level UDP.
Mar 12 21:28:35 hp zed: eid=3636 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:38 hp zed: eid=3637 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:38 hp zsysd[476502]: level=warning msg="[[489b0020:c2f4566c]] Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy \"rpool/USERDATA/m_enc\" and its children: cannot destroy dataset \"rpool/USERDATA/m_enc\": dataset is busy"
Mar 12 21:28:38 hp zsysctl[485834]: #033[33mWARNING#033[0m Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy "rpool/USERDATA/m_enc" and its children: cannot destroy dataset "rpool/USERDATA/m_enc": dataset is busy
Mar 12 21:28:41 hp zed: eid=3638 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:43 hp zed: eid=3639 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:45 hp systemd[1]: zsys-gc.service: Succeeded.
Mar 12 21:28:45 hp systemd[1]: Finished Clean up old snapshots to free space.
Mar 12 21:28:46 hp zed: eid=3640 class=history_event pool_guid=0x974173CCDE607995

From zpool history -il rpool output:

2021-03-13.10:38:39 [txg:73086] destroy rpool/USERDATA/root_3vm1uh@autozsys_rb901c (481)  [on hp]
2021-03-13.10:38:44 ioctl destroy_snaps
    input:
        snaps:
            rpool/USERDATA/root_3vm1uh@autozsys_rb901c
 [user 0 (root) on hp:linux]
2021-03-13.10:39:00 [txg:73115] destroy rpool/USERDATA/m_enc (1331)  [on hp]
2021-03-13.10:50:39 [txg:73753] open pool version 5000; software version unknown; uts hp 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 [on hp]
2021-03-13.10:50:39 [txg:73755] import pool version 5000; software version unknown; uts hp 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 [on hp]
2021-03-13.10:50:39 zpool import -N rpool [user 0 (root) on hp:linux]
2021-03-13.10:51:31 [txg:73799] set rpool/ROOT/ubuntu_7k8at6 (90) com.ubuntu.zsys:last-used=1615621890 [on hp]
2021-03-13.10:51:31 [txg:73800] set rpool/USERDATA/o_envgoi (5181) com.ubuntu.zsys:last-used=1615621890 [on hp]
2021-03-13.10:51:31 [txg:73802] set rpool/USERDATA/root_3vm1uh (288) com.ubuntu.zsys:last-used=1615621890 [on hp]
2021-03-13.10:51:31 [txg:73804] set rpool/ROOT/ubuntu_7k8at6 (90) com.ubuntu.zsys:last-booted-kernel=vmlinuz-5.8.0-44-generic [on hp]

Filesystem got deleted silently, without any notice:

Mar 13 10:38:10 hp zed: eid=77 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:13 hp zed: eid=78 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:16 hp zed: eid=79 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:18 hp zed: eid=80 class=history_event pool_guid=0xC787AB1273593DF8
Mar 13 10:38:22 hp zed: eid=81 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:25 hp zed: eid=82 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:29 hp zed: eid=83 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:33 hp zed: eid=84 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:36 hp zed: eid=85 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:39 hp zed: eid=86 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:42 hp zed: eid=87 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:45 hp zed: eid=88 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:47 hp dbus-daemon[2744]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.109' (uid=1001 pid=22803 comm="exo-desktop-item-edit -t Link -c --xid=0x18d /home" label="unconfined")
Mar 13 10:38:48 hp systemd[1]: Starting Hostname Service...
Mar 13 10:38:48 hp dbus-daemon[2744]: [system] Successfully activated service 'org.freedesktop.hostname1'
Mar 13 10:38:48 hp systemd[1]: Started Hostname Service.
Mar 13 10:38:49 hp zed: eid=89 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:52 hp zed: eid=90 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:55 hp zed: eid=91 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:58 hp zed: eid=92 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:01 hp zed: eid=93 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:01 hp CRON[23349]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Mar 13 10:39:05 hp zed: eid=94 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:07 hp zed: eid=95 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:10 hp zed: eid=96 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:12 hp systemd[1]: zsys-gc.service: Succeeded.
Mar 13 10:39:12 hp systemd[1]: Finished Clean up old snapshots to free space.

To Reproduce

  1. Create encrypted home filesystem as written here: https://talldanestale.dk/2020/04/06/zfs-and-homedir-encryption/
  2. Populate it with data to trigger gc
  3. Create zsys.conf with nonzero general.minfreepoolspace
  4. Reboot

Installed versions:

  • OS: Ubuntu 20.04.2 LTS
  • Zsysd running version: zsysctl 0.4.8 zsysd 0.4.8
@azazar
Copy link
Author

azazar commented Mar 14, 2021

Not sure if it needed, but let it be here: zsys.conf.gz

@azazar
Copy link
Author

azazar commented Mar 14, 2021

zsys shouldn't manage filesystems and snapshots that it didn't create, implicitly.

@jvcdk
Copy link

jvcdk commented Mar 14, 2021

I agree with @azazar and would go further and say it shouldn't destroy filesystems at all (only auto-created snapshots).

@lathiat
Copy link

lathiat commented Mar 15, 2021

I think I hit this today too, all 3 homedirs including all snapshots are gone on my system (Ubuntu Hirsute). Similar looking logs to the reporter. I will try to gather more evidence. Happened on Friday for me.

@didrocks
Copy link
Member

First, sorry for your destroyed datasets. We handled at first that USERDATA (which was never used in any ZFS systems we monitor) as a reserved ZSYS datasets and own them.
Note that any dataset that you create here without the appropriate zsys metadata won’t be handled and you loose the benefits of being able to revert with user datasets automatically and such. I think in general USERDATA shouldn’t be used manually.

We need to delete datasets there, as after a revert, a dataset without any zsys tag (because of unsucessful revert or a dataset that expired due to garbage collection) would remain forever, filesystem datasets are some kind of a snapshots for it and we need to delete them to not clutter the system.

However, we tried to create some mitigation as you noted on bug #103 and there has been no change since, this is why I am a little bit puzzled (could this be related to your particular encryption setup?) on why it only triggers now. Thanks for the reproducer, I’ll start my investigation from there and keep you posted.

@didrocks
Copy link
Member

We unfortunately couldn’t reproduce it with the steps described. I’m really wonder what differed in your case.
(Note: I saw you mentioned that it will only GC once the disk is full at 80%, this is not the case but rather this is time-based)

For your information, here are the datasets that are up for deletion in /USERDATA (once it reached the GC limit):

  • any snapshots that were created by zsys (@autozsys_xxxx) -> manual snapshots are kept
  • any untagged clone of a filesystem dataset that is associated to a dataset managed by zsys (com.ubuntu.zsys:bootfs-datasets). Untagged means that no system dataset is associated with this user filesystem datasets and we limit on clones to ensure that was related to a user dataset managed by zsys (only way to know this was a failed revert OR a user dataset tagged for deletion). Any other filesystem datasets are untouched.

In addition to the reproducer, we tried the following (everytime, we advance the date and forced GC to keep 0 datasets) in the USERDATA namespace:

  • creating a filesystem dataset associated with zsys -> kept
  • creating a filesystem dataset not associated with zsys (manual user command zfs create) with <user>_ or <foo> prefix -> kept
  • creating a zsys snapshot @autozsys_ on filesystem dataset tagged via zsys -> deleted
  • creating a manual snapshot @manual (manual user command zfs snapshot) on filesystem dataset tagged via zsys -> kept
  • creating a clone (on a manual or automated snapshot) of a filesystem dataset tagged with zsys, itself associated to a zsys system dataset (active dataset clone which can be reverted) -> kept
  • creating a clone (on a manual or automated snapshot) of a filesystem dataset tagged with zsys, not associated with a zsys system dataset (expired dataset clone or failed revert) -> deleted

All those cases pass on encrypted or unencrypted datasets as we expect (we have found an issue on hirsute due to ZFS packaging, not related to ZSys itself, which makes some datasets not mounted at boot, we are fixing it). Any idea what’s different on your configuration (if you can come with a full reproducer, that would be awesome)? The only reason I can see is that rpool/USERDATA/m_enc was a clone of the unencrypted dataset, never tagged with ZSys, which isn’t what the how-to is doing (it’s creating its own dataset and tag it with ZSys).

@azazar
Copy link
Author

azazar commented Mar 15, 2021

If by tagging, do you mean setting com.ubuntu.zsys:bootfs-datasets fs option, then maybe it was the cause problem? When I followed the guide on home fs encryption, I've set it to -.

@didrocks
Copy link
Member

Yeah, tagging is about adding that tag, but the manual doesn’t tell to set it to - but to the system dataset it’s associated with:

VAL=$(zfs get com.ubuntu.zsys:bootfs-datasets rpool/USERDATA/jvc_tdssc -H -ovalue)
sudo zfs set com.ubuntu.zsys:bootfs-datasets=$VAL rpool/USERDATA/jvc_enc

(ofc, you need to change the dataset names there)

@jdavidberger
Copy link

I think this bug hit me today. Normally I wouldn't bother reporting behavior with this sparse of information but the bug deleted my home directory and after a few hours attempting recovery I'm pretty sure it's gone.

This was on ubuntu 21.04, zfs 2.02, Linux 5.11.0-7620 with an encrypted home directory.

Admittedly vague notes:

  • earlier in the day the system seemed unstable - nothing could delete files in my home directory. It returned error input/output always
  • I tried reverting my system to previous state. It didn't help.
  • I tried a could different kernels. It didn't help.
  • I booted into recovery mode via grub to try and fix it. Eventually I unmounted my home directory to see if zfs send would work. It didn't -- EIO again. Then I rebooted. I think my critical error here was possibly not mounting the user directory again before rebooting.

From that point forward the home directory dataset was gone. Zpool history did not show the deletion but it was there with the -i flag. I have a log file with that in it that I'll post when I have a new system up and running.

I think there is a chance you can exhibit this bug by logging in to a fresh install via recovery mode, unmounting the home filesystem and rebooting but can't be sure other interactions don't play a part.

I can't help but think there should be a tag/flag you can affix to certain datasets that marks them ineligible for destruction except for a very manual cli "zfs destroy -f NAME".

@almereyda
Copy link

While you propose that unvoluntary destruction of datasets should be opt-out, I would rather vote for making voluntary destruction of datasets opt-in. This means we would only allow to destroy datasets that have a certain property set, and not the other way round.

But as of #213, I fear we can rather consider ZSYS as abandoned, with ZFS support leaving experimental status on Ubuntu nowadays. I am unable to figure out how this can go well together, but the world is contradictory at times.

@darkbasic
Copy link

While I understand zsys is no longer maintained, this issue is scaring the crap out of a lot of people and puts the project under a bad light. I myself have moved to zrepl despite always tagging every dataset. If Ubuntu is not funding this maybe you could try a crowdfunding platform instead (Github Sponsors, patreon, gofundme, whatever)? Development will be slow (I don't see to many people interested, albeit there are surely some) but at least it won't be completely unmaintained with huge bugs eating your data.

@runejuhl
Copy link

I have to chime in here, even if it's a bit +1-ish.

Though there hasn't been a formal announcement from Canonical about the status of zsys it seems that the project is dead. That's life for software -- sometimes it lives on, sometimes it dies, and sometimes it gets resurrected by someone with an itch and continues in a new incarnation.

What's problematic here is that this is (was?) a supported installation method, and because of this bug there's a very real risk of data loss -- just ask @jdavidberger. Because zsys is included in official releases it'll continue to affect users until this is properly fixed. I just had a look at the installer for Ubuntu 22.04, and there's no mention of an installation with ZFS being any less supported than a regular installation:

image

image

Even if Canonical sees no future in zsys, having such a bug around in Ubuntu it reflects extremely poorly on Canonical and Ubuntu, and I hope that you can be convinced to fix this issue before putting zsys in the grave for good.

@almereyda
Copy link

AFAIK Ubuntu 22.04 will install ZFS and set up the datasets through their Ubiquity installer, but it won't install zsys anymore, which is probably good. 🤣

Please see this line, where the installation of zsys is commented out (permalink to latest current LTS ref):

@darkbasic
Copy link

but it won't install zsys anymore, which is probably good. rofl

I wouldn't rejoice over zsys being dropped. It's bugged, of course, but it's a really nice and convenient piece of software.
In its current state I wouldn't risk using it without replication to another machine, but I'm currently evaluating letting zsys handle snapshotting of BOOT, ROOT and USERDATA while letting zrepl snapshot everything else and replication. Basically zrepl creates bookmarks on top of zsys snapshots and replicates them to another machine and it also manages snapshotting itself for all the other datasets. That way I would still be able to conveniently revert from grub using zsys. The only problem is that zsys is more broken than I suspected and I cannot even revert without breaking the system: #236
I would love to write a detailed guide on how to use zsys in conjunction with zrepl replication, but I'm already in the process of upgrading several servers so I either find a solution to the broken reverts in the short term or I abandon zsys forever :(
If you have any idea why reverting breaks your system I'm all ears.

@a0c
Copy link

a0c commented Apr 10, 2023

This might be helpful in case of issues with backups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants