Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support BTRFS and XFS Reflink source volumes #75

Closed
Tracked by #117
tlaurion opened this issue Apr 11, 2021 · 20 comments
Closed
Tracked by #117

Feature request: Support BTRFS and XFS Reflink source volumes #75

tlaurion opened this issue Apr 11, 2021 · 20 comments
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@tlaurion
Copy link
Contributor

tlaurion commented Apr 11, 2021

Thoughts?
QubesOS/qubes-issues#6476

https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

@tasket
Copy link
Owner

tasket commented Apr 12, 2021

The two interesting parts about this are the suggestion that Thin LVM is less reliable than Btrfs (this might be accurate), and the point about providing authentication (which might not be accurate).

I could make a point about perceived efficiency and speed for Thin LVM vs Btrfs, the main one being that no one ever seems to actually compare them with benchmarks, not even @michaellarabel. My experience says that Btrfs would lag behind Thin LVM in overall use, but that is just my impression. I also saw a tendency for Btrfs to "blow up" where metadata use would suddenly skyrocket when reflinking large image files in combination with snapshotting the parent (sub)volume; this was with the late 3.x kernels so ymmv.

Its worth noting WRT the future of Linux storage, Red Hat appears to actively dislike both Thin LVM and Btrfs and they are reported to be building a successor flexible storage system called Stratis

Since interest in backups on Qubes (at least incremental backups) is not high, a change to using Btrfs as the Qubes default would not impact Wyng greatly. But also, adding Btrfs support to Wyng should not be a huge undertaking if people want it.

@tasket
Copy link
Owner

tasket commented Apr 12, 2021

A quick note about Stratis...

It appears to be a configuration management system for "storage pools", where a pool is an XFS filesystem spanning one or more block devices. XFS is used in reflink mode to manage disk image files and "snapshots" containing online shrink-capable filesystems. Red Hat claims to be doing this bc Btrfs code tree was supposedly not maintainable for enterprise environments. The only tangible benefit I'd expect is a performance advantage over Btrfs (it would be interesting to compare Xfs and Btrfs for hosting large reflinked disk image files).

@DemiMarie
Copy link

@tasket @tlaurion Would you be willing to comment on QubesOS/qubes-issues#6476? That is a mere proposal, not a final decision, and commentary (including by those who are not QubesOS users!) would be greatly appreciated. I am no expert whatsoever on the Linux storage stack.

@tasket tasket changed the title QubesOS to switch to BRTFS-Reflink Feature request: Support BTRFS-Reflink source volumes May 20, 2021
@tasket
Copy link
Owner

tasket commented May 20, 2021

I am still going to wait for detailed benchmark comparisons before supporting this. As it stands now, the general wisdom and experience is that Btrfs can be slow, and large disk image files with snapshots is exactly its worst performance case.

Even ZFS created a special mode (ZVOLs) to handle disk images efficiently.

I would wager that the best way to wring performance from Btrfs with disk image snapshots is to flag them nodatacow and add them to separate subvolumes, instead of using reflinks. If that's the case, it would mean a) Qubes getting a refactored Btrfs driver, b) quite different coding details when adding Btrfs to Wyng.

@DemiMarie
Copy link

I would wager that the best way to wring performance from Btrfs with disk image snapshots is to flag them nodatacow and add them to separate subvolumes, instead of using reflinks. If that's the case, it would mean a) Qubes getting a refactored Btrfs driver, b) quite different coding details when adding Btrfs to Wyng.

Snapshots automatically turn CoW back on, so nodatacow will not help.

@tasket
Copy link
Owner

tasket commented May 20, 2021

IIRC nodatacow can be set for individual disk image files that are sitting in a subvolume. So the files only experience a data CoW-like event after a subvol snapshot, not on a second-by-second basis whenever any data is written.

@DemiMarie
Copy link

IIRC nodatacow can be set for individual disk image files that are sitting in a subvolume. So the files only experience a data CoW-like event after a subvol snapshot, not on a second-by-second basis whenever any data is written.

In Qubes OS, all persistent volumes have at least one snapshot, by default. So the only difference would be second and further writes to the same extent after qube startup.

@DemiMarie
Copy link

A quick note about Stratis...

It appears to be a configuration management system for "storage pools", where a pool is an XFS filesystem spanning one or more block devices. XFS is used in reflink mode to manage disk image files and "snapshots" containing online shrink-capable filesystems. Red Hat claims to be doing this bc Btrfs code tree was supposedly not maintainable for enterprise environments. The only tangible benefit I'd expect is a performance advantage over Btrfs (it would be interesting to compare Xfs and Btrfs for hosting large reflinked disk image files).

Stratis uses device-mapper thin volumes (without LVM) to store its XFS filesystems.

@tasket
Copy link
Owner

tasket commented May 21, 2021

In Qubes OS, all persistent volumes have at least one snapshot, by default. So the only difference would be second and further writes to the same extent after qube startup.

Yes, so the difference in performance should be somewhere between the cases shown in these benchmarks. We still need benchmarks that are performed in a Qubes environment.


In relation to Wyng, Stratis mapping should be very similar since the current thin-pool method is to ask LVM what the dm-thin device ID is, then use the dm-thin tools on that device.

@tasket tasket added enhancement New feature or request help wanted Extra attention is needed labels Dec 31, 2021
@tasket tasket added this to the v0.4 milestone Dec 31, 2021
@tasket tasket mentioned this issue Jan 29, 2023
12 tasks
@tasket tasket changed the title Feature request: Support BTRFS-Reflink source volumes Feature request: Support BTRFS and XFS Reflink source volumes Feb 14, 2023
tasket added a commit that referenced this issue Feb 15, 2023
get_reflink_deltas() and update_delta_digest_reflink()
@tasket
Copy link
Owner

tasket commented Feb 16, 2023

Work has begun on Btrfs reflink volume support. The algorithms needed to obtain metadata and find differences between two snapshots were added, however at present the code needed to recognize and snapshot reflink vols still needs to be written to make this usable.

A side-effect of the approach I took (using simple FIEMAP tables obtained via filefrag) is that other filesystems that report this data, such as XFS, will also be supported.

tasket added a commit that referenced this issue Feb 16, 2023
Remove or mark unconverted lvm code

issue #75
@tasket
Copy link
Owner

tasket commented Feb 16, 2023

To continue a line of thought from code comments:

Its worth noting that file extent maps have 4KB blocks, which is an order of magnitude more detail than the most detailed thin lvm map with 64KB chunks. So 'do it in Python' is a big maybe here, as even Python libs tend to fall down on either speed or memory requirements. Using Linux commands to pre-process the maps gives me delta lists (to use in Python) that are much smaller than the input maps, and they're fast and work on data streams instead of in memory. Python's difflib does look interesting, though. I would love to see an alternate implementation using that or something similar to see how it performs.

Right now the Wyng alpha work in progress is balancing different qualities like low dependency count, CPU portability (as in use cp and its ported!), efficiency and overall speed. Some of the choices I'm making (for now, at least) to move forward and retain those qualities means code that is less aesthetically pleasing or in the case of sed just plain harder to read. (I do respond to requests to add comments to segments of code.)

I'd also like to note that our systems are based on the same Linux commands that I'm invoking from Wyng, and I'm being pretty conservative in my choices. I would consider custom re-implemention of those commands' functions or replacement with 3rd-party libs to be as much or more of a security risk.

@tasket
Copy link
Owner

tasket commented Feb 17, 2023

Major problem:

The Linux FIEMAP ioctl output doesn't carry block device numbers, which are needed when a Btrfs volume spans more than one device. With a multi-device fs, the returned data looks OK but won't be correct. This does not affect XFS because that fs doesn't have multi-device maps.

Edit: On further inspection, Btrfs may be synthesizing its own singular address space to account for multiple devices. So we are seeing the numbers from Btrfs' internal raid. If this is true, then the resulting FIEMAP data may be good enough to reliably show where reflinked files have the same blocks.

Edit 2: The issue/solution is explained in a Linux bugzilla record.

tasket added a commit that referenced this issue Feb 20, 2023
@tasket
Copy link
Owner

tasket commented Feb 20, 2023

I've added close checking of the column layout to the sed script; any significant change should raise an error.

Also checked the filefrag source code. The basic format hasn't changed for well over a decade and the last change ~11ya was minor, adding dots and colons after numbers.


The next hurdle will be getting Wyng to recognize & access regular files as logical volumes. At that point, this feature will be ready to test.

@tasket
Copy link
Owner

tasket commented Feb 22, 2023

OK, so over in filefrag land, a prominent Linux dev doesn't want me to use filefrag with Btrfs because:

the FIEMAP ioctl wasn't intended for this use

Egads. The FIEMAP describes the data composition of the file. But he is implying the ioctl strips something important from FIEMAP data (it doesn't because Btrfs virtual addresses encompass multiple devices).

Plus meaningless hand waving about Btrfs subvolumes (as if this were the debate about Btrfs inodes) and total lack of concern about filefrag used on other raid-like storage, and I get the impression Btrfs is not exactly TT's area. IOW, this looks like get-off-my-lawn bs. Unless a Btrfs dev says an extent address is not unique within a Btrfs filesystem, I consider the question settled.

@tasket
Copy link
Owner

tasket commented Feb 24, 2023

Update: Since I've been lured into combing Btrfs dev notes and source code to address spurious claims about the supposed deep, dark messy pit that is Btrfs internals, I keep seeing details that are actually reassuring. Btrfs does indeed use logical extent addresses (claiming it doesn't is weird), they are a crucial part of the disk format itself, and – the really good part – they are one of the higher-level abstractions in the format. What the Btrfs design is telling me so far is that they wanted to insulate extent addresses organization and mundane file I/O from the vicissitudes of low-level RAID maintenance. (Edit– addresses can change due to internal maint. functions, but not without incrementing the fs or subvol generation id.) The chart at the bottom of this page gives a general overview. I think a more abstract extent concept makes reading them from a source like FIEMAP even less worrisome than usual, if all you want are extent addresses and sizes. We should just accept that what comes out of the "physical" fields in that ioctl is virtual in most cases, regardless of the filesystem used. TL;dr all we care about is two files pointing to the same extent are pointing to the same data, and whether its mdraid/lvm etc or Btrfs providing ultimate translation and access to physical data blocks is of no concern.

All this is making me eager to start testing Wyng on multi-device Btrfs setups. And if big issues do arise, there is still XFS as a way to do reflink snapshots.

@tasket
Copy link
Owner

tasket commented Mar 7, 2023

Local storage abstraction classes including ReflinkVolume have been added. Most required functions are now there, including the ability to make read-only Btrfs subvolume snapshots and monitor fs maintenance incursions via the snapshot's transaction generation property.

This changes Wyng's model of local storage from collections of Lvm_VolGroups containing tables of Lvm_Volumes and pools to a single LocalStorage class pointed at the archive's local storage location. The resulting 'storage' object's lvols dict is populated with objects based on relevant volume and snapshot names (which may or may not exist).

The next steps will be:

  • Make the snapshot + generation handling transparent, so as not to affect non-Btrfs reflink systems
  • Accommodate subdirs in volume names, making Wyng volume names like tar paths
  • Add these functions to get_reflink_deltas()
  • Convert receive/verify/diff functions to allow data verification tests
  • Convert the monitor_send() chain of functions to use the abstract storage objects
  • Test wyng monitor and wyng send
  • Convert other Wyng commands to use abstract storage
  • Test the rest

Also to do:

  • Check whether local storage is Btrfs/subvolume or XFS
  • Option to convert the Btrfs dir (referenced by --local) to a subvol

@DemiMarie
Copy link

@tasket: what advantages will Wyng have over e.g. btrfs send?

@tasket
Copy link
Owner

tasket commented Mar 7, 2023

@DemiMarie

  1. Wyng archive requires only a traditional fs (or semantics that encompass a Unix fs, like sftp and s3) on the backup destination, instead needing specifically Btrfs on the destination. The only way to get around this with btrfs receive is to stack up the send streams like cordwood, which leaves you with a very inefficient/tedious restore process and no archive maintenance functions.
  2. The 'cordwood' scenario is probable if encryption functions must remain in the local admin env.
  3. Wyng can work with other snapshot-capable local storage, and the user isn't tied to restoring to the same type of local fs as what they originally had... they can restore directly to non-COW storage if desired.

Edit: One could tongue-in-cheek say that the reasons for using Wyng are the reasons why qvm-backup doesn't use btrfs send. :)

Edit:
4. Wyng's monitor function lowers disk-space consumption for snapshots because snapshots (both reflinked img files and subvol snaps) are deleted after a delta map is made from them. So Wyng enables continuous rotation of snapshots, even when backups aren't being sent. btrfs-send requires that local snapshots stay in-place where the disk space they consume will keep growing in size until the next backup.

tasket added a commit that referenced this issue Mar 11, 2023
tasket added a commit that referenced this issue Mar 12, 2023
tasket added a commit that referenced this issue Mar 17, 2023
@tasket
Copy link
Owner

tasket commented Mar 17, 2023

@tlaurion @DemiMarie Wyng now has basically a full implementation of reflink support and is ready to try out on Btrfs for anyone curious enough at this stage (note: it still has not yet returned to alpha).

The prerequisite for using Wyng with Btrfs is to make the --local directory a subvolume, such as sudo btrfs subvolume create /var/lib/qubes or use whichever dir_path your Qubes Btrfs pool uses:

$ qvm-pool info btrpool
name                btrpool
dir_path            /mnt/btrpool/libqubes
driver              file-reflink
ephemeral_volatile  False
revisions_to_keep   1

Since we are now accessing local filesystem objects, you must be mindful of directory structure. In fact, the current implementation treats subdirectories as part of the Archive volume's name. To demonstrate, send-ing a Qubes VM's disk image file to the archive looks like this:

sudo wyng --local=/mnt/btrpool/libqubes send appvms/untrusted/private.img

You don't have to specify --local if the archive already has that local setting. But showing it this way demonstrates:

  1. --local can now be specified at any time (not just with arch-init)
  2. reflink mode is automatically detected
  3. reflink mode accesses disk images simply by using --local as a base path and the volume name as the rest. Your system configuration determines how messy or neat the volume naming will be (but, yes, wyng-util-qubes will cope with this automatically).

It also raises the question of whether users might want to set aside a special dir where they create symlinks to the image files they want to back up, and then point Wyng at that special dir. This would be interesting to try.

tasket added a commit that referenced this issue Mar 18, 2023
Allow multi-vol receive with reflink --local

Update Readme
tasket added a commit that referenced this issue Mar 20, 2023
Optimize: do not init_dedup_index if no vol changes
@tasket
Copy link
Owner

tasket commented Mar 20, 2023

Btrfs reflink and LVM have now been tested and are working.

@tasket tasket closed this as completed Mar 20, 2023
tasket added a commit that referenced this issue Mar 21, 2023
Convert remove_local_metadata() issue #75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants