Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blkid reports disk as zfs_member if it has a zfs_member partition #918

Closed
sschmitz opened this issue Dec 29, 2019 · 9 comments
Closed

blkid reports disk as zfs_member if it has a zfs_member partition #918

sschmitz opened this issue Dec 29, 2019 · 9 comments

Comments

@sschmitz
Copy link

sschmitz commented Dec 29, 2019

This bug has previously been reported to the mailing list, however the author later concluded it was user error. However, I can reproduce the erroneous behavior consistently starting from an empty virtual disk. This may also result in the behavior described in this report.

Reproduction

I tested this in a virtual machine running the Xubuntu 19.10 Live ISO with util-linux 2.34 (but I have also tested it with the current master). The VM was configured to have an empty virtual 10737514496-byte disk. This is a little over 10 GiB; a nice round power-of-2 size will not reproduce the issue. The virtual disk was hosted on an ext4 filesystem; I have read that ZFS does not like to be self-hosted, at least not inside a ZVOL.

# fdisk -l /dev/sda
Disk /dev/sda: 10 GiB, 10737514496 bytes, 20971708 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Verify the disk does not contain any old signatures

# blkid /dev/sda
# wipefs /dev/sda
# hd /dev/sda # In fact, the disk is completely empty (will take a few seconds)
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
280017800

Then, initialize the disk and create a partition spanning the whole disk (in fact, it only needs to be aligned to the end of the disk).

# fdisk -l /dev/sda
Disk /dev/sda: 10 GiB, 10737514496 bytes, 20971708 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 90CB2A9F-2026-4622-83EA-318F2BE6C275

Device     Start      End  Sectors Size Type
/dev/sda1   2048 20971674 20969627  10G Linux filesystem

blkid will now correctly identify the disk as GPT-formatted.

# blkid /dev/sda
/dev/sda: PTUUID="90cb2a9f-2026-4622-83ea-318f2be6c275" PTTYPE="gpt"

Now, create a zpool on the partition.

# zpool create tank /dev/sda1

Voila, blkid is now confused:

# blkid /dev/sda
/dev/sda: LABEL="tank" UUID="11388703897118653008" UUID_SUB="2779192954700749898" TYPE="zfs_member" PTUUID="90cb2a9f-2026-4622-83ea-318f2be6c275" PTTYPE="gpt"

Cause

ZFS stores four copies of its uberblock on every backing device, two at the front and two at the back. The ones at the back are aligned to a 256 KiB boundary. Therefore, if the end of the last uberblock in the partition is less than 256 KiB from the end of the disk, it is stored in the same position as the last uberblock of a whole-disk ZFS would be (and likewise for the second-to-last copy).

libblkid will find these two uberblocks at the end of the disk and treat them as evidence of the whole disk being ZFS-formatted. Looking at the code for probe_zfs in superblocks/zfs.c, I'm not entirely clear on the extent to which this is intentional. While a comment does say "Look for at least 4 uberblocks to ensure a positive match", the function description also says "Check only some of [the 4 uberblock copies]". Anyway, what does happen is that the function checks the four locations for uberblocks, however because there are several slots at each location that all contain the magic number. find_uberblocks returns the number of such slots at the given location, and the results are summed.

https://github.com/karelzak/util-linux/blob/9418ba6d05feed6061f5343741b1bc56e7bde663/libblkid/src/superblocks/zfs.c#L288

As soon as this sum reaches 4 (ZFS_WANT) or more, the loop is exited early and it is determined that a ZFS filesystem is present. This number may already be reached when a single uberblock location is found, as may be the case for an end-of-disk-aligned partition as described above.

As I said above, I'm not sure how the probe_zfs function is supposed to work, but when I edited the line above to found++ instead, blkid no longer reported the whole disk as a zfs_member but still did so for the partition. It will then require the uberblock magic number to be present at all four expected locations, and not four copies of the magic number at only one of them. However, I have only looked at the code for half an hour, so I have no idea if that is actually the solution or if that would break everything ;-)

Robustness of find_uberblocks

Also, I noticed that the code seems to assume that at every location, there are 128 uberblock slots spaced at 1 KiB. I believe that, while this was true several years ago, this is no longer necessarily the case. I could not quickly find an up to date specification of the on-disk format, but when I first noticed this bug on my actual system (with a ~500 GB ZFS partition with ashift=12), the blkid debug output included 32 uberblock slots per location spaced at 4 KiB (offsets 128, 132, 136 ...). In the test system described above, the debug output only contains 3 slots, spaced at 1 KiB (offsets 132, 133, 134). When the test zpool is created with ashift=12, I can see again 3 slots, this time spaced at 4 KiB (offsets 144, 148, 152). So I guess the format has become more flexible than it used to be. The "fixed" code was able to work correctly in all three cases, but without an authoritative source for the format, it is hard to say if it will work for every possible ZFS. In any case, the comments also assign some special significance to the uberblock slot "#4 (@ 132kB)". I cannot really see that the code actually does anything special with this slot (unless this where the value of ZFS_WANT comes from?), but seeing as there can be pools which do not have four slots per location, and that the offset might no longer be 132 KiB, at least the comments should be updated.

@sschmitz
Copy link
Author

Note that I partitioned the virtual disk using GPT, which writes a backup partition table to the end of the disk. That is why I needed to carefully craft the size of the disk. When you use the MBR scheme instead, a partition can go right to the last sector of the disk, causing the bug to surface with disks of any size.

@karelzak
Copy link
Collaborator

karelzak commented Jan 2, 2020

I have doubts that any find_uberblocks() related optimization is the right way how to fix it. It all sounds like an issue we had with Linux RAID where is/was possible to create superblock at the end of the whole-device. The core of the problem is that probe_zfs() reads within an area covered by valid partition. It sounds like mistake.

Maybe we need to add blkid_probe_is_covered_by_pt() to skip uberblocks within partitions, something like:

probe_zfs() {
       ...
       for (label_no = 0; label_no < 4; label_no++) {
          ...
          if ((S_ISREG(pr->mode) || blkid_probe_is_wholedisk(pr)) &&
             blkid_probe_is_covered_by_pt(pr,  offset, VDEV_LABEL_SIZE))
                  /* ignore this area, it's within any partition and
                   * we are working with whole-disk now */
                   continue;
        
          label = blkid_probe_get_buffer(pr, offset, VDEV_LABEL_SIZE);
      }
      ...
}

@karelzak
Copy link
Collaborator

karelzak commented Jan 2, 2020

Can you try it?

@sschmitz
Copy link
Author

sschmitz commented Jan 3, 2020

I will try it next week.

@karelzak
Copy link
Collaborator

karelzak commented Jan 9, 2020

ping ;-) I'd like to add some usable bugfix to the final v2.35.

@sschmitz
Copy link
Author

sschmitz commented Jan 9, 2020

Oh, it's Thursday already? 😅 I finally got around to testing it today, and it seems to be working. The whole find_uberblocks() thing was more of an afterthought anyway, and while it does still smell fishy to me, I'm happy with this fix. (I guess in a perfect world, blkid would use the same logic as zpool import does, but I haven't looked at that.)

karelzak added a commit that referenced this issue Jan 13, 2020
Addresses: #918
Signed-off-by: Karel Zak <kzak@redhat.com>
@karelzak
Copy link
Collaborator

Fixed. Thanks for the test and report!

@yan12125
Copy link

yan12125 commented Feb 9, 2020

Hi, blkid is still confused if I do the following steps:

  1. Create several partitions on a disk. For example,
  • 0-100GB /dev/sda1
  • 100-200GB /dev/sda2
  • 200-300GB /dev/sda3
  • 300-400GB /dev/sda4
  1. Create a zpool on the last partition (e.g. /dev/sda4)
  2. Delete the last partition

Now blkid is confused again lol

$ sudo /usr/bin/blkid /dev/sda
/dev/sda: LABEL="zfs_data" UUID="3969014677530013818" UUID_SUB="5766627027699155424" BLOCK_SIZE="4096" TYPE="zfs_member" PTUUID="8325ffb7-6b52-4114-af04-2c55312e79b6" PTTYPE="gpt"

If I understand it correctly, the condition blkid_probe_is_covered_by_pt() in f6e1820 is no longer true after deletion of the partition, so the blkid reports incorrect result.

UPDATE: I think I got everything OK. After deleting sda4 (formerly a zfs vdev), I enlarged sda3 to cover remaining disk space, and the run wipefs:

sudo wipefs --types=zfs_member --all --backup /dev/sda3
sudo wipefs --types=zfs_member --all --backup /dev/sda

Now blkid is working again.

UPDATE 2: wipefs on /dev/sda is also needed.

@karelzak
Copy link
Collaborator

@yan12125 this is a generic problem for all on-disk stuff. It's always necessary to really delete relevant bytes from the devices to avoid misinterpretations. And yes, we have wipefs for this purpose and some mkfs-like tools also delete old stuff before it creates new superblocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants