Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS option to add raw disks without creating GPT partitions #94

Open
stevecs opened this issue Feb 8, 2011 · 36 comments

Comments

Projects
None yet
@stevecs
Copy link

commented Feb 8, 2011

I see in the change logs that this was added a while ago (2009-11-02) that when adding a raw volume to a pool zfs will create a GPT partition on the disk and actually add the partition not the raw volume as the user requested, this is also started at ~1049KB offset:

Disk /dev/sdq: 251GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 251GB 251GB zfs
9 251GB 251GB 8389kB

This poses problems when the volumes presented to ZFS are not physical drives but are luns due to alignment issues. Also this creates more of a management issue when dealing with many devices as seen in large deployments when trying to remove/replace a device with one of a larger size (having to now not only replace all the devices in the vdev but also modify the gpt tables).

If possible would like an option to NOT create a partition table but to use the device as presented.

@behlendorf

This comment has been minimized.

Copy link
Member

commented Feb 8, 2011

Certainly not creating the GPT partition tables is possible, but it's still a unclear to me exactly why this is a problem.

The idea here is that given a full device (LUNs look like full devices) we create a GPT partition table and attempt to automatically align the first partition 1MiB in to the device. We want to keep it 4kB aligned. Based on your results it does look like something went wrong. Your reporting 1049KB instead of 1024kB which seems wrong so that certain looks like a bug.

Alternately, if when your setting up the system you know best and want to set your own alignment you can always manually create the partition tables and use these precreated partitions for you vdev. The code will detect that this is a partition and simply use it without adjusting the partition tables.

@stevecs

This comment has been minimized.

Copy link
Author

commented Feb 9, 2011

Besides the bug that is secondary to the discussion (but needs to be fixed, the 1049KiB offset opposed to 1024KiB), there are two issues.

  1. with any GPT offset there is a problem in proper alignment both to the stripe size of an underlaying device as well as it's stripe width. For example with a 1024KiB offset that will work to fit stripe sizes of 128, 256, 512, or 1024KiB (since ZFS itself uses 128KiB sizes that would be the smallest). However you then have the secondary issue of the stripe width alignment. i.e. say you have a san which has luns comprised of 3D+1P (raid 5 of 4 disks). Now you effectively have a stripe width of 384KiB. 384KiB will not align with 1024KiB. To your point you can manually re-size this GPT offset to say 768KiB or 1536KiB. But then if the back-end san changes the array configuration now you have the issue of being mis-aligned again.

If on the other hand you don't have any partition table your lun and zfs will /ALWAYS/ be aligned with no manual/end user intervention.

  1. Since ZFS carves up a block device into discrete atomic units (i.e. it allocates sectors as needed) you can create a vdev based on say a couple 1TB drives, you can then remove those drives (one at a time) and replace them with say 2TB drives. When all members are replaced you will have double the capacity available for use. If you also have to play with partition schemes this gets much more involved and further raises the bar as to the level of the admin supporting the subsystem. KISS principle applies. This is similar with many other large deployments of disks that are aggregated into logical volume groups; raids; or other management structures under the file system layers.

Lastly, I would suggest that the default function of creating a new pool would NOT create the GPT so that the commands function the same way as under solaris. (i.e. use a new flag if the user WANTS to do a GPT but default is no flag and no GPT). that way it would be easier for people to move from one OS to another for support and expect the commands to work the same way.

In our deployments I have systems that generally have upwards of 100 luns/drives per system, having to do the additional steps across those at 2am is not something I would relish.

@behlendorf

This comment has been minimized.

Copy link
Member

commented Feb 15, 2011

Thanks for the detailed reply. I believe the Sun/Oracle ZFS team would suggest not using a raid array to back individual vdevs because of the fist issue you mention. ZFS is designed to work best when it's managing the individual devices. The partition tables are created with the assumption that your just using a directly attached SAS/SATA/SSD/etc device. In which case they just do the right thing, even for your case 2) above.

That said, we're actually considering doing the same thing for some of our systems as a short term measure. Our concern isn't with ZFS persay, but that we still need to get a better JBOD/device management infrastructure in place for Linux. Solaris has FMA which is OK, but not great, but we still need to build up those sort of tools for Linux.

I think the first order of business here is to make sure the commands work exactly the same as they do under Solaris. I thought that's what I'd done with the partition table creation but if that's not the case it should be fixed. Once we have the basic tool behavior the same we can look at adding additional command line options which add any needed functionality.

So my question is... what exactly does OpenSolaris do with a blank drive when you create a new pool, and then what does the Linux port do? If you could provide the raw data that would help.

@stevecs

This comment has been minimized.

Copy link
Author

commented Feb 15, 2011

Yeah, with large environments where there exists a san infrastructure already getting raw disks exported is not something that really happens, in some cases you can export say raid1 mirrors or what I do here, the smallest raid5's I can, but even then that's pricey from another point of view. If we were all solaris and had 10gbit end-end, probably would come up with what you're doing (zfs/luster) or even just clustered filers and export to other clients via iscsi).

From the sun boxes I have here, I think it's more complex as from the couple I just logged into I have 5.10/zfs systems both running SMI as well as EFI/GPT disk labels on them. I was surprised by the EFI ones but those also happen to have been newer media installs so perhaps sun/oracle changed the default at some point? Originally on ZFS it was SMI only and was optional for EFI/GPT.

This raises a question w/ DR now when you may need to bring back a pool that may have EFI on a box with an older version of solaris that may only handle SMI?

Does the linux port have code to handle both SMI/EMI? From my tests it appears it only creates EMI, so similar to what I mentioned above then an option to NOT do EMI/GPT on creation would still be useful for porting purposes but it does not appear to be as bad at first glance.

@behlendorf

This comment has been minimized.

Copy link
Member

commented Feb 15, 2011

Yes, I think the EFI scheme may be a late addition to ZFS as you were saying. The Linux port currently will always create an EFI label if it detects your using a whole disk for the vdev. If it determines your using a partition/slice it will just use the full partition/slice for ZFS and not do anything. It occurs to me if you want to trick the current tools in to doing nothing you could probably just create a single partition which spans the entire device, and just use that.

As for accessing old ZFS pools which may have been created with SMI partitioning I'm not exactly sure how Linux will handle that. If the Linux kernel can properly read the partition information and construct the partitions everything should be fine. If it can't you won't be able to import those pool. For pools created with EFI labels on Solaris I have verified that you can access these pools under Linux.

@fajarnugraha

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2011

(1) How does the code currently detect wheter we're using a whole disk (as opposed to say, an LV) for the vdev?

(2) When CONFIG_SUN_PARTITION enabled (like in RHEL kernel), Linux can detect SMI label/Solaris slices correctly. dmesg will show something like this

xvda: xvda1
xvda1: <solaris: [s0] xvda5 [s2] xvda6 [s8] xvda7 >

(3) Solaris can detect zfs directly on whole disk (without partition) just fine. It will use "p0" (e.g. c1t0d0p0, instead of the usual c1t0d0s0 when using SMI label).

(4) Solaris by default will create EFI label when presented with whole disk for zfs pool. The exception is when it's going to be used for boot (rpool), on which it will use SMI label.

(5) zfs-fuse doesn't create any partition by default (whole disk and regular file is treated the same way), but mostly it was because licensing issue (something about libparted being GPL, while zfs is CDDL)

@behlendorf

This comment has been minimized.

Copy link
Member

commented Feb 18, 2011

Detecting if your using a wholedisk in Linux is actually surprisingly tricky. The current logic resides in the in_whole_disk() function, it basically just attempts to create an EFI label for the provided device name. This will only succeed if we were given a device which can be partitioned.

From what you say the ZFS on Linux port should then have no trouble accessing either the Solaris SMI labels, or a pool created without partitions such as zfs-fuse. It also sounds like this port is doing the right thing by creating EFI labels by default.

Yes, libparted is GPL so we can't link ZFS with its development headers and use it to create the partitions. The ZFS on Linux port gets around this by not using libparted and instead used a modified version of libefi from Solaris which is compatibly licensed.

@stevecs

This comment has been minimized.

Copy link
Author

commented Mar 2, 2011

just an update, the parted offset is a units issue with that command it seems:

GNU Parted 2.2
Using /dev/sdo
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Seagate ST2000DL003-9VT1 (scsi)
Disk /dev/sdo: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 2000GB 2000GB zfs
9 2000GB 2000GB 8389kB

(parted) unit
Unit? [compact]? B
(parted) print
Model: Seagate ST2000DL003-9VT1 (scsi)
Disk /dev/sdo: 2000398934016B
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1048576B 2000390528511B 2000389479936B zfs
9 2000390528512B 2000398917119B 8388608B

If you set your units to bytes you have exactly 1048576 (1MiB) so there must be some routing or something going on w/ parted which, though screwed, is not a problem w/ your offset code.

So that leaves just the original request for a means to avoid creating partitions at all.

@baryluk

This comment has been minimized.

Copy link

commented Apr 25, 2011

Hi. I also wanted to ask about this long time ago. I tried linux zfs, and found it very iritating that zpool create tries to be too clever, and creates GPT automatically. Please just use block devices, and do not pretend what this block device is at all. It will also make it more similar to the way a zfs works under solaris, freebsd and zfs-fuse. I like raw devices as they are easier to work in virtualized environments, iscsi, xen, kvm, as they are simpler to be managed or mounted outside of the virtualized system also.

Or maybe now EFI labels or GPT (or whetever) is a recomended way of giving whole disks to the zfs? I'm also concerned about aligment issues here.

Thanks.

@stevecs

This comment has been minimized.

Copy link
Author

commented Apr 25, 2011

The default behavior is (under solaris) is when presented with a raw disk cXtXdX to use SMI labels if the disk is to be booted; or EFI/GPT labels for data partitions and creates the first partition at a 1MiB offset from the start of the volume. Brian appears to have attempted to copy this behavior which is correct in that it would be the EXPECTED behavior (as it would match what happens by default under solaris).

To your point, and the reason why I created the bug report is that this does not work correctly in all cases (mainly ones where you have complex storage). In these scenarios the ability to use a raw partition without a volume header/partition table is very useful if not required for good performance.

A work-around until something gets done to user space tools is to create your arrays with zfs-fuse which due to it complete lack of such functionality will allow you to use any raw block device to create your zvol/pools. You can then export them and import them into native zfs. This is far from ideal but it would allow you to create some items in a pinch.

@fajarnugraha

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2011

On newer opensolaris (I tested b148) with "autoexpand=on" zpool property, it currently does the "right thing" when the disk becomes larger (e.g. by resizing LUN from storage side). The GPT partition was automatically adjusted so the new size was recognized correctly.

@stevecs

This comment has been minimized.

Copy link
Author

commented Apr 25, 2011

Yes, that has been true under solaris for a while actually. What it does not do however is allow you to set the offset from sector 0 for the beginning of the ZFS partition (item #1 above). This is what causes mis-alignment to underlaying stripe widths and/or if you have a stripe size of >1024KiB.

@stevecs stevecs closed this Apr 25, 2011

@stevecs

This comment has been minimized.

Copy link
Author

commented Apr 25, 2011

Yes, that has been true under solaris for a while actually. What it does not do however is allow you to set the offset from sector 0 for the beginning of the ZFS partition (item #1 above). This is what causes mis-alignment to underlaying stripe widths and/or if you have a stripe size of >1024KiB.

@stevecs stevecs reopened this Apr 25, 2011

@behlendorf

This comment has been minimized.

Copy link
Member

commented Apr 25, 2011

I'm not opposed to fixing this, but I think we need a concrete proposal for what to do. First let me suggest that we not change the current default behavior. Not only is it consistent with the Solaris behavior, which is good for interoperability, but it's also what most people probably want.

I suggest we add a new property called 'aoffset' which can only be set once at zpool creation time. By default it will be the current 1MB offset, but if you have a specific offset requirement for your storage you can set it here. This could be further extended if needed so that an 'aoffset=-1' might indicate that no GPT partition tables should be created at all.

Related to this there is further work being done is issue #195 to allow the ashift to be set via a property as well. I'd very much like to see the same mechanism be used to set both these tunings. Is this sufficient to cover everyones needs?

@stevecs

This comment has been minimized.

Copy link
Author

commented Apr 26, 2011

I also agree that the base utility command set should mimic what's on solaris. Similar to what we're talking about doing for ashift values such as using long options (which do not exist in oracle/zfs) would provide command similarities to across the board only exposing localizations when explicitly called for.

I like your idea of setting a specific size for the disk partition alignment offset opposed to just having it there or not. That would give much more flexibility.

something like "--diskpartalignment=xx" where xx would be a reference in kibibytes KiB. Since no underlaying controllers expose this alignment factor in anything but base 2 numbers.

Checks would be that the offset value has to be:
1) multiple of zfs file system block size
2) multiple of ashift value
3) offset size must at least 33*ashift (LBA size). For 2^9 (512byte sectors) this would be a minimum of 16896 bytes (legacy lba0 + 32size must be at least 16896 bytes offset from 0 (LBA0 (legacy + 128 partition entries (32bytes each)). Note: with ashift values > 9 this would inflate this beyond the base minimum of 16896 however I think this is probably the best method for offset as it would not hard-code sector sizes (would follow what is being set on the particular drive so if 4K or larger sector devices come out should auto-adapt).

comments?

@behlendorf

This comment has been minimized.

Copy link
Member

commented Apr 26, 2011

Part of my motivation for setting this as a new zpool property, rather than simply a long option, is that it provides an easy way to preserve the originally requested value. That means if you ever need to replace a disk in the pool it can automatically be created with a correctly aligned GPT table. Otherwise, you will always need to respecify this offset when replacing a disk, and that's the kind of thing which is easily accidentally forgotten.

Your first two checks for the offset look good, we should absolutely check the offset for sanity when setting it. However, I think your one sector off on the minimum size. To allow space for the GPT table itself we need at least the following room left at the beginning of the device. The first partition may start at LBA 34, or 17408 for 512-byte sectors. Or 8x that for 4k sectors.

LBA 0 - Legacy MBR
LBA 1 - Partition Table Header
LBA 2-33 - Partition Table Entries
LBA34 - First Partition

@stevecs

This comment has been minimized.

Copy link
Author

commented Apr 26, 2011

Setting it as a zpool option would also allow for having different values on different zdevs which would allow for mixing of devices (not in the same zdev obviously but in same zpool).

as for the offset, you are correct, I shouldn't hop-around when writing a post. ;)

@behlendorf

This comment has been minimized.

Copy link
Member

commented Apr 26, 2011

How exactly would the syntax work for setting it per vdev? I see what your saying but without substantially rewriting the parser I don't see how you would specify that level of detail. My suggested syntax would be something like this. The ashift and aoffset properties would only be settable at pool creation time.

zpool create -o ashift=12 -o aoffset=4MB pool vdev ...

@stevecs

This comment has been minimized.

Copy link
Author

commented May 3, 2011

sorry, didn't get an update on this for some reason when you posted. You are correct you would need to do it at the zpool level. I was thinking that it would be something that we /could/ do per vdev however as you pointed out there is no parsing of that both at creation time as well as import/export time (it's only at the zpool level) so the only way to do this at this point would be there. This would mean that you would need to create your zpool with the ashift value that you want and require a full destroy/re-create if you want to change it.

As for your suggested syntax, I think that would work however would suggest using IEC 60027-2 standards/parsing (2^n) to avoid ambiguity opposed to base 10 numbers. Likewise the ability to parse the KiB abbreviation would be needed minimally (to do MiB or GiB may be useful) but offsets can be required less than 1MiB but are very unlikely to ever exist below 1KiB. (hard pressed to think of anything like that in relation to ZFS unless ZFS has internal support for fat sectors (520/528byte) native which I doubt.

@baryluk

This comment has been minimized.

Copy link

commented May 14, 2011

If current zfs on linux have same behaviour as current default on solaris, and default aligment is good (1MB), then it is of low importance to me personally. I'm also slightly against adding custom fields or changing CLI interface, until similar thing is introduced in FreeBSD or Solaris.
Thanks.

@rlaager

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2012

In the example above:
1 1049kB 2000GB 2000GB zfs
9 2000GB 2000GB 8389kB

What is partition 9 for?

I'm using a VM for testing. If I use virtio for the virtual disk, it shows up as /dev/vda. Running zpool create tank vda does not create a GPT partition label. But if I use SCSI for the virtual disk instead, it shows up as /dev/sda and a zpool create tank sda results in it getting a GPT label. Any ideas why (or how I might debug this)?

@dajhorn

This comment has been minimized.

Copy link
Member

commented Jan 7, 2012

@rlaager: Partition 9 is the EFI System Partition, which should be FAT32. The loader is installed there if you are booting from a GPT disk.

@rlaager

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2012

@dajhorn: Just on Solaris, or is it used on Linux as well? I don't see how that fits in with the bios_grub partition in the Linux scheme of things.

@dajhorn

This comment has been minimized.

Copy link
Member

commented Jan 7, 2012

@rlaager:

What you're seeing as partition 9 is part of the EFI/GPT standard. It is not peculiar to Solaris. It should be created by Linux and Microsoft Windows too.

The bios_grub flag is, however, peculiar to the way that GRUB2 is implemented. If you don't install the grub loader exactly according to the documentation, then it won't work and you won't get a sensible error message. GRUB on EFI is brittle vice GRUB on MBR.

@rlaager

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2012

@dajhorn: It sounds like you're describing the EFI system partition: http://en.wikipedia.org/wiki/EFI_System_partition However, the GUID for that is supposed to be C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EF00 in gdisk). The partition being created by ZoL (and presumably Solaris, but I haven't verified that) is 6A945A3B-1DD2-11B2-99A6-080020736631 (BF07 in gdisk).

@putnam

This comment has been minimized.

Copy link

commented Sep 5, 2012

I came across this confusing problem today while janitoring my server.

I have three raidz2's in my pool, and each of them was created at a different time (the first and second were on different versions of zfs-fuse, the third was with zfsonlinux, and some drive switching occurred after i started using zfsonlinux too).

It was rather confusing to try and decipher all this because some drives were "true" block devices only, and others had this two-partition scheme thing going on. There is a lot of assumption in the zpool import/add commands that expects your block device symlinks to also have partition symlinks in the same directory. In some cases, like "zpool add tank spare a1 b1 c1", I can tell there is a path search going on that uses solaris style labels like "a1p1" before erroring out.

I suppose that since "whole disk" isn't really "whole disk" after all, and it seems to be this way upstream in Solaris, then this behavior is to be expected from here on out. In my case the confusion was compounded by the fact that I created my first two raidz2 zvols using zfs-fuse, which was using something apparently non-standard.

In my case I was trying to be slick and create my own symlink folder for labeling purposes. I created /zdev/ and put symlinks to each block-level device in my pool (e.g. A1->/dev/disk/by-id/scsi-A123456). I know this might be solved with zdev.conf but that only supports by-path, which I find to be too volatile on my setup (affected by scan order). So after I made all these symlinks nothing really worked right, because the zpool command was unable to properly tack on pathnames to get a handle on the partition. In the end I fixed my symlinks so the drives that used partitions would have their data partition symlinked instead of the block device.

Maybe it is an uncommon case, but I still kinda wish errors (or even the manual) for zpool add/import would include a note that you might need to point it to a partition symlink in the case of recent-Solaris or zfsonlinux "whole drive" initialization. It's very confusing indeed when zpool swears your device isn't available when the block device is right there, and you know you used "whole disk" when creating the vdev. The terminology is what got me.

@cwedgwood

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2012

It would be reasonable to detect when the partitions are not suitably aligned and issue a warning; this would also prevent people accidentally having 63 sector offsets with AF drives (I've not seen this in the ZFS context but have in other places).

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 3, 2014

FransUrbo added a commit to FransUrbo/zfs that referenced this issue May 29, 2015

Option to disable automatic EFI partitioning.
Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So add a '-D' option to 'add', 'attach',
'create', 'replace' and 'split' to disable the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue May 30, 2015

Option to disable automatic EFI partitioning.
Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue May 30, 2015

Option to disable automatic EFI partitioning.
Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Jul 22, 2015

3458: turbo/disable_autopart - Option to disable automatic EFI partit…
…ioning.

Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Jul 24, 2015

3458: turbo/disable_autopart - Option to disable automatic EFI partit…
…ioning.

Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Jul 24, 2015

3458: turbo/disable_autopart - Option to disable automatic EFI partit…
…ioning.

Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Jul 26, 2015

3458: turbo/disable_autopart - Option to disable automatic EFI partit…
…ioning.

Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Jul 26, 2015

3458: turbo/disable_autopart - Option to disable automatic EFI partit…
…ioning.

Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Jul 31, 2015

3458: turbo/disable_autopart - Option to disable automatic EFI partit…
…ioning.

Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452

FransUrbo added a commit to FransUrbo/zfs that referenced this issue Sep 8, 2015

Option to disable automatic EFI partitioning.
Sometimes it is desired to not have 'zpool' setup partitioning on
devices it uses for the pool. So allow '-o whole_disk={on,off}'
option to 'add', 'attach', 'create', 'replace' and 'split' to
disable or enable, respectivly, the automatic partitioning.

Signed-off-by: Turbo Fredriksson turbo@bayour.com
Closes zfsonlinux#94
Closes zfsonlinux#719
Closes zfsonlinux#1162
Closes zfsonlinux#3452
@rlaager

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2016

This is essentially the opposite of #1162. I'm mentioning it to link them together.

@Nable80

This comment has been minimized.

Copy link

commented Nov 20, 2017

Dear developers, is there any hope to (finally) get such an option in the official package? I wanted to use ZOL on relatively large JBOD and a dozen of completely senseless EFI partitions isn't the thing that I want to see in my system. I know that there are higher priority problems... but could you tell me what is currently holding this one?

@jumbi77

This comment has been minimized.

Copy link

commented Nov 20, 2017

@Nable80 there is currently an open PR for creating pools w/o gpt: #6277

@koitsu

This comment has been minimized.

Copy link

commented Mar 23, 2018

This whole matter was solved properly in Illumos recently. GPT primary/backup headers and EFI partitions are only created if the -B flag is given to zpool create: https://www.illumos.org/issues/7446

Not auto-GPT'ing and auto-partitioning by default is fully justified in the original ZFS Best Practises Guide, section "Storage Pools": https://web.archive.org/web/20150207105847/http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools

This same mentality and thought process is applied in ZFS on FreeBSD, as denoted here (don't miss the last 2 lines of the commit message): https://svnweb.freebsd.org/base?view=revision&revision=331395

Bootable devices/pools obviously need GPT and/or EFI (or alternately MBR, though achieving 4K alignment may be complicated/a problem with MBR). However, this behaviour being default for zpool create (i.e. for all pools, bootable or not) on Linux is a bummer. But I'm biased coming from both a FreeBSD and Solaris background.

@ehem

This comment has been minimized.

Copy link

commented Jul 16, 2018

I'm rather astonished such basic functionality is still broken after 7 years. mkfs -t <type> /dev/<anything> has worked since the begining. Insisting upon creating a GPT on the target is astonishingly broken. Even if something appears at first glance to be a whole disk, it may not be if you can see the bigger picture.

# zpool add test /dev/xvdk
invalid vdev specification
use '-f' to override the following errors:
/dev/xvdk does not contain an EFI label but it may contain partition
information in the MBR.
# dd if=/dev/zero count=1024 of=/dev/xvdk
1024+0 records in
1024+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.0359066 s, 14.6 MB/s
# zpool create test /dev/xvdk
invalid vdev specification
use '-f' to override the following errors:
/dev/xvdk does not contain an EFI label but it may contain partition
information in the MBR.
# zpool create test -f /dev/xvdk
[100.000000]  xvdk: xvdk1 xvdk9

No, there is certainly not any potential for there to be partition information in the MBR there. The code should check for the presence of such a thing before giving such an obviously broken error message. In this case, outside of the test VM /dev/xvdk has already had a GPT stripped off and it was greatly desired for zpool to use the entire block device it was given.

For the sanity of the next person to run into this breakage, I did manage a workaround:

# cp -l /dev/xvdk /dev/xvdk1
# zpool create test -f /dev/xvdk1

Yes, that really was successful at bypassing this insanity.

@koitsu

This comment has been minimized.

Copy link

commented Jul 24, 2018

@ehem Couple things I can think of. This is more of a brain dump than "here's where the problem lies":

  1. With GPT, there is a primary table (LBAs 1 through 33) and backup table (LBAs {last-LBA-of-disk-minus-33} to {last-LBA}). Your dd only clears the primary. I don't know about Linux, but FreeBSD will actually tell you (in console/dmesg) if the primary is corrupt and will fall back to using data from the backup table at the end of the physical disk. As such, when working with GPT, you have to clear out both tables. LBA 0 is usually the PMBR.

  2. On Linux (at least back in the 2.6.x days; not sure about today), the kernel kept an internal cache of what the MBR/GPT contained, all the way down to partitions. If you manipulated the regions directly on the disk, you had to inform the kernel via a special ioctl() to get the kernel to "re-taste" (re-read) MBR/GPT areas. The only way I know how to do this on Linux is to use fdisk followed by the w command, but as said before, there may be a newer command today that can do it.

@ehem

This comment has been minimized.

Copy link

commented Jul 25, 2018

@ehem Couple things I can think of. This is more of a brain dump than "here's where the problem lies":

  1. With GPT, there is a primary table (LBAs 1 through 33) and backup table (LBAs {last-LBA-of-disk-minus-33} to {last-LBA}). Your dd only clears the primary. I don't know about Linux, but FreeBSD will actually tell you (in console/dmesg) if the primary is corrupt and will fall back to using data from the backup table at the end of the physical disk. As such, when working with GPT, you have to clear out both tables. LBA 0 is usually the PMBR.

Indeed. Didn't bring it up, but I'd ended up writing a small program to explicitly whack all traces of both. There was no possibility of a GPT or other such format on the device beforehand. Also note the error message /dev/xvdk does not contain an EFI label but it may contain partition information in the MBR. I don't know whether zpool create will look for a backup GPT, but zpool was clearly indicating no GPT ("EFI label") was present. At which point the dd would of nuked all traces of any such construct which had been present. That error is garbage.

  1. On Linux (at least back in the 2.6.x days; not sure about today), the kernel kept an internal cache of what the MBR/GPT contained, all the way down to partitions. If you manipulated the regions directly on the disk, you had to inform the kernel via a special ioctl() to get the kernel to "re-taste" (re-read) MBR/GPT areas. The only way I know how to do this on Linux is to use fdisk followed by the w command, but as said before, there may be a newer command today that can do it.

Indeed, you're referring to ioctl(fd, BLKRRPART). I don't recall the exact sequence of actions I took prior to the above, but the kernel was unaware of any disk slicing before the commands were run (either I'd included the ioctl() during the erase, or perhaps restarted the VM).

As near as I can tell zpool create's logic is roughly:

if(gpt_is_present(passed_in_dev)) {
  /* some action or error message I didn't reproduce */
  return EDIE_FULLDEVICE;
} else if(!isdigit(passed_in_devname[strlen(passed_in_devname)-1]) {
  errormsg("%s does not contain an EFI label but it may contain partition\ninformation in the MBR.\n", passed_in_devname);
  return EDIE_FULLDEVICE;
}

This isn't 100% accurate of course. Notably the above wouldn't produce the error message for /dev/xvdk0, whereas zpool create gave the error for that situation (no error for /dev/xvdk1 though). I hope the code checks for GPTs inside devices named "/dev/xvdk1" (yes, many setups GPTs inside slices will be found and sub-devices created), but haven't tried this since I was trying to eliminate all traces of any such constructs.

This is a really awful misfeature here. One should include slack for slightly differing device sizes, but this is ridiculous. Using whole media devices as filesystems has been a long-time tradition, I highly dislike having to spend a fair bit of time to get a tool to do the basic operation I desire without adding useless garbage overhead on top.

@ttyS4

This comment has been minimized.

Copy link

commented Oct 5, 2018

I am experimenting using loopback devices to hide the device from ZoL's zpool create command.
So with just a single device pool it is:

losetup /dev/loop2 /dev/vdb
zpool create mypool /dev/loop2
zpool export mypool
losetup -d /dev/loop2
zpool import
zpool import mypool

as an end result I have my pool directly on the unpartitioned device.

richardelling pushed a commit to richardelling/zfs that referenced this issue Oct 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.