New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAIDZ1: unable to replace a drive with itself #2076

Open
mcrbids opened this Issue Jan 23, 2014 · 12 comments

Comments

Projects
None yet
@mcrbids
Copy link

mcrbids commented Jan 23, 2014

Trying to simulate failure scenarios with a 3+1 RAIDZ1 array in order to prepare for eventualities.

# create spfstank raidz1 -o ashift=12 sda sdb sdc sdd 
# zfs create spfstank/part
# dd if=/dev/random of=/spfstank/part/output.txt bs=1024 count=10000

manually pull out /dev/sdc without shutting anything down. As expected, zpool status shows the drive in a bad state:

# zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  UNAVAIL     16   122     0  corrupted data
-- SNIP -- 

This status doesn't change when I re-insert the drive. So I want to simulate re-introducing a drive that's extremely incoherent relative to the state of the ZFS pool. So, making sure that the drive is "offline", I introduce a raft of changes:

# zpool offline spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# dd if=/dev/zero of=/dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 bs=1024 count=100000

102 MB of changes, to be exact. Now, I want to re-introduce the drive to the pool and get ZFS to work it out. At this point, the status of the drive is:

# zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  OFFLINE     16   122     0
-- SNIP -- 

I try to replace the drive with itself:

# zpool replace spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 -f 
cannot replace ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 with ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637: ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 is busy

# zpool replace spfstank /dev/sdc /dev/sdc -f 
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'

# zpool replace spfstank /dev/sdc  -f 
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'

I was able to "fix" this with:

# zpool online spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# zpool clear spfstank
# /sbin/zpool scrub spfstank

During the scrub, the status of the drive changes:

zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  ONLINE       0     0     9  (repairing)
-- SNIP -- 

There doesn't seem to be a way to "replace" a known incoherent drive with itself.

@dweeezil

This comment has been minimized.

Copy link
Member

dweeezil commented Jan 23, 2014

You didn't corrupt the disk enough. The dd left the 3rd and 4th copies of the labels intact so it's still being recognized as a part of the pool. All you need to do in this case is to zpool online it. The only part of a vdev that's in a specific location are the labels; 2 at the beginning and 2 near the end. So long as any one of them is intact, you'd likely need extremely severe damage to prevent a simple "online" from working (due to the multiple copies of all metadata).

@nedbass

This comment has been minimized.

Copy link
Member

nedbass commented Jan 23, 2014

Or as was mentioned on the mailing list, zpool labelclear -f /dev/sdc should let zpool replace work to simulate a drive swap.

@joshenders

This comment has been minimized.

Copy link

joshenders commented Jul 29, 2014

@dweeezil is it safe to assume that after the disk is online'd and the scrub finishes, the disk in the FAULTED state will return to the ONLINE state?
UPDATE: After the scrub completed the disk is still in the faulted state.

This may be better mailing list fodder but I'm noticing similar behavior as @mcrbids and I believe this is on topic. I hope you don't mind.

Here is the zpool configuration:

config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          raidz2-0  ONLINE       0     0     0
            A0      ONLINE       0     0     0
            B0      ONLINE       0     0     0
            C0      ONLINE       0     0     0
            D0      ONLINE       0     0     0
            E0      ONLINE       0     0     0
            F0      ONLINE       0     0     0
          raidz2-1  DEGRADED     0     0     0
            A1      OFFLINE      0     0     0
            B1      ONLINE       0     0     0
            C1      ONLINE       0     0     0
            D1      ONLINE       0     0     0
            E1      ONLINE       0     0     0
            F1      ONLINE       0     0     0
          raidz2-2  DEGRADED     0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
            C2      OFFLINE      0     0     0
            D2      ONLINE       0     0     0
            E2      ONLINE       0     0     0
            F2      OFFLINE      0     0     0
          raidz2-3  ONLINE       0     0     0
            A3      ONLINE       0     0     0
            B3      ONLINE       0     0     0
            C3      ONLINE       0     0     0
            D3      ONLINE       0     0     0
            E3      ONLINE       0     0     0
            F3      ONLINE       0     0     0

I have attempted to "borrow" a disk from one of the N+2 vdevs (raidz2-1) to the vdev at N (raidz2-2) by offline'ing A1 and zero'ing the the first few hundred megs.

# zpool offline data A1
# dd if=/dev/zero of=/dev/disk/by-vdev/A1 bs=64M count=10

I then edited my /etc/zfs/vdev_id.conf so that udev will give A1 the label of C2 and commented the existing line that defines C2.

I then removed A1 and C2 and placed A1 in C2's drive tray. I reconnected the new C2. udev triggers and /dev/disk/by-vdev/C2 now exists.

# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root 9 Jul 29 16:15 /dev/disk/by-vdev/C2 -> ../../sdu

When I attempt to replace the offline'd C2 with the new C2 however, I get a message that C2 is busy and the disk is automatically partitioned. By zfs I assume.

# zpool replace data C2 /dev/disk/by-vdev/C2
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-vdev/C2 contains a corrupt primary EFI label.
# zpool replace -f data C2 /dev/disk/by-vdev/C2
cannot replace C2 with /dev/disk/by-vdev/C2: /dev/disk/by-vdev/C2 is busy
# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root  9 Jul 29 16:16 /dev/disk/by-vdev/C2 -> ../../sdu
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part1 -> ../../sdu1
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part9 -> ../../sdu9

Note, the "corrupt primary EFI label" message is always present even with brand new disks that have never touched the system. Not sure what that is about. I always have to use -f when replacing.

If I had to take a guess, this has something to do with the fact I created the pool with the /dev/disk/by-vdev/ labels and not /dev/disk/by-id/. ZFS sees the path, "/dev/disk/by-id/C2" and assumes it is just badly damaged (and as I've learned from this thread, a label still exists at a location higher than the first several hundred megs I overwrote). Am I close here?
UPDATE: Doesn't appear to be related to which symlink was used when referencing the disk.

Would the correct course of action in replacing a disk this way, be to just zpool online the "borrowed" disk if I need to borrow disks from other vdevs in the future.
UPDATE: No. zpool online will not resilver the faulted disk. zpool replace will not allow disk reuse within the pool which I believe to be a bug.

@joshenders

This comment has been minimized.

Copy link

joshenders commented Jul 30, 2014

I think there might actually be a bug here as of 0.6.3. Even if I zpool labelclear the disk I still cannot use it as a replacement in this pool.

# zpool replace -f data C2 /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX
invalid vdev specification
the following errors must be manually repaired:
/dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX is part of active pool '

As seen in the post above, the system automatically partitions the drive without my intervention. There must be some signaling beyond the zfs label on the drive that informs zfs that this disk is/was a member of this pool.

After I zero'd the drive fully with dd, I was able to use it as a replacement disk.

...
          raidz2-2  DEGRADED     0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
        replacing-2                      OFFLINE      0     0     0
          old                            OFFLINE      0     0     0
          C2                             ONLINE       0     0     0  (resilvering)
            D2      ONLINE       0     0     0
            E2      ONLINE       0     0     0
            F2      OFFLINE      0     0     0
...
@DeHackEd

This comment has been minimized.

Copy link
Contributor

DeHackEd commented Aug 1, 2014

You have to zpool labelclear the partition on the disk, not just the whole disk. Even if you give ZFS a whole disk it makes partitions on it and you have to clear those.

@joshenders

This comment has been minimized.

Copy link

joshenders commented Aug 1, 2014

Noted. That's a lot less time consuming than wiping the disk. Thanks!

@gitbisector

This comment has been minimized.

Copy link

gitbisector commented Oct 12, 2014

zpool labelclear scsi-SATA_ST3000DM001-1CH_XXXXXXX-part1 complains about the disk being part of an active pool too. Tried that after a zpool offline /dev/disk/by/id/scsi-SATA_ST3000DM001-1CH_XXXXXXX

To workaround this I moved the disk to another system and did the zpool labelclear there.

After that 'zpool replace -f tank scsi-SATA_ST3000DM001-1CH_XXXXXXX /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX got me to resilvering.

@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 30, 2014

@gordan-bobic

This comment has been minimized.

Copy link

gordan-bobic commented Mar 31, 2016

It would be really handy to be able to do this without physically removing the disk. A prime example of the use case is when changing partitions around, e.g. dropping a partition to make more space for a zfs one.

@Spongman

This comment has been minimized.

Copy link

Spongman commented Sep 27, 2017

i'm running in to this as well. i don't understand, how is this not considered a bug any more?

labelclear is clearly broken: it's impossible to clear a partition that was created as part of a whole-disk pool.

also, 'labelclear -f'ing the drive doesn't do enough to prevent the error 'does not contain an EFI label but it may contain information in the MBR'

why is it even necessary for the user to reason about partitions that they didn't create?

@behlendorf behlendorf added this to the 0.8.0 milestone Feb 10, 2018

@rueberger

This comment has been minimized.

Copy link

rueberger commented Jun 19, 2018

I believe I'm running into this problem.

sudo zpool status

 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 29h17m with 0 errors on Mon Jun 11 05:41:41 2018
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            sda-enc  ONLINE       0     0     0
            sdb-enc  ONLINE       0     0     0
            sdc-enc  ONLINE       0     0     0
        logs
          log        ONLINE       0     0     0
        cache
          cache      FAULTED      0     0     0  corrupted data
          cache      ONLINE       0     0     0

errors: No known data error

Cache is a logical volume on a LUKs drive. I must have done something wrong with the setup and it is not properly recognized on reboot.

sudo zpool replace -f tank cache /dev/disk/by-id/dm-name-ws1--vg-cache

cannot open '/dev/disk/by-id/dm-name-ws1--vg-cache': Device or resource busy
cannot replace cache with /dev/disk/by-id/dm-name-ws1--vg-cache: no such device in pool

sudo zpool labelclear /dev/disk/by-id/dm-name-ws1--vg-cache

labelclear operation failed.
        Vdev /dev/disk/by-id/dm-name-ws1--vg-cache is a member (L2CACHE), of pool "tank".
        To remove label information from this device, export or destroy
        the pool, or remove /dev/disk/by-id/dm-name-ws1--vg-cache from the configuration of this pool
        and retry the labelclear operation.

Any insights greatly appreciated.

EDIT: I should clarify that the cache seems to be in use, which explains why the device is busy. So it's maybe just a minor annoyance that the old cache is unable to be removed?

EDIT: sorry I must have just been being dumb about the paths.. I was able to remove the degraded device with sudo zpool remove tank /dev/ws1-vg/cache...

@shevek

This comment has been minimized.

Copy link

shevek commented Aug 13, 2018

I have this issue too. I can't labelclear an offline disk to reinsert it in the pool.

@shevek

This comment has been minimized.

Copy link

shevek commented Aug 13, 2018

Workaround: strace -e pread64 zdb -l $DEV >/dev/null

Gives a bunch of offsets:

pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment