Provide a way to recover from "permanent errors" #5528

Open
devsk opened this Issue Dec 25, 2016 · 12 comments

Projects

None yet

5 participants

@devsk
devsk commented Dec 25, 2016 edited

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version
Linux Kernel 4.9.0
Architecture x86_64
ZFS Version 0.7.0-rc2_83_gf2d8bdc6
SPL Version 0.7.0-rc2_4_gf200b83

Describe the problem you're observing

One file in my backup pool is showing permanent errors in a snapshot. The snapshot was made 2 years ago. All previous snapshots from that point onward have that file shown as corrupt (I/O error) when accessed through .zfs folder as well.

So, out of the blue, suddenly I have this file corrupt in every snapshot I have.

Describe how to reproduce the problem

Its not reproducible but this is second time it has hit me. The last time I hit this was in 2013 and I ended up recreating the pool.

My big worry is that if I need to replace a disk, its going to fail during resilvering and cause me to either delete all my snapshots or the whole pool itself and recreate it.

If we can have an option of correcting a specific file in all snapshots, that would be great and avoid a whole lot of pain of buying new drives, creating temporary pools, copying data back and fro. The suggestion came from another user on the list:

A function to rewrite defective, non-reconstructable (file) data could be helpful:

ZFS knows the file in question, it knows the mode how the data was written, it knows
offset and lendth of the bad data and the checksum that would need to have.

So it should be possible to feed in that file (from a backup) for ZFS to rewrite
the defective part inplace (as the data there is currently known-bad), healing the defect.

Gregor

--------------------------------------- Original email from zfs-discuss------------------
One fine day after the update to v0.6.5.8-r0-gentoo, a scrub finds a file
which has permanent error(I guess it means that it can't correct the
blocks in error) on a file in an old snapshot. The file is unreadable
(Input/Output error) in all snapshots taken after that but they were
taken in the years since.

# zpool status -v
   pool: backup
  state: DEGRADED
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://zfsonlinux.org/msg/ZFS-8000-8A
   scan: scrub repaired 0 in 9h16m with 1 errors on Sat Dec 10 23:19:10 2016
config:

         NAME                                            STATE READ
WRITE CKSUM
         backup DEGRADED     0     0     1
           raidz2-0 DEGRADED     0     0     2
             ata-WDC_WD4001FAEX-00MJRA0_WD-WCCxxxxxxxxx ONLINE   
0     0     0
             ata-ST4000VN000-1H4168_xxxx ONLINE       0     0     0
             ata-WDC_WD4001FAEX-00MJRA0_WD-WCCxxxxxxxxx ONLINE   
0     0     0
             /mnt/serviio/4tbFile OFFLINE      0     0     0

errors: Permanent errors have been detected in the following files:

backup/zfs-backup@move_backup_to_4tb_external_sep25_2014:/Installs/clonezilla/live/filesystem.squashfs

I tried to read the file in all subsequent snapshots (using
.zfs/snapshot folder) since Sept 2014 and its unreadable in all of them.
I can copy the correct file over and take snapshots and they are all
fine. I can keep deleting snapshots and the scrub keep pointing to the
next snapshot.

Neither of the 3 disks in there show any pending, uncorrectable or
reallocated sectors. Overall health is fine with the disks and scrub has
never failed on these for last any number of months I can remember (and
from zpool history).

Any ideas? I remember I had to restore from backup (of backup in this
case) last time I ran into this. Is there any other way? Its a pain to
start over.

Also, I want to add a 4TB disk to replace 4tbFile but I am wondering if
the resilver will even succeed in this state. I am afraid it will fail
at this snapshot and it will be a waste of time.

Thanks
-devsk

@ronnyegner
ronnyegner commented Dec 26, 2016 edited

Hi,

you can and should go on with the replacement. It will run and succeed if there are no further problems.

In it´s current state the pool is missing a disk.

There is a chance that the problem will go away after the disk is replaced AND a scrub has been run. I had the same problem some time ago and the pool healed itself after all missing disks were replaced and a scrub has been run. But it´s not guaranteed.

Since the corruption is in a snapshot you should be able to get rid of it by deleteing the snapshot (or all of them). That´s all. So: Go on and replace the disk. Then we see.

@devsk
devsk commented Dec 29, 2016

Resilver has completed without additional errors 3 times so far. It keeps resilvering from scratch every time I import the pool.

The resilvering does not make the original 1 error go away.

I need a way out of this. Its causing a lot of disk churn for 24 hours period.

@devsk
devsk commented Dec 29, 2016

One more thing: deleting all my snapshots (the only way to get rid of this error) is not a good solution for me.

@gmelikov
Member

IIRC permanent resilver usually caused by hardware problems (see #4873 for example).

Which disk resilvers? I think there may be something with it.

@devsk
devsk commented Dec 30, 2016

I did mention that the resilver completes successfully, except for the known error with single file in the older snapshots.

pool: backup
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 2.25T in 26h24m with 1 errors on Fri Dec 30 01:02:01 2016
config:

   NAME                                            STATE     READ WRITE CKSUM
   backup                                          ONLINE       0     0     1
     raidz2-0                                      ONLINE       0     0     2
       ata-WDC_WD4001FAEX-00MJRA0_WD-  ONLINE       0     0     0
       ata-ST4000VN000-1H4168_abc             ONLINE       0     0     0
       ata-WDC_WD4001FAEX-00MJRA0_WD-  ONLINE       0     0     0
       ata-ST4000VN000-1H4168_xyz             ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

   backup/zfs-backup@move_backup_to_4tb_external_sep25_2014:/Installs/clonezilla/live/filesystem.squashfs

The drive that resilvers is the new one I added. It is a pretty new Seagate 4TB NAS drive model ST4000VN000-1H4168 (shown xyz above). There is nothing wrong showing in the SMART data. I have used this drive extensively and it has never failed me so far:

ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 118 100 006 - 184350472
3 Spin_Up_Time PO---- 094 094 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 6
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 072 060 030 - 17840492
9 Power_On_Hours -O--CK 100 100 000 - 144
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 6
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 064 057 045 - 36 (Min/Max 33/40)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 5
193 Load_Cycle_Count -O--CK 100 100 000 - 6
194 Temperature_Celsius -O---K 036 043 000 - 36 (0 17 0 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0

@devsk
devsk commented Dec 30, 2016

Nothing reported in 'dmesg'. Its clean.

@gmelikov
Member

I did mention that the resilver completes successfully

My bad, you're right.

Did you see #591?

The resivering is triggered when a vdev is offline for some length of time and then the same original vdev is brought back online. ZFS will basically resync the array updating it missing vdev to reflect any changes which were made while it was offline. It would be interesting to see the contents of zpool events.

@kernelOfTruth
Contributor
kernelOfTruth commented Dec 30, 2016 edited

I wonder how stable and production ready #5316 is which could prevent lots of spurious i/o in those kind of situations

The no1 issue is of course to track down the source of the issue and prevent it or if there is a hardware issue at source.

Referencing #5018 ...

@devsk
devsk commented Dec 30, 2016

I think 5018 comes close to what I am seeing, except that the error is in a file in older snapshots.

@kernelOfTruth
Contributor

Additionally the following comes to mind (hole_birth), not sure however if there could be a connection

#5239 Rename hole_birth tunable to match OpenZFS
#5004 fix hole_birth issue #4996 and add hole_birth Test Suite
#4754 6513 partially filled holes lose birth time
#4369 6370 ZFS send fails to transmit some holes

can't totally remember, but I'm sure it was something with snapshots, sending of snapshots and holes ...

@mailinglists35
mailinglists35 commented Dec 30, 2016 edited

have you tried scrubbing twice? i remember once i have had to to scrub twice to get an error cleared (scrub, export, import, scrub, export, import). but in my case it was after already deleting the affected file (it gave an error out of the blue then i recovered the file from backup, but the error persisted after deleting the referenced file - that double scrub was before restoring the file from backup)

@devsk
devsk commented Dec 31, 2016

My pools get scrubbed every Saturday. So, it has been scrubbed many many times since I hit the issue (I waited before reporting it here, and discussed it on zfs-discuss).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment