-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPLError: 864:0:(zfs_vfsops.c:347:zfs_space_delta_cb()) SPL PANIC #1303
Comments
The offending VERIFY was recently introduced by a94addd which was a back ported fix from Illumos, see https://www.illumos.org/issues/3208. @ahrens any thoughts on this? Was there a follow fix we perhaps missed? |
@behlendorf look at that numbers in binary:
They are suspiciously similar. |
@behlendorf I haven't seen this before, and there was no follow-on fix. Can you get a crash dump and examine the rest of the sa structure to see if it seems to be correct aside from the sa_magic? Or use zdb to examine the on-disk state and see if it still has an invalid sa_magic? @maxximino How exactly are those numbers (0x2f505a and 0x4ea0cca2) similar? They have no (nonzero) bytes in common. |
I'm sorry, behlendorf, maxximo, ahrens. |
@bziller Thanks for following up and letting us know. |
@bziller I highly doubt that this was caused by L2ARC compression, since buffer corruption in L2ARC traversal would have been caught by the L2ARC checksum, since we compress after checksumming and verify after decompression - see lines 4811 and 4853 in jimmyH/zfs@c8d5f5a for the compression side and lines 4442 and 4448 for the decompression side. |
@skiselkov You're right, because I just got a SPL Panic again with a plain rc14 install. Sorry again :(
spl-log and more messages ('task blocked') at http://wwwlehre.dhbw-stuttgart.de/~bziller/spl-log.1361863592.29875.gz and http://wwwlehre.dhbw-stuttgart.de/~bziller/messages.1361863592.29875.gz |
I wrote a few posts over the course of thinking about this, so I decided consolidate my thoughts into a single post and delete any ideas that further thinking revealed to be obsolete. @bziller This looks like a duplicate of openzfs/spl#157, which was determined to have been caused by memory corruption. Does your system use ECC memory? If not, I am going to guess that a corrupt pointer caused nonsense to be copied into the sa_magic field, which was then written to disk with a good checksum. You might be able to identify the bad file by watching rsync's behavior. It likely freezes on a given file, which would be the bad one. @ahrens @behlendorf It might be best to handle this differently than we do now. If the magic number is invalid, we could invoke the failmode behavior. We could also implement a flag to |
@ryao I think you are proposing two changes to the scrub code:
|
@ryao The system has ECC. A 24h memtest86+ run didn't find any bad memory. But basically this error means that I have bad data with a good checksum on disk? |
It's interesting that in each reported instance of this bug, sa.sa_magic seems to contain a timestamp in seconds since the epoch. See also https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups=#!topic/zfs-discuss/Vp02RDQ2GjI |
I wonder if this is related to #1240. In that bug, we found that sa_find_sizes() might incorrectly indicate that the a spill block won't be needed. Accordingly, the variable
|
On second thought I guess that theory doesn't hold up, since we panicked in that case. Also there's a SA_SET_HDR further down that would rewrite the header. |
I am hitting this too, also while doing a rsync. Apr 5 21:33:54 csi kernel: [ 301.262394] VERIFY3(sa.sa_magic == |
I just received this error during a mass delete: Oct 6 02:42:02 master kernel: VERIFY3(sa.sa_magic == 0x2F505A) failed (1379523424 == 3100762) Any idea on why sa.sa_magic is equal to a unix timestamp? |
Found these messages a couple of day ago on my machine, a stable Gentoo amd64 system, using zfs 0.6.2-1: Nov 30 19:02:58 sher kernel: [85427.669302] VERIFY3(sa.sa_magic == 0x2F505A) failed (1381750087 == 3100762) that seems related to this issue. The system has "normal", non-ECC, memory and didn't panic after the messages. I can't reproduce the problem: I scrubbed the pool after the message, without booting, and scrubbed again, after a reboot:. Result, in both cases, "No known data errors" and no new errors in the system logs. The messages appeared just once and the number "1381750087" is again a reasonable timestamp, as the command "date --date='@1381750087'" reveal:: Mon Oct 14 13:28:07 CEST 2013. Note the VERIFY messages timestamps are related to one of my system logins, while the asserted "1381750087, Mon Oct 14 13:28:07 CEST 2013" show the strange coincidence with the usual time of my system logouts. The bad blocks, if any, may have been in one of my home profile files? Any idea? |
Possibly related to #1890. |
This is believe to be resolved. If anyone observed an instance of a 'zfs_space_delta_cb()) SPL PANIC' in 0.6.3 or newer please open a new issue. |
This system is running rc14 with only 3.5Gbyte of RAM. This happened during an rsync.
The log is at http://wwwlehre.dhbw-stuttgart.de/~bziller/spl-log.1361200575.864.gz
The text was updated successfully, but these errors were encountered: