VSR/Grid: Remove invalid zero-padding assert #2681
Conversation
Inject more localized storage corruption to ensure we test things like: - corruption in sector (e.g. grid block) padding, even if the real data is valid, - valid `header.checksum` but invalid `header.checksum_body`, - invalid `header.size`
When `replica_test.zig` injects a corruption (by setting a `storage.faults` bit) it expects that the sector is _definitely_ corrupt, not _maybe_ corrupt. And we may as well test both behaviors in the VOPR!
In the grid, after we read a block, we: 1. verify that the block has a valid header, then 2. verify that the block has a valid body, then 3. verify that it is the block we were expecting to find, then 4. assert that the block's sector padding is zeroed. The last step can fail due to disk corruption -- sector padding isn't covered by any checksum. Despite injecting many storage faults the VOPR couldn't detect this (until the grandparent commit). This is because when the VOPR injects a storage fault, it corrupts the whole sector -- if the padding is corrupt, then the header or body (or both) is also corrupt. We missed this bug because we injected "too many" faults! Seed: ./zig/zig build -Drelease vopr -- --lite 12441136989613272800
| if (constants.verify) { | ||
| assert(stdx.zeroed(block[header.size..vsr.sector_ceil(header.size)])); | ||
| } |
There was a problem hiding this comment.
Should we just forcefully zero-this out then, to reduce dimensionality?
There was a problem hiding this comment.
🤔 I think that relying even on zeroed padding is a bug, so I'd rather not zero the padding, to avoid potentially concealing bugs. I guess we could zero only when !constants.verify... but I'd rather avoid that kind of divergence between test/non-test.
There was a problem hiding this comment.
Ok, makes sense to go with offensive programming here! Worth capturing the reasoning in a comment here though!
|
Disabling auto-merge -- the new corruption injection found another bug! |
Even though `checksum_padding` is reserved for future use, consider it part of the superblock checksum for the purpose of checksum validation. Otherwise a bit error within the checksum will be ignored until we eventually assert that two copies of the same header are equal, and find that one of them has an invalid padding value. Seed: ./zig/zig build -Drelease vopr -- --lite 505264887382715275
|
And another bug: 010a30b For this one I did |
Unlike the grid, there are already many other assertions around zeroing message padding, and both message buses explicitly zero the padding. (Also log invalid padding for both grid + journal.) Seed: ./zig/zig build -Drelease vopr -- --lite 505264887382715275
010a30b to
858881b
Compare
858881b to
00aa759
Compare
Seed: ./zig/zig build -Drelease vopr -- 12191166023730368892
00aa759 to
f38f292
Compare
Hm, yeah, this makes me think that maybe grid approach is not good? If we can't tell right off the bat whether padding to sector-size matters or not, we will write some code where it matters in the future. Both ignoring and zeroing don't sit right with me.o But what about this alternative: If there's an error in the padding we've read from disk, we consider the entire block/prepare to be invalid, even if checksum matches. Rationale:
Still not sure what's the best approach here, but it definitely must be consistent between the grid and the journal. |
To be consistent with `Journal`'s behavior.
🤔 This would simplify the Though, I don't love the fact that this approach doesn't maximize our ability to repair.
Doesn't it just replace it with a new special/rare code path? i.e. instead of |
This feels a bit less special, than having the padding most of the time being zero, unless there's a corruption.
Yeah, my gut feeling is that:
|
Remove whole-sector corruption in favor of just bit corruption. Even if the bit error lands in the prepare/block padding, the prepare/block counts as corrupt, so bit errors now work for the purpose of `replica_test.zig`.
In the grid, after we read a block, we:
The last step can fail due to disk corruption -- sector padding isn't covered by any checksum.
Despite injecting many storage faults the VOPR (previously) couldn't detect this.
This is because when the VOPR injects a storage fault, it corrupts the whole sector -- if the padding is corrupt, then the header or body (or both) is also corrupt.
We missed this bug because we injected "too many" faults!
Accordingly, this PR adds single-bit fault storage injection.
Edit: VOPR found a similar bug in the superblock header padding and journal prepare padding.
And a bug in the superblock
copyindex handling: f38f292Each copy of the superblock includes a copy index (0...3) so that we can detect misdirects.
We want the superblock header checksums to match even though the copies are different, so the copy index is not covered by the checksum. We read in a superblock quorum, pick arbitrarily which of the identical headers to load into memory (well, near-identical -- they have distinct copy indexes). Then later when we are about to write our superblock, when we go to compute the superblock's checksum, we assert that (
superblock.copy < constants.superblock_copies).But if the copy of the superblock we loaded was corrupt, this may not be true!