Skip to content

VSR/Grid: Remove invalid zero-padding assert #2681

Merged
sentientwaffle merged 12 commits into
mainfrom
dj-grid-sector-padding-corruption
Jan 29, 2025
Merged

VSR/Grid: Remove invalid zero-padding assert #2681
sentientwaffle merged 12 commits into
mainfrom
dj-grid-sector-padding-corruption

Conversation

@sentientwaffle
Copy link
Copy Markdown
Member

@sentientwaffle sentientwaffle commented Jan 28, 2025

In the grid, after we read a block, we:

  1. verify that the block has a valid header, then
  2. verify that the block has a valid body, then
  3. verify that it is the block we were expecting to find, then
  4. assert that the block's sector padding is zeroed.

The last step can fail due to disk corruption -- sector padding isn't covered by any checksum.

Despite injecting many storage faults the VOPR (previously) couldn't detect this.
This is because when the VOPR injects a storage fault, it corrupts the whole sector -- if the padding is corrupt, then the header or body (or both) is also corrupt.

We missed this bug because we injected "too many" faults!

Accordingly, this PR adds single-bit fault storage injection.


Edit: VOPR found a similar bug in the superblock header padding and journal prepare padding.


And a bug in the superblock copy index handling: f38f292

Each copy of the superblock includes a copy index (0...3) so that we can detect misdirects.

We want the superblock header checksums to match even though the copies are different, so the copy index is not covered by the checksum. We read in a superblock quorum, pick arbitrarily which of the identical headers to load into memory (well, near-identical -- they have distinct copy indexes). Then later when we are about to write our superblock, when we go to compute the superblock's checksum, we assert that (superblock.copy < constants.superblock_copies).

But if the copy of the superblock we loaded was corrupt, this may not be true!

Inject more localized storage corruption to ensure we test things like:

- corruption in sector (e.g. grid block) padding, even if the real data is valid,
- valid `header.checksum` but invalid `header.checksum_body`,
- invalid `header.size`
When `replica_test.zig` injects a corruption (by setting a `storage.faults` bit) it expects that the sector is _definitely_ corrupt, not _maybe_ corrupt.
And we may as well test both behaviors in the VOPR!
In the grid, after we read a block, we:
1. verify that the block has a valid header, then
2. verify that the block has a valid body, then
3. verify that it is the block we were expecting to find, then
4. assert that the block's sector padding is zeroed.

The last step can fail due to disk corruption -- sector padding isn't covered by any checksum.

Despite injecting many storage faults the VOPR couldn't detect this (until the grandparent commit).
This is because when the VOPR injects a storage fault, it corrupts the whole sector -- if the padding is corrupt, then the header or body (or both) is also corrupt.

We missed this bug because we injected "too many" faults!

Seed: ./zig/zig build -Drelease vopr -- --lite 12441136989613272800
Comment thread src/vsr/grid.zig
Comment on lines -1182 to -1184
if (constants.verify) {
assert(stdx.zeroed(block[header.size..vsr.sector_ceil(header.size)]));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just forcefully zero-this out then, to reduce dimensionality?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think that relying even on zeroed padding is a bug, so I'd rather not zero the padding, to avoid potentially concealing bugs. I guess we could zero only when !constants.verify... but I'd rather avoid that kind of divergence between test/non-test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense to go with offensive programming here! Worth capturing the reasoning in a comment here though!

@sentientwaffle
Copy link
Copy Markdown
Member Author

Disabling auto-merge -- the new corruption injection found another bug!

Even though `checksum_padding` is reserved for future use, consider it part of the superblock checksum for the purpose of checksum validation.
Otherwise a bit error within the checksum will be ignored until we eventually assert that two copies of the same header are equal, and find that one of them has an invalid padding value.

Seed: ./zig/zig build -Drelease vopr -- --lite 505264887382715275
@sentientwaffle
Copy link
Copy Markdown
Member Author

And another bug: 010a30b

For this one I did @memset the padding because we assert zeroed padding already in a bunch of places, and both message buses already had @memsets. I'm open to copying the grid's approach of just ignoring padding instead, though.

Unlike the grid, there are already many other assertions around zeroing message padding, and both message buses explicitly zero the padding.

(Also log invalid padding for both grid + journal.)

Seed: ./zig/zig build -Drelease vopr -- --lite 505264887382715275
@sentientwaffle sentientwaffle force-pushed the dj-grid-sector-padding-corruption branch from 010a30b to 858881b Compare January 28, 2025 22:02
@sentientwaffle sentientwaffle force-pushed the dj-grid-sector-padding-corruption branch from 858881b to 00aa759 Compare January 28, 2025 22:10
Seed: ./zig/zig build -Drelease vopr -- 12191166023730368892
@matklad
Copy link
Copy Markdown
Member

matklad commented Jan 29, 2025

For this one I did @Memset the padding because we assert zeroed padding already in a bunch of places, and both message buses already had @memsets.

Hm, yeah, this makes me think that maybe grid approach is not good? If we can't tell right off the bat whether padding to sector-size matters or not, we will write some code where it matters in the future.

Both ignoring and zeroing don't sit right with me.o

But what about this alternative:

If there's an error in the padding we've read from disk, we consider the entire block/prepare to be invalid, even if checksum matches.

Rationale:

  • Simple mental model
  • Removes rare, special code path for rare condition
  • If we did get an error somewhere in sector, it is probably worth it to write again, so flagging block/prepare is corrupted feels desirable here.

Still not sure what's the best approach here, but it definitely must be consistent between the grid and the journal.

To be consistent with `Journal`'s behavior.
@sentientwaffle
Copy link
Copy Markdown
Member Author

If there's an error in the padding we've read from disk, we consider the entire block/prepare to be invalid, even if checksum matches.

🤔 This would simplify the TestStorage changes... we could remove the sector/byte fault distinction and just always do byte faults.

Though, I don't love the fact that this approach doesn't maximize our ability to repair.

Removes rare, special code path for rare condition

Doesn't it just replace it with a new special/rare code path? i.e. instead of if (nonzero padding) { memset(padding, 0) }, we have if (nonzero padding) { message is corrupt }.

@matklad
Copy link
Copy Markdown
Member

matklad commented Jan 29, 2025

Doesn't it just replace it with a new special/rare code path? i.e. instead of if (nonzero padding) { memset(padding, 0) }, we have if (nonzero padding) { message is corrupt }.

This feels a bit less special, than having the padding most of the time being zero, unless there's a corruption.

Though, I don't love the fact that this approach doesn't maximize our ability to repair.

Yeah, my gut feeling is that:

  • byte error in the padding seems a relatively unlikely
  • and if it occurs, it probably is better to repair the entire sector, than to carry on.

Remove whole-sector corruption in favor of just bit corruption. Even if the bit error lands in the prepare/block padding, the prepare/block counts as corrupt, so bit errors now work for the purpose of `replica_test.zig`.
@sentientwaffle sentientwaffle added this pull request to the merge queue Jan 29, 2025
Merged via the queue into main with commit 8ab9934 Jan 29, 2025
@sentientwaffle sentientwaffle deleted the dj-grid-sector-padding-corruption branch January 29, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants