-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vsr: store free set in the grid #1309
Conversation
e148fd3
to
88c9160
Compare
@@ -334,6 +335,65 @@ pub const TableData = struct { | |||
} | |||
}; | |||
|
|||
pub const FreeSetNode = struct { | |||
pub const Word = u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemingly duplicates FreeSet.Word
, but I think that's false sharing --- there's no fudnamental reason when in-memory and on-disk words must match. We require it in the current implementation, and indeed, in encoded free set, we assert that the two are the same.
src/lsm/schema.zig
Outdated
_ = metadata(free_set_block); | ||
} | ||
|
||
pub fn next(free_set_block: BlockPtrConst) ?struct { checksum: u128, address: u64 } { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add vsr.BlockReference
as it is used in a couple of different places. But that's for a follow up.
@@ -85,29 +78,28 @@ pub const SuperBlockHeader = extern struct { | |||
/// The checksum of the previous superblock to hash chain across sequence numbers. | |||
parent: u128, | |||
|
|||
/// The checksum over the actual encoded block free set in the superblock trailer. | |||
free_set_checksum: u128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is not strictly necessary, I think I'll add back the checksum covering the entire free set, as it can protect from encoding/decoding bugs. I think I'll leave it to a follow up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic idea to verify the roundtrip.
|
||
/// The last operation committed to the state machine. At startup, replay the log hereafter. | ||
commit_min: u64, | ||
|
||
// Storage size in bytes. | ||
storage_size: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage size is now part of checkpoint state, as it is no longer a function of the rest of the superblock, and we want to know storage size even without having access to the grid.
client_sessions: ClientSessions, | ||
|
||
/// Updated when: | ||
/// - the trailer is serialized immediately before checkpoint or sync, and | ||
/// - used to construct the trailer during state sync. | ||
free_set_buffer: []align(constants.sector_size) u8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This buffer is effectively moved inside FreeSetEncoded.
src/vsr/superblock.zig
Outdated
} else if (context.caller == .checkpoint) { | ||
superblock.free_set.checkpoint(); | ||
} | ||
} else if (context.caller == .checkpoint) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The free set is now delicately check-pointed in the middle of free-set-encoded checkpointing process.
src/vsr/superblock_free_set.zig
Outdated
@@ -578,12 +651,15 @@ test "FreeSet checkpoint" { | |||
const expectEqual = std.testing.expectEqual; | |||
const blocks_count = FreeSet.shard_bits; | |||
var set = try FreeSet.init(std.testing.allocator, blocks_count); | |||
set.opened = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tempting to say that the free set is opened after init (as it indeed is in a valid state where no blocks are allocated), but I'd rather not do it -- while it is a valid state, if we actually do have a non-empty free set on disk, using that valid state can lose data!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 critical.
2841ad1
to
9391f03
Compare
9391f03
to
d15d22f
Compare
pub fn assert_valid_header(free_set_block: BlockPtrConst) void { | ||
_ = metadata(free_set_block); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this pattern!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea here is calling assert_valid_header
when we write a data block, not only during reading.
@@ -63,6 +63,7 @@ const CommitStage = enum { | |||
checkpoint_state_machine, | |||
checkpoint_client_replies, | |||
checkpoint_grid_repair, | |||
checkpoint_free_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could checkpoint_free_set
precede checkpoint_grid_repair
? That way, if a block happens to be repairing during checkpoint_free_set
, we might avoid a pause for checkpoint_grid_repair
by repairing concurrent to the freeset checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, I tried that, an then got a weird error in vopr, and then realized that this doesn't work, at least without some extra legwork.
checkpoint_free_set
is what "buffer-flips" the free set right now. So, if we run grid-repair after checkpointing the free set, it'll see the new checkpoint and won't know which blocks are released.
What we can do is change the logic of repair checkpoint to assume that it happens after buffer flip and use is_free rather than is_released:
In some sense, this even feels clearer, as we are waiting the writes to the free blocks, which is easier to see how it could lead to troubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll do it, but in a follow up. One thing I don't like is how subtle it is when the data structures "flip" to the next checkpoint, maybe I find some better way to assert that...
self.superblock.free_set_encoded.open( | ||
&self.grid, | ||
self.superblock.working.free_set_reference(), | ||
free_set_open_callback, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat off-topic, but I think we are inconsistent about whether a callback is the first or last parameter. (This is probably my fault 😅 -- I reflexively put callbacks last, but I think most uses in the project put it first. e.g. Grid.write_block()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, yeah I went back and forth here as well. I also have “context first, lambdas last” rule hard-wired in my brain, but it isn’t really applicable to zig, since there are no lambdas…
@jorangreef any tiger style insights about the order of the callback argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so glad you both raise this... because... I also reflexively put callbacks last in my mind. 😁
"context first, lambdas last" makes sense also because we tend to "order according to logical use" and callbacks are the last thing that happens.
Beyond this, if you have a situation where you also have interim event callbacks that get emitted multiple times (e.g. as part of a streaming API) in addition to "done" callbacks, then putting the callback last follows the progression better.
So... let's do it, and start to move things over as we go.
3a73d95
to
a2ea2d6
Compare
assert(@sizeOf(CheckpointState) == 224); | ||
assert(@sizeOf(CheckpointState) % @sizeOf(u256) == 0); | ||
assert(@sizeOf(CheckpointState) == 256); | ||
assert(@sizeOf(CheckpointState) % @sizeOf(u128) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Stepping back, what is the purpose for this assertion? I changed it without thinking as part of the pseudo-256-bit-checksum change -- i.e., we are pretending that manifest_oldest_checksum + manifest_oldest_checksum_padding
is a u256
. (In the future it will be, but for the time being we don't need it and Zig doesn't support it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is was unintentional change, but I think this is correct: I think we want to align to u128, as many things are u128, and we want to align to 16 bytes, as that might help with SIMD?
Basically, we want most things to be aligned "nicely", and 16 seems like a reasonable bound for niceness
src/vsr/superblock.zig
Outdated
snapshots_block_checksum: u128, | ||
snapshots_block_checksum_padding: u128 = 0, | ||
free_set_last_block_checksum: u128, | ||
free_set_last_block_checksum_padding: u128 = 0, | ||
|
||
manifest_oldest_address: u64, | ||
manifest_newest_address: u64, | ||
snapshots_block_address: u64, | ||
free_set_last_block_address: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much a nit, but maybe "free set" could precede "manifest" in this order, since it is accessed sooner, and is "lower level" (grid rather than LSM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that's actually exactly the reason why I put FreeSetNode before ManifestNode in the scheme!
6177e4f
to
f8eae38
Compare
None of our fuzzers actually checks that. In theory, the simulator can discover inconsistencies when a replica restarts (decoding free set from disk resets the index), but, given that the shard size is 4096, this is quite unlikely. With this change, injecting a bug in index computation gets discovered by free set fuzzer at least. See #1309 (comment) for an example of a bug this assert guards against.
None of our fuzzers actually checks that. In theory, the simulator can discover inconsistencies when a replica restarts (decoding free set from disk resets the index), but, given that the shard size is 4096, this is quite unlikely. With this change, injecting a bug in index computation gets discovered by free set fuzzer at least. See #1309 (comment) for an example of a bug this assert guards against.
5fc91b3
to
a796347
Compare
Goal: get rid of storage_size_max Currently, free set is stored as a part of superblock in a trailer. As a result, the worst-case size of the free set affects the layout of the data file. As the worst-case size is proprtional to the size of the grid, this requires limiting the size of the data file up-front, using `storage_size_max` parameter. This is bad operator experience -- it's hard to predict which size would be enough, and failure to allocate enough would be very costly to fix down the line. The solution to make free set elastic and store it in the grid. ## Overview FreeSetEncoded is the persistent component of the free set. It defines defines the layout of the free set as stored in the grid between checkpoints. Free set is stored as a linked list of blocks containing EWAH-encoding of a bitset of allocated blocks. The length of the linked list is proportional to the degree of fragmentation, rather that to size of the data file. The common case is a single block. The blocks holding free set itself are marked as free in the on-disk encoding, because the number of blocks required to store the compressed bitset becomes known only after encoding. Linked list is a FIFO. While the blocks are written in the direct order, they have to be read in the reverse order. ## Implementation notes - Separate structs to hold persistent and in-memory components of the free set, as the shape of data is quite different between the two. - FreeSet Trailer and associated messages are gone. - `storage_size_max` is _not_ gone. While it's the entire purpose of the work, the actual removal of the parameter can happen separately. - FreeSetEncoded field is stored as a field in the superblock alongside the free set. This isn't great, as it needs access to the grid, but the superblock is above the grid. As a result, the implementation has a bunch of fairly awkward paths like `set.grid.superblock.free_set`. I want to move both free set and free set encoded to be just a part of the grid, but that's for a follow up to not mix code motion with logic changes. - I noticed that we intCast the encoded bit set size to `u32`. I wonder if we should maybe just store bitset lenght as u64? With u32 we can support a merely 16PiB data file. - Testing this is awkward. In normal use, the free set size just doesn't go beyond a single block, so the entire linked list code is effectively unused. To cover this, `fragment_free_set` is added to forest fuzz which manually injects fragmentation into the free set to make sure it occupies more than one block
a796347
to
c64b54c
Compare
set.size_transferred += chunk.size; | ||
|
||
if (schema.FreeSetNode.previous(block)) |previous| { | ||
assert(set.block_index > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert that previous.{address,checksum}
is not equal to the last_block_address
/last_block_checksum
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have last_block_address
, but I've added an assert that all read elements are distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh,
λ git show origin/matklad/free-set-in-the-grid
commit 80bcce2fd1a06fddd2008649b1441566f4c133ed (HEAD -> matklad/free-set-in-the-grid, origin/matklad/free-set-in-the-grid)
Author: Alex Kladov <aleksey.kladov@gmail.com>
Date: Mon Nov 27 19:51:02 2023 +0000
freeset: assert no duplicate blocks in open
did I broke GitHub by racing with the merge queue? If this doesn't get in here, will sneak this into some of the follow ups!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about that, it seems like it queued despite a new commit after approval.
closing in favor of #1328 |
Goal: get rid of storage_size_max
Currently, free set is stored as a part of superblock in a trailer. As a result, the worst-case size of the free set affects the layout of the data file. As the worst-case size is proprtional to the size of the grid, this requires limiting the size of the data file up-front, using
storage_size_max
parameter. This is bad operator experience -- it's hard to predict which size would be enough, and failure to allocate enough would be very costly to fix down the line.The solution to make free set elastic and store it in the grid.
Overview
FreeSetEncoded is the persistent component of the free set. It defines defines the layout of the free set as stored in the grid between checkpoints.
Free set is stored as a linked list of blocks containing EWAH-encoding of a bitset of allocated blocks. The length of the linked list is proportional to the degree of fragmentation, rather that to size of the data file. The common case is a single block.
The blocks holding free set itself are marked as free in the on-disk encoding, because the number of blocks required to store the compressed bitset becomes known only after encoding.
Linked list is a FIFO. While the blocks are written in the direct order, they have to be read in the reverse order.
Implementation notes
Separate structs to hold persistent and in-memory components of the free set, as the shape of data is quite different between the two.
FreeSet Trailer and associated messages are gone.
storage_size_max
is not gone. While it's the entire purpose of the work, the actual removal of the parameter can happen separately.FreeSetEncoded field is stored as a field in the superblock alongside the free set. This isn't great, as it needs access to the grid, but the superblock is above the grid. As a result, the implementation has a bunch of fairly awkward paths like
set.grid.superblock.free_set
.I want to move both free set and free set encoded to be just a part of the grid, but that's for a follow up to not mix code motion with logic changes.
I noticed that we
@intCast
the encoded bit set size tou32
. I wonder if we should maybe just store bitset lenght as u64? With u32 we can support a merely 16PiB data file.Testing this is awkward. In normal use, the free set size just doesn't
go beyond a single block, so the entire linked list code is
effectively unused. To cover this,
fragment_free_set
is added toforest fuzz which manually injects fragmentation into the free set to
make sure it occupies more than one block