Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsr: store free set in the grid #1309

Closed
wants to merge 2 commits into from
Closed

Conversation

matklad
Copy link
Member

@matklad matklad commented Nov 20, 2023

Goal: get rid of storage_size_max

Currently, free set is stored as a part of superblock in a trailer. As a result, the worst-case size of the free set affects the layout of the data file. As the worst-case size is proprtional to the size of the grid, this requires limiting the size of the data file up-front, using storage_size_max parameter. This is bad operator experience -- it's hard to predict which size would be enough, and failure to allocate enough would be very costly to fix down the line.

The solution to make free set elastic and store it in the grid.

Overview

FreeSetEncoded is the persistent component of the free set. It defines defines the layout of the free set as stored in the grid between checkpoints.

Free set is stored as a linked list of blocks containing EWAH-encoding of a bitset of allocated blocks. The length of the linked list is proportional to the degree of fragmentation, rather that to size of the data file. The common case is a single block.

The blocks holding free set itself are marked as free in the on-disk encoding, because the number of blocks required to store the compressed bitset becomes known only after encoding.

Linked list is a FIFO. While the blocks are written in the direct order, they have to be read in the reverse order.

Implementation notes

  • Separate structs to hold persistent and in-memory components of the free set, as the shape of data is quite different between the two.

  • FreeSet Trailer and associated messages are gone.

  • storage_size_max is not gone. While it's the entire purpose of the work, the actual removal of the parameter can happen separately.

  • FreeSetEncoded field is stored as a field in the superblock alongside the free set. This isn't great, as it needs access to the grid, but the superblock is above the grid. As a result, the implementation has a bunch of fairly awkward paths like set.grid.superblock.free_set.

    I want to move both free set and free set encoded to be just a part of the grid, but that's for a follow up to not mix code motion with logic changes.

  • I noticed that we @intCast the encoded bit set size to u32. I wonder if we should maybe just store bitset lenght as u64? With u32 we can support a merely 16PiB data file.

  • Testing this is awkward. In normal use, the free set size just doesn't
    go beyond a single block, so the entire linked list code is
    effectively unused. To cover this, fragment_free_set is added to
    forest fuzz which manually injects fragmentation into the free set to
    make sure it occupies more than one block

@matklad matklad force-pushed the matklad/free-set-in-the-grid branch 7 times, most recently from e148fd3 to 88c9160 Compare November 20, 2023 20:44
@@ -334,6 +335,65 @@ pub const TableData = struct {
}
};

pub const FreeSetNode = struct {
pub const Word = u64;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemingly duplicates FreeSet.Word, but I think that's false sharing --- there's no fudnamental reason when in-memory and on-disk words must match. We require it in the current implementation, and indeed, in encoded free set, we assert that the two are the same.

_ = metadata(free_set_block);
}

pub fn next(free_set_block: BlockPtrConst) ?struct { checksum: u128, address: u64 } {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add vsr.BlockReference as it is used in a couple of different places. But that's for a follow up.

@@ -85,29 +78,28 @@ pub const SuperBlockHeader = extern struct {
/// The checksum of the previous superblock to hash chain across sequence numbers.
parent: u128,

/// The checksum over the actual encoded block free set in the superblock trailer.
free_set_checksum: u128,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is not strictly necessary, I think I'll add back the checksum covering the entire free set, as it can protect from encoding/decoding bugs. I think I'll leave it to a follow up though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic idea to verify the roundtrip.


/// The last operation committed to the state machine. At startup, replay the log hereafter.
commit_min: u64,

// Storage size in bytes.
storage_size: u64,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage size is now part of checkpoint state, as it is no longer a function of the rest of the superblock, and we want to know storage size even without having access to the grid.

client_sessions: ClientSessions,

/// Updated when:
/// - the trailer is serialized immediately before checkpoint or sync, and
/// - used to construct the trailer during state sync.
free_set_buffer: []align(constants.sector_size) u8,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This buffer is effectively moved inside FreeSetEncoded.

} else if (context.caller == .checkpoint) {
superblock.free_set.checkpoint();
}
} else if (context.caller == .checkpoint) {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The free set is now delicately check-pointed in the middle of free-set-encoded checkpointing process.

@@ -578,12 +651,15 @@ test "FreeSet checkpoint" {
const expectEqual = std.testing.expectEqual;
const blocks_count = FreeSet.shard_bits;
var set = try FreeSet.init(std.testing.allocator, blocks_count);
set.opened = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tempting to say that the free set is opened after init (as it indeed is in a valid state where no blocks are allocated), but I'd rather not do it -- while it is a valid state, if we actually do have a non-empty free set on disk, using that valid state can lose data!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 critical.

@matklad matklad force-pushed the matklad/free-set-in-the-grid branch 11 times, most recently from 2841ad1 to 9391f03 Compare November 21, 2023 16:26
@matklad matklad marked this pull request as ready for review November 21, 2023 16:26
src/lsm/schema.zig Show resolved Hide resolved
Comment on lines +371 to +382
pub fn assert_valid_header(free_set_block: BlockPtrConst) void {
_ = metadata(free_set_block);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pattern!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea here is calling assert_valid_header when we write a data block, not only during reading.

src/lsm/schema.zig Outdated Show resolved Hide resolved
@@ -63,6 +63,7 @@ const CommitStage = enum {
checkpoint_state_machine,
checkpoint_client_replies,
checkpoint_grid_repair,
checkpoint_free_set,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could checkpoint_free_set precede checkpoint_grid_repair? That way, if a block happens to be repairing during checkpoint_free_set, we might avoid a pause for checkpoint_grid_repair by repairing concurrent to the freeset checkpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I tried that, an then got a weird error in vopr, and then realized that this doesn't work, at least without some extra legwork.

checkpoint_free_set is what "buffer-flips" the free set right now. So, if we run grid-repair after checkpointing the free set, it'll see the new checkpoint and won't know which blocks are released.

What we can do is change the logic of repair checkpoint to assume that it happens after buffer flip and use is_free rather than is_released:

b06bf4a

In some sense, this even feels clearer, as we are waiting the writes to the free blocks, which is easier to see how it could lead to troubles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll do it, but in a follow up. One thing I don't like is how subtle it is when the data structures "flip" to the next checkpoint, maybe I find some better way to assert that...

Comment on lines +665 to +669
self.superblock.free_set_encoded.open(
&self.grid,
self.superblock.working.free_set_reference(),
free_set_open_callback,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat off-topic, but I think we are inconsistent about whether a callback is the first or last parameter. (This is probably my fault 😅 -- I reflexively put callbacks last, but I think most uses in the project put it first. e.g. Grid.write_block()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, yeah I went back and forth here as well. I also have “context first, lambdas last” rule hard-wired in my brain, but it isn’t really applicable to zig, since there are no lambdas…

@jorangreef any tiger style insights about the order of the callback argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so glad you both raise this... because... I also reflexively put callbacks last in my mind. 😁

"context first, lambdas last" makes sense also because we tend to "order according to logical use" and callbacks are the last thing that happens.

Beyond this, if you have a situation where you also have interim event callbacks that get emitted multiple times (e.g. as part of a streaming API) in addition to "done" callbacks, then putting the callback last follows the progression better.

So... let's do it, and start to move things over as we go.

src/vsr/superblock_free_set_encoded.zig Outdated Show resolved Hide resolved
src/vsr/superblock_free_set_encoded.zig Outdated Show resolved Hide resolved
src/vsr/replica.zig Show resolved Hide resolved
src/vsr/superblock_free_set_encoded.zig Outdated Show resolved Hide resolved
src/vsr/superblock.zig Outdated Show resolved Hide resolved
@matklad matklad force-pushed the matklad/free-set-in-the-grid branch 4 times, most recently from 3a73d95 to a2ea2d6 Compare November 22, 2023 19:26
matklad added a commit that referenced this pull request Nov 22, 2023
assert(@sizeOf(CheckpointState) == 224);
assert(@sizeOf(CheckpointState) % @sizeOf(u256) == 0);
assert(@sizeOf(CheckpointState) == 256);
assert(@sizeOf(CheckpointState) % @sizeOf(u128) == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Stepping back, what is the purpose for this assertion? I changed it without thinking as part of the pseudo-256-bit-checksum change -- i.e., we are pretending that manifest_oldest_checksum + manifest_oldest_checksum_padding is a u256. (In the future it will be, but for the time being we don't need it and Zig doesn't support it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is was unintentional change, but I think this is correct: I think we want to align to u128, as many things are u128, and we want to align to 16 bytes, as that might help with SIMD?

Basically, we want most things to be aligned "nicely", and 16 seems like a reasonable bound for niceness

Comment on lines 317 to 325
snapshots_block_checksum: u128,
snapshots_block_checksum_padding: u128 = 0,
free_set_last_block_checksum: u128,
free_set_last_block_checksum_padding: u128 = 0,

manifest_oldest_address: u64,
manifest_newest_address: u64,
snapshots_block_address: u64,
free_set_last_block_address: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much a nit, but maybe "free set" could precede "manifest" in this order, since it is accessed sooner, and is "lower level" (grid rather than LSM).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that's actually exactly the reason why I put FreeSetNode before ManifestNode in the scheme!

matklad added a commit that referenced this pull request Nov 23, 2023
None of our fuzzers actually checks that. In theory, the simulator can
discover inconsistencies when a replica restarts (decoding free set from
disk resets the index), but, given that the shard size is 4096, this is
quite unlikely.

With this change, injecting a bug in index computation gets discovered
by free set fuzzer at least.

See
#1309 (comment)
for an example of a bug this assert guards against.
matklad added a commit that referenced this pull request Nov 23, 2023
None of our fuzzers actually checks that. In theory, the simulator can
discover inconsistencies when a replica restarts (decoding free set from
disk resets the index), but, given that the shard size is 4096, this is
quite unlikely.

With this change, injecting a bug in index computation gets discovered
by free set fuzzer at least.

See
#1309 (comment)
for an example of a bug this assert guards against.
@matklad matklad force-pushed the matklad/free-set-in-the-grid branch 4 times, most recently from 5fc91b3 to a796347 Compare November 23, 2023 14:39
Goal: get rid of storage_size_max

Currently, free set is stored as a part of superblock in a trailer. As a
result, the worst-case size of the free set affects the layout of the
data file. As the worst-case size is proprtional to the size of the
grid, this requires limiting the size of the data file up-front, using
`storage_size_max` parameter. This is bad operator experience -- it's
hard to predict which size would be enough, and failure to allocate
enough would be very costly to fix down the line.

The solution to make free set elastic and store it in the grid.

 ## Overview

FreeSetEncoded is the persistent component of the free set. It defines
defines the layout of the free set as stored in the grid between
checkpoints.

Free set is stored as a linked list of blocks containing EWAH-encoding
of a bitset of allocated blocks. The length of the linked list is
proportional to the degree of fragmentation, rather that to  size of the
data file. The common case is a single block.

The blocks holding free set itself are marked as free in the on-disk
encoding, because the number of blocks required to store the compressed
bitset becomes known only after encoding.

Linked list is a FIFO. While the blocks are written in the direct order,
they have to be read in the reverse order.

 ## Implementation notes

- Separate structs to hold persistent and in-memory components of the
  free set, as the shape of data is quite different between the two.

- FreeSet Trailer and associated messages are gone.

- `storage_size_max` is _not_ gone. While it's the entire purpose of the
  work, the actual removal of the parameter can happen separately.

- FreeSetEncoded field is stored as a field in the superblock alongside
  the free set. This isn't great, as it needs access to the grid, but
  the superblock is above the grid. As a result, the implementation has
  a bunch of fairly awkward paths like `set.grid.superblock.free_set`.

  I want to move both free set and free set encoded to be just a part of
  the grid, but that's for a follow up to not mix code motion with logic
  changes.

- I noticed that we intCast the encoded bit set size to `u32`. I wonder
  if we should maybe just store bitset lenght as u64? With u32 we can
  support a merely 16PiB data file.

- Testing this is awkward. In normal use, the free set size just doesn't
  go beyond a single block, so the entire linked list code is
  effectively unused. To cover this, `fragment_free_set` is added to
  forest fuzz which manually injects fragmentation into the free set to
  make sure it occupies more than one block
set.size_transferred += chunk.size;

if (schema.FreeSetNode.previous(block)) |previous| {
assert(set.block_index > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert that previous.{address,checksum} is not equal to the last_block_address/last_block_checksum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have last_block_address, but I've added an assert that all read elements are distinct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh,

image

λ git show origin/matklad/free-set-in-the-grid 
commit 80bcce2fd1a06fddd2008649b1441566f4c133ed (HEAD -> matklad/free-set-in-the-grid, origin/matklad/free-set-in-the-grid)
Author: Alex Kladov <aleksey.kladov@gmail.com>
Date:   Mon Nov 27 19:51:02 2023 +0000

    freeset: assert no duplicate blocks in open

did I broke GitHub by racing with the merge queue? If this doesn't get in here, will sneak this into some of the follow ups!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that, it seems like it queued despite a new commit after approval.

@matklad matklad added this pull request to the merge queue Nov 27, 2023
@matklad matklad removed this pull request from the merge queue due to a manual request Nov 27, 2023
@matklad
Copy link
Member Author

matklad commented Nov 27, 2023

closing in favor of #1328

@matklad matklad closed this Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants