Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VSR: Ping releases #1670

Merged
merged 6 commits into from Mar 11, 2024
Merged

VSR: Ping releases #1670

merged 6 commits into from Mar 11, 2024

Conversation

sentientwaffle
Copy link
Member

VSR: Add Replica.releases_bundled

This is presently stubbed out.

  • For real (multiversion) binaries, this will be set to a list of releases compiled into the binary.
  • For testing, we will manually set it to a list of releases (which may change between restarts).

VSR: Ping bundled releases

To determine when the cluster can upgrade, each replica advertises which releases it is able/willing to upgrade to.

Note that pings will include all bundled releases -- newer, older, and currently-running. For the purposes of upgrades, only the newer versions actually matter, but the list isn't that large, so there isn't really a benefit to discarding/filtering the information out.

VSR: Replica.upgrade_targets

Each replica tracks the releases that every other replica is advertising.

Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.

@@ -142,6 +142,7 @@ const ConfigCluster = struct {
lsm_batch_multiple: comptime_int = 32,
lsm_snapshots_max: usize = 32,
lsm_manifest_compact_extra_blocks: comptime_int = 1,
vsr_releases_max: usize = 64,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vsr_releases_max is a new config/constant which defines the maximum number of releases that can be advertised by a replica in a ping message. (So this is effectively the maximum number of releases that can be compiled into a multiversion binary).

I set it to 64 right now, which I chose somewhat arbitrarily: We release on a weekly cadence, there are 52 weeks/year, rounded up to the nearest power-of-2 is 64.

This is presently stubbed out.
- For real (multiversion) binaries, this will be set to a list of releases compiled into the binary.
- For testing, we will manually set it to a list of releases (which may change between restarts).

`vsr_releases_max` is a new config/constant which defines the maximum number of releases that can be advertised by a replica in a `ping` message. (So this is effectively the maximum number of releases that can be compiled into a multiversion binary).
I set it to `64` right now, which I chose somewhat arbitrarily: We release on a weekly cadence, there are 52 weeks/year, rounded up to the nearest power-of-2 is 64.

Depending on if (and how well) we can compress multiple releases into a single binary, it might be too low or too high.
To determine when the cluster can upgrade, each replica advertises which releases it is able/willing to upgrade to.

Note that pings will include all bundled releases -- newer, older, and currently-running. For the purposes of upgrades, only the newer versions actually matter, but the list isn't that large, so there isn't really a benefit to discarding/filtering the information out.
Each replica tracks the releases that every other replica is advertising.

Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.
Copy link
Member

@matklad matklad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing about time feels important, as well as validation of releases bundled with pings. Everything else are just nits!

src/vsr.zig Outdated
releases[0 .. releases.len - 1],
releases[1..],
) |release_a, release_b| {
assert(release_a > release_b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks for the descending list, right? That is, that's the opposite of natural sort order? What's the reason for deviation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to emphasize that the highest versions are higher priorities... in retrospect, probably an unnecessary complication.

@@ -487,6 +495,7 @@ pub fn ReplicaType(
grid_cache_blocks_count: u32 = Grid.Cache.value_count_max_multiple,
release: u16,
release_client_min: u16,
releases_bundled: vsr.ReleaseList,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but I'v bound bounded array to be clunky to work with (esp for vsr headers). Given that vsr.ReleaseList doesn't actually enforce validity in types (we need to call verify_release_list), then maybe just pass a slice of u16s here?

That is, I think the two stationary design points are:

  • fully enforce semantic validity in types, such that, if the caller has vsr.ReleaseList, that's already a proof that the list is valid
  • use types to specify structural representation, and rely on assertions for invariance checking. That is, use simple types like u16, []u128 etc, instead of vsr.Relaese, vsr.Checksum etc.

Here we are sort-of in the middle, where the interface requires a vsr.ReleaseList, but that doesn't actually check most of invariants itself.

return "size > limit";
}
if (self.size != @sizeOf(Header) + self.release_count * @sizeOf(u16)) {
return "size != @sizeOf(Header) + release_count * @sizeOf(u16)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a separate check that release_count <= vsr_releases_max, so that we don't have to worry about overflow?

I think the code is correct, but if, eg the release_count were an u64 then it would be possible to get an overflow here. Seems like its more robust to check count limits individually before multiplying to check size limits.

.cluster = self.cluster,
.replica = self.replica,
.view = self.view_durable(), // Don't drop pings while the view is being updated.
.release = self.release,
.checkpoint_id = self.superblock.working.checkpoint_id(),
.checkpoint_op = self.op_checkpoint(),
.ping_timestamp_monotonic = self.clock.monotonic(),
.release_count = @intCast(self.releases_bundled.count()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.releases_bundled.count_as(u16) to move the range check to compile time

.command = .ping,
.size = @intCast(@sizeOf(Header) + @sizeOf(u16) * self.releases_bundled.count()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we could also .count_as(u32) here to aviod intCast?

@@ -6614,6 +6625,15 @@ pub fn ReplicaType(
self.send_message_to_other_replicas_base(message.base());
}

fn send_message_to_other_replicas_and_standbys(self: *Self, message: *Message) void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why send_header functions don't delegate to send_message ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is due to Message pointers vs MessageType pointers, to keep the caller from needing to do any conversion.


fn invalid_header(self: *const @This()) ?[]const u8 {
assert(self.command == .ping);
if (self.size != @sizeOf(Header)) return "size != @sizeOf(Header)";
if (self.checksum_body != checksum_body_empty) return "checksum_body != expected";
if (self.size <= @sizeOf(Header)) return "size <= @sizeOf(Header)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code to validate the body of a ping is missing? I.e, checking that the releases are strictly sorted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't have access to the message body. We do assert the order in on_ping, though.

if (message.header.replica < self.replica_count) {
const upgrade_targets = &self.upgrade_targets[message.header.replica];
if (upgrade_targets.* == null or
upgrade_targets.*.?.timestamp > message.header.ping_timestamp_monotonic)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I think this is incorrect --- monotinic timestamps are not guaranteed to be monotinic across restarts. The doom scenario here would be a buggy replica advedtising a very high monotonic timestamp, then crashing, then restarting with a much smaller monotonic timestamp and not being able to sync versions.

Not sure what's the best solution here, options I have in mind:

  • use checkpoint_op/view for monotonicity (doesn't work for idle cluster)
  • use op for monotonicity (better then checkpoint, but still doesnt' work for idle cluster)
  • use real time --- that should be stable across restarts, but is not monotinic overwise
  • just always update the upgrade_target, making an assumption that all old messages in the network eventually die?

I'd probably go for this combination:

  • if the current target has higher view or checkpoint, skip the new target
  • otherwise, just replace the target, ignoring the timestamp

I do not feel great about timestamps, far-future timestamp locking us up from an upgrade feels like a nasty failure mode.

.releases = .{},
};

const releases = std.mem.bytesAsSlice(u16, message.body());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that releases are sorted.

vsr.verify_release_list(releases);
for (releases) |release| {
if (release > self.release) {
upgrade_targets.*.?.releases.append_assume_capacity(release);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append_assume_capacity

I think this can fire? We have a bound for a single message, but different ping messages might have originated from different binaries, so their union might actually exceed what we have here. This is of course and edge case, but there's a related realistic problem: imagine if you run a binary with a new release on one replica, but then retire it in favor of the old binary. The knowledge that that replica has a new release persists until the viewchange/checkpoint boundary, although the actual binary does not support that version...

It feels like just replacing the whole set of targets might be safer? But then again, that assumes that the network drops all old messages eventually.

Hm, I think I really don't like the union behavior because it is guaranteed to be wrong: if we have two pings with different release sets, then what we support is one release set or the other release set. When we union the two, we are guaranteed to have an unsupported release in there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We replace the entire list every time, there is no union:

+                    upgrade_targets.* = .{
+                        .checkpoint = message.header.checkpoint_op,
+                        .view = message.header.view,
+                        .releases = .{},
+                    };

above resets the list to empty before we append anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 gotta grab one more pu-erh cup as I don't seem to be sufficiently woken up, sorry!

matklad
matklad previously approved these changes Mar 11, 2024
Message bodies must align to `@sizeOf(u256)`. See: #1518.

For the moment, always send set the ping body's size for `vsr_releases_max`.
If #1518 is resolved by relaxing this alignment requirement, we can revert this commit. (Which would allow us to avoid padding ping messages).
@sentientwaffle sentientwaffle added this pull request to the merge queue Mar 11, 2024
Merged via the queue into main with commit c6fe5ef Mar 11, 2024
27 checks passed
@sentientwaffle sentientwaffle deleted the dj-vsr-ping-releases branch March 11, 2024 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants