VSR: Ping releases #1670

sentientwaffle · 2024-03-08T19:52:16Z

VSR: Add Replica.releases_bundled

This is presently stubbed out.

For real (multiversion) binaries, this will be set to a list of releases compiled into the binary.
For testing, we will manually set it to a list of releases (which may change between restarts).

VSR: Ping bundled releases

To determine when the cluster can upgrade, each replica advertises which releases it is able/willing to upgrade to.

Note that pings will include all bundled releases -- newer, older, and currently-running. For the purposes of upgrades, only the newer versions actually matter, but the list isn't that large, so there isn't really a benefit to discarding/filtering the information out.

VSR: Replica.upgrade_targets

Each replica tracks the releases that every other replica is advertising.

Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.

sentientwaffle · 2024-03-08T19:53:06Z

src/config.zig

@@ -142,6 +142,7 @@ const ConfigCluster = struct {
    lsm_batch_multiple: comptime_int = 32,
    lsm_snapshots_max: usize = 32,
    lsm_manifest_compact_extra_blocks: comptime_int = 1,
+    vsr_releases_max: usize = 64,


vsr_releases_max is a new config/constant which defines the maximum number of releases that can be advertised by a replica in a ping message. (So this is effectively the maximum number of releases that can be compiled into a multiversion binary).

I set it to 64 right now, which I chose somewhat arbitrarily: We release on a weekly cadence, there are 52 weeks/year, rounded up to the nearest power-of-2 is 64.

This is presently stubbed out. - For real (multiversion) binaries, this will be set to a list of releases compiled into the binary. - For testing, we will manually set it to a list of releases (which may change between restarts). `vsr_releases_max` is a new config/constant which defines the maximum number of releases that can be advertised by a replica in a `ping` message. (So this is effectively the maximum number of releases that can be compiled into a multiversion binary). I set it to `64` right now, which I chose somewhat arbitrarily: We release on a weekly cadence, there are 52 weeks/year, rounded up to the nearest power-of-2 is 64. Depending on if (and how well) we can compress multiple releases into a single binary, it might be too low or too high.

To determine when the cluster can upgrade, each replica advertises which releases it is able/willing to upgrade to. Note that pings will include all bundled releases -- newer, older, and currently-running. For the purposes of upgrades, only the newer versions actually matter, but the list isn't that large, so there isn't really a benefit to discarding/filtering the information out.

Each replica tracks the releases that every other replica is advertising. Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.

matklad

The thing about time feels important, as well as validation of releases bundled with pings. Everything else are just nits!

matklad · 2024-03-11T11:39:53Z

src/vsr.zig

+        releases[0 .. releases.len - 1],
+        releases[1..],
+    ) |release_a, release_b| {
+        assert(release_a > release_b);


This checks for the descending list, right? That is, that's the opposite of natural sort order? What's the reason for deviation?

It was to emphasize that the highest versions are higher priorities... in retrospect, probably an unnecessary complication.

matklad · 2024-03-11T11:48:20Z

src/vsr/replica.zig

@@ -487,6 +495,7 @@ pub fn ReplicaType(
            grid_cache_blocks_count: u32 = Grid.Cache.value_count_max_multiple,
            release: u16,
            release_client_min: u16,
+            releases_bundled: vsr.ReleaseList,


nit, but I'v bound bounded array to be clunky to work with (esp for vsr headers). Given that vsr.ReleaseList doesn't actually enforce validity in types (we need to call verify_release_list), then maybe just pass a slice of u16s here?

That is, I think the two stationary design points are:

fully enforce semantic validity in types, such that, if the caller has vsr.ReleaseList, that's already a proof that the list is valid

use types to specify structural representation, and rely on assertions for invariance checking. That is, use simple types like u16, []u128 etc, instead of vsr.Relaese, vsr.Checksum etc.

Here we are sort-of in the middle, where the interface requires a vsr.ReleaseList, but that doesn't actually check most of invariants itself.

matklad · 2024-03-11T11:53:27Z

src/vsr/message_header.zig

+                return "size > limit";
+            }
+            if (self.size != @sizeOf(Header) + self.release_count * @sizeOf(u16)) {
+                return "size != @sizeOf(Header) + release_count * @sizeOf(u16)";


Maybe add a separate check that release_count <= vsr_releases_max, so that we don't have to worry about overflow?

I think the code is correct, but if, eg the release_count were an u64 then it would be possible to get an overflow here. Seems like its more robust to check count limits individually before multiplying to check size limits.

matklad · 2024-03-11T11:55:15Z

src/vsr/replica.zig

                .cluster = self.cluster,
                .replica = self.replica,
                .view = self.view_durable(), // Don't drop pings while the view is being updated.
                .release = self.release,
                .checkpoint_id = self.superblock.working.checkpoint_id(),
                .checkpoint_op = self.op_checkpoint(),
                .ping_timestamp_monotonic = self.clock.monotonic(),
+                .release_count = @intCast(self.releases_bundled.count()),


Use self.releases_bundled.count_as(u16) to move the range check to compile time

matklad · 2024-03-11T11:57:43Z

src/vsr/replica.zig

                .command = .ping,
+                .size = @intCast(@sizeOf(Header) + @sizeOf(u16) * self.releases_bundled.count()),


I think here we could also .count_as(u32) here to aviod intCast?

matklad · 2024-03-11T11:59:47Z

src/vsr/replica.zig

@@ -6614,6 +6625,15 @@ pub fn ReplicaType(
            self.send_message_to_other_replicas_base(message.base());
        }

+        fn send_message_to_other_replicas_and_standbys(self: *Self, message: *Message) void {


I wonder why send_header functions don't delegate to send_message ones?

I think it is due to Message pointers vs MessageType pointers, to keep the caller from needing to do any conversion.

matklad · 2024-03-11T12:01:57Z

src/vsr/message_header.zig


        fn invalid_header(self: *const @This()) ?[]const u8 {
            assert(self.command == .ping);
-            if (self.size != @sizeOf(Header)) return "size != @sizeOf(Header)";
-            if (self.checksum_body != checksum_body_empty) return "checksum_body != expected";
+            if (self.size <= @sizeOf(Header)) return "size <= @sizeOf(Header)";


I think the code to validate the body of a ping is missing? I.e, checking that the releases are strictly sorted.

This function doesn't have access to the message body. We do assert the order in on_ping, though.

matklad · 2024-03-11T12:16:32Z

src/vsr/replica.zig

+            if (message.header.replica < self.replica_count) {
+                const upgrade_targets = &self.upgrade_targets[message.header.replica];
+                if (upgrade_targets.* == null or
+                    upgrade_targets.*.?.timestamp > message.header.ping_timestamp_monotonic)


Huh, I think this is incorrect --- monotinic timestamps are not guaranteed to be monotinic across restarts. The doom scenario here would be a buggy replica advedtising a very high monotonic timestamp, then crashing, then restarting with a much smaller monotonic timestamp and not being able to sync versions.

Not sure what's the best solution here, options I have in mind:

use checkpoint_op/view for monotonicity (doesn't work for idle cluster)

use op for monotonicity (better then checkpoint, but still doesnt' work for idle cluster)

use real time --- that should be stable across restarts, but is not monotinic overwise

just always update the upgrade_target, making an assumption that all old messages in the network eventually die?

I'd probably go for this combination:

if the current target has higher view or checkpoint, skip the new target

otherwise, just replace the target, ignoring the timestamp

I do not feel great about timestamps, far-future timestamp locking us up from an upgrade feels like a nasty failure mode.

matklad · 2024-03-11T12:16:56Z

src/vsr/replica.zig

+                        .releases = .{},
+                    };
+
+                    const releases = std.mem.bytesAsSlice(u16, message.body());


assert that releases are sorted.

matklad · 2024-03-11T15:21:55Z

src/vsr/replica.zig

+                    vsr.verify_release_list(releases);
+                    for (releases) |release| {
+                        if (release > self.release) {
+                            upgrade_targets.*.?.releases.append_assume_capacity(release);


append_assume_capacity

I think this can fire? We have a bound for a single message, but different ping messages might have originated from different binaries, so their union might actually exceed what we have here. This is of course and edge case, but there's a related realistic problem: imagine if you run a binary with a new release on one replica, but then retire it in favor of the old binary. The knowledge that that replica has a new release persists until the viewchange/checkpoint boundary, although the actual binary does not support that version...

It feels like just replacing the whole set of targets might be safer? But then again, that assumes that the network drops all old messages eventually.

Hm, I think I really don't like the union behavior because it is guaranteed to be wrong: if we have two pings with different release sets, then what we support is one release set or the other release set. When we union the two, we are guaranteed to have an unsupported release in there?

We replace the entire list every time, there is no union:

+ upgrade_targets.* = .{ + .checkpoint = message.header.checkpoint_op, + .view = message.header.view, + .releases = .{}, + };

above resets the list to empty before we append anything.

🤦 gotta grab one more pu-erh cup as I don't seem to be sufficiently woken up, sorry!

Message bodies must align to `@sizeOf(u256)`. See: #1518. For the moment, always send set the ping body's size for `vsr_releases_max`. If #1518 is resolved by relaxing this alignment requirement, we can revert this commit. (Which would allow us to avoid padding ping messages).

sentientwaffle commented Mar 8, 2024

View reviewed changes

sentientwaffle added 4 commits March 8, 2024 11:55

VSR: Replica.upgrade_targets

e1c4aff

Each replica tracks the releases that every other replica is advertising. Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.

VOPR: Log current and highest available release

5add306

sentientwaffle force-pushed the dj-vsr-ping-releases branch from f99631a to 5add306 Compare March 8, 2024 19:55

sentientwaffle marked this pull request as ready for review March 8, 2024 20:28

sentientwaffle enabled auto-merge March 8, 2024 20:28

sentientwaffle assigned matklad Mar 8, 2024

matklad reviewed Mar 11, 2024

View reviewed changes

VSR: Ping releases (code review)

135ba32

sentientwaffle force-pushed the dj-vsr-ping-releases branch from d14fb69 to 135ba32 Compare March 11, 2024 15:03

matklad reviewed Mar 11, 2024

View reviewed changes

matklad previously approved these changes Mar 11, 2024

View reviewed changes

sentientwaffle dismissed matklad’s stale review via 5beb02c March 11, 2024 18:07

matklad approved these changes Mar 11, 2024

View reviewed changes

sentientwaffle added this pull request to the merge queue Mar 11, 2024

Merged via the queue into main with commit c6fe5ef Mar 11, 2024
27 checks passed

sentientwaffle deleted the dj-vsr-ping-releases branch March 11, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSR: Ping releases #1670

VSR: Ping releases #1670

sentientwaffle commented Mar 8, 2024

sentientwaffle Mar 8, 2024

matklad left a comment

matklad Mar 11, 2024

sentientwaffle Mar 11, 2024

matklad Mar 11, 2024

matklad Mar 11, 2024

matklad Mar 11, 2024

matklad Mar 11, 2024

matklad Mar 11, 2024

sentientwaffle Mar 11, 2024

matklad Mar 11, 2024

sentientwaffle Mar 11, 2024

matklad Mar 11, 2024

matklad Mar 11, 2024

matklad Mar 11, 2024

sentientwaffle Mar 11, 2024

matklad Mar 11, 2024

		.command = .ping,
		.size = @intCast(@sizeOf(Header) + @sizeOf(u16) * self.releases_bundled.count()),

VSR: Ping releases #1670

VSR: Ping releases #1670

Conversation

sentientwaffle commented Mar 8, 2024

VSR: Add Replica.releases_bundled

VSR: Ping bundled releases

VSR: Replica.upgrade_targets

Choose a reason for hiding this comment

matklad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment