New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VSR: Ping releases #1670
VSR: Ping releases #1670
Conversation
@@ -142,6 +142,7 @@ const ConfigCluster = struct { | |||
lsm_batch_multiple: comptime_int = 32, | |||
lsm_snapshots_max: usize = 32, | |||
lsm_manifest_compact_extra_blocks: comptime_int = 1, | |||
vsr_releases_max: usize = 64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vsr_releases_max
is a new config/constant which defines the maximum number of releases that can be advertised by a replica in a ping
message. (So this is effectively the maximum number of releases that can be compiled into a multiversion binary).
I set it to 64
right now, which I chose somewhat arbitrarily: We release on a weekly cadence, there are 52 weeks/year, rounded up to the nearest power-of-2 is 64.
This is presently stubbed out. - For real (multiversion) binaries, this will be set to a list of releases compiled into the binary. - For testing, we will manually set it to a list of releases (which may change between restarts). `vsr_releases_max` is a new config/constant which defines the maximum number of releases that can be advertised by a replica in a `ping` message. (So this is effectively the maximum number of releases that can be compiled into a multiversion binary). I set it to `64` right now, which I chose somewhat arbitrarily: We release on a weekly cadence, there are 52 weeks/year, rounded up to the nearest power-of-2 is 64. Depending on if (and how well) we can compress multiple releases into a single binary, it might be too low or too high.
To determine when the cluster can upgrade, each replica advertises which releases it is able/willing to upgrade to. Note that pings will include all bundled releases -- newer, older, and currently-running. For the purposes of upgrades, only the newer versions actually matter, but the list isn't that large, so there isn't really a benefit to discarding/filtering the information out.
Each replica tracks the releases that every other replica is advertising. Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.
f99631a
to
5add306
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing about time feels important, as well as validation of releases bundled with pings. Everything else are just nits!
src/vsr.zig
Outdated
releases[0 .. releases.len - 1], | ||
releases[1..], | ||
) |release_a, release_b| { | ||
assert(release_a > release_b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks for the descending list, right? That is, that's the opposite of natural sort order? What's the reason for deviation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to emphasize that the highest versions are higher priorities... in retrospect, probably an unnecessary complication.
src/vsr/replica.zig
Outdated
@@ -487,6 +495,7 @@ pub fn ReplicaType( | |||
grid_cache_blocks_count: u32 = Grid.Cache.value_count_max_multiple, | |||
release: u16, | |||
release_client_min: u16, | |||
releases_bundled: vsr.ReleaseList, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but I'v bound bounded array to be clunky to work with (esp for vsr headers). Given that vsr.ReleaseList
doesn't actually enforce validity in types (we need to call verify_release_list
), then maybe just pass a slice of u16s here?
That is, I think the two stationary design points are:
- fully enforce semantic validity in types, such that, if the caller has
vsr.ReleaseList
, that's already a proof that the list is valid - use types to specify structural representation, and rely on assertions for invariance checking. That is, use simple types like
u16
,[]u128
etc, instead ofvsr.Relaese
,vsr.Checksum
etc.
Here we are sort-of in the middle, where the interface requires a vsr.ReleaseList
, but that doesn't actually check most of invariants itself.
src/vsr/message_header.zig
Outdated
return "size > limit"; | ||
} | ||
if (self.size != @sizeOf(Header) + self.release_count * @sizeOf(u16)) { | ||
return "size != @sizeOf(Header) + release_count * @sizeOf(u16)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a separate check that release_count <= vsr_releases_max
, so that we don't have to worry about overflow?
I think the code is correct, but if, eg the release_count
were an u64
then it would be possible to get an overflow here. Seems like its more robust to check count limits individually before multiplying to check size limits.
src/vsr/replica.zig
Outdated
.cluster = self.cluster, | ||
.replica = self.replica, | ||
.view = self.view_durable(), // Don't drop pings while the view is being updated. | ||
.release = self.release, | ||
.checkpoint_id = self.superblock.working.checkpoint_id(), | ||
.checkpoint_op = self.op_checkpoint(), | ||
.ping_timestamp_monotonic = self.clock.monotonic(), | ||
.release_count = @intCast(self.releases_bundled.count()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.releases_bundled.count_as(u16)
to move the range check to compile time
src/vsr/replica.zig
Outdated
.command = .ping, | ||
.size = @intCast(@sizeOf(Header) + @sizeOf(u16) * self.releases_bundled.count()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we could also .count_as(u32)
here to aviod intCast?
@@ -6614,6 +6625,15 @@ pub fn ReplicaType( | |||
self.send_message_to_other_replicas_base(message.base()); | |||
} | |||
|
|||
fn send_message_to_other_replicas_and_standbys(self: *Self, message: *Message) void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why send_header
functions don't delegate to send_message
ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is due to Message
pointers vs MessageType
pointers, to keep the caller from needing to do any conversion.
src/vsr/message_header.zig
Outdated
|
||
fn invalid_header(self: *const @This()) ?[]const u8 { | ||
assert(self.command == .ping); | ||
if (self.size != @sizeOf(Header)) return "size != @sizeOf(Header)"; | ||
if (self.checksum_body != checksum_body_empty) return "checksum_body != expected"; | ||
if (self.size <= @sizeOf(Header)) return "size <= @sizeOf(Header)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code to validate the body of a ping is missing? I.e, checking that the releases are strictly sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't have access to the message body. We do assert the order in on_ping
, though.
src/vsr/replica.zig
Outdated
if (message.header.replica < self.replica_count) { | ||
const upgrade_targets = &self.upgrade_targets[message.header.replica]; | ||
if (upgrade_targets.* == null or | ||
upgrade_targets.*.?.timestamp > message.header.ping_timestamp_monotonic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I think this is incorrect --- monotinic timestamps are not guaranteed to be monotinic across restarts. The doom scenario here would be a buggy replica advedtising a very high monotonic timestamp, then crashing, then restarting with a much smaller monotonic timestamp and not being able to sync versions.
Not sure what's the best solution here, options I have in mind:
- use checkpoint_op/view for monotonicity (doesn't work for idle cluster)
- use op for monotonicity (better then checkpoint, but still doesnt' work for idle cluster)
- use real time --- that should be stable across restarts, but is not monotinic overwise
- just always update the upgrade_target, making an assumption that all old messages in the network eventually die?
I'd probably go for this combination:
- if the current target has higher view or checkpoint, skip the new target
- otherwise, just replace the target, ignoring the timestamp
I do not feel great about timestamps, far-future timestamp locking us up from an upgrade feels like a nasty failure mode.
src/vsr/replica.zig
Outdated
.releases = .{}, | ||
}; | ||
|
||
const releases = std.mem.bytesAsSlice(u16, message.body()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert that releases are sorted.
d14fb69
to
135ba32
Compare
vsr.verify_release_list(releases); | ||
for (releases) |release| { | ||
if (release > self.release) { | ||
upgrade_targets.*.?.releases.append_assume_capacity(release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append_assume_capacity
I think this can fire? We have a bound for a single message, but different ping messages might have originated from different binaries, so their union might actually exceed what we have here. This is of course and edge case, but there's a related realistic problem: imagine if you run a binary with a new release on one replica, but then retire it in favor of the old binary. The knowledge that that replica has a new release persists until the viewchange/checkpoint boundary, although the actual binary does not support that version...
It feels like just replacing the whole set of targets might be safer? But then again, that assumes that the network drops all old messages eventually.
Hm, I think I really don't like the union behavior because it is guaranteed to be wrong: if we have two pings with different release sets, then what we support is one release set or the other release set. When we union the two, we are guaranteed to have an unsupported release in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We replace the entire list every time, there is no union:
+ upgrade_targets.* = .{
+ .checkpoint = message.header.checkpoint_op,
+ .view = message.header.view,
+ .releases = .{},
+ };
above resets the list to empty before we append anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 gotta grab one more pu-erh cup as I don't seem to be sufficiently woken up, sorry!
VSR: Add Replica.releases_bundled
This is presently stubbed out.
VSR: Ping bundled releases
To determine when the cluster can upgrade, each replica advertises which releases it is able/willing to upgrade to.
Note that pings will include all bundled releases -- newer, older, and currently-running. For the purposes of upgrades, only the newer versions actually matter, but the list isn't that large, so there isn't really a benefit to discarding/filtering the information out.
VSR: Replica.upgrade_targets
Each replica tracks the releases that every other replica is advertising.
Note that only release versions are tracked -- if the protocol version changes, that will necessarily change the release version too.