multiversioning: bring up multiversion binaries and build process for linux #2013

cb22 · 2024-06-14T15:04:15Z

Overview

This PR adds the platform specific support for multiversion binaries on Linux!

The idea behind multiversion binaries is to give operators a great experience when upgrading TigerBeetle clusters. Upgrades should be simple, involve minimal downtime and be robust, while not requiring external coordination. Multiple versions in a single binary are required for two reasons:

It allows a replica to crash after the binary has been upgraded, and still come back online.
- It also allows for deployments, like Docker, where the binary is immutable and the process has to be terminated to learn about new versions from itself.
It allows for migrations over a range to happen easily without having to manually jump from version to version.

The upgrade instructions look something like:

# SSH to each replica, in no particular order:
cd /tmp
wget https://github.com/tigerbeetle/tigerbeetle/releases/download/0.15.4/tigerbeetle-x86_64-linux.zip
unzip tigerbeetle-x86_64-linux.zip
rm /usr/bin/tigerbeetle
mv tigerbeetle /usr/bin/tigerbeetle

When the primary determines enough replicas have the new binary, it'll coordinate the upgrade (#1670). There are three main parts to multiversion binaries: building, monitoring and executing, with platform specific parts in each.

Building

Physically, multiversion binaries are regular TigerBeetle ELF files that have two extra sections embedded into them - marked as noload so that they're not memory mapped:

.tbmvm or TigerBeetleMultiVersionMetadata - a metadata struct containing information on past versions embedded as well as offsets, sizes, checksums and the like.
.tbmvp or TigerBeetleMultiVersionPack - a concatenated pack of binaries. The offsets in .tvmvm refer into here.

(the short names are for compatibility with Windows / PE when that gets added.)

These are added by an explicit objcopy step in the release process, after the regular build is done. After the epoch, the build process only needs to pull the last TigerBeetle release from GitHub, to access its embedded pack to build its own.

Monitoring

On a 1 second timer, TigerBeetle stats its binary file, looking for changes. Should anything differ (besides atime) it'll re-read the binary into memory, verify checksums and metadata, and start advertising new versions without requiring a restart.

This optimization allows skipping a potentially expensive WAL replay when upgrading: the previous version is what will checkpoint to the new version, at which point the exec happens.

Executing

The final step is executing into the new version of TigerBeetle. On Linux, this is handled by execveat which allows executing from a memfd. If executing the latest release, exec_latest re-execs the memfd as-is. If executing an older release,
exec_release copies it out of the pack, verifies its checksum, and then executes it.

Bootstrapping

0.15.3 is considered the epoch release, but it doesn't know about any future versions of TigerBeetle or how to read the metadata yet. This means that if the build process pulled in that exact release, when running on a 0.15.3 datafile, 0.15.3 would be executed and nothing further would happen. We have a special backport release (#1935), that'll be published to a draft release on GitHub, that embeds the fact that 0.15.4 is available to solve this problem.

Once 0.15.4 is running, no more special cases are needed.

Caveats

There are still a couple of FIXME's - but I wanted to get something up before the weekend :)
Only Linux is supported in this PR. The code for macOS / Windows will be added shortly in a followup PR.
Only building from the epoch (0.15.3) is supported.

… linux

sentientwaffle

Incomplete review, but posting what I have so far!

sentientwaffle · 2024-06-14T16:33:02Z

src/multiversioning.zig

The PR description (Overview/Building/Monitoring/etc) is a great candidate for a document in e.g. docs/internals/upgrades.md!

src/multiversioning.zig

sentientwaffle · 2024-06-14T17:18:42Z

src/multiversioning.zig

+    // When slicing into the binary:
+    // checksum(section[past_offset..past_offset+past_size]) == past_checksum.
+    // This is then validated when the binary is written to a memfd or similar.
+    // TODO: Might be nicer as an AoS? It's control plane state.


Checking my understanding: We could change this MultiversionMetadata schema later, and to transition the multi-version build step would need to decode the old version and encode the new version -- at which point we would not need to old version again.

e.g. I imagine that we will soon want to compress the binaries.

Maybe there should be a schema version number?

Checking my understanding: We could change this MultiversionMetadata schema later, and to transition the multi-version build step would need to decode the old version and encode the new version -- at which point we would not need to old version again.

Yes, that's correct. Much in the same way the epoch is currently a special case.

The other consideration is that this struct is read by past versions of TigerBeetle, we have 2 escape hatches there:

If we really need to break compatibility, the upgrade process becomes "upgrade the binary, restart each replica"

It's always the latest version that does the reading and executing, so otherwise the information is used for advertising. So, if we had to add compression, we wouldn't need to change anything in a breaking way here.

As long as the top level checksums (checksum_header, checksum_binary_without_header), and past.count, past.versions and past.visits are readable, the old version will correctly advertise the new version, which can then do anything with the rest of the fields (or additional ones) as it needs.

This makes it quite flexible! I'm thinking of dropping the explicit reserved and check for elf_section_header.sh_size != @sizeOf(MultiversionHeader) because we can add fields freely if we need with this approach. (the checksum would differ, so we can't quite do that)

src/multiversioning.zig

src/constants.zig

src/multiversioning.zig

sentientwaffle · 2024-06-17T20:34:55Z

re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the mv tigerbeetle /usr/bin/tigerbeetle, to ensure that it is atomic.

sentientwaffle · 2024-06-17T20:42:05Z

(Out of scope for this PR, but...)

re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the mv tigerbeetle /usr/bin/tigerbeetle, to ensure that it is atomic.

Alternatively, tigerbeetle could handle that step itself... e.g. /path/to/new/tigerbeetle install /path/to/old/tigerbeetle

It could also handle:

verifying that the new tigerbeetle binary is valid (checksum). (I'm more concerned here with truncated downloads than bitrot.)
verifying that the new binary has intersecting versions with the old binary.
verify that file permissions match?

...But idk if it is worth it. mv tigerbeetle-new tigerbeetle-old is slick.

src/multiversioning.zig

sentientwaffle · 2024-06-17T21:29:37Z

src/multiversioning.zig

+        err: anyerror,
+    } = .init,
+
+    callback: ?Callback = null,


callback doesn't seem to be used for much... just

fn on_read_from_binary_statx(self: *Multiversion, result: anyerror!void) void { self.start_timeout(); _ = result catch return; }

Imo it would be simpler to remove callback and just hardcode self.start_timeout().

Also, I think that would mean you could remove the self.start_timeout from fn start(), though maybe that messes up the timing.

Good idea. With timeout_start_enabled I've gone the route of disabling timeouts at the beginning of start(), and explicitly starting them when its finished to ensure the timing is correct.

src/multiversioning.zig

cb22 · 2024-06-18T19:02:33Z

re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the mv tigerbeetle /usr/bin/tigerbeetle, to ensure that it is atomic.

Yep we should - we should also recommend moving the old binary out, rather than straight rm'ing it in case. Although, that's only for an extra layer of protection - the code fully expects and handles the binary being written, truncated, etc from under it.

src/multiversioning.zig

sentientwaffle · 2024-06-19T16:52:57Z

src/multiversioning.zig

+            self.start_timeout();
+
+            return self.handle_error(e);
+        };


~~Should we also verify that the executable is in fact executable (according to the file mode bits)?~~

We already handle that correctly via handle_error after opening the file. And checking after stat isn't ideal anyway, due to the TOCTOU.

We already handle that correctly via handle_error after opening the file

We do? 😄

Currently, we will exec into a new binary even if it's not +x due to the memfd copying dance. But, if TigerBeetle crashes or the operator wants to restart it manually, this is a problem indeed.

I'm going to add a mode check to the stat; there is technically a TOCTOU, yes, but I think it's better than not erroring out if the mode isn't set. In regular operation there's no TOCTOU (from the logic described above), only if the following were to happen:

Operator drops in new binary, with +x

TB reads it in, updates everything

Operator runs chmod -x tigerbeetle

TB crashes / is manually stopped

The case I'm more concerned about is them dropping in a binary, by accident, that's not marked executable.

src/scripts/release.zig

multiversioning: bring up multiversion binaries and build process for…

d26d275

… linux

cb22 assigned sentientwaffle Jun 14, 2024

sentientwaffle reviewed Jun 14, 2024

View reviewed changes

sentientwaffle reviewed Jun 17, 2024

View reviewed changes

sentientwaffle reviewed Jun 19, 2024

View reviewed changes

cb22 mentioned this pull request Jun 24, 2024

Unexpectedly large datafile relative to amount of transfers and accounts #2039

Closed

multiversioning: review changes

4e99da7

cb22 force-pushed the cb22/99-binaries-linux branch from 656ad41 to 4e99da7 Compare July 8, 2024 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiversioning: bring up multiversion binaries and build process for linux #2013

multiversioning: bring up multiversion binaries and build process for linux #2013

cb22 commented Jun 14, 2024

sentientwaffle left a comment

sentientwaffle Jun 14, 2024

sentientwaffle Jun 14, 2024

cb22 Jul 5, 2024 •

edited

Loading

sentientwaffle commented Jun 17, 2024

sentientwaffle commented Jun 17, 2024 •

edited

Loading

sentientwaffle Jun 17, 2024

cb22 Jul 8, 2024

cb22 commented Jun 18, 2024 •

edited

Loading

sentientwaffle Jun 19, 2024

cb22 Jul 5, 2024 •

edited

Loading

multiversioning: bring up multiversion binaries and build process for linux #2013

Are you sure you want to change the base?

multiversioning: bring up multiversion binaries and build process for linux #2013

Conversation

cb22 commented Jun 14, 2024

Overview

Building

Monitoring

Executing

Bootstrapping

Caveats

sentientwaffle left a comment

Choose a reason for hiding this comment

sentientwaffle Jun 14, 2024

Choose a reason for hiding this comment

sentientwaffle Jun 14, 2024

Choose a reason for hiding this comment

cb22 Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

sentientwaffle commented Jun 17, 2024

sentientwaffle commented Jun 17, 2024 • edited Loading

sentientwaffle Jun 17, 2024

Choose a reason for hiding this comment

cb22 Jul 8, 2024

Choose a reason for hiding this comment

cb22 commented Jun 18, 2024 • edited Loading

sentientwaffle Jun 19, 2024

Choose a reason for hiding this comment

cb22 Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

cb22 Jul 5, 2024 •

edited

Loading

sentientwaffle commented Jun 17, 2024 •

edited

Loading

cb22 commented Jun 18, 2024 •

edited

Loading

cb22 Jul 5, 2024 •

edited

Loading