Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiversioning: bring up multiversion binaries and build process for linux #2013

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cb22
Copy link
Contributor

@cb22 cb22 commented Jun 14, 2024

Overview

This PR adds the platform specific support for multiversion binaries on Linux!

The idea behind multiversion binaries is to give operators a great experience when upgrading TigerBeetle clusters. Upgrades should be simple, involve minimal downtime and be robust, while not requiring external coordination. Multiple versions in a single binary are required for two reasons:

  • It allows a replica to crash after the binary has been upgraded, and still come back online.
    • It also allows for deployments, like Docker, where the binary is immutable and the process has to be terminated to learn about new versions from itself.
  • It allows for migrations over a range to happen easily without having to manually jump from version to version.

The upgrade instructions look something like:

# SSH to each replica, in no particular order:
cd /tmp
wget https://github.com/tigerbeetle/tigerbeetle/releases/download/0.15.4/tigerbeetle-x86_64-linux.zip
unzip tigerbeetle-x86_64-linux.zip
rm /usr/bin/tigerbeetle
mv tigerbeetle /usr/bin/tigerbeetle

When the primary determines enough replicas have the new binary, it'll coordinate the upgrade (#1670). There are three main parts to multiversion binaries: building, monitoring and executing, with platform specific parts in each.

Building

Physically, multiversion binaries are regular TigerBeetle ELF files that have two extra sections embedded into them - marked as noload so that they're not memory mapped:

  • .tbmvm or TigerBeetleMultiVersionMetadata - a metadata struct containing information on past versions embedded as well as offsets, sizes, checksums and the like.
  • .tbmvp or TigerBeetleMultiVersionPack - a concatenated pack of binaries. The offsets in .tvmvm refer into here.

(the short names are for compatibility with Windows / PE when that gets added.)

These are added by an explicit objcopy step in the release process, after the regular build is done. After the epoch, the build process only needs to pull the last TigerBeetle release from GitHub, to access its embedded pack to build its own.

Monitoring

On a 1 second timer, TigerBeetle stats its binary file, looking for changes. Should anything differ (besides atime) it'll re-read the binary into memory, verify checksums and metadata, and start advertising new versions without requiring a restart.

This optimization allows skipping a potentially expensive WAL replay when upgrading: the previous version is what will checkpoint to the new version, at which point the exec happens.

Executing

The final step is executing into the new version of TigerBeetle. On Linux, this is handled by execveat which allows executing from a memfd. If executing the latest release, exec_latest re-execs the memfd as-is. If executing an older release,
exec_release copies it out of the pack, verifies its checksum, and then executes it.

Bootstrapping

0.15.3 is considered the epoch release, but it doesn't know about any future versions of TigerBeetle or how to read the metadata yet. This means that if the build process pulled in that exact release, when running on a 0.15.3 datafile, 0.15.3 would be executed and nothing further would happen. We have a special backport release (#1935), that'll be published to a draft release on GitHub, that embeds the fact that 0.15.4 is available to solve this problem.

Once 0.15.4 is running, no more special cases are needed.

Caveats

  • There are still a couple of FIXME's - but I wanted to get something up before the weekend :)
  • Only Linux is supported in this PR. The code for macOS / Windows will be added shortly in a followup PR.
  • Only building from the epoch (0.15.3) is supported.

Copy link
Member

@sentientwaffle sentientwaffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review, but posting what I have so far!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description (Overview/Building/Monitoring/etc) is a great candidate for a document in e.g. docs/internals/upgrades.md!

src/multiversioning.zig Outdated Show resolved Hide resolved
// When slicing into the binary:
// checksum(section[past_offset..past_offset+past_size]) == past_checksum.
// This is then validated when the binary is written to a memfd or similar.
// TODO: Might be nicer as an AoS? It's control plane state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking my understanding: We could change this MultiversionMetadata schema later, and to transition the multi-version build step would need to decode the old version and encode the new version -- at which point we would not need to old version again.

e.g. I imagine that we will soon want to compress the binaries.

Maybe there should be a schema version number?

Copy link
Contributor Author

@cb22 cb22 Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking my understanding: We could change this MultiversionMetadata schema later, and to transition the multi-version build step would need to decode the old version and encode the new version -- at which point we would not need to old version again.

Yes, that's correct. Much in the same way the epoch is currently a special case.

The other consideration is that this struct is read by past versions of TigerBeetle, we have 2 escape hatches there:

  • If we really need to break compatibility, the upgrade process becomes "upgrade the binary, restart each replica"

  • It's always the latest version that does the reading and executing, so otherwise the information is used for advertising. So, if we had to add compression, we wouldn't need to change anything in a breaking way here.

    As long as the top level checksums (checksum_header, checksum_binary_without_header), and past.count, past.versions and past.visits are readable, the old version will correctly advertise the new version, which can then do anything with the rest of the fields (or additional ones) as it needs.

    This makes it quite flexible! I'm thinking of dropping the explicit reserved and check for elf_section_header.sh_size != @sizeOf(MultiversionHeader) because we can add fields freely if we need with this approach. (the checksum would differ, so we can't quite do that)

src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/constants.zig Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
@sentientwaffle
Copy link
Member

re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the mv tigerbeetle /usr/bin/tigerbeetle, to ensure that it is atomic.

@sentientwaffle
Copy link
Member

sentientwaffle commented Jun 17, 2024

(Out of scope for this PR, but...)

re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the mv tigerbeetle /usr/bin/tigerbeetle, to ensure that it is atomic.

Alternatively, tigerbeetle could handle that step itself... e.g. /path/to/new/tigerbeetle install /path/to/old/tigerbeetle

It could also handle:

  • verifying that the new tigerbeetle binary is valid (checksum). (I'm more concerned here with truncated downloads than bitrot.)
  • verifying that the new binary has intersecting versions with the old binary.
  • verify that file permissions match?

...But idk if it is worth it. mv tigerbeetle-new tigerbeetle-old is slick.

src/multiversioning.zig Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
err: anyerror,
} = .init,

callback: ?Callback = null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callback doesn't seem to be used for much... just

    fn on_read_from_binary_statx(self: *Multiversion, result: anyerror!void) void {
        self.start_timeout();

        _ = result catch return;
    }

Imo it would be simpler to remove callback and just hardcode self.start_timeout().

Also, I think that would mean you could remove the self.start_timeout from fn start(), though maybe that messes up the timing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. With timeout_start_enabled I've gone the route of disabling timeouts at the beginning of start(), and explicitly starting them when its finished to ensure the timing is correct.

src/multiversioning.zig Outdated Show resolved Hide resolved
@cb22
Copy link
Contributor Author

cb22 commented Jun 18, 2024

re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the mv tigerbeetle /usr/bin/tigerbeetle, to ensure that it is atomic.

Yep we should - we should also recommend moving the old binary out, rather than straight rm'ing it in case. Although, that's only for an extra layer of protection - the code fully expects and handles the binary being written, truncated, etc from under it.

src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
src/multiversioning.zig Outdated Show resolved Hide resolved
self.start_timeout();

return self.handle_error(e);
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also verify that the executable is in fact executable (according to the file mode bits)?

We already handle that correctly via handle_error after opening the file. And checking after stat isn't ideal anyway, due to the TOCTOU.

Copy link
Contributor Author

@cb22 cb22 Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already handle that correctly via handle_error after opening the file

We do? 😄

Currently, we will exec into a new binary even if it's not +x due to the memfd copying dance. But, if TigerBeetle crashes or the operator wants to restart it manually, this is a problem indeed.

I'm going to add a mode check to the stat; there is technically a TOCTOU, yes, but I think it's better than not erroring out if the mode isn't set. In regular operation there's no TOCTOU (from the logic described above), only if the following were to happen:

  1. Operator drops in new binary, with +x
  2. TB reads it in, updates everything
  3. Operator runs chmod -x tigerbeetle
  4. TB crashes / is manually stopped

The case I'm more concerned about is them dropping in a binary, by accident, that's not marked executable.

src/scripts/release.zig Outdated Show resolved Hide resolved
src/scripts/release.zig Outdated Show resolved Hide resolved
src/scripts/release.zig Outdated Show resolved Hide resolved
src/scripts/release.zig Outdated Show resolved Hide resolved
src/scripts/release.zig Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants