-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiversioning: bring up multiversion binaries and build process for linux #2013
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete review, but posting what I have so far!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description (Overview/Building/Monitoring/etc) is a great candidate for a document in e.g. docs/internals/upgrades.md
!
// When slicing into the binary: | ||
// checksum(section[past_offset..past_offset+past_size]) == past_checksum. | ||
// This is then validated when the binary is written to a memfd or similar. | ||
// TODO: Might be nicer as an AoS? It's control plane state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking my understanding: We could change this MultiversionMetadata
schema later, and to transition the multi-version build step would need to decode the old version and encode the new version -- at which point we would not need to old version again.
e.g. I imagine that we will soon want to compress the binaries.
Maybe there should be a schema version number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking my understanding: We could change this MultiversionMetadata schema later, and to transition the multi-version build step would need to decode the old version and encode the new version -- at which point we would not need to old version again.
Yes, that's correct. Much in the same way the epoch is currently a special case.
The other consideration is that this struct is read by past versions of TigerBeetle, we have 2 escape hatches there:
-
If we really need to break compatibility, the upgrade process becomes "upgrade the binary, restart each replica"
-
It's always the latest version that does the reading and executing, so otherwise the information is used for advertising. So, if we had to add compression, we wouldn't need to change anything in a breaking way here.
As long as the top level checksums (
checksum_header
,checksum_binary_without_header
), andpast.count
,past.versions
andpast.visits
are readable, the old version will correctly advertise the new version, which can then do anything with the rest of the fields (or additional ones) as it needs.This makes it quite flexible! I'm thinking of dropping the explicit(the checksum would differ, so we can't quite do that)reserved
and check forelf_section_header.sh_size != @sizeOf(MultiversionHeader)
because we can add fields freely if we need with this approach.
re: upgrade instructions -- we should recommend that the source and the destination binaries already be on the same filesystem before the |
(Out of scope for this PR, but...)
Alternatively, tigerbeetle could handle that step itself... e.g. It could also handle:
...But idk if it is worth it. |
src/multiversioning.zig
Outdated
err: anyerror, | ||
} = .init, | ||
|
||
callback: ?Callback = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callback
doesn't seem to be used for much... just
fn on_read_from_binary_statx(self: *Multiversion, result: anyerror!void) void {
self.start_timeout();
_ = result catch return;
}
Imo it would be simpler to remove callback
and just hardcode self.start_timeout()
.
Also, I think that would mean you could remove the self.start_timeout
from fn start()
, though maybe that messes up the timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. With timeout_start_enabled
I've gone the route of disabling timeouts at the beginning of start()
, and explicitly starting them when its finished to ensure the timing is correct.
Yep we should - we should also recommend moving the old binary out, rather than straight rm'ing it in case. Although, that's only for an extra layer of protection - the code fully expects and handles the binary being written, truncated, etc from under it. |
self.start_timeout(); | ||
|
||
return self.handle_error(e); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also verify that the executable is in fact executable (according to the file mode bits)?
We already handle that correctly via handle_error
after opening the file. And checking after stat isn't ideal anyway, due to the TOCTOU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already handle that correctly via handle_error after opening the file
We do? 😄
Currently, we will exec into a new binary even if it's not +x
due to the memfd copying dance. But, if TigerBeetle crashes or the operator wants to restart it manually, this is a problem indeed.
I'm going to add a mode check to the stat; there is technically a TOCTOU, yes, but I think it's better than not erroring out if the mode isn't set. In regular operation there's no TOCTOU (from the logic described above), only if the following were to happen:
- Operator drops in new binary, with +x
- TB reads it in, updates everything
- Operator runs chmod -x tigerbeetle
- TB crashes / is manually stopped
The case I'm more concerned about is them dropping in a binary, by accident, that's not marked executable.
656ad41
to
4e99da7
Compare
Overview
This PR adds the platform specific support for multiversion binaries on Linux!
The idea behind multiversion binaries is to give operators a great experience when upgrading TigerBeetle clusters. Upgrades should be simple, involve minimal downtime and be robust, while not requiring external coordination. Multiple versions in a single binary are required for two reasons:
The upgrade instructions look something like:
When the primary determines enough replicas have the new binary, it'll coordinate the upgrade (#1670). There are three main parts to multiversion binaries: building, monitoring and executing, with platform specific parts in each.
Building
Physically, multiversion binaries are regular TigerBeetle ELF files that have two extra sections embedded into them - marked as noload so that they're not memory mapped:
.tbmvm
or TigerBeetleMultiVersionMetadata - a metadata struct containing information on past versions embedded as well as offsets, sizes, checksums and the like..tbmvp
or TigerBeetleMultiVersionPack - a concatenated pack of binaries. The offsets in.tvmvm
refer into here.(the short names are for compatibility with Windows / PE when that gets added.)
These are added by an explicit objcopy step in the release process, after the regular build is done. After the epoch, the build process only needs to pull the last TigerBeetle release from GitHub, to access its embedded pack to build its own.
Monitoring
On a 1 second timer, TigerBeetle
stat
s its binary file, looking for changes. Should anything differ (besidesatime
) it'll re-read the binary into memory, verify checksums and metadata, and start advertising new versions without requiring a restart.This optimization allows skipping a potentially expensive WAL replay when upgrading: the previous version is what will checkpoint to the new version, at which point the exec happens.
Executing
The final step is executing into the new version of TigerBeetle. On Linux, this is handled by
execveat
which allows executing from amemfd
. If executing the latest release,exec_latest
re-execs thememfd
as-is. If executing an older release,exec_release
copies it out of the pack, verifies its checksum, and then executes it.Bootstrapping
0.15.3 is considered the epoch release, but it doesn't know about any future versions of TigerBeetle or how to read the metadata yet. This means that if the build process pulled in that exact release, when running on a 0.15.3 datafile, 0.15.3 would be executed and nothing further would happen. We have a special backport release (#1935), that'll be published to a draft release on GitHub, that embeds the fact that 0.15.4 is available to solve this problem.
Once 0.15.4 is running, no more special cases are needed.
Caveats