Skip to content

Comments

specs: add VMClock specification#199

Merged
brauner merged 1 commit intouapi-group:mainfrom
brauner:main
Dec 16, 2025
Merged

specs: add VMClock specification#199
brauner merged 1 commit intouapi-group:mainfrom
brauner:main

Conversation

@brauner
Copy link
Member

@brauner brauner commented Dec 15, 2025

This was written by David Woodhouse dwmw@amazon.co.uk. As discussed this will become part of the uapi specs.

VMClock: Efficient time synchronisation for virtual machines

The requirements for accurate synchronisation of application clocks against
real wallclock time are becoming ever more demanding. Increasingly cloud
providers are exposing precision clock devices to virtual machines to allow the
guest operating systems to synchronise their clocks.

Time on modern systems is typically derived from a CPU-internal counter (TSC,
timebase, arch counter
) which runs at a nominally constant frequency of
typically between 1GHz and 4GHz. In practice, the frequency of the underlying
hardware counter will vary with environmental conditions, with a tolerance of
the order of ±50PPM. It is this variance which must constantly be corrected by
synchronising against an external clock.

Synchronisation against an external clock typically works by reading the CPU
counter, then reading the external clock, and finally reading the CPU counter
again — then assuming that the external clock reading was concurrent with a
point in time between the two CPU counter readings to give a pair of { CPU
counter, real time } values. Successive such readings are used to calibrate the
precise rate at which the CPU counter is running, in order to use it for
precision timekeeping.

When applied at scale to virtual machines, there are a number of problems with
this approach. Firstly, where virtual CPUs are overcommitted across a smaller
number of physical CPUs in a host, guests experience "steal time" — time when
their vCPU is not actually running. That steal time is unpredictable and can
occur in the critical period between one read of the CPU counter and the next,
affecting the precision of the estimated reading.

A remedy for this issue is to repeat the reading a number of times, and to use
the result where the latency between first and last CPU counter reading is the
lowest. Which exacerbates the second problem, that a large number of separate
guest operating systems on the same host are now repeating the same work of
calibrating the same underlying hardware oscillator.

The third major problem of guest-calibrated time is Live Migration, in which a
guest is transparently moved from one host to another for maintenance reasons.
When this happens, the guest can experience a step change in both the frequency
and the value of the CPU counter. The frequency because the migrated guest is
now using a different underlying counter, and the value because correctly
setting the counter value seen by the guest is dependent on the time
synchronisation of each hypervisor host. After a Live Migration, a guest's
clock should be considered inaccurate until it has been resynchronised from
scratch. Failure to do so can lead to data corruption, in cases where database
coherency depends on accurately timestamped transactions.

@bluca
Copy link
Member

bluca commented Dec 15, 2025

| Version | Changes |
|---------|---------|
| 1.0     | Initial Release |

@dwmw2
Copy link
Contributor

dwmw2 commented Dec 15, 2025

Is the update to the top-level README.md automatic?

@bluca
Copy link
Member

bluca commented Dec 15, 2025

No that needs to be updated too, good catch

@dwmw2
Copy link
Contributor

dwmw2 commented Dec 15, 2025

Also happy to have review on the actual content too. This is an almost-final draft of the doc we'd been preparing internally, to go along with the existing implementations in QEMU and the Linux kernel, with the recent updates to add the generation support:

@brauner
Copy link
Member Author

brauner commented Dec 15, 2025

Also happy to have review on the actual content too. This is an almost-final draft of the doc we'd been preparing internally, to go along with the existing implementations in QEMU and the Linux kernel, with the recent updates to add the generation support:

Very nice. We should add these in a follow-up pr. I don't mind merging that before it's in. Buy let me know what you think.

@dwmw2
Copy link
Contributor

dwmw2 commented Dec 15, 2025

Those changes to the spec are already included.

specs/vmclock.md Outdated
@@ -0,0 +1,316 @@
---
title: VMClock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please assign UAPI.13 here.

see how this is done elsewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had a very outdated version of the repo and that's why all the new bits had been missing because I didn't see them in the existing versions of the specs.

@brauner brauner requested a review from poettering December 15, 2025 21:44
@poettering
Copy link
Collaborator

Lgtm. Just thenlink to the virtio rtc spec should be fixed

| 0x50 | `uint64_t time_frac_sec` | Fractional part of reference time, in units of second / 2⁶⁴. |
| 0x58 | `uint64_t time_esterror_nanosec` | Estimated ± error of the time given in `time_sec` + `time_frac_sec`, in nanoseconds |
| 0x60 | `uint64_t time_maxerror_nanosec` | Maximum ± error of the time given in `time_sec` + `time_frac_sec`, in nanoseconds |
| 0x64 | `uint64_t vm_generation_count` | A change in this field indicates that the guest has been loaded from a snapshot. In addition to handling a disruption in time (which will also be signalled through the `disruption_marker` field), a guest may wish to discard UUIDs, reset network connections or reseed entropy, etc. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which will also be signalled through the disruption_marker field

Is this a must or how is this ensured?

This was written by David Woodhouse <dwmw@amazon.co.uk>.
As discussed this will become part of the uapi specs.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>.
Signed-off-by: Christian Brauner <brauner@kernel.org>
@brauner brauner merged commit 4ecabc5 into uapi-group:main Dec 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants