specs: add VMClock specification#199
Conversation
|
|
Is the update to the top-level README.md automatic? |
|
No that needs to be updated too, good catch |
|
Also happy to have review on the actual content too. This is an almost-final draft of the doc we'd been preparing internally, to go along with the existing implementations in QEMU and the Linux kernel, with the recent updates to add the generation support: |
Very nice. We should add these in a follow-up pr. I don't mind merging that before it's in. Buy let me know what you think. |
|
Those changes to the spec are already included. |
specs/vmclock.md
Outdated
| @@ -0,0 +1,316 @@ | |||
| --- | |||
| title: VMClock | |||
There was a problem hiding this comment.
please assign UAPI.13 here.
see how this is done elsewhere
There was a problem hiding this comment.
Sorry, I had a very outdated version of the repo and that's why all the new bits had been missing because I didn't see them in the existing versions of the specs.
|
Lgtm. Just thenlink to the virtio rtc spec should be fixed |
| | 0x50 | `uint64_t time_frac_sec` | Fractional part of reference time, in units of second / 2⁶⁴. | | ||
| | 0x58 | `uint64_t time_esterror_nanosec` | Estimated ± error of the time given in `time_sec` + `time_frac_sec`, in nanoseconds | | ||
| | 0x60 | `uint64_t time_maxerror_nanosec` | Maximum ± error of the time given in `time_sec` + `time_frac_sec`, in nanoseconds | | ||
| | 0x64 | `uint64_t vm_generation_count` | A change in this field indicates that the guest has been loaded from a snapshot. In addition to handling a disruption in time (which will also be signalled through the `disruption_marker` field), a guest may wish to discard UUIDs, reset network connections or reseed entropy, etc. | |
There was a problem hiding this comment.
which will also be signalled through the
disruption_markerfield
Is this a must or how is this ensured?
This was written by David Woodhouse <dwmw@amazon.co.uk>. As discussed this will become part of the uapi specs. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>. Signed-off-by: Christian Brauner <brauner@kernel.org>
This was written by David Woodhouse dwmw@amazon.co.uk. As discussed this will become part of the uapi specs.
VMClock: Efficient time synchronisation for virtual machines
The requirements for accurate synchronisation of application clocks against
real wallclock time are becoming ever more demanding. Increasingly cloud
providers are exposing precision clock devices to virtual machines to allow the
guest operating systems to synchronise their clocks.
Time on modern systems is typically derived from a CPU-internal counter (TSC,
timebase, arch counter) which runs at a nominally constant frequency of
typically between 1GHz and 4GHz. In practice, the frequency of the underlying
hardware counter will vary with environmental conditions, with a tolerance of
the order of ±50PPM. It is this variance which must constantly be corrected by
synchronising against an external clock.
Synchronisation against an external clock typically works by reading the CPU
counter, then reading the external clock, and finally reading the CPU counter
again — then assuming that the external clock reading was concurrent with a
point in time between the two CPU counter readings to give a pair of { CPU
counter, real time } values. Successive such readings are used to calibrate the
precise rate at which the CPU counter is running, in order to use it for
precision timekeeping.
When applied at scale to virtual machines, there are a number of problems with
this approach. Firstly, where virtual CPUs are overcommitted across a smaller
number of physical CPUs in a host, guests experience "steal time" — time when
their vCPU is not actually running. That steal time is unpredictable and can
occur in the critical period between one read of the CPU counter and the next,
affecting the precision of the estimated reading.
A remedy for this issue is to repeat the reading a number of times, and to use
the result where the latency between first and last CPU counter reading is the
lowest. Which exacerbates the second problem, that a large number of separate
guest operating systems on the same host are now repeating the same work of
calibrating the same underlying hardware oscillator.
The third major problem of guest-calibrated time is Live Migration, in which a
guest is transparently moved from one host to another for maintenance reasons.
When this happens, the guest can experience a step change in both the frequency
and the value of the CPU counter. The frequency because the migrated guest is
now using a different underlying counter, and the value because correctly
setting the counter value seen by the guest is dependent on the time
synchronisation of each hypervisor host. After a Live Migration, a guest's
clock should be considered inaccurate until it has been resynchronised from
scratch. Failure to do so can lead to data corruption, in cases where database
coherency depends on accurately timestamped transactions.