New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Instant/Duration math precision & associativity on Windows #57380

Open
wants to merge 3 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@bearcage
Copy link

bearcage commented Jan 6, 2019

tl;dr Addition and subtraction on Duration/Instant are not associative on windows because we use the perfcounter frequency in every calculation instead of just when we measure time.

This is my first contrib (PR or Issue) to Rust, so please lmk if I've done this wrong. I followed CONTRIBUTING to the extent I could given my system doesn't seem to be able to build the compiler with changes in the source tree. I also asked about this issue in #rust-beginners a week or so ago, before digging through libstd -- I'm unsure if there's a good way to follow up on that, but I'd be happy to update the docs on the timing structs if this fixes the problem.

Issue

The Duration type keeps seconds in the upper-64 and nanoseconds in the lower-32 bits. In theory doing math on these ought to be basically the same as doing math on any other 64 or 32 bit integral number.

On windows (and I think macos too), however, our math gets messy because the Instant type stores the current point in time in units of HPET Performance Counter counts, not nanoseconds, and does unit conversions on every math operation, rather than just when we measure the time from the system clock.

I tried this code:

use std::time::{Duration, Instant};

fn main() {
    let now = Instant::now();
    let offset = Duration::from_millis(5);
    assert_eq!((now + offset) - now, (now - now) + offset);
}

On UNIX machines (linux and macos) it behaves as you'd expect -- no crash.

On Windows hosts, however, it blows up because of a precision problem in the Instant +/- Duration math:

C:\Users\aberg\work\timetest (master -> origin)
λ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target\debug\timetest.exe`
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `4.999914ms`,
 right: `5ms`', src\main.rs:6:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: process didn't exit successfully: `target\debug\timetest.exe` (exit code: 101)

C:\Users\aberg\work\timetest (master -> origin)
λ cat src\main.rs
use std::time::{Duration, Instant};

fn main() {
    let now = Instant::now();
    let offset = Duration::from_millis(5);
    assert_eq!((now + offset) - now, (now - now) + offset);
}

On windows I think this is a consequence of doing the HPET-PerfCounter-Unit conversion on each math operation. I suspect the reason it works on macs is that (from what I could find online) most apple machines report timing in nanoseconds anyway. For anyone interested, the equivalent functions on macos, with a little work to fish out the numerator/denominator from a timebase struct:

  • QueryPerformanceCounter() -> mach_absolute_time()
  • QueryPerformanceFrequency() -> mach_timebase_info()

If this PR ends up working as I expect it to when CI runs the tests, I can make the same changes to the macos implementation.

Potential Fix

We ought to be able to sort this out by storing nanoseconds, rather than PerfCounter units, that way intermediate math is done in the most precise units we support and we're only doing unit conversions when we actually measure the system clock (and it might even translate to a small perf gain for people doing tons of Instant/Duration math).

I believe this will address the underlying cause of #56034, and make the guessed epsilon constant from #56059 unnecessary. If it's of interest, I can write up how these timing types work on the tier 1 platforms to address #32626 as well, since I'm already in here figuring it out.

This Patch

To that end, I've got this patch, which I think should fix it on windows, but I'm having trouble testing it -- any time I change anything in libstd I start getting this error, which no amount of clean building seems to resolve:

C:\Users\aberg\work\rust (master -> origin)
λ python x.py test --stage 0 --no-doc src/libstd
Updating only changed submodules
Submodules updated in 0.27 seconds
    Finished dev [unoptimized] target(s) in 2.41s
Building stage0 std artifacts (x86_64-pc-windows-msvc -> x86_64-pc-windows-msvc)
    Finished release [optimized] target(s) in 6.78s
Copying stage0 std from stage0 (x86_64-pc-windows-msvc -> x86_64-pc-windows-msvc / x86_64-pc-windows-msvc)
Building stage0 test artifacts (x86_64-pc-windows-msvc -> x86_64-pc-windows-msvc)
   Compiling test v0.0.0 (C:\Users\aberg\work\rust\src\libtest)
error[E0460]: found possibly newer version of crate `std` which `getopts` depends on
  --> src\libtest\lib.rs:36:1
   |
36 | extern crate getopts;
   | ^^^^^^^^^^^^^^^^^^^^^
   |
   = note: perhaps that crate needs to be recompiled?
   = note: the following crate versions were found:
           crate `std`: \\?\C:\Users\aberg\work\rust\build\x86_64-pc-windows-msvc\stage0-sysroot\lib\rustlib\x86_64-pc-windows-msvc\lib\libstd-d7a80ca2ae113c97.rlib
           crate `std`: \\?\C:\Users\aberg\work\rust\build\x86_64-pc-windows-msvc\stage0-sysroot\lib\rustlib\x86_64-pc-windows-msvc\lib\std-d7a80ca2ae113c97.dll
           crate `getopts`: \\?\C:\Users\aberg\work\rust\build\x86_64-pc-windows-msvc\stage0-test\x86_64-pc-windows-msvc\release\deps\libgetopts-ae40a96de5f5d144.rlib

error: aborting due to previous error

For more information about this error, try `rustc --explain E0460`.
error: Could not compile `test`.

To learn more, run the command again with --verbose.
command did not execute successfully: "C:\\Users\\aberg\\work\\rust\\build\\x86_64-pc-windows-msvc\\stage0\\bin\\cargo.exe" "build" "--target" "x86_64-pc-windows-msvc" "-j" "12" "--release" "--manifest-path" "C:\\Users\\aberg\\work\\rust\\src/libtest/Cargo.toml" "--message-format" "json"
expected success, got: exit code: 101
failed to run: C:\Users\aberg\work\rust\build\bootstrap\debug\bootstrap test --stage 0 --no-doc src/libstd
Build completed unsuccessfully in 0:00:20

Since you wrote the linked PRs and might remember looking at related problems:

r? @alexcrichton

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Jan 6, 2019

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@alexcrichton
Copy link
Member

alexcrichton left a comment

Thanks so much for the thorough PR and description! I've been feeling for some time now that we probably want to just switch to a bland Duration as the internal storage for all Instant and possibly SystemTime. That I think would also solve this and likely improve our portability of these types across platforms without actually sacrificing much functionality.

One comment I'd have is that we may want to consider the esoteric use case of moving very far back in time. If you take a huge duration and subtract it from the current Instant it may work today and may have overflow with this PR. I... think though that this may affect SystemTime more than Instant, so it may not be a worry for this PR at all! Mind testing this out locally though to see how it goes?

If it's of interest, I can write up how these timing types work on the tier 1 platforms to address #32626 as well, since I'm already in here figuring it out.

It would definitely be of interest! We'd be more than happy to help take a look at some docs here.


cc @sfackler, I think you've had thoughts on the implementations of these APIs on the past, just wanted to make sure you didn't have any objections to this either!

Show resolved Hide resolved src/libstd/sys/windows/time.rs Outdated
Show resolved Hide resolved src/libstd/sys/windows/time.rs Outdated
@bearcage

This comment has been minimized.

Copy link

bearcage commented Jan 8, 2019

Thanks for the quick turnaround!

I got a fresh shiny Windows VM going after work today, and I can finally get the test suite to run, albeit a bit slowly.

I believe you're correct about the potential for a loss of range with this change as it is today, at least on some machines. Given a pair of windows hosts, if one is giving us PerfCounter increments in nanoseconds and another is giving us microseconds, the precision of our math, and extent of time we can represent, has to vary accordingly.

If you'd like to see one, I can come up with a concrete example of a Duration that'll overflow with this patch and wouldn't without it. I'm not sure it'd be a useful test to add to the suite, though, since whether it passes or fails depends on what computer you run it on.


Personally I'm leaning toward moving the underlying type to a Duration, assuming y'all feel similarly about the extra memory spend being worth the model simplicity we get for it.

I think this probably isn't much of a breaking change since we already don't promise anything about the Instant type other than that it represents a point in time (the docs even call it opaque). If this sounds good to y'all, I'll make the change for sys/windows/time.rs first, we can land that, and then I'll do the other platforms in libstd, and finally we can see about changing the interface, if it makes sense?

@bearcage

This comment has been minimized.

Copy link

bearcage commented Jan 8, 2019

In either case, I'll stack on another commit to update the docs with whatever we decide on :-D

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 8, 2019

I think I personally feel that this is the best way to go now. Using a second/nanosecond counter (even if it's not precisely Duration) across all platforms I think will make it easier to work with the code internally and externally (less duplication and more consistent internally while also being more portable by having nanosecond resolution externally).

To be clear I would be worried about something like representing the date 1500 in SystemTime. That's pre-epoch on both Unix and Windows, and so representing such a time would require some sort of sign (negative) in Instant and/or SystemTime. The Duration type, however, is unsigned and always represents a positive duration.

I'm not really sure how well this is supported today, however, nor whether it's really a case that comes up that often. I believe for Instant it's not really an issue as it's roughly relative to process start. For SystemTime it may be a bit worrisome, but anyone storing timestamps so old is likely using a more full-fledged datetime library anyway.

In any case starting with Windows sounds like a good idea, and we can probably just copy all the code into src/libstd/time.rs in a future PR! Before going too far down that route though I think it's best to hear from @sfackler first (or at least give them a bit of time to respond)

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Jan 8, 2019

I wonder if there's ever a situation where QueryPerformanceFrequency will report a frequency greater than 1,000,000,000...

@bearcage

This comment has been minimized.

Copy link

bearcage commented Jan 11, 2019

Checking in -- @sfackler did you get a chance to peek at this PR / do you have an opinion about using a Duration as the backing type for Instant?

@retep998 - it's certainly possible, although since 1/ns = 1 GHz, so it seems unlikely to me that we'd have a clock much more precise, given the speeds CPUs operate at... food for thought.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 11, 2019

Ah ok, it's been long enough that @bearcage if you want to go ahead and make the changes I think that should be good to do!

@bearcage

This comment has been minimized.

Copy link

bearcage commented Jan 21, 2019

Sorry that took me a little to get to — my usual windows machine is still failing to build the test suite (I suspect I've misconfigured visual studio at some point along the way).

Regardless, the last commit on here reworks this PR to use a Duration as the backing type, which simplifies the code a fair bit. If this looks good overall, I'll stack on another commit to reduce the perf_counter module interface to a couple querying functions, and cut another PR for applying the same change to macos.

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Jan 21, 2019

The job mingw-check of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:1df7e3fd:start=1548032876937771204,finish=1548032951321227676,duration=74383456472
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=mingw-check
---
[00:06:28] configure: build.locked-deps    := True
[00:06:28] configure: llvm.ccache          := sccache
[00:06:28] configure: build.cargo-native-static := True
[00:06:28] configure: dist.missing-tools   := True
[00:06:28] configure: build.configure-args := ['--enable-sccache', '--disable-manage-submodu ...
[00:06:28] configure: writing `config.toml` in current directory
[00:06:28] configure: 
[00:06:28] configure: run `python /checkout/x.py --help`
[00:06:28] configure: 
---
[00:08:22] For more information about this error, try `rustc --explain E0308`.
[00:08:22] error: Could not compile `std`.
[00:08:22] 
[00:08:22] To learn more, run the command again with --verbose.
[00:08:22] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "check" "--target" "i686-pc-windows-gnu" "-j" "4" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "--message-format" "json"
[00:08:22] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap check --target=i686-pc-windows-gnu --host=i686-pc-windows-gnu
[00:08:22] Build completed unsuccessfully in 0:00:39
travis_time:end:22f12edc:start=1548032961733702640,finish=1548033464844018427,duration=503110315787
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 1.
---
travis_time:end:003e05dc:start=1548033465663676873,finish=1548033465671055043,duration=7378170
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:322070ce
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:08f93f9a
travis_time:start:08f93f9a
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:059e3818
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@bearcage

This comment has been minimized.

Copy link

bearcage commented Jan 21, 2019

I'm not sure what's up with that travis failure — looking at the logs, it's showing a failure referring to code that doesn't exist in the current HEAD:

perf_counter::PerformanceCounterInstant::epsilon();

travis snippet:

[00:08:22] error[E0308]: mismatched types
[00:08:22]   --> src/libstd/sys/windows/time.rs:46:22
[00:08:22]    |
[00:08:22] 46 |         Instant { t: 0 }
[00:08:22]    |                      ^ expected struct `core::time::Duration`, found integral variable
[00:08:22]    |
[00:08:22]    = note: expected type `core::time::Duration`
[00:08:22]               found type `{integer}`
[00:08:22] 
[00:08:22] error: aborting due to previous error
[00:08:22] 
[00:08:22] For more information about this error, try `rustc --explain E0308`.
[00:08:22] error: Could not compile `std`.

Any chance it's got some wonky caching going on?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 22, 2019

I think that may perhaps be a "merge conflict" of sorts? If you rebase I think that code should show up

bearcage added some commits Jan 6, 2019

Simplify units in Duration/Instant math on Windows
Right now we do unit conversions between PerfCounter measurements
and nanoseconds for every add/sub we do between Durations and Instants
on Windows machines. This leads to goofy behavior, like this snippet
failing:

```
let now = Instant::now();
let offset = Duration::from_millis(5);
assert_eq!((now + offset) - now, (now - now) + offset);
```

with precision problems like this:

```
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `4.999914ms`,
 right: `5ms`', src\main.rs:6:5
```

To fix it, this changeset does the unit conversion once, when we
measure the clock, and all the subsequent math in u64 nanoseconds.

It also adds an exact associativity test to the `sys/time.rs`
test suite to make sure we don't regress on this in the future.
Move Instant backing type to Duration
Per review comments, this commit switches out the backing
type for Instant on windows to a Duration. Tests all pass,
and the code's a lot simpler (plus it should be portable now,
with the exception of the QueryPerformanceWhatever functions).

@bearcage bearcage force-pushed the bearcage:master branch from a31ec05 to 0f566ec Jan 23, 2019

@bearcage

This comment has been minimized.

Copy link

bearcage commented Jan 23, 2019

Ahh right, I forgot y'all have a fancy post-merge CI system — thanks for the pointer!

Rebased and fixed.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 23, 2019

Looks great to me!

It's probably worthwhile having a small comment in struct Instant explaining what the contained Duration is relative to, but other than that I think this is good to go. Do you want to expand the scope though before landing? (aka include other platforms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment