-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[hist] Improve precision of TAxis::FindFixBin / FindBin. #19033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
3eb7398
to
50e7024
Compare
Thanks a lot Stephan!
|
Good questions, I'm checking! 🙂 |
Test Results 19 files 19 suites 4d 3h 1m 0s ⏱️ For more details on these failures, see this check. Results for commit 901d041. ♻️ This comment has been updated with latest results. |
It can be closed, indeed.
It fixes the problem described there. I linked the issue.
It does not affect run time on a modern CPU, even though more instructions are issued. When you run the bare code to find the bin (not as part of a virtual function), you have the following behaviour:
It doesn't matter how many instructions are issued. What's interesting is how many cycles they need to complete. Although that depends on the CPU in use, you see that the two versions are very close. The killer argument, however, is that all of this happens inside a virtual function, and that of course has its own latencies. When plugging the codes into a virtual function, and benchmarking 10M calls, we arrive at:
So you see that more instructions are issued, but due to the CPU being a superscalar CPU, these additional instructions have no noteworthy effect on the run time, because the correction terms can be evaluated in parallel to the original computation. This means, that in TAxis, where it is part of a virtual function, we should not see a run time difference.
I did not manage to provoke a more-than-one difference. I didn't set out to prove it mathematically, but I had lots of difficulties crafting a case that made an "off by -1" error. |
eab0ca7
to
543f148
Compare
543f148
to
b9849ac
Compare
Btw, this PR reminds me of this issue: https://its.cern.ch/jira/browse/ROOT-10179 |
b9849ac
to
8e6acc0
Compare
@ferdymercury I think for this PR, the - max = currentMax;
+ max = std::nextafter(currentMax, inf); but that's to be tested.
Was this the correct link? |
It was about the question/fact of whether performance was impacted. Maybe more the subissue https://its.cern.ch/jira/browse/ROOT-10185 which would have been easier for you if rootbench could be triggered from a PR.
I had tried that and it didn't work. I really had to add For most cases, the 1e-15*max_width is what really counts. Related: #17896 |
Due to the floating-point subtraction x - xMin, the bin index occasionally flows over to the next bin. This is particularly annoying when it goes into the overflow, although the coordinate is strictly smaller than the max of the axis. By adding a correction term that can detect this case, overflows such as the one discussed in https://root-forum.cern.ch/t/bug-or-feature-in-ttree-draw/62862 can be avoided. An error in the other direction is possible, too, and fixed in this commit: https://root-forum.cern.ch/t/floating-point-rounding-error-when-filling-the-histogram/35189 Microbenchmarking the changed lines showed them to be the same speed in gcc, and 40% slower in clang. Both changes are by far outweighted by the virtual call overhead, though. This allowed for removing the cautionary note on rounding errors, added in 1703c54, which is fixed now. Fix root-project#14091.
The tests are based on two issues brought up in the forum, a few cases designed by hand to break the old algorithm, github issue root-project#14091, the tests from PR root-project#14105, and a test sampling 100 random points.
8e6acc0
to
901d041
Compare
The bin computation in TAxis can suffer from floating-point uncertainties, returning bins such that this assertion breaks:
This is of course surprising for users, and sometimes brings data into the overflow, although they fall into the valid range of the axis. Here, two terms are added, which fix these floating-point errors.
On a recent AMD CPU, there is no runtime difference, since the latency of the virtual call overshadows the additional computations for the correction. For 10M calls, running 100 times with perf, we arrive at:
Note:
Several attempts at reordering the equations for better precision failed. There were always a few test cases in the included tests that didn't pass, so applying the correction looks to be the best solution.
This PR fixes the issues brought up here: