- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.6k
[benchmark] Robust Measurements #26462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @swift-ci please benchmark | 
| @swift-ci please smoke test | 
| @eeckstein Hmm… I should have started full test, to also exercise the python unit tests. But if you trust me it all passes locally, we could check this (assuming smoke test passes) based on my honor word, to help Joe get less erratic results ASAP. | 
| Performance: -O
 Code size: -OPerformance: -Osize
 Code size: -Osize | 
| Hmm… I must have messed something up. Will check after returning from dog walk. | 
| Interesting… failure in 2nd iteration of Onone in the  | 
| @Catfish-Man, disregard. The error is in python code for merging results and has nothing to do with the particular benchmark. (Note to self: do not comment from phone while on dog walk) | 
| If we get more stable benchmark results can we decrease the 7% reporting threshold? | 
| I can't figure out how am I sporadically reporting removed benchmarks whose runtime is 0 and how can they be missing in results dictionary during merge. Here it's too late now, I'll try again tomorrow. Sorry! | 
| The error that occurred while running the benchmark is some kind of a heisenbug… (I reproduced the issue locally, but any attempt at observation around potential error sources made it not occur. 💁♂️) I'm going for vacation next week, so I will not be able to work on this. Maybe I'll push the little cleanup work I did in the meantime and see if that fixes the issue. @eeckstein if this modified process works, passes tests and your review, could you please check this in? I'm not sure if I'll be online next week. | 
| @swift-ci benchmark | 
| @swift-ci please test | 
| Performance: -O
 Code size: -OPerformance: -Osize
 Code size: -OsizePerformance: -Onone
 Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 5.0% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview | 
| This isn’t really better… 🙁 | 
| @swift-ci benchmark | 
| Performance: -O
 Code size: -OPerformance: -Osize
 Code size: -OsizePerformance: -Onone
 Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 5.0% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview | 
| @swift-ci benchmark | 
    
      
        1 similar comment
      
    
  
    | @swift-ci benchmark | 
| Performance: -OCode size: -OPerformance: -Osize
 Code size: -OsizePerformance: -OnoneCode size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 5.0% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview | 
| I think this looks OK, but let me run one more benchmark. | 
| @swift-ci benchmark | 
| @swift-ci test | 
| Performance: -O
 Code size: -OPerformance: -Osize
 Code size: -OsizePerformance: -OnoneCode size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 5.0% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview | 
| Well, benchmarks with runtimes under 20 μs need to be adjusted anyway… and  | 
| Since my comments were alluded to above, I want to avoid confusion about the different kinds of variation we see in performance results. Performance is unpredictable in response to either changes to the compiler or to the inputs. I've made comments about these "optimizer instability" problems, or more generally performance unpredictability of the language. It's a problem because fundamentally good changes often produce bad results. The tradeoffs involved have to do with language features, peak benchmark performance, and time spent developing robust optimizations. Run-to-run variation is different. It has to do with the design of the benchmarks and measurement methodology. I have not made any comments about this recently. The tradeoffs here are with turn around time, machine resources, precision and sensitivity, applicability to larger benchmarks, etc. To the extent that anyone pays attention to the benchmark scores, I do think it's important to fix "measurement instability". We just need to acknowledge that even with a perfectly deterministic benchmarking methodology, changes in benchmark scores across compiler changes can still be very misleading. | 
| @swift-ci please benchmark I’m gathering more data… not all previously problematic benchmarks were present in last report. As I said above: 
 So far, it looks like the anomalous results are “lucky” runs. The whole execution of that independent run is an outlier. I’ll post some charts to illustrate the issue tomorrow. | 
| @palimondo The goal is to add another Python validation-test to run  | 
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
| @swift-ci please benchmark | 
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
| @swift-ci please benchmark | 
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
Replaced list comprehension that computes the minimum from runtimes corrected for setup overhead with a procedural style that is easier to understand.
| I don't understand this new output and it definitely needs to be behind a detail drop down. | 
| @gottesmm I’m sorry for spamming you all with these verbose reports. It was a quick hack to dump samples from all independent runs that go into the result comparison. I needed these just temporarily to gather more data and devise better strategy for coping with the issue of false changes. Now it’s clear the extreme outliers are coming from “lucky runs”, which then skew the aggregated sample distribution, used for the final comparison of OLD and NEW. The selection of most stable location from f888aefe9d989fee26fac6a2f8ef3d6deeb6c15c does not help with this problem, because the individual runs have the same shape of distribution, the lucky runs are just shifted in location and we keep comparing the MINs. I think I’ll need to detect and throw out such outlier runs. This explains why  | 
Use the most stable location estimate (MIN, P05, P10, Q1, MED) for `ResultComparison` based on the standard deviation for given quantiles across the independent runs and the aggregate sample (hardening against outlier runs). We multiply the standard deviations for each quantile between the OLD and NEW results to get “cross-sample” variance. We pick the quantile with lowest variance as most stable and use it for the comparisons.
With the faster benchmarks in the Swift Benchmark Suite, we now don’t need to spend a whole 1 second measuring each one of them. So I’m adjusting the Benchmark_Driver to sample each one for 50 ms and gather up to 102 actual runtime values (percentiles + MIN + MAX). In my tests, for most optmized benchmarks, the sample distribution was still roughly comparable with full second mesurements. It is more important to gather samples from multiple independent runs to cover the possible variations in distribution.
With the faster benchmarks in Swift Benchmark Suite (now that the Legacy Factor refactoring is finished), we don’t need to be so extremly frugal with the sample count. Gathering just the first 3 to 10 samples per benchmark was not very representative from the statistical point of view. I suspect it hides Type II errors — unreported changes. Adjusting the measurement method to sample each benchmark for 50 ms and gather at minimum 10 samples. For the suspected changes, gather up to 10 independent samples. Also thorougly measure the newly added test in re-runs.
Removed unused `num_samples` argument and redundant `run_count` variable. Also gather metadata (currently unused). Print the actual threshold in the “How to read the data“ decription.
| @swift-ci python lint | 
| @swift-ci please benchmark | 
| @swift-ci please smoke test | 
| @swift-ci benchmark | 
| Performance: -O
 Code size: -OPerformance: -Osize
 Code size: -OsizePerformance: -Onone
 Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 5.0% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Wow, now this is upsetting! Switching away from MIN as location estimate on these benchmarks did backfire tremendously. I have seen nothing like this locally. I need to re-enable that verbose dump to understand the distribution of individual runs for these cases. | 
| @swift-ci please benchmark | 
| Performance: -O
 Code size: -OPerformance: -Osize
 Code size: -OsizePerformance: -Onone
 Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 5.0% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| @shahmishal @eeckstein Did @swift-ci start to invoke  | 
| @palimondo This PR is enormous! The python formatting changes need to be broken out separately from the benchmark formatting changes. | 
| 
@CodaFi Python formatting was broken out already in a separate PR this one
depends on, but that's neither here nor there since #30085 has landed :-/
resolving the merge conflicts will be a slog. I'll get back to this,
eventually.… On Wed, Apr 15, 2020 at 2:17 AM Robert Widmann ***@***.***> wrote:
 @palimondo <https://github.com/palimondo> This PR is enormous! The python
 formatting changes need to be broken out separately from the benchmark
 formatting changes.
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#26462 (comment)>, or
 unsubscribe
 <https://github.com/notifications/unsubscribe-auth/AAH7S4YDI4HRXNKCVDDZBTDRMT4KDANCNFSM4II66NHQ>
 .
 | 
With the faster benchmarks in Swift Benchmark Suite (now that the Legacy Factor refactoring is finished), we don’t need to be so extremely frugal with the sample count. Gathering just the first 3 to 10 samples per benchmark was not very representative from the statistical point of view. There was a not insignificant chance that the benchmark did not yet reach a steady state. I suspect it was the reason for sporadic false changes reported from CI. (For most recent example see #26455.)
This PR adjusts the measurement method (in both the
Benchmark_Driveras well asrun_smoke_bench) to sample each benchmark for 250 ms and gather at minimum 10 samples. Therun_smoke_benchnow gathers up to 10 independent runs for the suspected changes. Ten independent runs with larger sample counts are enough to get statistically sound results, so it no longer spins in a measurement loop waiting for 10 no-change runs in a row.