-
Couldn't load subscription status.
- Fork 10.6k
[WIP] Compare performance #8793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@lplarson Please review. cc @eeckstein @shahmishal @gottesmm @atrick |
|
@swift-ci please benchmark |
Build comment file:Optimized (O) Regression (4)
Improvement (5)
No Changes (357)
Regression (5)
Improvement (2)
No Changes (359)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments. I don't completely understand what you are trying to do with the 2nd patch and I have not read the first. I am going to look at this again later.
benchmark/scripts/Benchmark_Driver
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to make sure I am understanding this correctly, you are instrumenting the max resident size of the process tree rooted at the Driver given a specific test.
I think that will give you the max, but it won't allow you to calculate other statistics over the samples. I haven't thought about the implications of this... @lplarson your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I understand this, this just gets the max memory used by the given test. This will not be a statistical sample, just a single value. All other stats: MIN, MAX, MEAN, SD, MEDIAN are correctly reported by Benchmark_O & co. and were — before this patch — dirscarded or corrupted by this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to be able to take an average of the peak memory usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We first need to get the basics working at all. The script as is, does not report average memory use at all, but a sum of memory used in all runs — results depend on num-iterations. The MAX_RSS is not being reported by the compare_perf_tests.py either.
I have compared two runs (num-iterations 3 and 20) with the fixed script on my machine and the difference in MAX_RSS between the runs for all tests is in 0-1% range, with the exception of following:
| Testname | Mem Diff |
|---|---|
| Array2D | 2 % |
| ArrayAppendSequence | 2 % |
| ArrayAppend | 3% |
| ArrayAppendStrings | 4 % |
| ArrayPlusEqualSingleElementCollection | 2 % |
| ArraySubscript | 9 % |
| DictionaryBridge | 90 % |
| MonteCarloE | 3 % |
| ObjectiveCBridgeStubFromNSStringRef | 9 % |
| ObjectiveCBridgeStubToNSStringRef | 7 % |
| ObjectiveCBridgeToNSDictionary | 84 % |
| ObjectiveCBridgeToNSSet | 84 % |
| ObserverClosure | 2 % |
| ObserverPartiallyAppliedMethod | 3 % |
It looks like there are some issues (leaks ?) with ObjC Bridging (DictionaryBridge, ObjectiveCBridgeToNSDictionary, ObjectiveCBridgeToNSSet). Let me emphasize that I was comparing run with 3 vs 20 samples, done by Benchmark_O. I’ll bet that comparing same number of iterations will yield 0-1% difference for all test.
I don’t see what could be gained by computing average memory used over multiple runs executed by Benchmark_Driver — are we really expecting Swift to have varying memory consumption between runs?!? That approach resulted in the need to redo all the statistics Benchmark_O already does in Benchmark_Driver- none of which was implemented correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Averaging is useful here for the same reason it's useful in measuring benchmark runtime. These systems always have noise and averaging helps deal with it. We moved to using Benchmark_Driver specifically for the reason of averaging over memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The script as is, does not report average memory use at all, but a sum of memory used in all runs" Where do you see this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: "script doesn’t report average but sum” - I cannot reproduce what I saw. It definitely wasn’t caused by this script, but I have a few logs where the MAX_RSS dropped from about 10 MB (for small tests) to 3,3 MB. This lead me to believe the change was caused by my modification of the script (I was testing with num-iterations=3). Now the runs are back to 10 MB. I have no idea what is happening here, will investigate further. (One more reason to keep a better history - SR-4591).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m taking this conversation to the top, being hidden under comments to outdated file is not helping.
benchmark/scripts/Benchmark_Driver
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a string format rather than str? i.e.: '--num-samples={}'.format(num_samples) or using '%s'
benchmark/scripts/Benchmark_Driver
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use early returns here to eliminate the elif. You do not have any code of consequence after the condition. This is communicated to the reader of the code in a more succint way by using early returns. i.e.:
if args.filter:
prefix = args.filter
return filter(lambda name: name.startswith(prefix), tests)
if not args.benchmark:
return tests
benchmarks = set(args.benchmark)
return sorted(list(set(tests).intersection(benchmarks)))
Notice how just by reading the code the reader is able to know the amount of work left to be returned without having to read the whole function.
benchmark/scripts/Benchmark_Driver
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using --num-samples and getting the peak memory usage of a single invocation doesn't allow us to average the memory usage for each sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to be able to average over peak memory usages.
|
I have a question about output from I propose that it always writes only in specified format and that it writes either to |
|
Can I please get a guidance about the script outputs I suggest altering in the above comment? |
|
OMG, I need help from git expert - I messed something up here. I wasn’t able to push my recent commits. It said I needed to git pull first, I played with git rebase, reset, clean and co… and now I’ve somehow pulled in commit from @atrick about “CopyForwarding” and have one of mine duplicated. Agh. Help (e-mail, twitter)? |
|
@palimondo Can you rebuild your branch and force push? |
|
I’ll try to fix my branch and force push. Here I’d like to continue our discussion from the hidden comments on outdated @lplarson: Maybe you could explain more about the reasoning that lead to iterating in When measuring runtime performance on multitasking system, we have to take into account the system’s weather: other processes competing for the CPU. We are taking multiple samples and trying to statistically approximate a “true” runtime. Instead of measuring how would our benchmark perform if it was the only thing running — this is never the case — we settle for a consistently reproducible result. We should be discarding extreme values and work with the mean. In case of measuring memory usage, the system's weather is much less severe. There certainly is variation due to interaction between allocator and OS that depends on the available memory, but this is small compared to measuring runtime. Other processes do not directly impact our own memory consumption. We are measuring the memory usage by instrumenting the benchmark run with From this we extract the Maximum resident set size. The granularity of this value is always a multiple of page size - 4098 on macOS. That is the smallest amount of memory the operating system deals with - the smaller allocations amounts used internally by the program are handled by the allocator (malloc) and that is part of the Swift runtime. In theory it is possible to measure memory usage on finer level by using a custom allocator to track all the details, but this is what we are discussing here. When we look at the vast amount of tests in the Swift benchmark suite, as long as they are pure Swift, they are very stable memory wise. Running
Each MAX_RSS value is the mean of 3 samples taken by the current version of Raising the sample size to 20 buys us very little:
Notice how the variation between MAX_RSS values reported from different runs did not decrease even when we increased the sample size. This is because we are now doing statistical analysis of the noise that is below the precision of our measurement. The point to take home is that CV is hugging 0%. If we ditched averaging of RSS and left the computation of other statistics to
At first glance the SDlooks worse, but to clarify I’ve added 2 columns: the number of memory pages allocated (MAX_RSS / 4096) and the delta between the minimal number of pages (#Pages, last row) and the current one. Notice the different sizes reported between the two versions of My argument is that while the CV stays low, we are good to stake just one measurement and report memory regressions (not done currently at all) based on the % change between them, say above 5%. This takes me to further analyze tests that had high CV — I’ve referred to that as Mem Diff in the table of suspicious test in my previous comment. I believe they are all a symptom of a bug: memory leaks in the bridging implementation. This is actually hidden by current, MAX_RSS averaging, implementation of
Going from 3 to 20 samples increases memory consumption 5 times! Based on that I’m lumping ObjectiveCBridgeToNSDictionary and ObjectiveCBridgeToNSSet together as the same bug. The other volatile test are much less severe (CV < 10%) and have in common intense use of Array. I’ve looked in detail at MonteCarloE. It performs a lot of random number computations, similar to MonteCarloPi, but in contrast to that it builds a huge [Bool].
This looks like a much less severe, but still some kind of memory leak in the implementation of Array. Given that the ObserverPartiallyAppliedMethod had a up to 5% CV with 3 iterations but stabilized to 0-1% with 20 iterations, the memory increase maybe worth investigating?
All in all, I believe that unstable memory usage in benchmarks is a sign of bugs in underlying implementation - not a fact of nature that we need to deal with by averaging. This approach allowed for the above bugs to hide in plain sight for who-knows-how-long. Or am I really misunderstanding some fundamental issue that lead you to compute average MAX_RSS in the |
01e433a to
f594745
Compare
|
I have fixed my branch and force pushed. @lplarson @shahmishal @gottesmm Can I get some answers here? I’d like to continue the work on this, but I’d hate to throw it away if you don’t like where I’d like to take it. See the 4-5 comments above. |
|
Note about the current |
|
No idea what you're actually proposing. The original motivation was to paste the output into a git message or copy paste some lines into emacs org mode and I'd be fine with just a "markdown" format by default, as it is today, and "text" format that has the markup and table characters stripped away. |
|
BTW: what we really need is a Benchmark_Driver --rerun=N feature: https://bugs.swift.org/browse/SR-4669 |
|
@atrick I'm proposing that the script always outputs in the format specified by |
|
@palimondo That sounds completely sane. Thanks. As long as it's ok with @shahmishal. |
|
@palimondo Can you group the commits into different PRs. I have time to review today in a little bit. |
… and list computations
242d614 to
3c59bdf
Compare
|
@gottesmm I can, if you tell me what you want… Wrote you an e-mail. |
|
@palimondo I responded via email. Did I answer your question? |
Also: use single quote strings
|
@palimondo Thinking about this more, I'm fine if we don't take the average MAXRSS from multiple runs as long as we maintain the same number of columns in the output. We can then monitor the results to see if we notice problems in action. |
|
@lplarson In the discussion on SR-4667 @aschwaighofer explained where does the whole variance in memory usage between runs come from: So I'm thinking we need to monitor memory usage separately, in a different run with fixed num-iterations? |
|
If we can use a static number of iterations and get reliable results, then I'm all for it. We can potentially speed up benchmarking by not running each test for a second as well. @aschwaighofer, @gottesmm Am I missing a profound reason to adaptively find iteration counts that run for 1 second apart from convenience to the person adding a test? If it's only convenience, it seems easy enough to create a utility that will find the proper iteration count and set it statically in the test before committing. This would cut down on some noise in the benchmarks. |
|
@lplarson I am not the author. I see no profound reason. We will need different counts for O and Onone. Also, it will vary on platforms. But presumably not by a significant factor so if you aim for 0.1 s you will probably be fine on slower platforms. |
|
@lplarson @aschwaighofer @gottesmm The driver should first determine how many iterations are necessary for a sufficient sample period. 1s is seen as long enough to factor out the timing and harness overhead. The number of iterations varies wildly between revisions of the compiler. It's critically important that a badly behaving benchmark doesn't sabotage a set of runs by taking minutes to complete, and it's important that benchmarks remain measurable after they are dramatically improved. I've never heard of this technique as introducing noise. It's supposed to reduce noise. At any rate, the driver became much more useful after Michael Z (Mr. zperf) implemented this. |
|
@lplarson @palimondo Oh, I see where this fixed iteration question came from. Measure memory utilization and cpu cycles is a totally different thing. I think it should be a separate run with a separate report. |
|
One more thing :) I really don't want my benchmark runs to be slowed down by measuring memory usage. It's a great thing to track and helpful to reveal leaks, but not something I want to look at 99% of the time. |
|
That was my point. We should kick the MAX_RSS measurement out of the timing measurement and do a separate pass for memory only. Then we don’t need to re-implement whole averaging in I think this would be better doable in a separate log file even. I suggest following:
This frees Do I have a green light to implement this? |
|
@palimondo That sounds great. The technique of parsing the |
|
Sounds good to me. |
|
Please work with @shahmishal to make sure benchmark pull request testing isn't disrupted. |
|
@lplarson Re: Work with @shahmishal. What do you mean, exactly? |
|
Wow… 😲 Scale (num-iters) varies between samples? How are these then comparable?!? |
|
I’d argue that the |
|
Re: How are these comparable… |
|
@palimondo Let @shahmishal know if you need to change something that will break "@SwiftCI please benchmark" |
|
@shahmishal @lplarson Where can I see how the "@SwiftCI please benchmark” works? Is the GitHub integration visible somewhere? I’d love to know how the CI works in more detail, in order to fix SR-4675 Report detailed build version from Benchmark drivers. In the final stage, I want to see in the report the unique identification of the branch and commit of the two compared versions. These also need to be in the logs from |
This PR fixes reporting issues when comparing benchmark results.
Resolves:
compare_perf_tests.pyfails when new benchmarks are added