Skip to content

Use the "nearly divisionless" algorithm on all targets. #37920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

stephentyrone
Copy link
Contributor

@stephentyrone stephentyrone commented Jun 15, 2021

We have multipliedFullWidth available everywhere now, so we don't need to carry around the old implementation on 32b targets.

Also adds a few more benchmarks for random number generation.

Resolves #53302

We have multipliedFullWidth available everywhere now, so we don't need to carry around the old implementation on 32b targets.

Also adds a few more benchmarks for random number generation.
@stephentyrone
Copy link
Contributor Author

@swift-ci please benchmark

@stephentyrone
Copy link
Contributor Author

@swift-ci please test

@stephentyrone stephentyrone requested a review from Azoy June 15, 2021 14:56
…rks.

When these are visible compile-time constants, the compiler is smart enough to evaluate the division in the "nearly divisionless" algorithm, which makes it completely divisionless. That's good, but it obscures what the runtime performance of the algorithm will be when the bounds are _not_ available as compile-time constants. Thus, for some of the newly-added benchmarks, we pass the upper bound through `identity` to hide it from the optimizer (this is imperfect, but it's the simplest tool we have).

We don't want to do this for all the tests for two reasons:
- compile-time constant bounds are a common case that should still be reflected in our testing
- we don't want to perturb existing benchmark results more than we have to.
@stephentyrone
Copy link
Contributor Author

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

Performance (x86_64): -O

Regression OLD NEW DELTA RATIO
DictionaryOfAnyHashableStrings_insert 3262 5558 +70.4% 0.59x
FlattenListLoop 1630 2495 +53.1% 0.65x (?)
Set.isDisjoint.Box25 362 514 +42.0% 0.70x (?)
DictionaryKeysContainsNative 23 27 +17.4% 0.85x (?)
StringRemoveDupes 263 293 +11.4% 0.90x (?)
UTF8Decode_InitFromBytes_ascii_as_ascii 512 556 +8.6% 0.92x (?)
NSStringConversion.UTF8 927 1000 +7.9% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
RandomDoubleLCG 216 200 -7.4% 1.08x (?)
NSStringConversion.Rebridge.LongUTF8 54 50 -7.4% 1.08x (?)
 
Added MIN MAX MEAN MAX_RSS
RandomDouble01Def 36200 36200 36200
RandomDouble01LCG 142 147 144
RandomDoubleOpaqueDef 36100 36400 36200
RandomDoubleOpaqueLCG 543 543 543
RandomInt64Def 65100 67400 66100
RandomInt64LCG 1691 1743 1709
RandomInt8Def 70300 71700 71200
RandomInt8LCG 638 638 638

Code size: -O

Regression OLD NEW DELTA RATIO
RandomValues.o 2763 8003 +189.6% 0.35x

Performance (x86_64): -Osize

Improvement OLD NEW DELTA RATIO
FlattenListLoop 2495 1645 -34.1% 1.52x (?)
FlattenListFlatMap 6003 4375 -27.1% 1.37x (?)
DataToStringLargeUnicode 7250 6500 -10.3% 1.12x (?)
 
Added MIN MAX MEAN MAX_RSS
RandomDouble01Def 36100 36100 36100
RandomDouble01LCG 514 514 514
RandomDoubleOpaqueDef 35400 35400 35400
RandomDoubleOpaqueLCG 524 524 524
RandomInt64Def 64400 67200 66033
RandomInt64LCG 1727 1788 1747
RandomInt8Def 71200 72800 71967
RandomInt8LCG 653 666 657

Code size: -Osize

Regression OLD NEW DELTA RATIO
RandomValues.o 2419 6885 +184.6% 0.35x

Performance (x86_64): -Onone

Improvement OLD NEW DELTA RATIO
StringInterpolationSmall 4390 3960 -9.8% 1.11x (?)
StringToDataMedium 5300 4900 -7.5% 1.08x (?)
String.data.LargeUnicode 137 127 -7.3% 1.08x (?)
DataCreateMediumArray 6380 5920 -7.2% 1.08x (?)
 
Added MIN MAX MEAN MAX_RSS
RandomDouble01Def 68500 68600 68567
RandomDouble01LCG 30286 30434 30358
RandomDoubleOpaqueDef 83900 84000 83967
RandomDoubleOpaqueLCG 44450 44510 44472
RandomInt64Def 110300 113400 112133
RandomInt64LCG 45408 45714 45522
RandomInt8Def 108000 111600 109267
RandomInt8LCG 44031 44432 44212

Code size: -swiftlibs

Benchmark Check Report
⚠️ RandomInt64LCG execution took at least 1690 μs.
Decrease the workload of RandomInt64LCG by a factor of 2 (10), to be less than 1000 μs.
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@stephentyrone
Copy link
Contributor Author

@swift-ci test

Copy link
Contributor

@Azoy Azoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stdlib changes look good to me. I wonder if for the benchmarks it makes sense to add separate IntegerOpaque benchmarks like the ones you did for Double so we can see clearly which integer benchmarks are constant ranges vs. going through the algorithm. I was going to ask to land these benchmarks separately, but I don't think CI runs 32b benchmarks for this to matter for this PR.

@eeckstein
Copy link
Contributor

Is it possible to reduce the number of benchmarks to a representative subset?
We cannot add benchmarks for every permutation of every language feature. That would explode the benchmark runtime.

Alternatively, you can keep all the benchmarks, but add a .skip attribute, which excludes the from the regular run.

@stephentyrone
Copy link
Contributor Author

This is the reduced set of benchmarks! I'm not sure why we would reduce it further if we care about performance at all. I'm pretty sure that report and track fewer benchmark results for the entirety of the Swift compiler and standard library than we did for a single API in Accelerate when I was working on it(!) A reasonable set of "all" the benchmarks, just for random integers, would include the full matrix of (every builtin integer type + representative custom types) x (fast and slow generators) x (various ranges--tiny, about half, nearly all) x (compile time constant vs unknown bounds). I'm probably leaving a few dimensions out. The benchmarks are fast--much, much faster to run than the test suite--so I don't see what we gain by stripping it down below the minimum representative sample, and there's a lot to lose in doing so.

@eeckstein
Copy link
Contributor

ok, fine with me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SR-10912] Use Lemire's Nearly Divisionless Random Integer Generation on 32-bit Systems
4 participants