Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate replacements for SmallRng algorithm #910

Closed
dhardy opened this issue Nov 19, 2019 · 23 comments · Fixed by #1038
Closed

Investigate replacements for SmallRng algorithm #910

dhardy opened this issue Nov 19, 2019 · 23 comments · Fixed by #1038
Milestone

Comments

@dhardy
Copy link
Member

dhardy commented Nov 19, 2019

Due to close correlations of PCG streams and lack of right-state propegation we should consider replacing PCG with another algorithm(s) for the next Rand version (0.8).

From the docs, the purpose of SmallRng is:

SmallRng may be a good choice when a PRNG with small state, cheap initialization, good statistical quality and good performance are required. It is not a good choice when security against prediction or reproducibility are important. ... The algorithm is deterministic but should not be considered reproducible due to dependence on platform and possible replacement in future library versions.

Ideally (in my opinion), SmallRng should be small but not too small; preferably 128-bit or 256-bit if we must. @vigna have you thoughts on this (given that you recommend a 256-bit variant of your generator for general usage, but in this case we already have a ChaCha-based generator for general usage)?

There are other generators besides PCG and Xo(ro)shiro, e.g. GJrand, JSF and SFC, though I've seen less analysis of these. Previous decisions on this topic have been somewhat influenced by this thread, though it only considers benchmarks and some very basic analysis.

@vigna
Copy link

vigna commented Nov 19, 2019

Do you need multiple streams? Constant-time jumps for parallel execution? Are 64-bit operations fine or would you prefer 32-bit operations?

@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2019

We expect the key size (seed + stream) to be at least 100 bits. We require an algorithm for 32-bit CPUs and for 64-bit CPUs (currently we have a different RNG for each; they could also be the same).

For parallel usage we generally recommend seeding each PRNG independently from a strong master (CSPRNG or system source) instead of using jumps.

@vks
Copy link
Collaborator

vks commented Nov 19, 2019

Does this affect the current implementation of SeedableRng::seed_from_u64?

@vigna
Copy link

vigna commented Nov 19, 2019

For these two cases, on my page I suggest xoroshiro128++ or xoshiro128++. The second generator is 32-bit though, so you'll need to call it twice for a 64-bit integer or a double.

Note that in general generators writing to multiple locations (like those above) might perform better in artificial benchmarks (e.g., loops) than generators with a stream parameter, but in real life the cost of storing two words instead of one might be detectable even at the application level in some circumstances (e.g., a lot of generation but in a situation in which the compiler cannot keep the state of the PRNG in registers).

SplitMix fills many of your requirements. The problem with it is that the parameter (the additive constant of the underlying Weyl generator) must have a balance of zeros and ones, so it might be tricky to have the user specify an arbitrary parameter and avoid collisions when you "fix" it.

An alternative you might explore is any kind of 64-bit engine (e.g., a xorshift generator or an LCG with modulus 2^64) with a strong parameterized scrambling function. Something like a murmurHash mix, but inserting the constant defining the stream in the middel (e.g., when you multiply). Of course it is not guaranteed that all constants will work equally, etc. Streams are objectively difficult unless you are using a stream cipher.

@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2019

Does this affect the current implementation of SeedableRng::seed_from_u64?

I don't believe so. In this case we care more about reproducibility and less about supporting many independent instances (the function only accepts 64-bit key, and uses a fixed stream).

I did some quick benchmarks of our existing Rust PRNGs below. These are not fully representative (only one CPU arch, may be a little imprecise), but show roughly how our current code performs.

x86_64 (Haswell):

test gen_bytes_chacha12      ... bench:     351,126 ns/iter (+/- 22,460) = 2916 MB/s
test gen_bytes_chacha20      ... bench:     535,578 ns/iter (+/- 41,333) = 1911 MB/s
test gen_bytes_chacha8       ... bench:     256,668 ns/iter (+/- 10,779) = 3989 MB/s
test gen_bytes_hc128         ... bench:     461,475 ns/iter (+/- 42,922) = 2218 MB/s
test gen_bytes_os            ... bench:   4,672,746 ns/iter (+/- 105,596) = 219 MB/s
test gen_bytes_pcg32         ... bench:     480,698 ns/iter (+/- 8,081) = 2130 MB/s
test gen_bytes_pcg64         ... bench:     410,520 ns/iter (+/- 10,004) = 2494 MB/s
test gen_bytes_pcg64mcg      ... bench:     337,623 ns/iter (+/- 8,450) = 3032 MB/s
test gen_bytes_std           ... bench:     535,034 ns/iter (+/- 9,829) = 1913 MB/s
test gen_bytes_step          ... bench:     272,422 ns/iter (+/- 4,405) = 3758 MB/s
test gen_bytes_isaac                ... bench:     767,380 ns/iter (+/- 14,082) = 1334 MB/s
test gen_bytes_isaac64              ... bench:     410,192 ns/iter (+/- 11,555) = 2496 MB/s
test gen_bytes_splitmix64           ... bench:     504,261 ns/iter (+/- 19,653) = 2030 MB/s
test gen_bytes_xoroshiro128plus     ... bench:     471,811 ns/iter (+/- 19,862) = 2170 MB/s
test gen_bytes_xoroshiro128plusplus ... bench:     465,413 ns/iter (+/- 25,334) = 2200 MB/s
test gen_bytes_xoroshiro128starstar ... bench:     465,180 ns/iter (+/- 20,501) = 2201 MB/s
test gen_bytes_xoroshiro64star      ... bench:     560,486 ns/iter (+/- 12,691) = 1826 MB/s
test gen_bytes_xoroshiro64starstar  ... bench:     590,918 ns/iter (+/- 14,597) = 1732 MB/s
test gen_bytes_xorshift             ... bench:     385,338 ns/iter (+/- 15,345) = 2657 MB/s
test gen_bytes_xoshiro128plus       ... bench:     480,411 ns/iter (+/- 21,039) = 2131 MB/s
test gen_bytes_xoshiro128plusplus   ... bench:     506,841 ns/iter (+/- 35,335) = 2020 MB/s
test gen_bytes_xoshiro128starstar   ... bench:     485,184 ns/iter (+/- 34,264) = 2110 MB/s
test gen_bytes_xoshiro256plus       ... bench:     373,064 ns/iter (+/- 29,397) = 2744 MB/s
test gen_bytes_xoshiro256plusplus   ... bench:     382,244 ns/iter (+/- 15,187) = 2678 MB/s
test gen_bytes_xoshiro256starstar   ... bench:     365,659 ns/iter (+/- 27,107) = 2800 MB/s
test gen_bytes_xoshiro512plus       ... bench:     395,434 ns/iter (+/- 46,206) = 2589 MB/s
test gen_bytes_xoshiro512plusplus   ... bench:     410,415 ns/iter (+/- 32,176) = 2495 MB/s
test gen_bytes_xoshiro512starstar   ... bench:     400,167 ns/iter (+/- 11,301) = 2558 MB/s
test gen_bytes_siprng ... bench:     662,201 ns/iter (+/- 41,361) = 1546 MB/s

test gen_u32_chacha12        ... bench:       1,778 ns/iter (+/- 77) = 2249 MB/s
test gen_u32_chacha20        ... bench:       2,516 ns/iter (+/- 73) = 1589 MB/s
test gen_u32_chacha8         ... bench:       1,304 ns/iter (+/- 46) = 3067 MB/s
test gen_u32_hc128           ... bench:       1,836 ns/iter (+/- 46) = 2178 MB/s
test gen_u32_os              ... bench:     623,795 ns/iter (+/- 28,568) = 6 MB/s
test gen_u32_pcg32           ... bench:       1,218 ns/iter (+/- 38) = 3284 MB/s
test gen_u32_pcg64           ... bench:       1,867 ns/iter (+/- 40) = 2142 MB/s
test gen_u32_pcg64mcg        ... bench:       1,222 ns/iter (+/- 50) = 3273 MB/s
test gen_u32_std             ... bench:       2,506 ns/iter (+/- 86) = 1596 MB/s
test gen_u32_step            ... bench:          96 ns/iter (+/- 3) = 41666 MB/s
test gen_u32_isaac                  ... bench:       3,242 ns/iter (+/- 228) = 1233 MB/s
test gen_u32_isaac64                ... bench:       3,251 ns/iter (+/- 292) = 1230 MB/s
test gen_u32_splitmix64             ... bench:         820 ns/iter (+/- 34) = 4878 MB/s
test gen_u32_xoroshiro128plus       ... bench:       1,071 ns/iter (+/- 79) = 3734 MB/s
test gen_u32_xoroshiro128plusplus   ... bench:       1,524 ns/iter (+/- 115) = 2624 MB/s
test gen_u32_xoroshiro128starstar   ... bench:       1,358 ns/iter (+/- 302) = 2945 MB/s
test gen_u32_xoroshiro64star        ... bench:       1,582 ns/iter (+/- 108) = 2528 MB/s
test gen_u32_xoroshiro64starstar    ... bench:       1,720 ns/iter (+/- 206) = 2325 MB/s
test gen_u32_xorshift               ... bench:       1,376 ns/iter (+/- 676) = 2906 MB/s
test gen_u32_xoshiro128plus         ... bench:         983 ns/iter (+/- 32) = 4069 MB/s
test gen_u32_xoshiro128plusplus     ... bench:       1,036 ns/iter (+/- 23) = 3861 MB/s
test gen_u32_xoshiro128starstar     ... bench:       1,150 ns/iter (+/- 4) = 3478 MB/s
test gen_u32_xoshiro256plus         ... bench:         981 ns/iter (+/- 6) = 4077 MB/s
test gen_u32_xoshiro256plusplus     ... bench:       1,120 ns/iter (+/- 17) = 3571 MB/s
test gen_u32_xoshiro256starstar     ... bench:       1,065 ns/iter (+/- 7) = 3755 MB/s
test gen_u32_xoshiro512plus         ... bench:       3,362 ns/iter (+/- 34) = 1189 MB/s
test gen_u32_xoshiro512plusplus     ... bench:       3,467 ns/iter (+/- 23) = 1153 MB/s
test gen_u32_xoshiro512starstar     ... bench:       3,418 ns/iter (+/- 67) = 1170 MB/s
test gen_u32_siprng   ... bench:       5,052 ns/iter (+/- 417) = 791 MB/s

test gen_u64_chacha12        ... bench:       3,993 ns/iter (+/- 315) = 2003 MB/s
test gen_u64_chacha20        ... bench:       4,463 ns/iter (+/- 164) = 1792 MB/s
test gen_u64_chacha8         ... bench:       3,268 ns/iter (+/- 181) = 2447 MB/s
test gen_u64_hc128           ... bench:       3,930 ns/iter (+/- 387) = 2035 MB/s
test gen_u64_os              ... bench:     625,831 ns/iter (+/- 14,467) = 12 MB/s
test gen_u64_pcg32           ... bench:       2,664 ns/iter (+/- 228) = 3003 MB/s
test gen_u64_pcg64           ... bench:       1,868 ns/iter (+/- 37) = 4282 MB/s
test gen_u64_pcg64mcg        ... bench:       1,215 ns/iter (+/- 25) = 6584 MB/s
test gen_u64_std             ... bench:       5,380 ns/iter (+/- 211) = 1486 MB/s
test gen_u64_step            ... bench:         152 ns/iter (+/- 9) = 52631 MB/s
test gen_u64_isaac                  ... bench:       7,551 ns/iter (+/- 26) = 1059 MB/s
test gen_u64_isaac64                ... bench:       3,621 ns/iter (+/- 16) = 2209 MB/s
test gen_u64_splitmix64             ... bench:         994 ns/iter (+/- 14) = 8048 MB/s
test gen_u64_xoroshiro128plus       ... bench:       1,201 ns/iter (+/- 3) = 6661 MB/s
test gen_u64_xoroshiro128plusplus   ... bench:       1,408 ns/iter (+/- 4) = 5681 MB/s
test gen_u64_xoroshiro128starstar   ... bench:       1,342 ns/iter (+/- 26) = 5961 MB/s
test gen_u64_xoroshiro64star        ... bench:       3,064 ns/iter (+/- 111) = 2610 MB/s
test gen_u64_xoroshiro64starstar    ... bench:       3,337 ns/iter (+/- 142) = 2397 MB/s
test gen_u64_xorshift               ... bench:       1,849 ns/iter (+/- 218) = 4326 MB/s
test gen_u64_xoshiro128plus         ... bench:       1,834 ns/iter (+/- 193) = 4362 MB/s
test gen_u64_xoshiro128plusplus     ... bench:       2,190 ns/iter (+/- 187) = 3652 MB/s
test gen_u64_xoshiro128starstar     ... bench:       2,070 ns/iter (+/- 189) = 3864 MB/s
test gen_u64_xoshiro256plus         ... bench:         987 ns/iter (+/- 148) = 8105 MB/s
test gen_u64_xoshiro256plusplus     ... bench:       1,038 ns/iter (+/- 107) = 7707 MB/s
test gen_u64_xoshiro256starstar     ... bench:       1,129 ns/iter (+/- 98) = 7085 MB/s
test gen_u64_xoshiro512plus         ... bench:       3,337 ns/iter (+/- 123) = 2397 MB/s
test gen_u64_xoshiro512plusplus     ... bench:       3,377 ns/iter (+/- 239) = 2368 MB/s
test gen_u64_xoshiro512starstar     ... bench:       3,446 ns/iter (+/- 203) = 2321 MB/s
test gen_u64_siprng   ... bench:       5,037 ns/iter (+/- 128) = 1588 MB/s


i686 (emulated on Haswell):

test gen_bytes_chacha12      ... bench:   2,959,160 ns/iter (+/- 69,935) = 346 MB/s
test gen_bytes_chacha20      ... bench:   4,676,660 ns/iter (+/- 176,771) = 218 MB/s
test gen_bytes_chacha8       ... bench:   2,129,900 ns/iter (+/- 105,724) = 480 MB/s
test gen_bytes_hc128         ... bench:     779,442 ns/iter (+/- 23,737) = 1313 MB/s
test gen_bytes_os            ... bench:   4,763,796 ns/iter (+/- 68,559) = 214 MB/s
test gen_bytes_pcg32         ... bench:     841,579 ns/iter (+/- 20,688) = 1216 MB/s
test gen_bytes_pcg64         ... bench:   1,499,088 ns/iter (+/- 90,571) = 683 MB/s
test gen_bytes_pcg64mcg      ... bench:   1,291,361 ns/iter (+/- 24,630) = 792 MB/s
test gen_bytes_std           ... bench:   4,732,747 ns/iter (+/- 124,216) = 216 MB/s
test gen_bytes_step          ... bench:     319,588 ns/iter (+/- 4,945) = 3204 MB/s
test gen_bytes_isaac                ... bench:   1,028,576 ns/iter (+/- 195,479) = 995 MB/s
test gen_bytes_isaac64              ... bench:     564,817 ns/iter (+/- 6,652) = 1812 MB/s
test gen_bytes_splitmix64           ... bench:   2,781,309 ns/iter (+/- 67,637) = 368 MB/s
test gen_bytes_xoroshiro128plus     ... bench:   4,087,332 ns/iter (+/- 1,702,551) = 250 MB/s
test gen_bytes_xoroshiro128plusplus ... bench:   4,295,951 ns/iter (+/- 468,283) = 238 MB/s
test gen_bytes_xoroshiro128starstar ... bench:   2,778,644 ns/iter (+/- 119,197) = 368 MB/s
test gen_bytes_xoroshiro64star      ... bench:     598,607 ns/iter (+/- 46,824) = 1710 MB/s
test gen_bytes_xoroshiro64starstar  ... bench:     621,693 ns/iter (+/- 34,658) = 1647 MB/s
test gen_bytes_xorshift             ... bench:     441,959 ns/iter (+/- 37,409) = 2316 MB/s
test gen_bytes_xoshiro128plus       ... bench:     704,238 ns/iter (+/- 56,033) = 1454 MB/s
test gen_bytes_xoshiro128plusplus   ... bench:     557,983 ns/iter (+/- 53,534) = 1835 MB/s
test gen_bytes_xoshiro128starstar   ... bench:     568,295 ns/iter (+/- 67,564) = 1801 MB/s
test gen_bytes_xoshiro256plus       ... bench:     728,636 ns/iter (+/- 42,730) = 1405 MB/s
test gen_bytes_xoshiro256plusplus   ... bench:     758,109 ns/iter (+/- 11,789) = 1350 MB/s
test gen_bytes_xoshiro256starstar   ... bench:     769,551 ns/iter (+/- 4,485) = 1330 MB/s
test gen_bytes_xoshiro512plus       ... bench:     907,000 ns/iter (+/- 3,928) = 1128 MB/s
test gen_bytes_xoshiro512plusplus   ... bench:   1,025,192 ns/iter (+/- 15,996) = 998 MB/s
test gen_bytes_xoshiro512starstar   ... bench:     914,588 ns/iter (+/- 28,553) = 1119 MB/s
test gen_bytes_siprng ... bench:   1,699,922 ns/iter (+/- 29,795) = 602 MB/s

test gen_u32_chacha12        ... bench:      12,904 ns/iter (+/- 308) = 309 MB/s
test gen_u32_chacha20        ... bench:      19,474 ns/iter (+/- 255) = 205 MB/s
test gen_u32_chacha8         ... bench:       9,149 ns/iter (+/- 240) = 437 MB/s
test gen_u32_hc128           ... bench:       2,453 ns/iter (+/- 151) = 1630 MB/s
test gen_u32_os              ... bench:     642,248 ns/iter (+/- 23,286) = 6 MB/s
test gen_u32_pcg32           ... bench:       2,831 ns/iter (+/- 293) = 1412 MB/s
test gen_u32_pcg64           ... bench:       9,930 ns/iter (+/- 691) = 402 MB/s
test gen_u32_pcg64mcg        ... bench:       8,675 ns/iter (+/- 751) = 461 MB/s
test gen_u32_std             ... bench:      18,959 ns/iter (+/- 995) = 210 MB/s
test gen_u32_step            ... bench:           2 ns/iter (+/- 0) = 2000000 MB/s
test gen_u32_isaac                  ... bench:       3,742 ns/iter (+/- 74) = 1068 MB/s
test gen_u32_isaac64                ... bench:       4,328 ns/iter (+/- 124) = 924 MB/s
test gen_u32_splitmix64             ... bench:       2,001 ns/iter (+/- 93) = 1999 MB/s
test gen_u32_xoroshiro128plus       ... bench:       2,499 ns/iter (+/- 336) = 1600 MB/s
test gen_u32_xoroshiro128plusplus   ... bench:       2,608 ns/iter (+/- 37) = 1533 MB/s
test gen_u32_xoroshiro128starstar   ... bench:       2,816 ns/iter (+/- 309) = 1420 MB/s
test gen_u32_xoroshiro64star        ... bench:       1,307 ns/iter (+/- 106) = 3060 MB/s
test gen_u32_xoroshiro64starstar    ... bench:       1,490 ns/iter (+/- 133) = 2684 MB/s
test gen_u32_xorshift               ... bench:       1,547 ns/iter (+/- 164) = 2585 MB/s
test gen_u32_xoshiro128plus         ... bench:       2,372 ns/iter (+/- 177) = 1686 MB/s
test gen_u32_xoshiro128plusplus     ... bench:       2,408 ns/iter (+/- 232) = 1661 MB/s
test gen_u32_xoshiro128starstar     ... bench:       1,702 ns/iter (+/- 68) = 2350 MB/s
test gen_u32_xoshiro256plus         ... bench:       4,157 ns/iter (+/- 284) = 962 MB/s
test gen_u32_xoshiro256plusplus     ... bench:       4,162 ns/iter (+/- 593) = 961 MB/s
test gen_u32_xoshiro256starstar     ... bench:       4,091 ns/iter (+/- 295) = 977 MB/s
test gen_u32_xoshiro512plus         ... bench:       6,711 ns/iter (+/- 341) = 596 MB/s
test gen_u32_xoshiro512plusplus     ... bench:       7,369 ns/iter (+/- 438) = 542 MB/s
test gen_u32_xoshiro512starstar     ... bench:       6,171 ns/iter (+/- 83) = 648 MB/s
test gen_u32_siprng   ... bench:      15,812 ns/iter (+/- 965) = 252 MB/s

test gen_u64_chacha12        ... bench:      23,474 ns/iter (+/- 2,590) = 340 MB/s
test gen_u64_chacha20        ... bench:      35,887 ns/iter (+/- 4,109) = 222 MB/s
test gen_u64_chacha8         ... bench:      16,714 ns/iter (+/- 1,957) = 478 MB/s
test gen_u64_hc128           ... bench:       5,033 ns/iter (+/- 681) = 1589 MB/s
test gen_u64_os              ... bench:     721,669 ns/iter (+/- 75,554) = 11 MB/s
test gen_u64_pcg32           ... bench:       6,309 ns/iter (+/- 251) = 1268 MB/s
test gen_u64_pcg64           ... bench:      11,835 ns/iter (+/- 1,544) = 675 MB/s
test gen_u64_pcg64mcg        ... bench:      11,505 ns/iter (+/- 545) = 695 MB/s
test gen_u64_std             ... bench:      36,579 ns/iter (+/- 634) = 218 MB/s
test gen_u64_step            ... bench:           4 ns/iter (+/- 0) = 2000000 MB/s
test gen_u64_isaac                  ... bench:       7,792 ns/iter (+/- 257) = 1026 MB/s
test gen_u64_isaac64                ... bench:       5,642 ns/iter (+/- 407) = 1417 MB/s
test gen_u64_splitmix64             ... bench:       2,236 ns/iter (+/- 96) = 3577 MB/s
test gen_u64_xoroshiro128plus       ... bench:       3,360 ns/iter (+/- 223) = 2380 MB/s
test gen_u64_xoroshiro128plusplus   ... bench:       3,833 ns/iter (+/- 159) = 2087 MB/s
test gen_u64_xoroshiro128starstar   ... bench:       3,787 ns/iter (+/- 164) = 2112 MB/s
test gen_u64_xoroshiro64star        ... bench:       2,982 ns/iter (+/- 51) = 2682 MB/s
test gen_u64_xoroshiro64starstar    ... bench:       3,119 ns/iter (+/- 72) = 2564 MB/s
test gen_u64_xorshift               ... bench:       2,558 ns/iter (+/- 204) = 3127 MB/s
test gen_u64_xoshiro128plus         ... bench:       3,017 ns/iter (+/- 315) = 2651 MB/s
test gen_u64_xoshiro128plusplus     ... bench:       2,772 ns/iter (+/- 330) = 2886 MB/s
test gen_u64_xoshiro128starstar     ... bench:       2,815 ns/iter (+/- 321) = 2841 MB/s
test gen_u64_xoshiro256plus         ... bench:       4,115 ns/iter (+/- 127) = 1944 MB/s
test gen_u64_xoshiro256plusplus     ... bench:       5,121 ns/iter (+/- 155) = 1562 MB/s
test gen_u64_xoshiro256starstar     ... bench:       4,620 ns/iter (+/- 154) = 1731 MB/s
test gen_u64_xoshiro512plus         ... bench:       7,430 ns/iter (+/- 858) = 1076 MB/s
test gen_u64_xoshiro512plusplus     ... bench:       7,301 ns/iter (+/- 756) = 1095 MB/s
test gen_u64_xoshiro512starstar     ... bench:       7,020 ns/iter (+/- 695) = 1139 MB/s
test gen_u64_siprng   ... bench:      16,127 ns/iter (+/- 592) = 496 MB/s

@vks
Copy link
Collaborator

vks commented Nov 19, 2019

It's nice to see that chacha8 is faster than steprng thanks to vectorization!

This looks like xoshiro128++ and xoshiro256++ are good choices for i686 and x86_64, respectively.

@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2019

test gen_bytes_chacha8 ... bench: 2,129,900 ns/iter (+/- 105,724) = 480 MB/s

Not on i686. I guess this has more to do with which CPU features can be assumed than the word size.

@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2019

Some results are improved a little when using RUSTFLAGS="-C target-cpu=native". Whether we should care about this is questionable. I posted some results using this (x86_64) here.

@vks
Copy link
Collaborator

vks commented Nov 19, 2019

Not on i686. I guess this has more to do with which CPU features can be assumed than the word size.

Yes, this is expected, because x86_64 implies SSE2, but i686 doesn't.

@vigna
Copy link

vigna commented Nov 19, 2019

Are you keeping the compiler from unrolling loops? My experience, in particular with clang, is that the compiler will unroll different code a different number of times, strongly biasing the results. That might not be the case here, but it doesn't hurt.

The other parameter that might influence heavily the result is whether you let the compiler extract loop invariants. In that case, anything using large constants might appear much faster than in reality, because due to the small loop size all constants are loaded into registers before entering the loop. When you're mixing generation with other activities this might or might not happen (but this depends a lot on the CPU architecture).

@dhardy
Copy link
Member Author

dhardy commented Nov 20, 2019

Most of the results (the MB/s value) are robust against doubling each of RAND_BENCH_N (from 1000 to 2000) and BYTES_LEN (from 1024 to 2048), testing gen_bytes_* benches from the rngs repo and gen_u* from the rand repo. There are a few differences, e.g. SplitMix is approx 8% slower when the byte output length is doubled, and the 128-bit PCG generators appear about 4% faster when the number of loop iterations is doubled. Most benchmarks are basically unaffected.

This doesn't rule out loop unrolling, but I would expect to see some larger changes if it were a significant factor.

When you're mixing generation with other activities this might or might not happen

I don't really know what we can do about this, except to recommend that users benchmark their particular applications (which we already do). SmallRng is supposed to be a "reasonable default choice", not necessarily the best.

@vigna
Copy link

vigna commented Nov 20, 2019

I didn't say you had to do anything—that was just a general comment. My benchmarks use 10^10 iterations.

In C, -march=native can have disruptive effects—for example, on my benchmark hardware SplitMix gets unrolled/vectorized by GCC and takes 0.50ns/word.

@dhardy
Copy link
Member Author

dhardy commented Dec 9, 2019

Based on re-reading this and this, sfc64 and sfc32 also appear good candidates.

I have updated @pitdicker's small-rngs repo here and run some benchmarks; they perform well here:

test gen_u32_ci                   ... bench:       5,662 ns/iter (+/- 308) = 706 MB/s
test gen_u32_gj                   ... bench:       2,701 ns/iter (+/- 105) = 1480 MB/s
test gen_u32_jsf32                ... bench:         963 ns/iter (+/- 28) = 4153 MB/s
test gen_u32_jsf64                ... bench:       1,048 ns/iter (+/- 134) = 3816 MB/s
test gen_u32_kiss32               ... bench:       3,572 ns/iter (+/- 700) = 1119 MB/s
test gen_u32_kiss64               ... bench:       2,997 ns/iter (+/- 117) = 1334 MB/s
test gen_u32_msws                 ... bench:       1,119 ns/iter (+/- 28) = 3574 MB/s
test gen_u32_mwp                  ... bench:         938 ns/iter (+/- 31) = 4264 MB/s
test gen_u32_pcg_xsh_64_lcg       ... bench:       1,162 ns/iter (+/- 76) = 3442 MB/s
test gen_u32_pcg_xsl_128_mcg      ... bench:       1,202 ns/iter (+/- 54) = 3327 MB/s
test gen_u32_pcg_xsl_64_lcg       ... bench:       1,123 ns/iter (+/- 48) = 3561 MB/s
test gen_u32_sapparoth_32         ... bench:       1,355 ns/iter (+/- 53) = 2952 MB/s
test gen_u32_sapparoth_64         ... bench:       1,339 ns/iter (+/- 93) = 2987 MB/s
test gen_u32_sfc_32               ... bench:       1,203 ns/iter (+/- 66) = 3325 MB/s
test gen_u32_sfc_64               ... bench:       1,210 ns/iter (+/- 20) = 3305 MB/s
test gen_u32_velox                ... bench:       2,117 ns/iter (+/- 113) = 1889 MB/s
test gen_u32_xoroshiro_128_plus   ... bench:       1,284 ns/iter (+/- 63) = 3115 MB/s
test gen_u32_xoroshiro_64_plus    ... bench:       1,336 ns/iter (+/- 58) = 2994 MB/s
test gen_u32_xoroshiro_mt_32of128 ... bench:       1,376 ns/iter (+/- 79) = 2906 MB/s
test gen_u32_xoroshiro_mt_64of128 ... bench:       1,413 ns/iter (+/- 57) = 2830 MB/s
test gen_u32_xorshift_128_32      ... bench:         808 ns/iter (+/- 32) = 4950 MB/s
test gen_u32_xorshift_128_64      ... bench:       1,053 ns/iter (+/- 70) = 3798 MB/s
test gen_u32_xorshift_128_plus    ... bench:       1,180 ns/iter (+/- 132) = 3389 MB/s
test gen_u32_xorshift_mt_32       ... bench:       1,202 ns/iter (+/- 50) = 3327 MB/s
test gen_u32_xorshift_mt_64       ... bench:       1,262 ns/iter (+/- 37) = 3169 MB/s
test gen_u32_xsm32                ... bench:       2,654 ns/iter (+/- 355) = 1507 MB/s
test gen_u32_xsm64                ... bench:       1,185 ns/iter (+/- 26) = 3375 MB/s
test gen_u64_ci                   ... bench:      10,732 ns/iter (+/- 1,305) = 745 MB/s
test gen_u64_gj                   ... bench:       2,943 ns/iter (+/- 327) = 2718 MB/s
test gen_u64_jsf32                ... bench:       2,121 ns/iter (+/- 219) = 3771 MB/s
test gen_u64_jsf64                ... bench:       1,038 ns/iter (+/- 103) = 7707 MB/s
test gen_u64_kiss32               ... bench:       5,281 ns/iter (+/- 110) = 1514 MB/s
test gen_u64_kiss64               ... bench:       3,022 ns/iter (+/- 202) = 2647 MB/s
test gen_u64_msws                 ... bench:       1,154 ns/iter (+/- 25) = 6932 MB/s
test gen_u64_mwp                  ... bench:       1,219 ns/iter (+/- 108) = 6562 MB/s
test gen_u64_pcg_xsh_64_lcg       ... bench:       2,318 ns/iter (+/- 164) = 3451 MB/s
test gen_u64_pcg_xsl_128_mcg      ... bench:       1,206 ns/iter (+/- 136) = 6633 MB/s
test gen_u64_pcg_xsl_64_lcg       ... bench:       2,322 ns/iter (+/- 186) = 3445 MB/s
test gen_u64_sapparoth_32         ... bench:       2,734 ns/iter (+/- 263) = 2926 MB/s
test gen_u64_sapparoth_64         ... bench:       1,336 ns/iter (+/- 103) = 5988 MB/s
test gen_u64_sfc_32               ... bench:       2,829 ns/iter (+/- 152) = 2827 MB/s
test gen_u64_sfc_64               ... bench:       1,078 ns/iter (+/- 62) = 7421 MB/s
test gen_u64_velox                ... bench:       3,995 ns/iter (+/- 219) = 2002 MB/s
test gen_u64_xoroshiro_128_plus   ... bench:       1,346 ns/iter (+/- 111) = 5943 MB/s
test gen_u64_xoroshiro_64_plus    ... bench:       2,609 ns/iter (+/- 193) = 3066 MB/s
test gen_u64_xoroshiro_mt_32of128 ... bench:       2,873 ns/iter (+/- 202) = 2784 MB/s
test gen_u64_xoroshiro_mt_64of128 ... bench:       1,510 ns/iter (+/- 109) = 5298 MB/s
test gen_u64_xorshift_128_32      ... bench:       2,902 ns/iter (+/- 195) = 2756 MB/s
test gen_u64_xorshift_128_64      ... bench:       1,048 ns/iter (+/- 84) = 7633 MB/s
test gen_u64_xorshift_128_plus    ... bench:       1,003 ns/iter (+/- 55) = 7976 MB/s
test gen_u64_xorshift_mt_32       ... bench:       2,647 ns/iter (+/- 162) = 3022 MB/s
test gen_u64_xorshift_mt_64       ... bench:       1,349 ns/iter (+/- 126) = 5930 MB/s
test gen_u64_xsm32                ... bench:       3,534 ns/iter (+/- 302) = 2263 MB/s
test gen_u64_xsm64                ... bench:       1,158 ns/iter (+/- 70) = 6908 MB/s

@dhardy
Copy link
Member Author

dhardy commented Dec 9, 2019

Simulated i686 benchmarks imply we should prefer SFC32 over SFC64 for 32-bit platforms:

test gen_u32_sfc_32               ... bench:       2,495 ns/iter (+/- 193) = 1603 MB/s
test gen_u32_sfc_64               ... bench:       4,046 ns/iter (+/- 172) = 988 MB/s
test gen_u64_sfc_32               ... bench:       3,731 ns/iter (+/- 192) = 2144 MB/s
test gen_u64_sfc_64               ... bench:       3,839 ns/iter (+/- 398) = 2083 MB/s

@dhardy
Copy link
Member Author

dhardy commented Dec 30, 2019

@vigna can you please comment on the repeats that O'Neill found in Xoshiro? I realise that there is a certain amount of hyperbole involved since this apparently affects only (approx) 2^64 of the 2^256 outputs of Xoshiro256. I understand that David Blackman did some further analysis?

The SFC generator by Doty-Humphrey also appears interesting but appears to lack third-party review — a recurring problem in the space of fast non-crypto PRNGs!

@dhardy
Copy link
Member Author

dhardy commented Dec 30, 2019

I guess the alternative is that we just retire SmallRng in Rand 0.8. It already requires explicit activation via a feature flag. Since StdRng (ChaCha20) is already fast/small enough for many uses and SmallRng does not provide stability, it is not appropriate for usage in that many cases anyway (it might be useful for some forms of testing, but e.g. is not used by quickcheck).

@vigna
Copy link

vigna commented Dec 30, 2019

SFC is very fast and, empirically, very good (it might be different in applications because writing four words of state it takes time—xoshiro256 has the same problem). I don't think anybody tried seriously to find flaws in it, tho.

I have made a point of not considering in my work generators for which you cannot prove that the period is close to the size of the state space (the "bang for the buck" criterion), and SFC has a guaranteed period of at least 2^64 against 2^256 points of the state space. This is also a problem if you wanna do random seeding for multiple streams, because the assumption of random seeding is that it is like jumping at random in an orbit, and you estimate overlap based on the size of the orbit. The size is not known for SFC—we have just a lower bound that is too small for any practical random-seeding practice. One can, of course, engage in imaginative, non-mathematical considerations about that size, but I was trained otherwise, so not my cup of tea.

The occurrence of a too-frequent pattern every 2^192 elements: yes, it is true. Basically every PRNG with structure, if you work it enough, will show you something like that. Harase, for example, shows that if you pick the output of the Mersenne Twister with certain specific lags (like, take one output, skip 15, take one, skip 37, etc.) periodically, the result fails the Birthday Spacings test (!). So subsequences of the Mersenne Twister are not random. Analogously, Matsumoto shows that if you take triples of output from xorshift128+, multiply them by 2^22 and use the result to generate points in the unit cube the points you obtain are not very random (multiplying by 2^22 moves to the most significant bits at the top, and so it "magnifies" carefully a slight bias in the middle bits that affects triples of consecutive outputs and that would be very difficult to detect directly).

All these results (there are many more—I just wanted to share some examples) are very common: if your generator has structure and you bang the structure enough, you'll find something. I consider more relevant, for example, the fact that linear generators map states with few ones in states with few ones (the "escape from zeroland" problem), and my solution is to make good generators with a small state space (the Mersenne Twister needs millions of iterations to get back to normality starting from a state with all bits zero except one).

The point is that there is statistically no way your computation can be influenced by a slight bias in probability happening after 2^192 values. If you believe that, all generators with less than 192 bits of state would be problematic, because if you use 2^192 values out of them you'll repeat the period several times, and every imaginable statistical feature you expect from a random stream will be wrong.

It is a very different situation when structure kills the alleged properties of the generator. For example, all sequences produced by an LCG with power-of-two modulus (and multiplier with maximum potency, as usually happens) with varying constant are the same, modulo a sign change and an additive constant (in a couple of days there will be an arXiv paper explaining this in detail). It took some time to realize this—there are papers in the late '80s suggesting to use different constants to get different, independent streams. People even proved complex properties of these "independent streams", properties that were hinting, but not proving independence. Then Durst showed that all such sequences are basically the same, and everything fell like a house of cards.

@dhardy
Copy link
Member Author

dhardy commented Dec 30, 2019

This is also a problem if you wanna do random seeding for multiple streams, because the assumption of random seeding is that it is like jumping at random in an orbit, and you estimate overlap based on the size of the orbit.

Assuming all orbits have size 2^64 within a 256-bit state, the chance of two random seeds having the same orbit is 2^-192. Of course this is not the case since orbits may be (and usually are) larger, but if you consider the RNG as having 2^192 slices of size 2^64 (each of which may jump to any slice at its end), the chance of two random seeds being within 2^64 of each other is still 2^-191, right? Thus I fail to see how the size of the slice is important to this argument. (Assuming that the state mapping function is bijective — perhaps this is an assumption too far?)

(the "escape from zeroland" problem)

Yes, I read your recent paper attacking the Mersenne Twister. Lets hope it helps do what many past papers have not: convince an audience who are not expert in this subject that they should consider alternatives to MT19937. I would like to link it from our book.

The point is that there is statistically no way your computation can be influenced by a slight bias in probability happening after 2^192 values.

Perhaps its more that 5-in-7 repeats are a very obvious problem, no matter which scrambler/distribution you apply to the results. Also, 2^64 such patterns makes this more prevalent than zero-land issues, right? (There are far less than 2^64 ways to choose 6 bits from 256.)

all sequences produced by an LCG with power-of-two modulus (and multiplier with maximum potency, as usually happens) with varying constant are the same

I look forward to the read!

@vigna
Copy link

vigna commented Dec 30, 2019

Usually you assume P processors using L outputs and you want to compute the probability of overlap. I don't know if the next-state function of SFC is bijective, but let's assume it. But yes, either the orbits are very long and then there will be collision on the orbits, but not on the sequence, or they are short and then there could be an overlap on the sequence, but since everyone is in a different orbit everything is fine. I have no idea if this is more "efficient" than a single orbit—one would need to compute the probability of overlap in both cases.

But you're right, my considerations about random seeding were mostly FUD (given that the next-state function is bijective).

Note that "Escape from Zeroland" is not my invention—it is a term used in the paper about WELL.

For what matters the "no matter which scrambler": no, if you use a scrambler like ++ that combines several parts of the state the observation about 5-in-7 is no longer valid: there might be this bias in one of the combined words, but it will be hidden by the others. ** is different because it remaps bijectively a single word of state, and that word is specifically that with the reported bias (in fact, we could probably change the word we use if this makes someone happier).

For the zeroland part, no, it's not less frequent—it is just a matter of how close you wanna look. Since there are 256 bits of state, you can consider the > 2^64 states containing at most 11 ones. You will definitely see a bias in terms of number of ones in the next state (which will be partially hidden by the scrambler, but it's there).

@dhardy
Copy link
Member Author

dhardy commented Dec 31, 2019

I don't know if the next-state function of SFC is bijective

Here's the SFC-64 transition function. I think x ^ (x >> 11) and x + (x << 3) are both bijective? If so, then the transition is bijective. (Though there's a small bug in this implementation: Rust traps on overflow unless explicitly using wrapping arithmetic, so that counter will eventually cause a panic.)

@vigna
Copy link

vigna commented Jan 20, 2020

BTW, if you're interested our paper is out: https://arxiv.org/abs/2001.05304

@peteroupc
Copy link

peteroupc commented Jan 25, 2020

For your information I have written a short document on testing PRNGs for high-quality randomness. There is more to testing PRNGs than testing a single random number sequence. For instance:

  • Testing two "nearby sequences" of the same PRNG can reveal correlations between them. For example, PCG can produce correlated sequences from two internal states that differ only in their high bits. However, this might depend on how these nearby sequences are formed. For instance, testing could reveal that one PRNG and the same PRNG formed by efficiently discarding period / golden-ratio worth of outputs have practically uncorrelated nearby sequences. These tests can reveal the best way to initialize "nearby sequences" of a given PRNG to reduce correlation risks.
  • Hash functions have their own suggestion for a random number sequence to test, especially since they can form the basis of counter-based PRNGs.
  • Splittable PRNGs have four suggested random number sequences, taken from Schaathun's work from 2015.
  • Sometimes a PRNG can gain useful properties when combined with another. For example, an XOR combination of a full-period LCG and a nonlinear PRNG (such as JSF) will be a PRNG with a guaranteed minimum cycle length, namely that of the LCG.

However, if we limit the scope to PRNGs with 128 or maybe 256 bits of state, maybe a PractRand threshold of 1 TiB, as I suggest in that document, is too much for SmallRNG's purposes. Note also that hash functions are essentially stateless.

@vks vks mentioned this issue Aug 5, 2020
19 tasks
@vks vks added this to the 0.8 release milestone Aug 31, 2020
@dhardy
Copy link
Member Author

dhardy commented Sep 6, 2020

Its time we need to draw a conclusion here. I'm inclined to take @vks's suggestion:

This looks like xoshiro128++ and xoshiro256++ are good choices for i686 and x86_64, respectively.

Vigna noted that microbenchmarks may be misleading; still without more to go on (and knowing our target platforms) I believe the above is a good choice.

Note that in general generators writing to multiple locations (like those above) might perform better in artificial benchmarks (e.g., loops) than generators with a stream parameter, but in real life the cost of storing two words instead of one might be detectable even at the application level in some circumstances (e.g., a lot of generation but in a situation in which the compiler cannot keep the state of the PRNG in registers).

vks added a commit to vks/rand that referenced this issue Sep 6, 2020
Due to close correlations of PCG streams (rust-random#907) and lack of right-state
propegation (rust-random#905), the `SmallRng` algorithm is switched to
xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro`
crate and slightly simplified.

Fixes rust-random#910.
vks added a commit to vks/rand that referenced this issue Sep 6, 2020
Due to close correlations of PCG streams (rust-random#907) and lack of right-state
propagation (rust-random#905), the `SmallRng` algorithm is switched to
xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro`
crate and slightly simplified.

Fixes rust-random#910.
vks added a commit to vks/rand that referenced this issue Sep 15, 2020
Due to close correlations of PCG streams (rust-random#907) and lack of right-state
propagation (rust-random#905), the `SmallRng` algorithm is switched to
xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro`
crate and slightly simplified.

Fixes rust-random#910.
@vks vks closed this as completed in #1038 Sep 15, 2020
vks added a commit that referenced this issue Sep 15, 2020
Due to close correlations of PCG streams (#907) and lack of right-state
propagation (#905), the `SmallRng` algorithm is switched to
xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro`
crate and slightly simplified.

Fixes #910.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants