-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perfomance improvement for the realtek. #88
Comments
Thank you again for your contribution. I will test as soon as I have some spare time :) The modulo operator is quite slow: seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647); can be substituted with this (should be ~30% faster): const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m); |
I did not noticed then you replaced srandom_r () |
Added the new code with (7738fda). I simplified the code a bit, like removing the On my x86 machine it's about 60% faster, but my commit aimed for a quick and simple integration with the existing code. Some further testing and tweaking could be useful. |
In the cycle |
Yes. I think the code could be re-written this way: static inline uint32_t glibc_fast_seed(uint32_t seed)
{
uint32_t word0 = 0;
for (int j = 3; j < 31 + 3 - 1; j++) {
word0 += seed * glibc_seed_tbl[j];
/* This does: seed = (16807LL * seed) % 0x7fffffff
using the sum of digits method which works for mod N, base N+1 */
const uint64_t p = 16807ULL * seed; // Seed is always positive (31 bits)
seed = (p >> 31) + (p & 0x7fffffff);
}
return (word0 + seed * glibc_seed_tbl[33]) >> 1;
} The variable seed is always positive thus the multiplication /* The result might still not fit in 31 bits, if not, repeat
(conditional seems to make it slightly faster) */
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; is useless. Maybe you can confirm me that. BTW if you want come back to the IRC channel we can discuss these things there. I know that @rofl0r can be a bit annoying but bear with him ;) |
Pushed the new changes (7acd739). On my laptop now the difference is much more noticeable compared to my old code: 4 times faster. |
Not confirm. For example with seed 5951 glibc_fast_seed() gets wrong result. I don know why. With |
A constantly working messenger distracts me. But when there is something to discuss - I will join. |
I found something interesting:
produces wrong nonce. |
Upd. |
Will do soon. At the time I added the conditional because it seemed to be faster that way. In this function however it seems to make it noticeably slower so I guess we should replace: const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; with this: uint64_t p = 16807ULL * seed;
p = (p >> 31) + (p & 0x7fffffff);
seed = (p >> 31) + (p & 0x7fffffff); In the current code negative values are never tested since the seed corresponds always to the current time, it's always positive. So I wouldn't add the check with: seed = (seed == 0x7fffffff || seed == 0xfffffffe) ? 0 : seed; which only slows down the cracking process. If the current time is not retrieved from the router, the value of 0 (Unix Epoch) is returned. But seed = 0 returns a value of 0. So internally the PRNG checks if seed = 0, if it is it changes it to 1, see glibc_random_old.c#L93: /* We must make sure the seed is not 0. Take arbitrarily 1 in this case. */
if (seed == 0)
seed = 1; The value of |
What's the purpose of doing: REALTEK_FAST_SEED_STEP over and over again, if the seed doesn't change? |
It updates the seed :) |
Old:
New:
It doesn't really change anything for me. Even if you unroll or remove the for loop, the instructions are still executed sequentially since one depends on the other. For example: REALTEK_FAST_SEED_STEP
b += realtek_fast_lookup_4d[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */
REALTEK_FAST_SEED_STEP /* This must wait operation above to finish (since seed is shared) */
b += realtek_fast_lookup_5[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */ If instead we'd have something like: for (...) {
// ...
independent_operation_A();
independent_operation_B();
// ...
} those 2 would execute in parallel if they wouldn't share the same variables ( Also the most expensive operation is the modulo (%) or its variant in |
I pushed the changes to fix my previous commit so we have the same starting point to judge optimizations and modifications. Maybe another consideration that could be useful is the fact that we test the seeds in sequence:
or, backwards
Not sure if this helps, but if it would we could setup some intermediate table(s), eg: I'm just throwing out ideas. I'm still not sure I understand what you did with your changes :) |
I have no more ideas. |
We've come a log way. A simple benchmark over a 3 years period (
Obviously the difference is less noticeable on a faster machine (e.g. my Desktop). There will be a new release soon. Thank you :) Data used:
|
I found the way to improvement perfomance of the realtek prng.
and the example - how to use these functions:
It works 5 more times faster than the chain srandom_r() + random_r(). I do not tested it on big endian CPU.
Edit: added syntax coloring (wiire-a)
The text was updated successfully, but these errors were encountered: