Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing MD5 with CRC32 for quick check #2396

Closed
animetosho opened this issue Jan 2, 2023 · 34 comments
Closed

Consider replacing MD5 with CRC32 for quick check #2396

animetosho opened this issue Jan 2, 2023 · 34 comments

Comments

@animetosho
Copy link

To not stray too far off topic in a previous conversation, I've decided to move the discussion here.

Summary so far:

  • assemble computes the MD5 of the file it's assembling. MD5 computation may become a performance limiter in the future (and already consumes a fair bit of CPU)
  • @puzzledsab pointed out that MD5 could probably be replaced with CRC32
  • CRC32 is already being computed during yEnc decoding, so adopting CRC32 would effectively eliminate the need to separately compute a hash
  • only checking yEnc CRC32 against the decoded articles (without any PAR2 check) has been tried in the past (Make Quick-Check less strict (second try) #2160), but showed issues
  • PAR2, however, does also contain CRC32s that could be checked against, however they are block hashes, not file hashes
  • a file's CRC32 can be obtained by joining the decoded articles' CRC32s (without the need to hash all the data again)
  • similarly, the file's CRC32 can be obtained from the PAR2 by joining the block hashes. The last block's hash may need to have it's zero padding trimmed
  • sample Python code for CRC manipulation operations

Essentially this could eliminate the need to compute the MD5 hash of each file, whilst still verifying the data against what's specified in the PAR2. The downside is that CRC32 is a less reliable hash than MD5.


If we're going to experiment, I thought I'd point the following out, in case it's missed:

When assembling the CRC from decoded articles, they don't need to be assembled in order (unlike MD5). This is handy since articles don't necessarily arrive in order, and there's no need to keep the article CRCs around.

Example:

file_crc = 0
once_an_article_is_decoded(article, file):
	article_end = article.offset + article.length
	bytes_til_file_end = file.size - article_end
	file_crc = crc_concat(article.crc, file_crc, bytes_til_file_end)

# once all articles have been decoded, file_crc will be the file's CRC32

When assembling CRC from PAR2's IFSC, the block size is constant across the entire PAR2. If you look at the definition of crc_concat:

return crc_multiply(crc1, crc_2pow(len2*8)) ^ crc2

Since len2 is constant, it follows that crc_2pow(len2*8) is also constant. So as a simple optimisation, it only needs to be computed once, and can be reused in the loop.

Example:

file_crc = 0
crc_block_coeff = crc_2pow(par2_block_size * 8)
for block_crc in file_ifsc_crcs:
	file_crc = crc_multiply(file_crc, crc_block_coeff) ^ block_crc

total_crc_len = len(file_ifsc_crcs) * par2_block_size
zeroes_appended = total_crc_len - file_size
file_crc = crc_zero_unpad(file_crc, zeroes_appended)

(in theory, crc_multiply could be made faster if one of the coefficients is constant, but that optimisation is probably not necessary here)

@puzzledsab
Copy link
Contributor

I have started looking into the par2 code in SAB and how to get the slices. I haven't looked at it much before so it may take some time. Just wanted to let you all know so we don't do the same work.

@mnightingale
Copy link
Contributor

I had a look through NZBGet and it appears to implement what's discussed here. Besides from repairs the MD5 calculation must be the most demanding job SAB does I'm optimistic this could be a large imrpovement especially for slower devices.

Some of the comments on the below function may have some useful information or things to watch out for.

https://github.com/nzbget/nzbget/blob/5e26d52d706f129769e1d620a595c78498ca8cff/daemon/postprocess/ParChecker.cpp#L1288-L1302

@Safihre
Copy link
Member

Safihre commented Jan 6, 2023

I guess because I never understood that par2 contains CRC, I always misunderstood when hugbug said that they use CRC to do quick checking. I thought he was solely talking about per-article CRC instead of combining then!

@puzzledsab
Copy link
Contributor

First seemingly working version https://github.com/puzzledsab/sabnzbd/tree/feature/block_hash
It calculates CRC32 in the decoder thread because sabyenc does not currently return it. Maybe it could return the value instead of crc_correct when the crc is correct and 0 when it's not?

I did not try to calculate the file CRC out of order in the decoder because doing it in the assembler seems simpler. I would have to use index to get the article index in the array and there were more calculations involved. Maybe if we start using sparse files.

@Safihre : The new article.crc32 value could probably replace article.decoded to keep article variables to a minimum. None = not decoded, 0 = decoded with wrong CRC and >0 = decoded with correct CRC.

crc32.py has the standard SABnzbd GPL header. Maybe I should remove it considering it's @animetosho's code?

@Safihre
Copy link
Member

Safihre commented Jan 7, 2023

@puzzledsab nice, does it work? 🤩
Indeed the CRC should be returned from sabyenc3, it used to do that but I striped it at some point since we didn't really use the exact CRC value.
The CRC code Python can be in the utils directory, and indeed without SABnzbd header but refering to @animetosho and maybe this issue link.

@puzzledsab
Copy link
Contributor

puzzledsab commented Jan 7, 2023

It generates the correct CRC32 file values from the PAR2 file and the quick-check says the files are all ok. Unless I'm missing something it should be working. I don't have any nzbs with broken files for testing. I kept the article.md5sum working like before but using the CRC32 file value. The PAR2 code is a bit complex and used in multiple contexts so I didn't want to do any more changes than I had to. From what I understand you intend to change it before the 4.0 release.

Edit: It detected the bad file when I deleted the last part of a file in an nzb.

@Safihre
Copy link
Member

Safihre commented Jan 7, 2023

If you can extend the tests in test_par2file and make a PR?
Then I can give proper feedback.

@animetosho
Copy link
Author

animetosho commented Jan 8, 2023

Nice work @puzzledsab!

I generally write code as public domain, so feel free to include/use/modify/license it in any way you see fit.

Sabyenc could probably return both the CRC and file begin or end offsets, which would allow the CRC to be merged in the download code. There'd be a bit of an assumption that the begin/end value in the yEnc header is accurate, but if not, the computed hash would be wrong.

@Safihre do you want to keep the CRC_CHECK check? If the CRC value is going to be depended on, it probably makes no sense to allow disabling it.
Perhaps it could make sense to keep, if you intend to provide the user with the option of CRC or MD5 quick verification?

@Safihre
Copy link
Member

Safihre commented Jan 8, 2023

@animetosho why do we need the file offsets? I don't understand the CRC fully, but in @puzzledsab python version we don't seem need them either?

@animetosho
Copy link
Author

animetosho commented Jan 8, 2023

It's not strictly necessary, as you point out.

The first example in the first post shows how the CRC can be accumulated out of order. To do the accumulation part, you'll need to know the offset of the data the CRC refers to.
You can, of course, just remember the CRC for each article, and accumulate it at the end, which I think is what @puzzledsab's code is doing.
But since Sabyenc already parses the begin/end values, it seems fairly straightforward to just pass those values back to Python, even if you don't ever use them. It's up to you though - I don't think it matters much either way.

Edit: thinking about it, if true threading becomes a thing in Python, aggregating at the end might be easier anyway. I see puzzledsab's code also avoids computing crc_2pow for every article, which is a neat trick, but this wouldn't work for an out-of-order aggregation. So that's a minor plus with doing in-order aggregation at the end.

@Safihre
Copy link
Member

Safihre commented Jan 8, 2023

Thanks!
I also noticed that crc_2pow is cached, but it seems to rely on each article having the same size, right @puzzledsab?
But that's not very reliable, for example when people manually combine files on search websites like Binsearch, or when an indexer wrongly combines multiple posts.
Is that maybe something we could also do in the C code (outside GIL)? If it's per article and expensive to calculate.

@animetosho
Copy link
Author

animetosho commented Jan 8, 2023

From what I can tell, there's a check on the size. Although it's not guaranteed to be the same, I'd expect it to be most of the time, so the check should usually hit the cached value.

crc_2pow should usually be fast. The main loop just cycles through the bits of the length, so for a 750KB=6Mbit article size (0b10111011100000000000000 in binary), that's 23 loop cycles. For each set bit (7 for 750KB), it calls crc_multiply, whose loop cycles at most 32 times (so 7*32=224 cycles for 750KB).
So a total of a few hundred loop cycles in the typical case, and the loop is mostly bitwise operations, which should be fast.

@Safihre
Copy link
Member

Safihre commented Jan 8, 2023

Ah I missed that. Thanks.
All the CRC functions look very ripe for tools like mypyc or Cython!
Although I think the part that's actually in Python is probably not the most CPU intensive (right @animetosho?).

@animetosho
Copy link
Author

Bitwise logic, like that for CRC, generally codes nicely in lower level languages, though it's often not too different in higher level languages. Note that I don't code much in Python, so everyone here would definitely know more than I do.

Performance wise, a few hundred cycles of bit manipulation, per article, doesn't really concern me if done in Python, but profiling it will give you a more definitive answer.

@puzzledsab
Copy link
Contributor

puzzledsab commented Jan 8, 2023

I already did some testing to find out if the caching was worth it. Seconds for 100k rounds, crcs calculated from random 750KB blocks:

Standard Python
Precalculated: 1.079958499991335
Full:         13.607817300013267

mypyc crc32calc.py
Precalculated: 0.09876009996514767
Full:          0.9986711000092328

@thezoggy
Copy link
Contributor

thezoggy commented Jan 8, 2023

found this earlier, never knew there is two crc32 variants (crc32a/b):
https://github.com/Michaelangel007/crc32#checksum

not that it should matter to our needs (as were staying somewhat within a small ecosystem bubble and not using external tools).

@animetosho
Copy link
Author

animetosho commented Jan 9, 2023

That's a bit slower than I was expecting (though that is effectively over 72GB of data) - thanks for testing!

Since there's some interest in speed, here's a faster crc_2pow which processes 4 bits per iteration instead of 1:

# compute 2**n (without LUT)
def crc_2pow_slow(n: int):
	k = 0x80000000
	for bit in range(0, 32):
		k = crc_multiply(k, k)
		if n & (0x80000000 >> bit):
			k = (k>>1) ^ (CRC32_POLYNOMIAL & -(k&1))
	return k

CRC32_POWER_TABLE = [
	[
		crc_2pow_slow(v << (tbl*4)) for v in range(0, 16)
	] for tbl in range(0, 8)
]

# compute 2**n
def crc_2pow(n: int):
	result = CRC32_POWER_TABLE[0][n & 15]
	n >>= 4
	tbl = 1
	while n:
		if n & 15:
			result = crc_multiply(result, CRC32_POWER_TABLE[tbl & 7][n & 15])
		n >>= 4
		tbl += 1
	return result

(not sure whether you prefer pre-computed tables, or runtime computed)

From my quick test, it's around 2.5x the speed of the original.

@animetosho
Copy link
Author

not that it should matter to our needs (as were staying somewhat within a small ecosystem bubble and not using external tools).

If you're checking against PAR2's CRC32, you have to use the same polynomial as that. So we don't actually have a choice in the matter.
(it just so happens that yEnc's CRC32 also uses the same polynomial, so it matches up nicely 😊)

@puzzledsab
Copy link
Contributor

(not sure whether you prefer pre-computed tables, or runtime computed)

I don't think it matters too much considering it normally runs as a server. I assume it will only be calculated at startup.

From my quick test, it's around 2.5x the speed of the original.

It was just under 2x faster in my tests but that's pretty good too.

@mnightingale
Copy link
Contributor

mnightingale commented Jan 9, 2023

Would be more efficient to do the crc_multiply in sabyenc?

I don't know if crcutil provides any of the required functions.

@animetosho I did notice cloudera/crcutil#3 should that be fixed? IsSSE42Available() returns false for me when it shouldn't but I've no idea if the suggest fix is correct.
edit: seems it doesn't use it anyway since CRCUTIL_USE_MM_CRC32 is defined 0, any reason for that?

@puzzledsab
Copy link
Contributor

@Safihre : Do you know if the distributions can use mypyc? I think the bpsmeter could benefit from it too. I have modified it locally so that the only thing stopping it from compiling is that mysterious T() function for translations.

@Safihre
Copy link
Member

Safihre commented Jan 9, 2023

@animetosho maybe.. I looked into it but found no straightforward way.
We could always split the part that requires T().
However, that's for another issue 🙂

@puzzledsab
Copy link
Contributor

I was mostly thinking about how crc32calc.py would benefit from it. What about including the size comparison? It would be very easy to add a counter to the assembler but it would require another nzf variable. Do you prefer reading it from the filesystem?

@animetosho
Copy link
Author

Would be more efficient to do the crc_multiply in sabyenc?

What's the Python <-> C transition overhead like?
Given that CPython doesn't have JIT, perhaps it isn't so bad.

With crc_2pow cached, it was benchmarked at ~1 second for 72GB of data, which seems negligible to me, but if it's still a concern, we could explore implementing it in C.
Do you think it's worth it?

I don't know if crcutil provides any of the required functions.

I think they've got most things implemented.

@animetosho I did notice cloudera/crcutil#3 should that be fixed? IsSSE42Available() returns false for me when it shouldn't but I've no idea if the suggest fix is correct. edit: seems it doesn't use it anyway since CRCUTIL_USE_MM_CRC32 is defined 0, any reason for that?

It's pointless for us regardless, as the CRC32 instruction in SSE4.2 uses the Castagnoli polynomial (otherwise known as CRC32C). Even if that wasn't an issue, I can't see how it can be used for the calculations here (but maybe there's some way).

The CLMUL instruction could be useful, but was introduced after crcutil was written.

@animetosho
Copy link
Author

animetosho commented Jan 10, 2023

None = not decoded, 0 = decoded with wrong CRC and >0 = decoded with correct CRC.

I suspect 0 is actually a valid CRC, so if possible, it's probably best to avoid using that as a special marker.

Edit: example: python3 -c "import zlib; print(zlib.crc32(b'\x9d\n\xd9m'))"

@puzzledsab
Copy link
Contributor

puzzledsab commented Jan 10, 2023

I suspect 0 is actually a valid CRC, so if possible, it's probably best to avoid using that as a special marker.

I suspected that too but it's just so much easier and all that happens on a false positive is that the par2 check is run. For instance, concating 0 and a new crc32 value gives the new value so I can initialize them to 0 and start concating without checking every time if it needs to be initialized.

It's not a big deal changing it, though. We'll need some other way of signaling bad crc from sabyenc.

@animetosho
Copy link
Author

animetosho commented Jan 11, 2023

Sorry, I was more referring to this code.

Perhaps it could return False or similar to the Python side, where it could detect a bad CRC and deal with it there.
Handling it as a false-negative sounds fine, just thought I should point out the issue in case it was missed.

@Safihre
Copy link
Member

Safihre commented Jan 14, 2023

Considering this implemented! Will fix the last comment of @animetosho in next sabyenc update :)

@Safihre Safihre closed this as completed Jan 14, 2023
@animetosho
Copy link
Author

Thanks for everyone's contributions!

@puzzledsab
Copy link
Contributor

Particularly you, @animetosho :) I tried importing zlib's crc32_combine as described here: https://stackoverflow.com/questions/35259527/using-zlib-crc32-combine-in-python

It was only marginally faster than your optimized python code for the full calculation and much slower for the precalculated. It's a bit strange considering the mypyc version of your code is 10x faster. Maybe it's the way it is imported, but still...

precalculated
1.0768595999688841
full
6.895056699984707
zlib
5.742461500049103

@mnightingale
Copy link
Contributor

@puzzledsab if you're curious try https://github.com/mnightingale/sabyenc/tree/feature/crc32_combine it's barely any effort to add it to sabyenc I even used the faster method definition to have as little overhead as possible.

https://gist.github.com/mnightingale/9f03901063dcf66510a20f0acb8e8d6c

500K rounds:

python
17.913410699999076
crcutil
0.7693139000002702

@puzzledsab
Copy link
Contributor

@Safihre : We should probably use that. Same test as previously used 0.0684 seconds.

@Safihre
Copy link
Member

Safihre commented Jan 16, 2023

Nice. Indeed let's add it there but with a tiny bit of input verification, just to prevent a segfault if due to some bug one of the 2 values is None or something like that.
Just out of curiosity I would be very interested to see how much slower the ParseTuple version would be 😊

@animetosho
Copy link
Author

It was only marginally faster than your optimized python code for the full calculation and much slower for the precalculated

That's surprising. zlib, crcutil and my original Python version looks to be largely the same algorithm. Maybe there's a fair bit of FFI overhead with the zlib test.
I haven't seen anyone implementing the '4 bits at a time' strategy - probably no-one really bothered to optimise it that much.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants