Consider replacing MD5 with CRC32 for quick check #2396

animetosho · 2023-01-02T01:07:34Z

To not stray too far off topic in a previous conversation, I've decided to move the discussion here.

Summary so far:

assemble computes the MD5 of the file it's assembling. MD5 computation may become a performance limiter in the future (and already consumes a fair bit of CPU)
@puzzledsab pointed out that MD5 could probably be replaced with CRC32
CRC32 is already being computed during yEnc decoding, so adopting CRC32 would effectively eliminate the need to separately compute a hash
only checking yEnc CRC32 against the decoded articles (without any PAR2 check) has been tried in the past (Make Quick-Check less strict (second try) #2160), but showed issues
PAR2, however, does also contain CRC32s that could be checked against, however they are block hashes, not file hashes
a file's CRC32 can be obtained by joining the decoded articles' CRC32s (without the need to hash all the data again)
similarly, the file's CRC32 can be obtained from the PAR2 by joining the block hashes. The last block's hash may need to have it's zero padding trimmed
sample Python code for CRC manipulation operations

Essentially this could eliminate the need to compute the MD5 hash of each file, whilst still verifying the data against what's specified in the PAR2. The downside is that CRC32 is a less reliable hash than MD5.

If we're going to experiment, I thought I'd point the following out, in case it's missed:

When assembling the CRC from decoded articles, they don't need to be assembled in order (unlike MD5). This is handy since articles don't necessarily arrive in order, and there's no need to keep the article CRCs around.

Example:

file_crc = 0
once_an_article_is_decoded(article, file):
	article_end = article.offset + article.length
	bytes_til_file_end = file.size - article_end
	file_crc = crc_concat(article.crc, file_crc, bytes_til_file_end)

# once all articles have been decoded, file_crc will be the file's CRC32

When assembling CRC from PAR2's IFSC, the block size is constant across the entire PAR2. If you look at the definition of crc_concat:

return crc_multiply(crc1, crc_2pow(len2*8)) ^ crc2

Since len2 is constant, it follows that crc_2pow(len2*8) is also constant. So as a simple optimisation, it only needs to be computed once, and can be reused in the loop.

Example:

file_crc = 0
crc_block_coeff = crc_2pow(par2_block_size * 8)
for block_crc in file_ifsc_crcs:
	file_crc = crc_multiply(file_crc, crc_block_coeff) ^ block_crc

total_crc_len = len(file_ifsc_crcs) * par2_block_size
zeroes_appended = total_crc_len - file_size
file_crc = crc_zero_unpad(file_crc, zeroes_appended)

(in theory, crc_multiply could be made faster if one of the coefficients is constant, but that optimisation is probably not necessary here)

The text was updated successfully, but these errors were encountered:

puzzledsab · 2023-01-05T21:52:21Z

I have started looking into the par2 code in SAB and how to get the slices. I haven't looked at it much before so it may take some time. Just wanted to let you all know so we don't do the same work.

mnightingale · 2023-01-06T01:01:08Z

I had a look through NZBGet and it appears to implement what's discussed here. Besides from repairs the MD5 calculation must be the most demanding job SAB does I'm optimistic this could be a large imrpovement especially for slower devices.

Some of the comments on the below function may have some useful information or things to watch out for.

https://github.com/nzbget/nzbget/blob/5e26d52d706f129769e1d620a595c78498ca8cff/daemon/postprocess/ParChecker.cpp#L1288-L1302

Safihre · 2023-01-06T07:39:16Z

I guess because I never understood that par2 contains CRC, I always misunderstood when hugbug said that they use CRC to do quick checking. I thought he was solely talking about per-article CRC instead of combining then!

puzzledsab · 2023-01-07T18:44:07Z

First seemingly working version https://github.com/puzzledsab/sabnzbd/tree/feature/block_hash
It calculates CRC32 in the decoder thread because sabyenc does not currently return it. Maybe it could return the value instead of crc_correct when the crc is correct and 0 when it's not?

I did not try to calculate the file CRC out of order in the decoder because doing it in the assembler seems simpler. I would have to use index to get the article index in the array and there were more calculations involved. Maybe if we start using sparse files.

@Safihre : The new article.crc32 value could probably replace article.decoded to keep article variables to a minimum. None = not decoded, 0 = decoded with wrong CRC and >0 = decoded with correct CRC.

crc32.py has the standard SABnzbd GPL header. Maybe I should remove it considering it's @animetosho's code?

Safihre · 2023-01-07T18:55:37Z

@puzzledsab nice, does it work? 🤩
Indeed the CRC should be returned from sabyenc3, it used to do that but I striped it at some point since we didn't really use the exact CRC value.
The CRC code Python can be in the utils directory, and indeed without SABnzbd header but refering to @animetosho and maybe this issue link.

puzzledsab · 2023-01-07T19:15:59Z

It generates the correct CRC32 file values from the PAR2 file and the quick-check says the files are all ok. Unless I'm missing something it should be working. I don't have any nzbs with broken files for testing. I kept the article.md5sum working like before but using the CRC32 file value. The PAR2 code is a bit complex and used in multiple contexts so I didn't want to do any more changes than I had to. From what I understand you intend to change it before the 4.0 release.

Edit: It detected the bad file when I deleted the last part of a file in an nzb.

Safihre · 2023-01-07T21:05:23Z

If you can extend the tests in test_par2file and make a PR?
Then I can give proper feedback.

animetosho · 2023-01-08T06:45:38Z

Nice work @puzzledsab!

I generally write code as public domain, so feel free to include/use/modify/license it in any way you see fit.

Sabyenc could probably return both the CRC and file begin or end offsets, which would allow the CRC to be merged in the download code. There'd be a bit of an assumption that the begin/end value in the yEnc header is accurate, but if not, the computed hash would be wrong.

@Safihre do you want to keep the CRC_CHECK check? If the CRC value is going to be depended on, it probably makes no sense to allow disabling it.
Perhaps it could make sense to keep, if you intend to provide the user with the option of CRC or MD5 quick verification?

Safihre · 2023-01-08T06:49:32Z

@animetosho why do we need the file offsets? I don't understand the CRC fully, but in @puzzledsab python version we don't seem need them either?

animetosho · 2023-01-08T06:58:58Z

It's not strictly necessary, as you point out.

The first example in the first post shows how the CRC can be accumulated out of order. To do the accumulation part, you'll need to know the offset of the data the CRC refers to.
You can, of course, just remember the CRC for each article, and accumulate it at the end, which I think is what @puzzledsab's code is doing.
But since Sabyenc already parses the begin/end values, it seems fairly straightforward to just pass those values back to Python, even if you don't ever use them. It's up to you though - I don't think it matters much either way.

Edit: thinking about it, if true threading becomes a thing in Python, aggregating at the end might be easier anyway. I see puzzledsab's code also avoids computing crc_2pow for every article, which is a neat trick, but this wouldn't work for an out-of-order aggregation. So that's a minor plus with doing in-order aggregation at the end.

Safihre · 2023-01-08T07:17:34Z

Thanks!
I also noticed that crc_2pow is cached, but it seems to rely on each article having the same size, right @puzzledsab?
But that's not very reliable, for example when people manually combine files on search websites like Binsearch, or when an indexer wrongly combines multiple posts.
Is that maybe something we could also do in the C code (outside GIL)? If it's per article and expensive to calculate.

animetosho · 2023-01-08T08:27:10Z

From what I can tell, there's a check on the size. Although it's not guaranteed to be the same, I'd expect it to be most of the time, so the check should usually hit the cached value.

crc_2pow should usually be fast. The main loop just cycles through the bits of the length, so for a 750KB=6Mbit article size (0b10111011100000000000000 in binary), that's 23 loop cycles. For each set bit (7 for 750KB), it calls crc_multiply, whose loop cycles at most 32 times (so 7*32=224 cycles for 750KB).
So a total of a few hundred loop cycles in the typical case, and the loop is mostly bitwise operations, which should be fast.

Safihre · 2023-01-08T11:04:38Z

Ah I missed that. Thanks.
All the CRC functions look very ripe for tools like mypyc or Cython!
Although I think the part that's actually in Python is probably not the most CPU intensive (right @animetosho?).

animetosho · 2023-01-08T11:53:48Z

Bitwise logic, like that for CRC, generally codes nicely in lower level languages, though it's often not too different in higher level languages. Note that I don't code much in Python, so everyone here would definitely know more than I do.

Performance wise, a few hundred cycles of bit manipulation, per article, doesn't really concern me if done in Python, but profiling it will give you a more definitive answer.

puzzledsab · 2023-01-08T12:07:51Z

I already did some testing to find out if the caching was worth it. Seconds for 100k rounds, crcs calculated from random 750KB blocks:

Standard Python
Precalculated: 1.079958499991335
Full:         13.607817300013267

mypyc crc32calc.py
Precalculated: 0.09876009996514767
Full:          0.9986711000092328

thezoggy · 2023-01-08T23:15:46Z

found this earlier, never knew there is two crc32 variants (crc32a/b):
https://github.com/Michaelangel007/crc32#checksum

not that it should matter to our needs (as were staying somewhat within a small ecosystem bubble and not using external tools).

animetosho · 2023-01-09T13:13:35Z

That's a bit slower than I was expecting (though that is effectively over 72GB of data) - thanks for testing!

Since there's some interest in speed, here's a faster crc_2pow which processes 4 bits per iteration instead of 1:

# compute 2**n (without LUT)
def crc_2pow_slow(n: int):
	k = 0x80000000
	for bit in range(0, 32):
		k = crc_multiply(k, k)
		if n & (0x80000000 >> bit):
			k = (k>>1) ^ (CRC32_POLYNOMIAL & -(k&1))
	return k

CRC32_POWER_TABLE = [
	[
		crc_2pow_slow(v << (tbl*4)) for v in range(0, 16)
	] for tbl in range(0, 8)
]

# compute 2**n
def crc_2pow(n: int):
	result = CRC32_POWER_TABLE[0][n & 15]
	n >>= 4
	tbl = 1
	while n:
		if n & 15:
			result = crc_multiply(result, CRC32_POWER_TABLE[tbl & 7][n & 15])
		n >>= 4
		tbl += 1
	return result

(not sure whether you prefer pre-computed tables, or runtime computed)

From my quick test, it's around 2.5x the speed of the original.

animetosho · 2023-01-09T13:29:37Z

not that it should matter to our needs (as were staying somewhat within a small ecosystem bubble and not using external tools).

If you're checking against PAR2's CRC32, you have to use the same polynomial as that. So we don't actually have a choice in the matter.
(it just so happens that yEnc's CRC32 also uses the same polynomial, so it matches up nicely 😊)

puzzledsab · 2023-01-09T15:36:04Z

(not sure whether you prefer pre-computed tables, or runtime computed)

I don't think it matters too much considering it normally runs as a server. I assume it will only be calculated at startup.

From my quick test, it's around 2.5x the speed of the original.

It was just under 2x faster in my tests but that's pretty good too.

mnightingale · 2023-01-09T16:21:50Z

Would be more efficient to do the crc_multiply in sabyenc?

I don't know if crcutil provides any of the required functions.

@animetosho I did notice cloudera/crcutil#3 should that be fixed? IsSSE42Available() returns false for me when it shouldn't but I've no idea if the suggest fix is correct.
edit: seems it doesn't use it anyway since CRCUTIL_USE_MM_CRC32 is defined 0, any reason for that?

puzzledsab · 2023-01-09T17:34:01Z

@Safihre : Do you know if the distributions can use mypyc? I think the bpsmeter could benefit from it too. I have modified it locally so that the only thing stopping it from compiling is that mysterious T() function for translations.

Safihre · 2023-01-09T17:42:16Z

@animetosho maybe.. I looked into it but found no straightforward way.
We could always split the part that requires T().
However, that's for another issue 🙂

puzzledsab · 2023-01-09T19:33:49Z

I was mostly thinking about how crc32calc.py would benefit from it. What about including the size comparison? It would be very easy to add a counter to the assembler but it would require another nzf variable. Do you prefer reading it from the filesystem?

animetosho · 2023-01-10T00:45:50Z

Would be more efficient to do the crc_multiply in sabyenc?

What's the Python <-> C transition overhead like?
Given that CPython doesn't have JIT, perhaps it isn't so bad.

With crc_2pow cached, it was benchmarked at ~1 second for 72GB of data, which seems negligible to me, but if it's still a concern, we could explore implementing it in C.
Do you think it's worth it?

I don't know if crcutil provides any of the required functions.

I think they've got most things implemented.

@animetosho I did notice cloudera/crcutil#3 should that be fixed? IsSSE42Available() returns false for me when it shouldn't but I've no idea if the suggest fix is correct. edit: seems it doesn't use it anyway since CRCUTIL_USE_MM_CRC32 is defined 0, any reason for that?

It's pointless for us regardless, as the CRC32 instruction in SSE4.2 uses the Castagnoli polynomial (otherwise known as CRC32C). Even if that wasn't an issue, I can't see how it can be used for the calculations here (but maybe there's some way).

The CLMUL instruction could be useful, but was introduced after crcutil was written.

animetosho · 2023-01-10T13:31:47Z

None = not decoded, 0 = decoded with wrong CRC and >0 = decoded with correct CRC.

I suspect 0 is actually a valid CRC, so if possible, it's probably best to avoid using that as a special marker.

Edit: example: python3 -c "import zlib; print(zlib.crc32(b'\x9d\n\xd9m'))"

puzzledsab · 2023-01-10T16:41:41Z

I suspect 0 is actually a valid CRC, so if possible, it's probably best to avoid using that as a special marker.

I suspected that too but it's just so much easier and all that happens on a false positive is that the par2 check is run. For instance, concating 0 and a new crc32 value gives the new value so I can initialize them to 0 and start concating without checking every time if it needs to be initialized.

It's not a big deal changing it, though. We'll need some other way of signaling bad crc from sabyenc.

animetosho · 2023-01-11T09:59:06Z

Sorry, I was more referring to this code.

Perhaps it could return False or similar to the Python side, where it could detect a bad CRC and deal with it there.
Handling it as a false-negative sounds fine, just thought I should point out the issue in case it was missed.

Safihre · 2023-01-14T20:09:00Z

Considering this implemented! Will fix the last comment of @animetosho in next sabyenc update :)

animetosho · 2023-01-15T10:07:59Z

Thanks for everyone's contributions!

puzzledsab · 2023-01-15T17:11:42Z

Particularly you, @animetosho :) I tried importing zlib's crc32_combine as described here: https://stackoverflow.com/questions/35259527/using-zlib-crc32-combine-in-python

It was only marginally faster than your optimized python code for the full calculation and much slower for the precalculated. It's a bit strange considering the mypyc version of your code is 10x faster. Maybe it's the way it is imported, but still...

precalculated
1.0768595999688841
full
6.895056699984707
zlib
5.742461500049103

mnightingale · 2023-01-15T21:38:05Z

@puzzledsab if you're curious try https://github.com/mnightingale/sabyenc/tree/feature/crc32_combine it's barely any effort to add it to sabyenc I even used the faster method definition to have as little overhead as possible.

https://gist.github.com/mnightingale/9f03901063dcf66510a20f0acb8e8d6c

500K rounds:

python
17.913410699999076
crcutil
0.7693139000002702

puzzledsab · 2023-01-15T22:16:05Z

@Safihre : We should probably use that. Same test as previously used 0.0684 seconds.

Safihre · 2023-01-16T04:51:40Z

Nice. Indeed let's add it there but with a tiny bit of input verification, just to prevent a segfault if due to some bug one of the 2 values is None or something like that.
Just out of curiosity I would be very interested to see how much slower the ParseTuple version would be 😊

animetosho · 2023-01-17T03:52:23Z

It was only marginally faster than your optimized python code for the full calculation and much slower for the precalculated

That's surprising. zlib, crcutil and my original Python version looks to be largely the same algorithm. Maybe there's a fair bit of FFI overhead with the zlib test.
I haven't seen anyone implementing the '4 bits at a time' strategy - probably no-one really bothered to optimise it that much.

puzzledsab mentioned this issue Jan 8, 2023

Use CRC32 for file check #2405

Closed

Safihre closed this as completed Jan 14, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider replacing MD5 with CRC32 for quick check #2396

Consider replacing MD5 with CRC32 for quick check #2396

animetosho commented Jan 2, 2023

puzzledsab commented Jan 5, 2023

mnightingale commented Jan 6, 2023

Safihre commented Jan 6, 2023

puzzledsab commented Jan 7, 2023

Safihre commented Jan 7, 2023

puzzledsab commented Jan 7, 2023 •

edited

Safihre commented Jan 7, 2023

animetosho commented Jan 8, 2023 •

edited

Safihre commented Jan 8, 2023

animetosho commented Jan 8, 2023 •

edited

Safihre commented Jan 8, 2023

animetosho commented Jan 8, 2023 •

edited

Safihre commented Jan 8, 2023

animetosho commented Jan 8, 2023

puzzledsab commented Jan 8, 2023 •

edited

thezoggy commented Jan 8, 2023

animetosho commented Jan 9, 2023 •

edited

animetosho commented Jan 9, 2023

puzzledsab commented Jan 9, 2023

mnightingale commented Jan 9, 2023 •

edited

puzzledsab commented Jan 9, 2023

Safihre commented Jan 9, 2023

puzzledsab commented Jan 9, 2023

animetosho commented Jan 10, 2023

animetosho commented Jan 10, 2023 •

edited

puzzledsab commented Jan 10, 2023 •

edited

animetosho commented Jan 11, 2023 •

edited

Safihre commented Jan 14, 2023

animetosho commented Jan 15, 2023

puzzledsab commented Jan 15, 2023

mnightingale commented Jan 15, 2023

puzzledsab commented Jan 15, 2023

Safihre commented Jan 16, 2023

animetosho commented Jan 17, 2023

Consider replacing MD5 with CRC32 for quick check #2396

Consider replacing MD5 with CRC32 for quick check #2396

Comments

animetosho commented Jan 2, 2023

puzzledsab commented Jan 5, 2023

mnightingale commented Jan 6, 2023

Safihre commented Jan 6, 2023

puzzledsab commented Jan 7, 2023

Safihre commented Jan 7, 2023

puzzledsab commented Jan 7, 2023 • edited

Safihre commented Jan 7, 2023

animetosho commented Jan 8, 2023 • edited

Safihre commented Jan 8, 2023

animetosho commented Jan 8, 2023 • edited

Safihre commented Jan 8, 2023

animetosho commented Jan 8, 2023 • edited

Safihre commented Jan 8, 2023

animetosho commented Jan 8, 2023

puzzledsab commented Jan 8, 2023 • edited

thezoggy commented Jan 8, 2023

animetosho commented Jan 9, 2023 • edited

animetosho commented Jan 9, 2023

puzzledsab commented Jan 9, 2023

mnightingale commented Jan 9, 2023 • edited

puzzledsab commented Jan 9, 2023

Safihre commented Jan 9, 2023

puzzledsab commented Jan 9, 2023

animetosho commented Jan 10, 2023

animetosho commented Jan 10, 2023 • edited

puzzledsab commented Jan 10, 2023 • edited

animetosho commented Jan 11, 2023 • edited

Safihre commented Jan 14, 2023

animetosho commented Jan 15, 2023

puzzledsab commented Jan 15, 2023

mnightingale commented Jan 15, 2023

puzzledsab commented Jan 15, 2023

Safihre commented Jan 16, 2023

animetosho commented Jan 17, 2023

puzzledsab commented Jan 7, 2023 •

edited

animetosho commented Jan 8, 2023 •

edited

animetosho commented Jan 8, 2023 •

edited

animetosho commented Jan 8, 2023 •

edited

puzzledsab commented Jan 8, 2023 •

edited

animetosho commented Jan 9, 2023 •

edited

mnightingale commented Jan 9, 2023 •

edited

animetosho commented Jan 10, 2023 •

edited

puzzledsab commented Jan 10, 2023 •

edited

animetosho commented Jan 11, 2023 •

edited