fix off-by-one bug in crc32_acle.c #1274

landfillbaby · 2022-05-13T07:48:40Z

would be doing unaligned memory reads on very small buffers

mtl1979 · 2022-05-13T09:43:32Z

I'm not sure the fix is necessary... First part of the code just tries to align the pointer if buffer is large enough to need or benefit from aligning... Last part of the code handles small buffers.

landfillbaby · 2022-05-13T09:52:22Z

if you think about it though, the last part operates in reverse, so in the case of a small buffer misaligned by 2 bytes, the 4 byte read will be unaligned

landfillbaby · 2022-05-13T10:41:07Z

decided to simplify it so the compiler understands it better. if you really want to you can just cherry-pick the 1st commit

mtl1979 · 2022-05-13T10:44:18Z

if you think about it though, the last part operates in reverse, so in the case of a small buffer misaligned by 2 bytes, the 4 byte read will be unaligned

There wouldn't be 4 byte read when the remaining buffer length is only 2 (16 bits).

landfillbaby · 2022-05-13T11:00:08Z

alright, make it consistent in the opposite way then, replace if (len && ((ptrdiff_t)buf & 1)) with if ((len > 1) && ((ptrdiff_t)buf & 1))

codecov · 2022-05-13T11:17:55Z

Codecov Report

Merging #1274 (8ae4740) into develop (41d6739) will decrease coverage by 0.68%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #1274      +/-   ##
===========================================
- Coverage    87.17%   86.49%   -0.69%     
===========================================
  Files          117      125       +8     
  Lines        10249    10664     +415     
  Branches      2587     2630      +43     
===========================================
+ Hits          8935     9224     +289     
- Misses         977     1084     +107     
- Partials       337      356      +19

Flag	Coverage Δ
macos_clang	`32.60% <ø> (-0.73%)`	⬇️
macos_gcc	`73.91% <ø> (-1.04%)`	⬇️
ubuntu_clang	`85.20% <ø> (-2.22%)`	⬇️
ubuntu_clang_debug	`84.86% <ø> (-2.10%)`	⬇️
ubuntu_clang_inflate_allow_invalid_dist	`84.79% <ø> (-2.34%)`	⬇️
ubuntu_clang_inflate_strict	`85.16% <ø> (-2.22%)`	⬇️
ubuntu_clang_mmap	`85.21% <ø> (-2.35%)`	⬇️
ubuntu_clang_pigz	`40.10% <ø> (-0.86%)`	⬇️
ubuntu_clang_pigz_no_optim	`41.52% <ø> (+0.37%)`	⬆️
ubuntu_clang_pigz_no_threads	`39.69% <ø> (-0.84%)`	⬇️
ubuntu_clang_reduced_mem	`85.59% <ø> (-1.38%)`	⬇️
ubuntu_gcc	`75.74% <ø> (-1.27%)`	⬇️
ubuntu_gcc_aarch64	`77.46% <100.00%> (+0.15%)`	⬆️
ubuntu_gcc_aarch64_compat_no_opt	`75.78% <ø> (+0.42%)`	⬆️
ubuntu_gcc_aarch64_no_acle	`76.31% <ø> (+0.10%)`	⬆️
ubuntu_gcc_aarch64_no_neon	`76.35% <100.00%> (+0.26%)`	⬆️
ubuntu_gcc_armhf	`77.43% <100.00%> (+0.10%)`	⬆️
ubuntu_gcc_armhf_compat_no_opt	`75.67% <ø> (+0.39%)`	⬆️
ubuntu_gcc_armhf_no_acle	`77.42% <ø> (+0.08%)`	⬆️
ubuntu_gcc_armhf_no_neon	`77.32% <100.00%> (+0.22%)`	⬆️
ubuntu_gcc_armsf	`77.18% <100.00%> (-0.05%)`	⬇️
ubuntu_gcc_armsf_compat_no_opt	`75.33% <ø> (+0.38%)`	⬆️
ubuntu_gcc_benchmark	`74.16% <ø> (-0.84%)`	⬇️
ubuntu_gcc_compat_no_opt	`76.99% <ø> (+0.49%)`	⬆️
ubuntu_gcc_compat_sprefix	`74.23% <ø> (-0.87%)`	⬇️
ubuntu_gcc_m32	`73.61% <ø> (-1.24%)`	⬇️
ubuntu_gcc_mingw_i686	`0.00% <ø> (ø)`
ubuntu_gcc_mingw_x86_64	`0.00% <ø> (ø)`
ubuntu_gcc_no_avx2	`74.91% <ø> (-1.75%)`	⬇️
ubuntu_gcc_no_ctz	`74.99% <ø> (+0.22%)`	⬆️
ubuntu_gcc_no_ctzll	`75.02% <ø> (+0.22%)`	⬆️
ubuntu_gcc_no_pclmulqdq	`74.19% <ø> (-0.47%)`	⬇️
ubuntu_gcc_no_sse2	`75.11% <ø> (-0.85%)`	⬇️
ubuntu_gcc_no_sse4	`74.81% <ø> (-1.45%)`	⬇️
ubuntu_gcc_o1	`74.44% <ø> (-0.77%)`	⬇️
ubuntu_gcc_osb	`∅ <ø> (∅)`
ubuntu_gcc_pigz	`38.08% <ø> (-0.82%)`	⬇️
ubuntu_gcc_pigz_aarch64	`39.20% <60.00%> (-0.25%)`	⬇️
ubuntu_gcc_ppc	`73.69% <ø> (+0.11%)`	⬆️
ubuntu_gcc_ppc64	`74.56% <ø> (+0.22%)`	⬆️
ubuntu_gcc_ppc64le	`74.53% <ø> (+0.22%)`	⬆️
ubuntu_gcc_ppc_no_power8	`74.62% <ø> (+0.10%)`	⬆️
ubuntu_gcc_s390x	`74.95% <ø> (+0.21%)`	⬆️
ubuntu_gcc_s390x_dfltcc	`72.18% <ø> (+0.08%)`	⬆️
ubuntu_gcc_s390x_dfltcc_compat	`73.74% <ø> (+0.10%)`	⬆️
ubuntu_gcc_s390x_no_crc32	`74.73% <ø> (+0.21%)`	⬆️
ubuntu_gcc_sparc64	`74.84% <ø> (+0.21%)`	⬆️
ubuntu_gcc_sprefix	`73.75% <ø> (-1.14%)`	⬇️
win64_gcc	`?`
win64_gcc_compat_no_opt	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
arch/arm/crc32_acle.c	`100.00% <100.00%> (ø)`
arch/x86/chunkset_sse2.c	`28.57% <0.00%> (-47.62%)`	⬇️
crc32_fold.c	`70.00% <0.00%> (-30.00%)`	⬇️
zutil_p.h	`77.77% <0.00%> (-22.23%)`	⬇️
fallback_builtins.h	`92.59% <0.00%> (-7.41%)`	⬇️
test/test_compare256.cc	`55.55% <0.00%> (-3.71%)`	⬇️
inflate.c	`93.58% <0.00%> (-0.58%)`	⬇️
test/infcover.c	`94.07% <0.00%> (-0.44%)`	⬇️
deflate.c	`83.20% <0.00%> (-0.42%)`	⬇️
gzlib.c	`70.77% <0.00%> (-0.33%)`	⬇️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41d6739...8ae4740. Read the comment docs.

KungFuJesus · 2022-05-13T16:12:29Z

I'd argue if you're going to modify this function, you may as well remove the "#ifdef UNROLL_MORE" section, as this really didn't have any hope of being faster. The only thing it avoided was potential some loop condition checking overhead more often, but there's no opportunity for superscalar parallelism here.

mtl1979 · 2022-05-13T17:41:06Z

alright, make it consistent in the opposite way then, replace if (len && ((ptrdiff_t)buf & 1)) with if ((len > 1) && ((ptrdiff_t)buf & 1))

That is supposed to test that len is non-zero, not that it's more than 1. It's valid to call crc functions with zero length.

We can have a lot of cases when we first need to crc one byte and then crc more bytes and then 1 byte again... It makes sense to make 1 byte crc as fast as possible without slowing down the other cases with extra checks.

This is all about optimizing which branch is more likely to be taken. With one byte crc, it is equally likely that the last bit of the pointer is 0 or 1. But in this case testing last bit of 1 first benefits the other lengths more.

Dead2 · 2022-06-01T11:01:57Z

This PR contains two very different changes, one is the described fix from the PR title, and another is a bigger rewrite. Please split these into two different PRs so they can be discussed, tested and reviewed separately.

Regarding the fix, I am still unclear as to whether this is a bug or not, and the lack of any comment stating the original intentions does not help ascertain what is in fact correct. Since you did a rewrite commit, please include some comments that help establish the intentions of the code.

landfillbaby · 2022-06-01T16:10:06Z

pushed it back

nmoinvaz

Looks ok to me.

mtl1979 · 2022-06-02T15:24:41Z

I still say this is not a bug... There can't be unaligned access as I explained above.

KungFuJesus · 2022-06-07T13:50:20Z

The diff that's there now (>= len(type)) is fixing a bug (albeit a minor one). It allows the CRC instructions to operate on the largest size possible for the first handful to reach 64 bit alignment rather than terminating early and moving down to the next size before it has to. In some cases I think this could lead to unaligned 64 bit strides, as you could remove modulo 1 byte but not the modulo 2 bytes or 4 bytes that were needed prior to reach alignment.

mtl1979 · 2022-06-07T13:56:18Z

@KungFuJesus It will never reach larger sizes as the len is too small by then... There is minimal speed gain possible as we don't test if len == 0 after each step...

KungFuJesus · 2022-06-07T14:04:12Z

@KungFuJesus It will never reach larger sizes as the len is too small by then... There is minimal speed gain possible as we don't test if len == 0 after each step...

https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/crc32_acle.c#L26

I don't think so, this is at the beginning, when subtracting off the modulus so that things can be aligned to the largest possible crc32 operand. So, say that you have an address that happens to be a multiple of 3 bytes (and also happens to be that length):

This:

if (len && ((ptrdiff_t)buf & 1)) {
        c = __crc32b(c, *buf++);
        len--;
    }

Subtracts off the first byte. Then this:

if ((len > sizeof(uint16_t)) && ((ptrdiff_t)buf & sizeof(uint16_t))) {
        buf2 = (const uint16_t *) buf;
        c = __crc32h(c, *buf2++);
        len -= sizeof(uint16_t);
        buf4 = (const uint32_t *) buf2;
    } else {
        buf4 = (const uint32_t *) buf;
    }

Ends up not running even though it could, and it could do optimally at 2 byte alignment and be finished. Instead, it falls through all the way to the bottom and does the last 2 bytes.

Edit: eh well, it ends up being done in the loop body, instead. But still, the intention of the code seems to read that it was supposed to be done earlier. Granted, this is a small corner case if len is just barely the size of what's being peeled in the unaligned access and unlikely to be a huge issue.

I don't think >= vs > is a meaningful difference (there's a scalar instruction for each and I think they complete in the same number of cycles).

https://www.cs.princeton.edu/courses/archive/spr19/cos217/reading/ArmInstructionSetOverview.pdf (page 67)

mtl1979 · 2022-06-07T14:25:59Z

@KungFuJesus It doesn't need to do further aligning if len == 3 at the top of function... If it is anything larger, then the test already returns true and everything still works...

KungFuJesus · 2022-06-07T14:50:20Z

@KungFuJesus It doesn't need to do further aligning if len == 3 at the top of function... If it is anything larger, then the test already returns true and everything still works...

It still works, yes, but it does mean it has to be handled later in the loop that assumes access is aligned. Granted, you weren't going to get aligned access with a 3 byte aligned 3 byte width buffer, anyway, but the intention of the beginning of the code reads like it's trying to rip off the unaligned access early and terminate there if possible. Whether it's handled at the top or later at the bottom probably doesn't matter much other than the fact that control flow has to jump further than possibly necessary.

mtl1979 · 2022-06-07T14:58:55Z

@KungFuJesus That terminating part isn't really implemented and I'm not sure if I will implement it... There will always be unnecessary branches, but that's what by definition makes unaligned versions slower than aligned versions...

arch/arm/crc32_acle.c

KungFuJesus · 2022-06-07T15:08:46Z

@KungFuJesus That terminating part isn't really implemented and I'm not sure if I will implement it... There will always be unnecessary branches, but that's what by definition makes unaligned versions slower than aligned versions...

Sure it will, by the very fact that you subtracted from len. It means that by the time you get to the loop portions of the code len is already zero and it will fall through to the bottom. You could also explicitly return with an extra branch (I'm guessing that's what you meant), I suppose, preventing further evaluations for len at each loop sequence. Whether or not that makes an observable difference is up the air (likely not, unless you feed a lot of unaligned small buffers into the CRC function).

The fact that this works regardless of the changes looks incidental rather than intentional, but you did write it so you'd probably know (assuming you remember, anyway).

arch/arm/crc32_acle.c

Dead2

LGTM

mtl1979 · 2022-06-21T14:25:44Z

I don't think there is anything else to do than to reword the commit message of the first commit and possibly squash the commits... This is more a logic change than a bug fix by definition... It should slightly improve speed for short lengths, but the gain might be marginal... However it does improve readability of the code.

KungFuJesus · 2022-06-21T14:57:59Z

One would hope the compiler is making the early returns a single label and having them jump to a common termination but one could accomplish such a thing with gotos. Perhaps for the compliment + return though that's a bit overkill.

Dead2 · 2022-06-24T13:14:59Z

I don't think there is anything else to do than to reword the commit message of the first commit and possibly squash the commits... This is more a logic change than a bug fix by definition... It should slightly improve speed for short lengths, but the gain might be marginal... However it does improve readability of the code.

@landfillbaby Would you be willing to reword the commit message for the first commit?

landfillbaby · 2022-06-24T14:04:45Z

how's that as a commit message

Changes since 2.0.6: - Fix CVE-2022-37434 #1328 - Fix chunkmemset #1196 - Fix deflateBound too small #1236 - Fix Z_SOLO #1263 - Fix ACLE variant of crc32 #1274 - Fix inflateBack #1311 - Fix deflate_quick windowsize #1431 - Fix DFLTCC bugs related to adler32 #1349 and #1390 - Fix warnings #1194 #1312 #1362 - MacOS build fix #1198 - Add invalid windowBits handling #1293 - Support for Force TZCNT #1186 - Support for aligned_alloc() #1360 - Minideflate improvements #1175 #1238 - Dont use unaligned access for memcpy #1309 - Build system #1209 #1233 #1267 #1273 #1278 #1292 #1316 #1318 #1365 - Test improvements #1208 #1227 #1241 #1353 - Cleanup #1266 - Documentation #1205 #1359 - Misc improvements #1294 #1297 #1306 #1344 #1348 - Backported zlib fixes - Backported CI workflows from Develop branch

nmoinvaz approved these changes Jun 2, 2022

View reviewed changes

Dead2 added discussion labels Jun 3, 2022

mtl1979 reviewed Jun 7, 2022

View reviewed changes

arch/arm/crc32_acle.c Show resolved Hide resolved

mtl1979 reviewed Jun 7, 2022

View reviewed changes

arch/arm/crc32_acle.c Show resolved Hide resolved

nmoinvaz reviewed Jun 8, 2022

View reviewed changes

arch/arm/crc32_acle.c Outdated Show resolved Hide resolved

arch/arm/crc32_acle.c Outdated Show resolved Hide resolved

Dead2 approved these changes Jun 19, 2022

View reviewed changes

landfillbaby added 3 commits June 24, 2022 15:00

crc32_acle.c: make logic more consistent

00dae55

remove UNROLL_MORE as suggested

d5d1dfd

early return as requested

8ae4740

Dead2 merged commit 9c17bd0 into zlib-ng:develop Jun 30, 2022

landfillbaby deleted the patch-1 branch June 30, 2022 13:22

Dead2 mentioned this pull request Dec 27, 2022

Stable release 2.0.7 #1393

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix off-by-one bug in crc32_acle.c #1274

fix off-by-one bug in crc32_acle.c #1274

landfillbaby commented May 13, 2022 •

edited

mtl1979 commented May 13, 2022

landfillbaby commented May 13, 2022

landfillbaby commented May 13, 2022

mtl1979 commented May 13, 2022

landfillbaby commented May 13, 2022 •

edited

codecov bot commented May 13, 2022 •

edited

KungFuJesus commented May 13, 2022

mtl1979 commented May 13, 2022

Dead2 commented Jun 1, 2022

landfillbaby commented Jun 1, 2022

nmoinvaz left a comment

mtl1979 commented Jun 2, 2022

KungFuJesus commented Jun 7, 2022

mtl1979 commented Jun 7, 2022

KungFuJesus commented Jun 7, 2022 •

edited

mtl1979 commented Jun 7, 2022

KungFuJesus commented Jun 7, 2022

mtl1979 commented Jun 7, 2022

KungFuJesus commented Jun 7, 2022

Dead2 left a comment

mtl1979 commented Jun 21, 2022

KungFuJesus commented Jun 21, 2022

Dead2 commented Jun 24, 2022

landfillbaby commented Jun 24, 2022 •

edited

fix off-by-one bug in crc32_acle.c #1274

fix off-by-one bug in crc32_acle.c #1274

Conversation

landfillbaby commented May 13, 2022 • edited

mtl1979 commented May 13, 2022

landfillbaby commented May 13, 2022

landfillbaby commented May 13, 2022

mtl1979 commented May 13, 2022

landfillbaby commented May 13, 2022 • edited

codecov bot commented May 13, 2022 • edited

Codecov Report

KungFuJesus commented May 13, 2022

mtl1979 commented May 13, 2022

Dead2 commented Jun 1, 2022

landfillbaby commented Jun 1, 2022

nmoinvaz left a comment

Choose a reason for hiding this comment

mtl1979 commented Jun 2, 2022

KungFuJesus commented Jun 7, 2022

mtl1979 commented Jun 7, 2022

KungFuJesus commented Jun 7, 2022 • edited

mtl1979 commented Jun 7, 2022

KungFuJesus commented Jun 7, 2022

mtl1979 commented Jun 7, 2022

KungFuJesus commented Jun 7, 2022

Dead2 left a comment

Choose a reason for hiding this comment

mtl1979 commented Jun 21, 2022

KungFuJesus commented Jun 21, 2022

Dead2 commented Jun 24, 2022

landfillbaby commented Jun 24, 2022 • edited

landfillbaby commented May 13, 2022 •

edited

landfillbaby commented May 13, 2022 •

edited

codecov bot commented May 13, 2022 •

edited

KungFuJesus commented Jun 7, 2022 •

edited

landfillbaby commented Jun 24, 2022 •

edited