Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix off-by-one bug in crc32_acle.c #1274

Merged
merged 3 commits into from
Jun 30, 2022
Merged

fix off-by-one bug in crc32_acle.c #1274

merged 3 commits into from
Jun 30, 2022

Conversation

landfillbaby
Copy link
Contributor

@landfillbaby landfillbaby commented May 13, 2022

would be doing unaligned memory reads on very small buffers

@mtl1979
Copy link
Collaborator

mtl1979 commented May 13, 2022

I'm not sure the fix is necessary... First part of the code just tries to align the pointer if buffer is large enough to need or benefit from aligning... Last part of the code handles small buffers.

@landfillbaby
Copy link
Contributor Author

if you think about it though, the last part operates in reverse, so in the case of a small buffer misaligned by 2 bytes, the 4 byte read will be unaligned

@landfillbaby
Copy link
Contributor Author

decided to simplify it so the compiler understands it better. if you really want to you can just cherry-pick the 1st commit

@mtl1979
Copy link
Collaborator

mtl1979 commented May 13, 2022

if you think about it though, the last part operates in reverse, so in the case of a small buffer misaligned by 2 bytes, the 4 byte read will be unaligned

There wouldn't be 4 byte read when the remaining buffer length is only 2 (16 bits).

@landfillbaby
Copy link
Contributor Author

landfillbaby commented May 13, 2022

alright, make it consistent in the opposite way then, replace if (len && ((ptrdiff_t)buf & 1)) with if ((len > 1) && ((ptrdiff_t)buf & 1))

@codecov
Copy link

codecov bot commented May 13, 2022

Codecov Report

Merging #1274 (8ae4740) into develop (41d6739) will decrease coverage by 0.68%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #1274      +/-   ##
===========================================
- Coverage    87.17%   86.49%   -0.69%     
===========================================
  Files          117      125       +8     
  Lines        10249    10664     +415     
  Branches      2587     2630      +43     
===========================================
+ Hits          8935     9224     +289     
- Misses         977     1084     +107     
- Partials       337      356      +19     
Flag Coverage Δ
macos_clang 32.60% <ø> (-0.73%) ⬇️
macos_gcc 73.91% <ø> (-1.04%) ⬇️
ubuntu_clang 85.20% <ø> (-2.22%) ⬇️
ubuntu_clang_debug 84.86% <ø> (-2.10%) ⬇️
ubuntu_clang_inflate_allow_invalid_dist 84.79% <ø> (-2.34%) ⬇️
ubuntu_clang_inflate_strict 85.16% <ø> (-2.22%) ⬇️
ubuntu_clang_mmap 85.21% <ø> (-2.35%) ⬇️
ubuntu_clang_pigz 40.10% <ø> (-0.86%) ⬇️
ubuntu_clang_pigz_no_optim 41.52% <ø> (+0.37%) ⬆️
ubuntu_clang_pigz_no_threads 39.69% <ø> (-0.84%) ⬇️
ubuntu_clang_reduced_mem 85.59% <ø> (-1.38%) ⬇️
ubuntu_gcc 75.74% <ø> (-1.27%) ⬇️
ubuntu_gcc_aarch64 77.46% <100.00%> (+0.15%) ⬆️
ubuntu_gcc_aarch64_compat_no_opt 75.78% <ø> (+0.42%) ⬆️
ubuntu_gcc_aarch64_no_acle 76.31% <ø> (+0.10%) ⬆️
ubuntu_gcc_aarch64_no_neon 76.35% <100.00%> (+0.26%) ⬆️
ubuntu_gcc_armhf 77.43% <100.00%> (+0.10%) ⬆️
ubuntu_gcc_armhf_compat_no_opt 75.67% <ø> (+0.39%) ⬆️
ubuntu_gcc_armhf_no_acle 77.42% <ø> (+0.08%) ⬆️
ubuntu_gcc_armhf_no_neon 77.32% <100.00%> (+0.22%) ⬆️
ubuntu_gcc_armsf 77.18% <100.00%> (-0.05%) ⬇️
ubuntu_gcc_armsf_compat_no_opt 75.33% <ø> (+0.38%) ⬆️
ubuntu_gcc_benchmark 74.16% <ø> (-0.84%) ⬇️
ubuntu_gcc_compat_no_opt 76.99% <ø> (+0.49%) ⬆️
ubuntu_gcc_compat_sprefix 74.23% <ø> (-0.87%) ⬇️
ubuntu_gcc_m32 73.61% <ø> (-1.24%) ⬇️
ubuntu_gcc_mingw_i686 0.00% <ø> (ø)
ubuntu_gcc_mingw_x86_64 0.00% <ø> (ø)
ubuntu_gcc_no_avx2 74.91% <ø> (-1.75%) ⬇️
ubuntu_gcc_no_ctz 74.99% <ø> (+0.22%) ⬆️
ubuntu_gcc_no_ctzll 75.02% <ø> (+0.22%) ⬆️
ubuntu_gcc_no_pclmulqdq 74.19% <ø> (-0.47%) ⬇️
ubuntu_gcc_no_sse2 75.11% <ø> (-0.85%) ⬇️
ubuntu_gcc_no_sse4 74.81% <ø> (-1.45%) ⬇️
ubuntu_gcc_o1 74.44% <ø> (-0.77%) ⬇️
ubuntu_gcc_osb ∅ <ø> (∅)
ubuntu_gcc_pigz 38.08% <ø> (-0.82%) ⬇️
ubuntu_gcc_pigz_aarch64 39.20% <60.00%> (-0.25%) ⬇️
ubuntu_gcc_ppc 73.69% <ø> (+0.11%) ⬆️
ubuntu_gcc_ppc64 74.56% <ø> (+0.22%) ⬆️
ubuntu_gcc_ppc64le 74.53% <ø> (+0.22%) ⬆️
ubuntu_gcc_ppc_no_power8 74.62% <ø> (+0.10%) ⬆️
ubuntu_gcc_s390x 74.95% <ø> (+0.21%) ⬆️
ubuntu_gcc_s390x_dfltcc 72.18% <ø> (+0.08%) ⬆️
ubuntu_gcc_s390x_dfltcc_compat 73.74% <ø> (+0.10%) ⬆️
ubuntu_gcc_s390x_no_crc32 74.73% <ø> (+0.21%) ⬆️
ubuntu_gcc_sparc64 74.84% <ø> (+0.21%) ⬆️
ubuntu_gcc_sprefix 73.75% <ø> (-1.14%) ⬇️
win64_gcc ?
win64_gcc_compat_no_opt ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
arch/arm/crc32_acle.c 100.00% <100.00%> (ø)
arch/x86/chunkset_sse2.c 28.57% <0.00%> (-47.62%) ⬇️
crc32_fold.c 70.00% <0.00%> (-30.00%) ⬇️
zutil_p.h 77.77% <0.00%> (-22.23%) ⬇️
fallback_builtins.h 92.59% <0.00%> (-7.41%) ⬇️
test/test_compare256.cc 55.55% <0.00%> (-3.71%) ⬇️
inflate.c 93.58% <0.00%> (-0.58%) ⬇️
test/infcover.c 94.07% <0.00%> (-0.44%) ⬇️
deflate.c 83.20% <0.00%> (-0.42%) ⬇️
gzlib.c 70.77% <0.00%> (-0.33%) ⬇️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41d6739...8ae4740. Read the comment docs.

@KungFuJesus
Copy link
Contributor

I'd argue if you're going to modify this function, you may as well remove the "#ifdef UNROLL_MORE" section, as this really didn't have any hope of being faster. The only thing it avoided was potential some loop condition checking overhead more often, but there's no opportunity for superscalar parallelism here.

@mtl1979
Copy link
Collaborator

mtl1979 commented May 13, 2022

alright, make it consistent in the opposite way then, replace if (len && ((ptrdiff_t)buf & 1)) with if ((len > 1) && ((ptrdiff_t)buf & 1))

That is supposed to test that len is non-zero, not that it's more than 1. It's valid to call crc functions with zero length.

We can have a lot of cases when we first need to crc one byte and then crc more bytes and then 1 byte again... It makes sense to make 1 byte crc as fast as possible without slowing down the other cases with extra checks.

This is all about optimizing which branch is more likely to be taken. With one byte crc, it is equally likely that the last bit of the pointer is 0 or 1. But in this case testing last bit of 1 first benefits the other lengths more.

@Dead2
Copy link
Member

Dead2 commented Jun 1, 2022

This PR contains two very different changes, one is the described fix from the PR title, and another is a bigger rewrite. Please split these into two different PRs so they can be discussed, tested and reviewed separately.

Regarding the fix, I am still unclear as to whether this is a bug or not, and the lack of any comment stating the original intentions does not help ascertain what is in fact correct. Since you did a rewrite commit, please include some comments that help establish the intentions of the code.

@landfillbaby
Copy link
Contributor Author

pushed it back

Copy link
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

@mtl1979
Copy link
Collaborator

mtl1979 commented Jun 2, 2022

I still say this is not a bug... There can't be unaligned access as I explained above.

@KungFuJesus
Copy link
Contributor

The diff that's there now (>= len(type)) is fixing a bug (albeit a minor one). It allows the CRC instructions to operate on the largest size possible for the first handful to reach 64 bit alignment rather than terminating early and moving down to the next size before it has to. In some cases I think this could lead to unaligned 64 bit strides, as you could remove modulo 1 byte but not the modulo 2 bytes or 4 bytes that were needed prior to reach alignment.

@mtl1979
Copy link
Collaborator

mtl1979 commented Jun 7, 2022

@KungFuJesus It will never reach larger sizes as the len is too small by then... There is minimal speed gain possible as we don't test if len == 0 after each step...

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Jun 7, 2022

@KungFuJesus It will never reach larger sizes as the len is too small by then... There is minimal speed gain possible as we don't test if len == 0 after each step...

https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/crc32_acle.c#L26

I don't think so, this is at the beginning, when subtracting off the modulus so that things can be aligned to the largest possible crc32 operand. So, say that you have an address that happens to be a multiple of 3 bytes (and also happens to be that length):

This:

if (len && ((ptrdiff_t)buf & 1)) {
        c = __crc32b(c, *buf++);
        len--;
    }

Subtracts off the first byte. Then this:

if ((len > sizeof(uint16_t)) && ((ptrdiff_t)buf & sizeof(uint16_t))) {
        buf2 = (const uint16_t *) buf;
        c = __crc32h(c, *buf2++);
        len -= sizeof(uint16_t);
        buf4 = (const uint32_t *) buf2;
    } else {
        buf4 = (const uint32_t *) buf;
    }

Ends up not running even though it could, and it could do optimally at 2 byte alignment and be finished. Instead, it falls through all the way to the bottom and does the last 2 bytes.

Edit: eh well, it ends up being done in the loop body, instead. But still, the intention of the code seems to read that it was supposed to be done earlier. Granted, this is a small corner case if len is just barely the size of what's being peeled in the unaligned access and unlikely to be a huge issue.

I don't think >= vs > is a meaningful difference (there's a scalar instruction for each and I think they complete in the same number of cycles).

https://www.cs.princeton.edu/courses/archive/spr19/cos217/reading/ArmInstructionSetOverview.pdf (page 67)

@mtl1979
Copy link
Collaborator

mtl1979 commented Jun 7, 2022

@KungFuJesus It doesn't need to do further aligning if len == 3 at the top of function... If it is anything larger, then the test already returns true and everything still works...

@KungFuJesus
Copy link
Contributor

@KungFuJesus It doesn't need to do further aligning if len == 3 at the top of function... If it is anything larger, then the test already returns true and everything still works...

It still works, yes, but it does mean it has to be handled later in the loop that assumes access is aligned. Granted, you weren't going to get aligned access with a 3 byte aligned 3 byte width buffer, anyway, but the intention of the beginning of the code reads like it's trying to rip off the unaligned access early and terminate there if possible. Whether it's handled at the top or later at the bottom probably doesn't matter much other than the fact that control flow has to jump further than possibly necessary.

@mtl1979
Copy link
Collaborator

mtl1979 commented Jun 7, 2022

@KungFuJesus That terminating part isn't really implemented and I'm not sure if I will implement it... There will always be unnecessary branches, but that's what by definition makes unaligned versions slower than aligned versions...

@KungFuJesus
Copy link
Contributor

@KungFuJesus That terminating part isn't really implemented and I'm not sure if I will implement it... There will always be unnecessary branches, but that's what by definition makes unaligned versions slower than aligned versions...

Sure it will, by the very fact that you subtracted from len. It means that by the time you get to the loop portions of the code len is already zero and it will fall through to the bottom. You could also explicitly return with an extra branch (I'm guessing that's what you meant), I suppose, preventing further evaluations for len at each loop sequence. Whether or not that makes an observable difference is up the air (likely not, unless you feed a lot of unaligned small buffers into the CRC function).

The fact that this works regardless of the changes looks incidental rather than intentional, but you did write it so you'd probably know (assuming you remember, anyway).

arch/arm/crc32_acle.c Outdated Show resolved Hide resolved
arch/arm/crc32_acle.c Outdated Show resolved Hide resolved
Copy link
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mtl1979
Copy link
Collaborator

mtl1979 commented Jun 21, 2022

I don't think there is anything else to do than to reword the commit message of the first commit and possibly squash the commits... This is more a logic change than a bug fix by definition... It should slightly improve speed for short lengths, but the gain might be marginal... However it does improve readability of the code.

@KungFuJesus
Copy link
Contributor

One would hope the compiler is making the early returns a single label and having them jump to a common termination but one could accomplish such a thing with gotos. Perhaps for the compliment + return though that's a bit overkill.

@Dead2
Copy link
Member

Dead2 commented Jun 24, 2022

I don't think there is anything else to do than to reword the commit message of the first commit and possibly squash the commits... This is more a logic change than a bug fix by definition... It should slightly improve speed for short lengths, but the gain might be marginal... However it does improve readability of the code.

@landfillbaby Would you be willing to reword the commit message for the first commit?

@landfillbaby
Copy link
Contributor Author

landfillbaby commented Jun 24, 2022

how's that as a commit message

@Dead2 Dead2 merged commit 9c17bd0 into zlib-ng:develop Jun 30, 2022
@landfillbaby landfillbaby deleted the patch-1 branch June 30, 2022 13:22
@Dead2 Dead2 mentioned this pull request Dec 27, 2022
Dead2 added a commit that referenced this pull request Mar 7, 2023
Changes since 2.0.6:
- Fix CVE-2022-37434 #1328
- Fix chunkmemset #1196
- Fix deflateBound too small #1236
- Fix Z_SOLO #1263
- Fix ACLE variant of crc32 #1274
- Fix inflateBack #1311
- Fix deflate_quick windowsize #1431
- Fix DFLTCC bugs related to adler32 #1349 and #1390
- Fix warnings #1194 #1312 #1362
- MacOS build fix #1198
- Add invalid windowBits handling #1293
- Support for Force TZCNT #1186
- Support for aligned_alloc() #1360
- Minideflate improvements #1175 #1238
- Dont use unaligned access for memcpy #1309
- Build system #1209 #1233 #1267 #1273 #1278 #1292 #1316 #1318 #1365
- Test improvements #1208 #1227 #1241 #1353
- Cleanup #1266
- Documentation #1205 #1359
- Misc improvements #1294 #1297 #1306 #1344 #1348
- Backported zlib fixes
- Backported CI workflows from Develop branch
Dead2 added a commit that referenced this pull request Mar 17, 2023
Changes since 2.0.6:
- Fix CVE-2022-37434 #1328
- Fix chunkmemset #1196
- Fix deflateBound too small #1236
- Fix Z_SOLO #1263
- Fix ACLE variant of crc32 #1274
- Fix inflateBack #1311
- Fix deflate_quick windowsize #1431
- Fix DFLTCC bugs related to adler32 #1349 and #1390
- Fix warnings #1194 #1312 #1362
- MacOS build fix #1198
- Add invalid windowBits handling #1293
- Support for Force TZCNT #1186
- Support for aligned_alloc() #1360
- Minideflate improvements #1175 #1238
- Dont use unaligned access for memcpy #1309
- Build system #1209 #1233 #1267 #1273 #1278 #1292 #1316 #1318 #1365
- Test improvements #1208 #1227 #1241 #1353
- Cleanup #1266
- Documentation #1205 #1359
- Misc improvements #1294 #1297 #1306 #1344 #1348
- Backported zlib fixes
- Backported CI workflows from Develop branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants