-
-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix off-by-one bug in crc32_acle.c #1274
Conversation
I'm not sure the fix is necessary... First part of the code just tries to align the pointer if buffer is large enough to need or benefit from aligning... Last part of the code handles small buffers. |
if you think about it though, the last part operates in reverse, so in the case of a small buffer misaligned by 2 bytes, the 4 byte read will be unaligned |
decided to simplify it so the compiler understands it better. if you really want to you can just cherry-pick the 1st commit |
There wouldn't be 4 byte read when the remaining buffer length is only 2 (16 bits). |
alright, make it consistent in the opposite way then, replace |
Codecov Report
@@ Coverage Diff @@
## develop #1274 +/- ##
===========================================
- Coverage 87.17% 86.49% -0.69%
===========================================
Files 117 125 +8
Lines 10249 10664 +415
Branches 2587 2630 +43
===========================================
+ Hits 8935 9224 +289
- Misses 977 1084 +107
- Partials 337 356 +19
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I'd argue if you're going to modify this function, you may as well remove the "#ifdef UNROLL_MORE" section, as this really didn't have any hope of being faster. The only thing it avoided was potential some loop condition checking overhead more often, but there's no opportunity for superscalar parallelism here. |
That is supposed to test that We can have a lot of cases when we first need to crc one byte and then crc more bytes and then 1 byte again... It makes sense to make 1 byte crc as fast as possible without slowing down the other cases with extra checks. This is all about optimizing which branch is more likely to be taken. With one byte crc, it is equally likely that the last bit of the pointer is 0 or 1. But in this case testing last bit of 1 first benefits the other lengths more. |
This PR contains two very different changes, one is the described fix from the PR title, and another is a bigger rewrite. Please split these into two different PRs so they can be discussed, tested and reviewed separately. Regarding the fix, I am still unclear as to whether this is a bug or not, and the lack of any comment stating the original intentions does not help ascertain what is in fact correct. Since you did a rewrite commit, please include some comments that help establish the intentions of the code. |
pushed it back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me.
I still say this is not a bug... There can't be unaligned access as I explained above. |
The diff that's there now (>= len(type)) is fixing a bug (albeit a minor one). It allows the CRC instructions to operate on the largest size possible for the first handful to reach 64 bit alignment rather than terminating early and moving down to the next size before it has to. In some cases I think this could lead to unaligned 64 bit strides, as you could remove modulo 1 byte but not the modulo 2 bytes or 4 bytes that were needed prior to reach alignment. |
@KungFuJesus It will never reach larger sizes as the len is too small by then... There is minimal speed gain possible as we don't test if |
https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/crc32_acle.c#L26 I don't think so, this is at the beginning, when subtracting off the modulus so that things can be aligned to the largest possible crc32 operand. So, say that you have an address that happens to be a multiple of 3 bytes (and also happens to be that length): This: if (len && ((ptrdiff_t)buf & 1)) {
c = __crc32b(c, *buf++);
len--;
} Subtracts off the first byte. Then this: if ((len > sizeof(uint16_t)) && ((ptrdiff_t)buf & sizeof(uint16_t))) {
buf2 = (const uint16_t *) buf;
c = __crc32h(c, *buf2++);
len -= sizeof(uint16_t);
buf4 = (const uint32_t *) buf2;
} else {
buf4 = (const uint32_t *) buf;
} Ends up not running even though it could, and it could do optimally at 2 byte alignment and be finished. Instead, it falls through all the way to the bottom and does the last 2 bytes. Edit: eh well, it ends up being done in the loop body, instead. But still, the intention of the code seems to read that it was supposed to be done earlier. Granted, this is a small corner case if len is just barely the size of what's being peeled in the unaligned access and unlikely to be a huge issue. I don't think >= vs > is a meaningful difference (there's a scalar instruction for each and I think they complete in the same number of cycles). https://www.cs.princeton.edu/courses/archive/spr19/cos217/reading/ArmInstructionSetOverview.pdf (page 67) |
@KungFuJesus It doesn't need to do further aligning if |
It still works, yes, but it does mean it has to be handled later in the loop that assumes access is aligned. Granted, you weren't going to get aligned access with a 3 byte aligned 3 byte width buffer, anyway, but the intention of the beginning of the code reads like it's trying to rip off the unaligned access early and terminate there if possible. Whether it's handled at the top or later at the bottom probably doesn't matter much other than the fact that control flow has to jump further than possibly necessary. |
@KungFuJesus That |
Sure it will, by the very fact that you subtracted from len. It means that by the time you get to the loop portions of the code len is already zero and it will fall through to the bottom. You could also explicitly return with an extra branch (I'm guessing that's what you meant), I suppose, preventing further evaluations for len at each loop sequence. Whether or not that makes an observable difference is up the air (likely not, unless you feed a lot of unaligned small buffers into the CRC function). The fact that this works regardless of the changes looks incidental rather than intentional, but you did write it so you'd probably know (assuming you remember, anyway). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I don't think there is anything else to do than to reword the commit message of the first commit and possibly squash the commits... This is more a logic change than a bug fix by definition... It should slightly improve speed for short lengths, but the gain might be marginal... However it does improve readability of the code. |
One would hope the compiler is making the early returns a single label and having them jump to a common termination but one could accomplish such a thing with gotos. Perhaps for the compliment + return though that's a bit overkill. |
@landfillbaby Would you be willing to reword the commit message for the first commit? |
how's that as a commit message |
Changes since 2.0.6: - Fix CVE-2022-37434 #1328 - Fix chunkmemset #1196 - Fix deflateBound too small #1236 - Fix Z_SOLO #1263 - Fix ACLE variant of crc32 #1274 - Fix inflateBack #1311 - Fix deflate_quick windowsize #1431 - Fix DFLTCC bugs related to adler32 #1349 and #1390 - Fix warnings #1194 #1312 #1362 - MacOS build fix #1198 - Add invalid windowBits handling #1293 - Support for Force TZCNT #1186 - Support for aligned_alloc() #1360 - Minideflate improvements #1175 #1238 - Dont use unaligned access for memcpy #1309 - Build system #1209 #1233 #1267 #1273 #1278 #1292 #1316 #1318 #1365 - Test improvements #1208 #1227 #1241 #1353 - Cleanup #1266 - Documentation #1205 #1359 - Misc improvements #1294 #1297 #1306 #1344 #1348 - Backported zlib fixes - Backported CI workflows from Develop branch
Changes since 2.0.6: - Fix CVE-2022-37434 #1328 - Fix chunkmemset #1196 - Fix deflateBound too small #1236 - Fix Z_SOLO #1263 - Fix ACLE variant of crc32 #1274 - Fix inflateBack #1311 - Fix deflate_quick windowsize #1431 - Fix DFLTCC bugs related to adler32 #1349 and #1390 - Fix warnings #1194 #1312 #1362 - MacOS build fix #1198 - Add invalid windowBits handling #1293 - Support for Force TZCNT #1186 - Support for aligned_alloc() #1360 - Minideflate improvements #1175 #1238 - Dont use unaligned access for memcpy #1309 - Build system #1209 #1233 #1267 #1273 #1278 #1292 #1316 #1318 #1365 - Test improvements #1208 #1227 #1241 #1353 - Cleanup #1266 - Documentation #1205 #1359 - Misc improvements #1294 #1297 #1306 #1344 #1348 - Backported zlib fixes - Backported CI workflows from Develop branch
would be doing unaligned memory reads on very small buffers