-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up CRC32 calculation on ARM64 #77
Conversation
Thanks! A quick few early comments (I or Jia will comment more later): The ARM64 processors tend to support unaligned memory access but would it still be worth it (better speed) to align the I'm not sure but I guess the input
Are ARM64 processors without CRC32 common enough that runtime detection is worth it? Even they are, The crc_edits branch is still under consideration so that may change where the code will go, possibly making things simpler. |
I will update this part in next push.
I don't have big-endian test environments, so I cannot predict the behavior and unit test on big-endian.
Try to fix in the next push.
"crc" is a part of ARMv8.1 feature. To make all armv8 processors happy, it need to detect the processor feature on runtime.
|
Hello! Thanks for the PR. The point about not needing runtime detection likely needs more research. From a quick search, my understanding is that CRC32 instruction is optional in armv8.0 and required in ARMv8.1. So if all processors in practice actually supported CRC32 in armv8.0, then it will really simplify this feature since the runtime detection adds significant complexity. The runtime detection likely requires avoiding ifunc and having different versions based on the platform ( A note on compile time detection: __ARM_FEATURE_CRC32 isn't supported by MSVC so we will need another way to detect CRC32 instruction support there. |
Unfortunately, I have an arm64 machine without crc32, but it is an odd one (X-Gene Mudan). |
Over the holidays I got an odroid which has the crc instruction and pmull, so I have been working on an arm64 clmul implementation for crc32 and crc64. I have compared this to my implementation with clmul and found that the performance is similar to mine for smaller inputs, but becomes faster the larger the input size. From what I have found it seems like the arm64 crc instruction is also more supported than the pmull instruction that the clmul implementation depends on. I will look into seeing how commonly supported the crc instruction and pmull are, and if runtime checks would be necessary. I have already run into cases where pmull has not been supported so it is likely we would need runtime check if we wanted to include crc64 clmul. The speed increase of crc clmul and the crc instruction were both very significant for larger bytes. The crc64 clmul reached up to 3 times as fast for these inputs. I still need to clean up my code and remove the crc32 clmul so I can make a pr for it after this is resolved. |
Would you please share your implementation here, I would like to double check the CRC32 speed on my side. |
@hansjans162 Thanks for continuing your work on ARM64 CLMUL! Your benchmarks so far sound promising and will likely be a great value to liblzma. We look forward to seeing the code and further benchmarks when they are ready :) It sounds like runtime checks will be needed, but please let us know if anyone finds reasons to contradict this. We did a refactor to the existing CRC related files for code organization and small optimization related reasons. This change means that architecture specific CRC optimizations should go into header files that are included in the corresponding crc32/64_fast.c file. The reasons for this are best explained in the commit message here. Based on the refactor, @parheliamm, please put your changes into Similarly, Hans please refactor what you have to create We apologize for the refactor after this was already submitted, but this PR helped inform that refactor (and gave us the needed motivation to close out the branch). Thank you everyone so far for you patience and your contributions so far! |
I would expect aligned access to be faster still. In general, when unaligned access is fast on some hardware, it's fast in context of other methods for unaligned access. That is, it's much faster than doing byte-by-byte access. When the access crosses cache line or page boundaries, it may have penalties that don't occur with aligned access. You could test with something like
Linux has CRC32 assembly for both endiannesses. Maybe it can help in learning what extra steps are needed for big endian support. Having said that, little-endian-only version is fine for now, especially since it sounds that no one can test the big endian code anyway. Edited to add missing casts to the |
Based on my local test, unaligned access is slower(25%) than aligned access. |
I've updated aarch64 CRC code with align memory access. |
@parheliamm Here is the arm64 clmul implementation. Sorry for the delay I was updating my code to use the new organization. I would also like to note that my code is not finished yet. is_arch_extension_supported currently always returns true, and platform checks in configure.ac and CMakeLists.txt need to be made. |
@hansjans162 :
The crc32_arch_optimized seems takes more time on my local test. |
The CRC32 instructions in ARM64 can calculate the CRC32 result for 8 bytes in a single operation, making the use of ARM64 instructions much faster compared to the general CRC32 algorithm. Optimized CRC32 will be enabled if ARM64 has CRC extension running on Linux. Signed-off-by: Chenxi Mao <chenxi.mao2013@gmail.com>
@parheliamm I'm glad your test result match up with mine. I still found that the arm64 clmul implementation for crc64 is faster than the generic and worth keeping. Maybe you can test this to confirm? I will remove crc32 clmul from my code, but keep in crc64 clmul. |
@hansjans162 By the way, your CRC64 code has below warnings and unit tests cannot passed.
|
@parheliamm @hansjans162 Thank you for your patience on this review. We had a few other things we were focusing on (new website, release, etc.). I promise I did not forget about this :) I created a branch with some additions to this PR that will also be helpful for the CRC64 CLMUL. The extent of the edits should be clear from the commit messages and code changes, but let me know if you have questions about the changes. Can you both test this on your hardware to be sure it works correctly (and is still fast)? I was able to cross-compile it with GCC, Clang, and MSVC but I do not have an ARM64 device so I have not run the code. Thanks! |
I tested your code and found a problem with the crc32_arch_optimized function. the updated function below should fix this. I made two changes to this function. First was making align_amount the difference from 8 instead of just the remainder. I also xor this with 8 so buf_end does not change if it is already properly aligned. The second thing was changing __crc32w to __crc32d. crc_attr_target
static uint32_t
crc32_arch_optimized(const uint8_t *buf, size_t size, uint32_t crc)
{
crc = ~crc;
// Align the input buffer because this was shown to be
// significantly faster than unaligned accesses.
const size_t align_amount = my_min(size, ((8-((uintptr_t)(buf) & 7))^8));
const uint8_t *buf_end = buf + align_amount;
for (; buf < buf_end; ++buf)
crc = __crc32b(crc, *buf);
for (buf_end = ((size - align_amount) - 8) + buf; buf < buf_end;
buf += 8)
crc = __crc32d(crc, aligned_read64le(buf));
for (buf_end += 8; buf < buf_end; ++buf)
crc = __crc32b(crc, *buf);
return ~crc;
} |
Thanks for the fixes! I updated the branch. I slightly optimized the |
I suspect that Although it should work in practice on ARM64 (unless the buffer is at a weird address where the address would overflow but that's unlikely), I think it should be possible to avoid the problem without a performance penalty. |
Thanks for pointing this out! I pushed a new version to avoid pointer arithmetic beyond the beginning of the buffer. I also added macOS support, but I couldn't test it since I do not have an Apple device. The Apple specific code is small so it should work but I hope someone can test this. |
The CRC32 instructions in ARM64 can calculate the CRC32 result for 8 bytes in a single operation, making the use of ARM64 instructions much faster compared to the general CRC32 algorithm.
Optimized CRC32 will be enabled if ARM64 has CRC extension running on Linux.
Pull request checklist
Please check if your PR fulfills the following requirements:
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
Related Issue URL:
What is the new behavior?
Enable optimized CRC32 algorithm if ARM64 support CRC extension.
Does this introduce a breaking change?
Other information
Benchmark data will be updated soon