New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up CRC32 calculation on LoongArch #86
Conversation
d2da853
to
a5f69f9
Compare
Wait a minute... There are some warnings I'd not spotted. |
a5f69f9
to
ced63ba
Compare
Fixed. And added cmake support. |
ced63ba
to
7cb19f9
Compare
Removed tabs from CMakeLists.txt. |
7cb19f9
to
4023849
Compare
Hello! Thanks for the PR. Overall it looks like you did a great job with this. Can you provide benchmarks to show the speed increase from this? Specifically, can you show one version with the alignment adjustment in Also, how necessary are the runtime detection checks? Are there LoongArch chips that do not have the CRC32 instruction? |
9282d42
to
dab983c
Compare
I'll do it tomorrow.
The specification says 64-bit LoongArch chips shall implement CRC32 instructions, but 32-bit LoongArch chips may lack them (though no 32-bit LoongArch chips have been launched as at now). |
dab983c
to
f2c2510
Compare
Thanks!
Ok that is great to know. I had not found any references to 32-bit LoongArch chips, so that makes sense. Is it likely that 32-bit chips will be made? Otherwise it will simplify things to just design the code for 64-bit LoongArch and not bother with the runtime checks at all. Future 32-bit LoongArch may need extra compiler flags or a function |
f2c2510
to
e079b80
Compare
It's likely to be made but we are so unsure about some details of it (and whether we need some GCC flags for attributes for it). So I've modified the code to 64-bit-only and removed runtime detection for now. |
10M buffer, repeat 100 times: 0.7116s to 0.1015s
Some low-end 64-bit LoongArch CPUs (2K1000 for example) do not support unaligned access, on these CPUs unaligned access will trap and be emulated by the kernel (very slow). So we have to adjust the alignment anyway... I don't have a 2K1000 board for testing though, on my board (3A6000) the alignment adjustment only produces ~1% improvement. |
Thanks for the benchmarking numbers, those easily justify including this feature :)
If there are LoongArch CPUs that do not support unaligned access, that is plenty reason to have the code to align the buffer. Thanks for the info! |
|
||
#if !defined(HAVE_ENCODERS) && (defined(X86_CLMUL_NO_TABLE) \ | ||
|| defined(ARM64_CRC32_NO_TABLE_)) | ||
|| defined(ARM64_CRC32_NO_TABLE_) \ | ||
|| defined(LOONGARCH_CRC32_NO_TABLE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for remembering this! I had forgotten this was needed in the first round of review
The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32 result for 1/2/4/8 bytes in a single operation, making the use of LoongArch CRC32 instructions much faster compared to the general CRC32 algorithm. Optimized CRC32 will be enabled unconditionally on 64-bit LoongArch (the LoongArch specification says CRC32 instructions shall be implemented for 64-bit processirs). For 32-bit processors optimized CRC32 is not enabled as little details of these processors are known as at now. Signed-off-by: Xi Ruoyao <xry111@xry111.site>
e079b80
to
bf0934c
Compare
I presume "speed up" in actuality means something different here? "Speeding up" remote code execution on "special" xz archives? This project needs to be quarantined, maybe forked. Scrubbed through every piece of code. |
I don't introduce such a thing myself. If you don't agree you can hire some security expert to analyze my change, and I can sign a legal file with you saying "if I introduced an RCE I'll pay you ten times the costs for the analysis, otherwise you'll pay me a beer" :).
I don't know about the code not written by me though. |
Making small digression here, nice organization. You attacked person just because of their place of birth? |
That seems to be an assumption or I oversaw something. Besides that, https://www.openwall.com/lists/oss-security/2024/03/29/4 contains some very interesting technical details. |
No discrimination here, and that org of me you're likely referring to is a wink to 1984 thought police happening in China. I'm stating that people packaging xz should be very wary of any code touched or given feedback on by JiaT75 given todays security disclosure, no personal gripes to any. |
Currently I'm unsure how this PR is connected to the backdoor, if at all. Is this really related or just unrelated? Because in my opinion the architecture is not relevant (since the requirements for the backdoor are different and unrelated to the CPU architecture).
Makes sense. A very interesting case which seems to have been planned over a long timespan. At least that's the assumption from the mentioned commits from https://www.openwall.com/lists/oss-security/2024/03/29/4 |
Unrelated |
GitHub had closed all pull requests and marked them as if I had closed them on 2024-04-05. I didn't close any PRs on that day, GitHub just misattributed it to me. Now that I try to re-open these, GitHub immediately closes them again, once again claiming that I did that. |
On IRC it was suggested that this likely happens because the forks associated with the PRs are disabled. So this PR would require that @xry111 re-enables or re-creates the fork. |
I'll recreate it after 5.8.0 release considering you are focusing on more important issues now. |
BTW strangely my previous fork is now marked as "forked from xxxxxx/xz" (xxxxxx is a different user instead of tukaani-project) and it's still suspended. I'm contacting with GH support to see if they can just delete the fork and then I can fork again. |
@xry111 This is strange that GitHub chose to suspend your fork, because during the time that GitHub had suspended everything, there were many other xz forks that were not suspended. So, perhaps they thought you were a Jia sockpuppet due to being Chinese. |
Not sure that's right. At the very least, the fork count went down to 0 as did star count on the main repo. Plus the repo that his now shows as a fork of is also disabled. We also can't reopen any other PRs, so... |
I found https://github.com/ryandesign/xz is also still suspended, and it seems ryandesign (I'd not use a @ so I won't disturb him) worked it around by creating another fork named "xz-1": https://github.com/ryandesign/xz-1 While I can do the same I don't really like it. Let's wait several days for GH support response... |
I think what happened then is: the forks got suspended, but there were a few mirrors of the xz repo, which didn't get suspended. |
Pull request checklist
Please check if your PR fulfills the following requirements:
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
On LoongArch the generic table-based CRC32 implementation is used.
Related Issue URL: None
What is the new behavior?
The crc.w.{b/h/w/d}.w instructions in LoongArch can calculate the CRC32 result for 1/2/4/8 bytes in a single operation, making the use of LoongArch CRC32 instructions much faster compared to the general CRC32 algorithm.
Optimized CRC32 will be enabled if the kernel declares the LoongArch CRC32 instructions supported via AT_HWCAP.
Does this introduce a breaking change?
Other information