refactor: optimize number/uint32/base/muldw implementation for better performance#11702
refactor: optimize number/uint32/base/muldw implementation for better performance#11702impawstarlight wants to merge 4 commits intostdlib-js:developfrom
number/uint32/base/muldw implementation for better performance#11702Conversation
…`umul`
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: passed
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
Coverage Report
The above coverage report was generated for the changes in this PR. |
| * // returns [ 954437176, 1908874354 ] | ||
| */ | ||
| function umuldw( a, b ) { | ||
| if ( isnan( a ) || isnan( b ) ) { |
There was a problem hiding this comment.
@impawstarlight What is the rationale for keeping these checks here but not in assign.js?
There was a problem hiding this comment.
Nothing in particular, just trying to stay as close to the original as possible since the main tests also check for NaN handling. But I see how this might be inconsistent in relation to mul or imul since we dont do any input validation there. Should it be removed?
There was a problem hiding this comment.
Yeah, I'd say let's go ahead and remove it. If we are not going to include it in assign, we shouldn't deviate in the main export.
Co-authored-by: Athan <kgryte@gmail.com> Signed-off-by: Athan <kgryte@gmail.com>
|
|
||
| out[ offset ] = ( ( ha*hb ) + w1 + k ) >>> 0; // compute the higher 32 bits and cast to an unsigned 32-bit integer | ||
| out[ offset + stride ] = ( ( t << 16 ) + w3) >>> 0; // compute the lower 32 bits and cast to an unsigned 32-bit integer | ||
| out[ offset + stride ] = umul( a, b ) >>> 0; // compute the lower 32 bits and cast to an unsigned 32-bit integer |
There was a problem hiding this comment.
@impawstarlight I am a bit dense, but how does this manage to produce the same result? Previously, the logic for computing the lower 32 bits doesn't exceed the max uint32, but, here, a*b could, resulting in wraparound, which is a bit counterintuitive to me that it achieves the same result.
There was a problem hiding this comment.
Ultimately, this boils down to a call to imul, but not obvious to me why imul is faster than a bit shift plus addition.
There was a problem hiding this comment.
Previously, the logic for computing the lower 32 bits doesn't exceed the max uint32, but, here,
a*bcould, resulting in wraparound, which is a bit counterintuitive to me that it achieves the same result.
Actually, the way it avoids overflow is a clever engineering trick for extracting those bits that would normally overflow outside the lower 32 bits. This is done because these overflow bits contribute to the higher 32 bits and hence necessary for that calculation.
But for the lower 32 bits, we could very well make do with allowing overflow if we didn't have to calculate the higher 32 bits, like here in our imul polyfill.
Ultimately, it is fully equivalent to imul because of what its purpose is - calculate the low 32-bit of a 32x32 mult - which is basically the definition of imul.
So the wrap around behavour of imul is also happening in the shift-add approach, just not very obvious because they are handled through the 16-bit splitting logic while eliminating any intermediate overflow.
There was a problem hiding this comment.
Ultimately, this boils down to a call to
imul, but not obvious to me whyimulis faster than a bit shift plus addition.
imul is probably faster here because otherwise we were doing 3 operations before:
w3 = ( t & LOW_WORD_MASK ) >>> 0;
...
out[ offset + stride ] = ( ( t << 16 ) + w3) >>> 0;So we're comparing AND + SHIFT + ADD vs IMUL. Although individual add and bitwise instructions are very fast, the combination is probably slower than a single IMUL instruction because of various other factors like intermediate moving around around between registers. Just my guess, but the benchmark approves.
Co-authored-by: Athan <kgryte@gmail.com> Signed-off-by: Athan <kgryte@gmail.com>
kgryte
left a comment
There was a problem hiding this comment.
@impawstarlight With the removal of the isnan checks in the main export, do the corresponding tests also need to be removed?
Resolves none.
Description
This pull request:
number/uint32/base/muldwimplementation to improve performance by moving theisnancheck out of the lower level implementation.number/uint32/base/mulfor computing the lower half of the double word product to reduce unnecessary calculation.Related Issues
No.
Questions
No.
Other
No.
Checklist
AI Assistance
If you answered "yes" above, how did you use AI assistance?
Disclosure
{{TODO: add disclosure if applicable}}
@stdlib-js/reviewers