-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Vectorize rotate
better
#5502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Vectorize rotate
better
#5502
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
StephanTLavavej
requested changes
May 15, 2025
This comment was marked as resolved.
This comment was marked as resolved.
for `rotate_copy` too, it uses `memcpy`, hence considered vectorized
This comment was marked as resolved.
This comment was marked as resolved.
StephanTLavavej
approved these changes
May 16, 2025
This comment was marked as resolved.
This comment was marked as resolved.
5950X speedups: Click to embiggen:
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
I resolved a trivial adjacent-add conflict with #5493 in |
StephanTLavavej
approved these changes
May 17, 2025
This comment was marked as resolved.
This comment was marked as resolved.
When things spin, science happens! 🔁 🧑🔬 🧪 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚙️ The optimization
Before this PR
std::rotate
has already been vectorized, but indirectly. Therotate
callsreverse
first on two portion, then on the whole, this makes the result rotated. Thereverse
is vectorized by going from both ends, loading vectors, reversing them using shuffles, and storing them to the opposite ends, then finishing the middle part using scalar loop. So we have always exactly N*2 swaps, and most of them are vectorized for large arrays.A better vectorization improves for this as follows:
memcpy
/memmove
to move in blocks potentially faster than manually vectorized loopFirst obvious step for improving
rotate
would be handling small rotation. The small part can be fit into small temporary buffer, then the remaining moved, and then the small part copied from that buffer to the proper place. This makes most of the elements movement using only one assignment. only those which were put into temporary buffer use two assignments.The next step is to use vectorized
swap_ranges
to implement generic rotation. http://cppreference.com has an example recursive algorithm that usesswap
in a loop, that loop is actuallyswap_ranges
and that tail recursion is actually a loop. If the rotation point is exactly in the middle, we're done in one iteration, making only N assignments. If it is closer to the middle, we would do small rotation as the next step, and that would be on slightly more than half of the original elements, still efficient. In a worse case, we will do fewswap_ranges
steps, but still will never do as many assignments as with doublereverse
.A hypothetical functions like
swap_3_ranges
,swap_4_ranges
, etc could reduce the number of assignments for more cases. But going further in optimization will result in less and less improvement for more and more code added, and at some point will cause the complex decisions to take noticeable amount of time, resulting in negative improvement, so we need to stop somewhere. Probably stopping on just small rotation and two ranges swap strategy would be a good idea.The "small" rotation is arbitrarily defined as 512 bytes or less. That is, the algorithm would use at most 512 bytes of stack extra. This is the same amount as in
remove_copy
/unique_copy
(#5355). Additionally I think we should still preferswap_ranges
if the "small" is not very small, and the larger part is not very large, so that we end up with another significant reduction of the range. The amounts of "very small" and "not very large" are also arbitrary, not justified by profiling, only byvibecommon sense.As a bonus, with only
memcpy
/memmove
/swap_ranges
functions used, and no vector intrinsics that works with elements, the rotation for non-power-of-two sized elements is also vectorized now.🐛
memmove
performance bugAfter the initial implementation I observed that some of the benchmark exhibited unexpected slowdown. The issue was in surprisingly slow
memmove
. I've created benchmark repro of this problem.Benchmark
Results on i5-1235U
Results on i7-8750H
Analysis
All I know or suspect so far:
rep movsb
, which is used inmemmove
I appreciate any help in investigation the issue further.
Mitigation
Ideally we'd need to report this issue to VCRuntime maintainers.
But I feel like we need to try to gather more information to report it better.
For now,
_Move_to_lower_address
and_Move_to_upper_address
functions are introduced as a workaround.✅ Test coverage
Apparently there's no trivial implementation of in-place rotate. Well, then we will use an implementation that uses additional buffer as LKG implementation.
📄 Headers or separate compilation?
<xutility>
is non-core header, and it already drags in<cstring>
functions andswap_ranges
. So the algorithm may reside there.On the other hand, since it is non-template, it can be kept as separately compiled for throughput.
The current decision is relic of an intention to use AVX2 intrinsics directly. That intention was abandoned due to introducing far more complexity for a relatively little potential gain.
⏱️ Benchmark results
bm_rotate<uint8_t, AlgType::Std>/3333/2242
bm_rotate<uint8_t, AlgType::Std>/3332/1666
bm_rotate<uint8_t, AlgType::Std>/3333/1111
bm_rotate<uint8_t, AlgType::Std>/3333/501
bm_rotate<uint8_t, AlgType::Std>/3333/3300
bm_rotate<uint8_t, AlgType::Std>/3333/12
bm_rotate<uint8_t, AlgType::Std>/3333/5
bm_rotate<uint8_t, AlgType::Std>/3333/1
bm_rotate<uint8_t, AlgType::Std>/333/101
bm_rotate<uint8_t, AlgType::Std>/123/32
bm_rotate<uint8_t, AlgType::Std>/23/7
bm_rotate<uint8_t, AlgType::Std>/12/5
bm_rotate<uint8_t, AlgType::Std>/3/2
bm_rotate<uint8_t, AlgType::Rng>/3333/2242
bm_rotate<uint8_t, AlgType::Rng>/3332/1666
bm_rotate<uint8_t, AlgType::Rng>/3333/1111
bm_rotate<uint8_t, AlgType::Rng>/3333/501
bm_rotate<uint8_t, AlgType::Rng>/3333/3300
bm_rotate<uint8_t, AlgType::Rng>/3333/12
bm_rotate<uint8_t, AlgType::Rng>/3333/5
bm_rotate<uint8_t, AlgType::Rng>/3333/1
bm_rotate<uint8_t, AlgType::Rng>/333/101
bm_rotate<uint8_t, AlgType::Rng>/123/32
bm_rotate<uint8_t, AlgType::Rng>/23/7
bm_rotate<uint8_t, AlgType::Rng>/12/5
bm_rotate<uint8_t, AlgType::Rng>/3/2
bm_rotate<uint16_t, AlgType::Std>/3333/2242
bm_rotate<uint16_t, AlgType::Std>/3332/1666
bm_rotate<uint16_t, AlgType::Std>/3333/1111
bm_rotate<uint16_t, AlgType::Std>/3333/501
bm_rotate<uint16_t, AlgType::Std>/3333/3300
bm_rotate<uint16_t, AlgType::Std>/3333/12
bm_rotate<uint16_t, AlgType::Std>/3333/5
bm_rotate<uint16_t, AlgType::Std>/3333/1
bm_rotate<uint16_t, AlgType::Std>/333/101
bm_rotate<uint16_t, AlgType::Std>/123/32
bm_rotate<uint16_t, AlgType::Std>/23/7
bm_rotate<uint16_t, AlgType::Std>/12/5
bm_rotate<uint16_t, AlgType::Std>/3/2
bm_rotate<uint16_t, AlgType::Rng>/3333/2242
bm_rotate<uint16_t, AlgType::Rng>/3332/1666
bm_rotate<uint16_t, AlgType::Rng>/3333/1111
bm_rotate<uint16_t, AlgType::Rng>/3333/501
bm_rotate<uint16_t, AlgType::Rng>/3333/3300
bm_rotate<uint16_t, AlgType::Rng>/3333/12
bm_rotate<uint16_t, AlgType::Rng>/3333/5
bm_rotate<uint16_t, AlgType::Rng>/3333/1
bm_rotate<uint16_t, AlgType::Rng>/333/101
bm_rotate<uint16_t, AlgType::Rng>/123/32
bm_rotate<uint16_t, AlgType::Rng>/23/7
bm_rotate<uint16_t, AlgType::Rng>/12/5
bm_rotate<uint16_t, AlgType::Rng>/3/2
bm_rotate<uint32_t, AlgType::Std>/3333/2242
bm_rotate<uint32_t, AlgType::Std>/3332/1666
bm_rotate<uint32_t, AlgType::Std>/3333/1111
bm_rotate<uint32_t, AlgType::Std>/3333/501
bm_rotate<uint32_t, AlgType::Std>/3333/3300
bm_rotate<uint32_t, AlgType::Std>/3333/12
bm_rotate<uint32_t, AlgType::Std>/3333/5
bm_rotate<uint32_t, AlgType::Std>/3333/1
bm_rotate<uint32_t, AlgType::Std>/333/101
bm_rotate<uint32_t, AlgType::Std>/123/32
bm_rotate<uint32_t, AlgType::Std>/23/7
bm_rotate<uint32_t, AlgType::Std>/12/5
bm_rotate<uint32_t, AlgType::Std>/3/2
bm_rotate<uint32_t, AlgType::Rng>/3333/2242
bm_rotate<uint32_t, AlgType::Rng>/3332/1666
bm_rotate<uint32_t, AlgType::Rng>/3333/1111
bm_rotate<uint32_t, AlgType::Rng>/3333/501
bm_rotate<uint32_t, AlgType::Rng>/3333/3300
bm_rotate<uint32_t, AlgType::Rng>/3333/12
bm_rotate<uint32_t, AlgType::Rng>/3333/5
bm_rotate<uint32_t, AlgType::Rng>/3333/1
bm_rotate<uint32_t, AlgType::Rng>/333/101
bm_rotate<uint32_t, AlgType::Rng>/123/32
bm_rotate<uint32_t, AlgType::Rng>/23/7
bm_rotate<uint32_t, AlgType::Rng>/12/5
bm_rotate<uint32_t, AlgType::Rng>/3/2
bm_rotate<uint64_t, AlgType::Std>/3333/2242
bm_rotate<uint64_t, AlgType::Std>/3332/1666
bm_rotate<uint64_t, AlgType::Std>/3333/1111
bm_rotate<uint64_t, AlgType::Std>/3333/501
bm_rotate<uint64_t, AlgType::Std>/3333/3300
bm_rotate<uint64_t, AlgType::Std>/3333/12
bm_rotate<uint64_t, AlgType::Std>/3333/5
bm_rotate<uint64_t, AlgType::Std>/3333/1
bm_rotate<uint64_t, AlgType::Std>/333/101
bm_rotate<uint64_t, AlgType::Std>/123/32
bm_rotate<uint64_t, AlgType::Std>/23/7
bm_rotate<uint64_t, AlgType::Std>/12/5
bm_rotate<uint64_t, AlgType::Std>/3/2
bm_rotate<uint64_t, AlgType::Rng>/3333/2242
bm_rotate<uint64_t, AlgType::Rng>/3332/1666
bm_rotate<uint64_t, AlgType::Rng>/3333/1111
bm_rotate<uint64_t, AlgType::Rng>/3333/501
bm_rotate<uint64_t, AlgType::Rng>/3333/3300
bm_rotate<uint64_t, AlgType::Rng>/3333/12
bm_rotate<uint64_t, AlgType::Rng>/3333/5
bm_rotate<uint64_t, AlgType::Rng>/3333/1
bm_rotate<uint64_t, AlgType::Rng>/333/101
bm_rotate<uint64_t, AlgType::Rng>/123/32
bm_rotate<uint64_t, AlgType::Rng>/23/7
bm_rotate<uint64_t, AlgType::Rng>/12/5
bm_rotate<uint64_t, AlgType::Rng>/3/2
bm_rotate<color, AlgType::Std>/3333/2242
bm_rotate<color, AlgType::Std>/3332/1666
bm_rotate<color, AlgType::Std>/3333/1111
bm_rotate<color, AlgType::Std>/3333/501
bm_rotate<color, AlgType::Std>/3333/3300
bm_rotate<color, AlgType::Std>/3333/12
bm_rotate<color, AlgType::Std>/3333/5
bm_rotate<color, AlgType::Std>/3333/1
bm_rotate<color, AlgType::Std>/333/101
bm_rotate<color, AlgType::Std>/123/32
bm_rotate<color, AlgType::Std>/23/7
bm_rotate<color, AlgType::Std>/12/5
bm_rotate<color, AlgType::Std>/3/2
bm_rotate<color, AlgType::Rng>/3333/2242
bm_rotate<color, AlgType::Rng>/3332/1666
bm_rotate<color, AlgType::Rng>/3333/1111
bm_rotate<color, AlgType::Rng>/3333/501
bm_rotate<color, AlgType::Rng>/3333/3300
bm_rotate<color, AlgType::Rng>/3333/12
bm_rotate<color, AlgType::Rng>/3333/5
bm_rotate<color, AlgType::Rng>/3333/1
bm_rotate<color, AlgType::Rng>/333/101
bm_rotate<color, AlgType::Rng>/123/32
bm_rotate<color, AlgType::Rng>/23/7
bm_rotate<color, AlgType::Rng>/12/5
bm_rotate<color, AlgType::Rng>/3/2