Vectorize `rotate` better #5502

AlexGuteniev · 2025-05-13T18:59:13Z

⚙️ The optimization

Before this PR std::rotate has already been vectorized, but indirectly. The rotate calls reverse first on two portion, then on the whole, this makes the result rotated. The reverse is vectorized by going from both ends, loading vectors, reversing them using shuffles, and storing them to the opposite ends, then finishing the middle part using scalar loop. So we have always exactly N*2 swaps, and most of them are vectorized for large arrays.

A better vectorization improves for this as follows:

attempts to do fewer assignments than 2*N (for 2*N/2 = N swaps)
avoid shuffles (cross-lane shuffles may be expensive)
uses memcpy/memmove to move in blocks potentially faster than manually vectorized loop

First obvious step for improving rotate would be handling small rotation. The small part can be fit into small temporary buffer, then the remaining moved, and then the small part copied from that buffer to the proper place. This makes most of the elements movement using only one assignment. only those which were put into temporary buffer use two assignments.

The next step is to use vectorized swap_ranges to implement generic rotation. http://cppreference.com has an example recursive algorithm that uses swap in a loop, that loop is actually swap_ranges and that tail recursion is actually a loop. If the rotation point is exactly in the middle, we're done in one iteration, making only N assignments. If it is closer to the middle, we would do small rotation as the next step, and that would be on slightly more than half of the original elements, still efficient. In a worse case, we will do few swap_ranges steps, but still will never do as many assignments as with double reverse.

A hypothetical functions like swap_3_ranges, swap_4_ranges, etc could reduce the number of assignments for more cases. But going further in optimization will result in less and less improvement for more and more code added, and at some point will cause the complex decisions to take noticeable amount of time, resulting in negative improvement, so we need to stop somewhere. Probably stopping on just small rotation and two ranges swap strategy would be a good idea.

The "small" rotation is arbitrarily defined as 512 bytes or less. That is, the algorithm would use at most 512 bytes of stack extra. This is the same amount as in remove_copy / unique_copy (#5355). Additionally I think we should still prefer swap_ranges if the "small" is not very small, and the larger part is not very large, so that we end up with another significant reduction of the range. The amounts of "very small" and "not very large" are also arbitrary, not justified by profiling, only by ~~vibe~~ common sense.

As a bonus, with only memcpy/memmove/swap_ranges functions used, and no vector intrinsics that works with elements, the rotation for non-power-of-two sized elements is also vectorized now.

🐛 `memmove` performance bug

After the initial implementation I observed that some of the benchmark exhibited unexpected slowdown. The issue was in surprisingly slow memmove. I've created benchmark repro of this problem.

Benchmark

#include <benchmark/benchmark.h>
#include <cstring>

using namespace std;

alignas(4096) unsigned char v[1024 * 1024];

void bm_memmove(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    const auto n    = static_cast<ptrdiff_t>(state.range(1));

    const size_t n1 = n < 0 ? 0 : n;
    const size_t n0 = n < 0 ? -n : 0;

    benchmark::DoNotOptimize(v);

    for (auto _ : state) {
        memmove(v + n0, v + n1, size);
        benchmark::DoNotOptimize(v);
    }
}

BENCHMARK(bm_memmove)->ArgsProduct({{8191, 8193}, {-5, +5}});

BENCHMARK_MAIN();

Results on i5-1235U

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_memmove/8191/-5       71.4 ns         71.5 ns      8960000
bm_memmove/8193/-5       71.1 ns         71.5 ns      8960000
bm_memmove/8191/5        62.6 ns         61.0 ns      8960000
bm_memmove/8193/5        1903 ns         1925 ns       373333

Results on i7-8750H

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_memmove/8191/-5        143 ns          141 ns      4977778
bm_memmove/8193/-5        145 ns          146 ns      4480000
bm_memmove/8191/5        77.2 ns         76.7 ns      8960000
bm_memmove/8193/5        80.9 ns         80.2 ns      8960000

Analysis

All I know or suspect so far:

The problem exists on Alder Lake but does not exist on Coffee Lake or Skylake
The problematic instruction is rep movsb, which is used in memmove
The problematic behavior is recreated for me when the size is greater than 8192 and the pointer difference is smaller than 64
Clang on Linux is also affected, proved by recreating the issue here: https://quick-bench.com/q/HgY3kPAaUIqkfmzwz_NFeoTcj3U

I appreciate any help in investigation the issue further.

Mitigation

Ideally we'd need to report this issue to VCRuntime maintainers.
But I feel like we need to try to gather more information to report it better.

For now, _Move_to_lower_address and _Move_to_upper_address functions are introduced as a workaround.

✅ Test coverage

Apparently there's no trivial implementation of in-place rotate. Well, then we will use an implementation that uses additional buffer as LKG implementation.

📄 Headers or separate compilation?

<xutility> is non-core header, and it already drags in <cstring> functions and swap_ranges. So the algorithm may reside there.

On the other hand, since it is non-template, it can be kept as separately compiled for throughput.

The current decision is relic of an intention to use AVX2 intrinsics directly. That intention was abandoned due to introducing far more complexity for a relatively little potential gain.

⏱️ Benchmark results

Benchmark	Before	After	Speedup
`bm_rotate<uint8_t, AlgType::Std>/3333/2242`	92.2 ns	71.1 ns	1.30
`bm_rotate<uint8_t, AlgType::Std>/3332/1666`	92.4 ns	42.3 ns	2.18
`bm_rotate<uint8_t, AlgType::Std>/3333/1111`	88.5 ns	61.7 ns	1.43
`bm_rotate<uint8_t, AlgType::Std>/3333/501`	88.4 ns	32.6 ns	2.71
`bm_rotate<uint8_t, AlgType::Std>/3333/3300`	90.1 ns	32.1 ns	2.81
`bm_rotate<uint8_t, AlgType::Std>/3333/12`	85.7 ns	25.4 ns	3.37
`bm_rotate<uint8_t, AlgType::Std>/3333/5`	84.3 ns	28.7 ns	2.94
`bm_rotate<uint8_t, AlgType::Std>/3333/1`	86.0 ns	28.1 ns	3.06
`bm_rotate<uint8_t, AlgType::Std>/333/101`	19.3 ns	12.1 ns	1.60
`bm_rotate<uint8_t, AlgType::Std>/123/32`	18.9 ns	6.41 ns	2.95
`bm_rotate<uint8_t, AlgType::Std>/23/7`	15.2 ns	5.20 ns	2.92
`bm_rotate<uint8_t, AlgType::Std>/12/5`	12.3 ns	5.19 ns	2.37
`bm_rotate<uint8_t, AlgType::Std>/3/2`	8.71 ns	4.64 ns	1.88
`bm_rotate<uint8_t, AlgType::Rng>/3333/2242`	91.4 ns	66.4 ns	1.38
`bm_rotate<uint8_t, AlgType::Rng>/3332/1666`	92.3 ns	39.1 ns	2.36
`bm_rotate<uint8_t, AlgType::Rng>/3333/1111`	88.8 ns	58.1 ns	1.53
`bm_rotate<uint8_t, AlgType::Rng>/3333/501`	89.0 ns	31.0 ns	2.87
`bm_rotate<uint8_t, AlgType::Rng>/3333/3300`	89.9 ns	31.7 ns	2.84
`bm_rotate<uint8_t, AlgType::Rng>/3333/12`	84.7 ns	25.4 ns	3.33
`bm_rotate<uint8_t, AlgType::Rng>/3333/5`	84.0 ns	28.2 ns	2.98
`bm_rotate<uint8_t, AlgType::Rng>/3333/1`	79.9 ns	28.2 ns	2.83
`bm_rotate<uint8_t, AlgType::Rng>/333/101`	19.1 ns	12.2 ns	1.57
`bm_rotate<uint8_t, AlgType::Rng>/123/32`	18.6 ns	6.43 ns	2.89
`bm_rotate<uint8_t, AlgType::Rng>/23/7`	14.0 ns	5.11 ns	2.74
`bm_rotate<uint8_t, AlgType::Rng>/12/5`	11.4 ns	5.11 ns	2.23
`bm_rotate<uint8_t, AlgType::Rng>/3/2`	2.99 ns	4.61 ns	0.65
`bm_rotate<uint16_t, AlgType::Std>/3333/2242`	173 ns	128 ns	1.35
`bm_rotate<uint16_t, AlgType::Std>/3332/1666`	179 ns	82.4 ns	2.17
`bm_rotate<uint16_t, AlgType::Std>/3333/1111`	179 ns	110 ns	1.63
`bm_rotate<uint16_t, AlgType::Std>/3333/501`	179 ns	164 ns	1.09
`bm_rotate<uint16_t, AlgType::Std>/3333/3300`	178 ns	59.3 ns	3.00
`bm_rotate<uint16_t, AlgType::Std>/3333/12`	165 ns	50.5 ns	3.27
`bm_rotate<uint16_t, AlgType::Std>/3333/5`	175 ns	54.7 ns	3.20
`bm_rotate<uint16_t, AlgType::Std>/3333/1`	176 ns	52.2 ns	3.37
`bm_rotate<uint16_t, AlgType::Std>/333/101`	26.0 ns	13.6 ns	1.91
`bm_rotate<uint16_t, AlgType::Std>/123/32`	16.5 ns	12.0 ns	1.38
`bm_rotate<uint16_t, AlgType::Std>/23/7`	11.7 ns	4.83 ns	2.42
`bm_rotate<uint16_t, AlgType::Std>/12/5`	13.4 ns	5.13 ns	2.61
`bm_rotate<uint16_t, AlgType::Std>/3/2`	8.44 ns	4.76 ns	1.77
`bm_rotate<uint16_t, AlgType::Rng>/3333/2242`	172 ns	128 ns	1.34
`bm_rotate<uint16_t, AlgType::Rng>/3332/1666`	180 ns	82.7 ns	2.18
`bm_rotate<uint16_t, AlgType::Rng>/3333/1111`	179 ns	111 ns	1.61
`bm_rotate<uint16_t, AlgType::Rng>/3333/501`	177 ns	163 ns	1.09
`bm_rotate<uint16_t, AlgType::Rng>/3333/3300`	179 ns	58.2 ns	3.08
`bm_rotate<uint16_t, AlgType::Rng>/3333/12`	172 ns	46.5 ns	3.70
`bm_rotate<uint16_t, AlgType::Rng>/3333/5`	175 ns	53.7 ns	3.26
`bm_rotate<uint16_t, AlgType::Rng>/3333/1`	173 ns	54.0 ns	3.20
`bm_rotate<uint16_t, AlgType::Rng>/333/101`	25.1 ns	13.1 ns	1.92
`bm_rotate<uint16_t, AlgType::Rng>/123/32`	16.3 ns	12.5 ns	1.30
`bm_rotate<uint16_t, AlgType::Rng>/23/7`	11.3 ns	4.86 ns	2.33
`bm_rotate<uint16_t, AlgType::Rng>/12/5`	12.8 ns	5.08 ns	2.52
`bm_rotate<uint16_t, AlgType::Rng>/3/2`	3.01 ns	4.65 ns	0.65
`bm_rotate<uint32_t, AlgType::Std>/3333/2242`	322 ns	246 ns	1.31
`bm_rotate<uint32_t, AlgType::Std>/3332/1666`	331 ns	164 ns	2.02
`bm_rotate<uint32_t, AlgType::Std>/3333/1111`	323 ns	200 ns	1.62
`bm_rotate<uint32_t, AlgType::Std>/3333/501`	324 ns	303 ns	1.07
`bm_rotate<uint32_t, AlgType::Std>/3333/3300`	327 ns	105 ns	3.11
`bm_rotate<uint32_t, AlgType::Std>/3333/12`	322 ns	89.3 ns	3.61
`bm_rotate<uint32_t, AlgType::Std>/3333/5`	322 ns	88.7 ns	3.63
`bm_rotate<uint32_t, AlgType::Std>/3333/1`	320 ns	89.3 ns	3.58
`bm_rotate<uint32_t, AlgType::Std>/333/101`	34.0 ns	16.1 ns	2.11
`bm_rotate<uint32_t, AlgType::Std>/123/32`	14.8 ns	12.7 ns	1.17
`bm_rotate<uint32_t, AlgType::Std>/23/7`	11.5 ns	6.92 ns	1.66
`bm_rotate<uint32_t, AlgType::Std>/12/5`	9.84 ns	6.66 ns	1.48
`bm_rotate<uint32_t, AlgType::Std>/3/2`	7.67 ns	4.71 ns	1.63
`bm_rotate<uint32_t, AlgType::Rng>/3333/2242`	321 ns	245 ns	1.31
`bm_rotate<uint32_t, AlgType::Rng>/3332/1666`	337 ns	161 ns	2.09
`bm_rotate<uint32_t, AlgType::Rng>/3333/1111`	324 ns	199 ns	1.63
`bm_rotate<uint32_t, AlgType::Rng>/3333/501`	320 ns	303 ns	1.06
`bm_rotate<uint32_t, AlgType::Rng>/3333/3300`	329 ns	104 ns	3.16
`bm_rotate<uint32_t, AlgType::Rng>/3333/12`	323 ns	88.6 ns	3.65
`bm_rotate<uint32_t, AlgType::Rng>/3333/5`	323 ns	88.4 ns	3.65
`bm_rotate<uint32_t, AlgType::Rng>/3333/1`	321 ns	89.3 ns	3.59
`bm_rotate<uint32_t, AlgType::Rng>/333/101`	34.2 ns	16.1 ns	2.12
`bm_rotate<uint32_t, AlgType::Rng>/123/32`	14.4 ns	12.6 ns	1.14
`bm_rotate<uint32_t, AlgType::Rng>/23/7`	10.9 ns	6.91 ns	1.58
`bm_rotate<uint32_t, AlgType::Rng>/12/5`	8.74 ns	6.91 ns	1.26
`bm_rotate<uint32_t, AlgType::Rng>/3/2`	3.01 ns	4.63 ns	0.65
`bm_rotate<uint64_t, AlgType::Std>/3333/2242`	644 ns	429 ns	1.50
`bm_rotate<uint64_t, AlgType::Std>/3332/1666`	647 ns	319 ns	2.03
`bm_rotate<uint64_t, AlgType::Std>/3333/1111`	586 ns	387 ns	1.51
`bm_rotate<uint64_t, AlgType::Std>/3333/501`	647 ns	581 ns	1.11
`bm_rotate<uint64_t, AlgType::Std>/3333/3300`	674 ns	209 ns	3.22
`bm_rotate<uint64_t, AlgType::Std>/3333/12`	651 ns	136 ns	4.79
`bm_rotate<uint64_t, AlgType::Std>/3333/5`	652 ns	173 ns	3.77
`bm_rotate<uint64_t, AlgType::Std>/3333/1`	645 ns	184 ns	3.51
`bm_rotate<uint64_t, AlgType::Std>/333/101`	61.6 ns	47.6 ns	1.29
`bm_rotate<uint64_t, AlgType::Std>/123/32`	20.4 ns	14.1 ns	1.45
`bm_rotate<uint64_t, AlgType::Std>/23/7`	10.9 ns	11.8 ns	0.92
`bm_rotate<uint64_t, AlgType::Std>/12/5`	11.6 ns	11.4 ns	1.02
`bm_rotate<uint64_t, AlgType::Std>/3/2`	8.34 ns	4.60 ns	1.81
`bm_rotate<uint64_t, AlgType::Rng>/3333/2242`	657 ns	427 ns	1.54
`bm_rotate<uint64_t, AlgType::Rng>/3332/1666`	653 ns	322 ns	2.03
`bm_rotate<uint64_t, AlgType::Rng>/3333/1111`	580 ns	387 ns	1.50
`bm_rotate<uint64_t, AlgType::Rng>/3333/501`	648 ns	576 ns	1.13
`bm_rotate<uint64_t, AlgType::Rng>/3333/3300`	645 ns	209 ns	3.09
`bm_rotate<uint64_t, AlgType::Rng>/3333/12`	644 ns	131 ns	4.92
`bm_rotate<uint64_t, AlgType::Rng>/3333/5`	650 ns	172 ns	3.78
`bm_rotate<uint64_t, AlgType::Rng>/3333/1`	642 ns	179 ns	3.59
`bm_rotate<uint64_t, AlgType::Rng>/333/101`	61.3 ns	48.0 ns	1.28
`bm_rotate<uint64_t, AlgType::Rng>/123/32`	22.5 ns	13.4 ns	1.68
`bm_rotate<uint64_t, AlgType::Rng>/23/7`	11.1 ns	11.3 ns	0.98
`bm_rotate<uint64_t, AlgType::Rng>/12/5`	11.6 ns	10.6 ns	1.09
`bm_rotate<uint64_t, AlgType::Rng>/3/2`	3.00 ns	4.63 ns	0.65
`bm_rotate<color, AlgType::Std>/3333/2242`	1711 ns	357 ns	4.79
`bm_rotate<color, AlgType::Std>/3332/1666`	1700 ns	241 ns	7.05
`bm_rotate<color, AlgType::Std>/3333/1111`	1705 ns	318 ns	5.36
`bm_rotate<color, AlgType::Std>/3333/501`	1695 ns	468 ns	3.62
`bm_rotate<color, AlgType::Std>/3333/3300`	1716 ns	160 ns	10.73
`bm_rotate<color, AlgType::Std>/3333/12`	1706 ns	131 ns	13.02
`bm_rotate<color, AlgType::Std>/3333/5`	1726 ns	152 ns	11.36
`bm_rotate<color, AlgType::Std>/3333/1`	1701 ns	151 ns	11.26
`bm_rotate<color, AlgType::Std>/333/101`	171 ns	46.5 ns	3.68
`bm_rotate<color, AlgType::Std>/123/32`	61.7 ns	14.5 ns	4.26
`bm_rotate<color, AlgType::Std>/23/7`	11.6 ns	11.9 ns	0.97
`bm_rotate<color, AlgType::Std>/12/5`	6.69 ns	7.12 ns	0.94
`bm_rotate<color, AlgType::Std>/3/2`	2.10 ns	4.96 ns	0.42
`bm_rotate<color, AlgType::Rng>/3333/2242`	1693 ns	366 ns	4.63
`bm_rotate<color, AlgType::Rng>/3332/1666`	1707 ns	241 ns	7.08
`bm_rotate<color, AlgType::Rng>/3333/1111`	1710 ns	321 ns	5.33
`bm_rotate<color, AlgType::Rng>/3333/501`	1717 ns	472 ns	3.64
`bm_rotate<color, AlgType::Rng>/3333/3300`	1695 ns	170 ns	9.97
`bm_rotate<color, AlgType::Rng>/3333/12`	1708 ns	135 ns	12.65
`bm_rotate<color, AlgType::Rng>/3333/5`	1711 ns	151 ns	11.33
`bm_rotate<color, AlgType::Rng>/3333/1`	1708 ns	154 ns	11.09
`bm_rotate<color, AlgType::Rng>/333/101`	170 ns	45.9 ns	3.70
`bm_rotate<color, AlgType::Rng>/123/32`	61.8 ns	14.3 ns	4.32
`bm_rotate<color, AlgType::Rng>/23/7`	12.0 ns	11.9 ns	1.01
`bm_rotate<color, AlgType::Rng>/12/5`	6.56 ns	8.81 ns	0.74
`bm_rotate<color, AlgType::Rng>/3/2`	2.11 ns	5.22 ns	0.40

benchmarks/src/rotate.cpp

stl/inc/xutility

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/src/vector_algorithms.cpp

stl/inc/xutility

for `rotate_copy` too, it uses `memcpy`, hence considered vectorized

stl/inc/xutility

stl/inc/algorithm

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/inc/algorithm

StephanTLavavej · 2025-05-16T11:21:20Z

5950X speedups:

Click to embiggen:

Benchmark	Before	After	Speedup
`bm_rotate<uint8_t, AlgType::Std>/3333/2242`	114 ns	121 ns	0.94
`bm_rotate<uint8_t, AlgType::Std>/3332/1666`	114 ns	79.2 ns	1.44
`bm_rotate<uint8_t, AlgType::Std>/3333/1111`	132 ns	116 ns	1.14
`bm_rotate<uint8_t, AlgType::Std>/3333/501`	144 ns	31.7 ns	4.54
`bm_rotate<uint8_t, AlgType::Std>/3333/3300`	103 ns	46.6 ns	2.21
`bm_rotate<uint8_t, AlgType::Std>/3333/12`	112 ns	29.2 ns	3.84
`bm_rotate<uint8_t, AlgType::Std>/3333/5`	137 ns	28.0 ns	4.89
`bm_rotate<uint8_t, AlgType::Std>/3333/1`	137 ns	29.0 ns	4.72
`bm_rotate<uint8_t, AlgType::Std>/333/101`	24.6 ns	13.6 ns	1.81
`bm_rotate<uint8_t, AlgType::Std>/123/32`	23.5 ns	10.2 ns	2.30
`bm_rotate<uint8_t, AlgType::Std>/23/7`	11.7 ns	9.54 ns	1.23
`bm_rotate<uint8_t, AlgType::Std>/12/5`	8.68 ns	9.52 ns	0.91
`bm_rotate<uint8_t, AlgType::Std>/3/2`	7.38 ns	9.70 ns	0.76
`bm_rotate<uint8_t, AlgType::Rng>/3333/2242`	114 ns	121 ns	0.94
`bm_rotate<uint8_t, AlgType::Rng>/3332/1666`	115 ns	79.0 ns	1.46
`bm_rotate<uint8_t, AlgType::Rng>/3333/1111`	132 ns	116 ns	1.14
`bm_rotate<uint8_t, AlgType::Rng>/3333/501`	144 ns	31.8 ns	4.53
`bm_rotate<uint8_t, AlgType::Rng>/3333/3300`	103 ns	46.5 ns	2.22
`bm_rotate<uint8_t, AlgType::Rng>/3333/12`	112 ns	29.2 ns	3.84
`bm_rotate<uint8_t, AlgType::Rng>/3333/5`	137 ns	28.0 ns	4.89
`bm_rotate<uint8_t, AlgType::Rng>/3333/1`	137 ns	28.9 ns	4.74
`bm_rotate<uint8_t, AlgType::Rng>/333/101`	24.6 ns	13.6 ns	1.81
`bm_rotate<uint8_t, AlgType::Rng>/123/32`	23.1 ns	10.2 ns	2.26
`bm_rotate<uint8_t, AlgType::Rng>/23/7`	11.7 ns	9.53 ns	1.23
`bm_rotate<uint8_t, AlgType::Rng>/12/5`	8.48 ns	9.52 ns	0.89
`bm_rotate<uint8_t, AlgType::Rng>/3/2`	7.58 ns	9.08 ns	0.83
`bm_rotate<uint16_t, AlgType::Std>/3333/2242`	206 ns	195 ns	1.06
`bm_rotate<uint16_t, AlgType::Std>/3332/1666`	192 ns	98.4 ns	1.95
`bm_rotate<uint16_t, AlgType::Std>/3333/1111`	261 ns	177 ns	1.47
`bm_rotate<uint16_t, AlgType::Std>/3333/501`	271 ns	268 ns	1.01
`bm_rotate<uint16_t, AlgType::Std>/3333/3300`	205 ns	90.1 ns	2.28
`bm_rotate<uint16_t, AlgType::Std>/3333/12`	229 ns	51.5 ns	4.45
`bm_rotate<uint16_t, AlgType::Std>/3333/5`	278 ns	50.3 ns	5.53
`bm_rotate<uint16_t, AlgType::Std>/3333/1`	276 ns	51.0 ns	5.41
`bm_rotate<uint16_t, AlgType::Std>/333/101`	33.0 ns	14.1 ns	2.34
`bm_rotate<uint16_t, AlgType::Std>/123/32`	22.0 ns	13.4 ns	1.64
`bm_rotate<uint16_t, AlgType::Std>/23/7`	14.3 ns	9.31 ns	1.54
`bm_rotate<uint16_t, AlgType::Std>/12/5`	10.2 ns	9.29 ns	1.10
`bm_rotate<uint16_t, AlgType::Std>/3/2`	8.59 ns	10.1 ns	0.85
`bm_rotate<uint16_t, AlgType::Rng>/3333/2242`	206 ns	195 ns	1.06
`bm_rotate<uint16_t, AlgType::Rng>/3332/1666`	193 ns	98.3 ns	1.96
`bm_rotate<uint16_t, AlgType::Rng>/3333/1111`	264 ns	177 ns	1.49
`bm_rotate<uint16_t, AlgType::Rng>/3333/501`	273 ns	267 ns	1.02
`bm_rotate<uint16_t, AlgType::Rng>/3333/3300`	205 ns	90.3 ns	2.27
`bm_rotate<uint16_t, AlgType::Rng>/3333/12`	231 ns	51.1 ns	4.52
`bm_rotate<uint16_t, AlgType::Rng>/3333/5`	278 ns	50.2 ns	5.54
`bm_rotate<uint16_t, AlgType::Rng>/3333/1`	275 ns	50.9 ns	5.40
`bm_rotate<uint16_t, AlgType::Rng>/333/101`	33.1 ns	14.1 ns	2.35
`bm_rotate<uint16_t, AlgType::Rng>/123/32`	22.0 ns	13.4 ns	1.64
`bm_rotate<uint16_t, AlgType::Rng>/23/7`	14.8 ns	9.34 ns	1.58
`bm_rotate<uint16_t, AlgType::Rng>/12/5`	9.88 ns	9.29 ns	1.06
`bm_rotate<uint16_t, AlgType::Rng>/3/2`	8.91 ns	8.91 ns	1.00
`bm_rotate<uint32_t, AlgType::Std>/3333/2242`	381 ns	296 ns	1.29
`bm_rotate<uint32_t, AlgType::Std>/3332/1666`	387 ns	191 ns	2.03
`bm_rotate<uint32_t, AlgType::Std>/3333/1111`	383 ns	231 ns	1.66
`bm_rotate<uint32_t, AlgType::Std>/3333/501`	381 ns	367 ns	1.04
`bm_rotate<uint32_t, AlgType::Std>/3333/3300`	387 ns	178 ns	2.17
`bm_rotate<uint32_t, AlgType::Std>/3333/12`	384 ns	98.6 ns	3.89
`bm_rotate<uint32_t, AlgType::Std>/3333/5`	383 ns	98.6 ns	3.88
`bm_rotate<uint32_t, AlgType::Std>/3333/1`	383 ns	98.3 ns	3.90
`bm_rotate<uint32_t, AlgType::Std>/333/101`	39.6 ns	16.7 ns	2.37
`bm_rotate<uint32_t, AlgType::Std>/123/32`	18.4 ns	13.6 ns	1.35
`bm_rotate<uint32_t, AlgType::Std>/23/7`	13.4 ns	9.32 ns	1.44
`bm_rotate<uint32_t, AlgType::Std>/12/5`	7.25 ns	8.26 ns	0.88
`bm_rotate<uint32_t, AlgType::Std>/3/2`	7.03 ns	9.06 ns	0.78
`bm_rotate<uint32_t, AlgType::Rng>/3333/2242`	383 ns	296 ns	1.29
`bm_rotate<uint32_t, AlgType::Rng>/3332/1666`	387 ns	191 ns	2.03
`bm_rotate<uint32_t, AlgType::Rng>/3333/1111`	383 ns	231 ns	1.66
`bm_rotate<uint32_t, AlgType::Rng>/3333/501`	381 ns	367 ns	1.04
`bm_rotate<uint32_t, AlgType::Rng>/3333/3300`	387 ns	178 ns	2.17
`bm_rotate<uint32_t, AlgType::Rng>/3333/12`	382 ns	98.7 ns	3.87
`bm_rotate<uint32_t, AlgType::Rng>/3333/5`	383 ns	98.5 ns	3.89
`bm_rotate<uint32_t, AlgType::Rng>/3333/1`	381 ns	98.7 ns	3.86
`bm_rotate<uint32_t, AlgType::Rng>/333/101`	39.5 ns	16.7 ns	2.37
`bm_rotate<uint32_t, AlgType::Rng>/123/32`	18.3 ns	13.6 ns	1.35
`bm_rotate<uint32_t, AlgType::Rng>/23/7`	13.4 ns	9.33 ns	1.44
`bm_rotate<uint32_t, AlgType::Rng>/12/5`	7.23 ns	8.26 ns	0.88
`bm_rotate<uint32_t, AlgType::Rng>/3/2`	7.63 ns	9.63 ns	0.79
`bm_rotate<uint64_t, AlgType::Std>/3333/2242`	765 ns	505 ns	1.51
`bm_rotate<uint64_t, AlgType::Std>/3332/1666`	769 ns	380 ns	2.02
`bm_rotate<uint64_t, AlgType::Std>/3333/1111`	749 ns	442 ns	1.69
`bm_rotate<uint64_t, AlgType::Std>/3333/501`	765 ns	680 ns	1.13
`bm_rotate<uint64_t, AlgType::Std>/3333/3300`	768 ns	355 ns	2.16
`bm_rotate<uint64_t, AlgType::Std>/3333/12`	765 ns	198 ns	3.86
`bm_rotate<uint64_t, AlgType::Std>/3333/5`	765 ns	198 ns	3.86
`bm_rotate<uint64_t, AlgType::Std>/3333/1`	766 ns	203 ns	3.77
`bm_rotate<uint64_t, AlgType::Std>/333/101`	72.7 ns	54.5 ns	1.33
`bm_rotate<uint64_t, AlgType::Std>/123/32`	28.5 ns	14.6 ns	1.95
`bm_rotate<uint64_t, AlgType::Std>/23/7`	11.9 ns	12.5 ns	0.95
`bm_rotate<uint64_t, AlgType::Std>/12/5`	12.8 ns	11.2 ns	1.14
`bm_rotate<uint64_t, AlgType::Std>/3/2`	7.53 ns	9.43 ns	0.80
`bm_rotate<uint64_t, AlgType::Rng>/3333/2242`	761 ns	505 ns	1.51
`bm_rotate<uint64_t, AlgType::Rng>/3332/1666`	768 ns	380 ns	2.02
`bm_rotate<uint64_t, AlgType::Rng>/3333/1111`	746 ns	442 ns	1.69
`bm_rotate<uint64_t, AlgType::Rng>/3333/501`	762 ns	681 ns	1.12
`bm_rotate<uint64_t, AlgType::Rng>/3333/3300`	765 ns	356 ns	2.15
`bm_rotate<uint64_t, AlgType::Rng>/3333/12`	764 ns	198 ns	3.86
`bm_rotate<uint64_t, AlgType::Rng>/3333/5`	762 ns	198 ns	3.85
`bm_rotate<uint64_t, AlgType::Rng>/3333/1`	762 ns	203 ns	3.75
`bm_rotate<uint64_t, AlgType::Rng>/333/101`	72.6 ns	54.6 ns	1.33
`bm_rotate<uint64_t, AlgType::Rng>/123/32`	28.6 ns	14.7 ns	1.95
`bm_rotate<uint64_t, AlgType::Rng>/23/7`	11.9 ns	13.1 ns	0.91
`bm_rotate<uint64_t, AlgType::Rng>/12/5`	12.8 ns	11.2 ns	1.14
`bm_rotate<uint64_t, AlgType::Rng>/3/2`	7.36 ns	9.44 ns	0.78
`bm_rotate<color, AlgType::Std>/3333/2242`	3267 ns	548 ns	5.96
`bm_rotate<color, AlgType::Std>/3332/1666`	3218 ns	286 ns	11.25
`bm_rotate<color, AlgType::Std>/3333/1111`	3274 ns	505 ns	6.48
`bm_rotate<color, AlgType::Std>/3333/501`	3265 ns	767 ns	4.26
`bm_rotate<color, AlgType::Std>/3333/3300`	3257 ns	267 ns	12.20
`bm_rotate<color, AlgType::Std>/3333/12`	3304 ns	150 ns	22.03
`bm_rotate<color, AlgType::Std>/3333/5`	3254 ns	150 ns	21.69
`bm_rotate<color, AlgType::Std>/3333/1`	3250 ns	150 ns	21.67
`bm_rotate<color, AlgType::Std>/333/101`	323 ns	70.7 ns	4.57
`bm_rotate<color, AlgType::Std>/123/32`	120 ns	14.4 ns	8.33
`bm_rotate<color, AlgType::Std>/23/7`	19.4 ns	13.1 ns	1.48
`bm_rotate<color, AlgType::Std>/12/5`	9.99 ns	9.45 ns	1.06
`bm_rotate<color, AlgType::Std>/3/2`	2.75 ns	9.45 ns	0.29
`bm_rotate<color, AlgType::Rng>/3333/2242`	3267 ns	549 ns	5.95
`bm_rotate<color, AlgType::Rng>/3332/1666`	3213 ns	286 ns	11.23
`bm_rotate<color, AlgType::Rng>/3333/1111`	3271 ns	505 ns	6.48
`bm_rotate<color, AlgType::Rng>/3333/501`	3265 ns	769 ns	4.25
`bm_rotate<color, AlgType::Rng>/3333/3300`	3251 ns	267 ns	12.18
`bm_rotate<color, AlgType::Rng>/3333/12`	3292 ns	150 ns	21.95
`bm_rotate<color, AlgType::Rng>/3333/5`	3252 ns	150 ns	21.68
`bm_rotate<color, AlgType::Rng>/3333/1`	3256 ns	151 ns	21.56
`bm_rotate<color, AlgType::Rng>/333/101`	323 ns	70.7 ns	4.57
`bm_rotate<color, AlgType::Rng>/123/32`	120 ns	14.6 ns	8.22
`bm_rotate<color, AlgType::Rng>/23/7`	19.6 ns	13.4 ns	1.46
`bm_rotate<color, AlgType::Rng>/12/5`	9.98 ns	9.39 ns	1.06
`bm_rotate<color, AlgType::Rng>/3/2`	2.75 ns	9.68 ns	0.28

StephanTLavavej · 2025-05-16T16:04:03Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-05-17T03:44:56Z

I resolved a trivial adjacent-add conflict with #5493 in benchmarks/CMakeLists.txt.

StephanTLavavej · 2025-05-17T05:39:12Z

When things spin, science happens! 🔁 🧑‍🔬 🧪

AlexGuteniev added 3 commits May 13, 2025 18:42

benchmark

664bb63

test coverage

3d6a9b4

vectorization

95ad7d1

AlexGuteniev requested a review from a team as a code owner May 13, 2025 18:59

github-project-automation bot added this to STL Code Reviews May 13, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews May 13, 2025

StephanTLavavej added the performance Must go faster label May 13, 2025

StephanTLavavej self-assigned this May 13, 2025

StephanTLavavej added 3 commits May 15, 2025 09:31

Drop unnecessary std::.

7cdd924

Add missing calling convention to __std_rotate.

5f5bd0e

Keep actual and expected in sync with less work.

01d7359

StephanTLavavej mentioned this pull request May 15, 2025

VCRuntime: memmove() is surprisingly slow for more than 8 KB on certain CPUs #5506

Open

StephanTLavavej added 4 commits May 15, 2025 10:33

Cite GH 5506.

f92550d

upper_address => higher_address

0b7c678

I see dead variables. They don't even know they're dead.

0590287

Sources should point to const.

6b6e37e

StephanTLavavej requested changes May 15, 2025

View reviewed changes

github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews May 15, 2025

StephanTLavavej removed their assignment May 15, 2025

This comment was marked as resolved.

Sign in to view

AlexGuteniev added 6 commits May 16, 2025 08:45

ranges coverage

bf77cd2

integer class diffference coverage

19b5aa2

for `rotate_copy` too, it uses `memcpy`, hence considered vectorized

ranges benchmark

d60725d

more benchmark cases

fcc41cc

ranges codepath

ef3be12

we want swappable

55156e9

This comment was marked as resolved.

Sign in to view

AlexGuteniev requested a review from StephanTLavavej May 16, 2025 06:56

StephanTLavavej added 2 commits May 16, 2025 02:24

Properly detect element volatility.

a7b6a6e

Use _Is_trivially_ranges_swappable...

498b759

StephanTLavavej added 3 commits May 16, 2025 03:12

... so we can revert changes to ADL tests.

6a5c9f1

C++20 should directly use contiguous_iterator.

501a0e4

Test _HAS_CXX20 positively.

76d3b38

StephanTLavavej reviewed May 16, 2025

View reviewed changes

StephanTLavavej approved these changes May 16, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

StephanTLavavej moved this from Work In Progress to Ready To Merge in STL Code Reviews May 16, 2025

StephanTLavavej mentioned this pull request May 15, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 16, 2025

Merge branch 'main' into swirl

d32b1b1

StephanTLavavej approved these changes May 17, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

StephanTLavavej closed this May 17, 2025

github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025

StephanTLavavej reopened this May 17, 2025

github-project-automation bot moved this from Done to Initial Review in STL Code Reviews May 17, 2025

Merge branch 'main' into swirl

085f273

StephanTLavavej moved this from Initial Review to Merging in STL Code Reviews May 17, 2025

StephanTLavavej merged commit cbd091e into microsoft:main May 17, 2025
40 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025

AlexGuteniev deleted the swirl branch May 17, 2025 05:53

AlexGuteniev mentioned this pull request May 18, 2025

Vectorize rotate even better #5525

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize `rotate` better #5502

Vectorize `rotate` better #5502

Uh oh!

AlexGuteniev commented May 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

StephanTLavavej commented May 16, 2025

Uh oh!

StephanTLavavej commented May 16, 2025

Uh oh!

StephanTLavavej commented May 17, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

StephanTLavavej commented May 17, 2025

Uh oh!

Uh oh!

Vectorize rotate better #5502

Vectorize rotate better #5502

Uh oh!

Conversation

AlexGuteniev commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚙️ The optimization

🐛 memmove performance bug

Benchmark

Results on i5-1235U

Results on i7-8750H

Analysis

Mitigation

✅ Test coverage

📄 Headers or separate compilation?

⏱️ Benchmark results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

StephanTLavavej commented May 16, 2025

Uh oh!

StephanTLavavej commented May 16, 2025

Uh oh!

StephanTLavavej commented May 17, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

StephanTLavavej commented May 17, 2025

Uh oh!

Uh oh!

Vectorize `rotate` better #5502

Vectorize `rotate` better #5502

AlexGuteniev commented May 13, 2025 •

edited

Loading

🐛 `memmove` performance bug