Skip to content

Vectorize rotate better #5502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
May 17, 2025
Merged

Vectorize rotate better #5502

merged 23 commits into from
May 17, 2025

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented May 13, 2025

⚙️ The optimization

Before this PR std::rotate has already been vectorized, but indirectly. The rotate calls reverse first on two portion, then on the whole, this makes the result rotated. The reverse is vectorized by going from both ends, loading vectors, reversing them using shuffles, and storing them to the opposite ends, then finishing the middle part using scalar loop. So we have always exactly N*2 swaps, and most of them are vectorized for large arrays.

A better vectorization improves for this as follows:

  • attempts to do fewer assignments than 2*N (for 2*N/2 = N swaps)
  • avoid shuffles (cross-lane shuffles may be expensive)
  • uses memcpy/memmove to move in blocks potentially faster than manually vectorized loop

First obvious step for improving rotate would be handling small rotation. The small part can be fit into small temporary buffer, then the remaining moved, and then the small part copied from that buffer to the proper place. This makes most of the elements movement using only one assignment. only those which were put into temporary buffer use two assignments.

The next step is to use vectorized swap_ranges to implement generic rotation. http://cppreference.com has an example recursive algorithm that uses swap in a loop, that loop is actually swap_ranges and that tail recursion is actually a loop. If the rotation point is exactly in the middle, we're done in one iteration, making only N assignments. If it is closer to the middle, we would do small rotation as the next step, and that would be on slightly more than half of the original elements, still efficient. In a worse case, we will do few swap_ranges steps, but still will never do as many assignments as with double reverse.

A hypothetical functions like swap_3_ranges, swap_4_ranges, etc could reduce the number of assignments for more cases. But going further in optimization will result in less and less improvement for more and more code added, and at some point will cause the complex decisions to take noticeable amount of time, resulting in negative improvement, so we need to stop somewhere. Probably stopping on just small rotation and two ranges swap strategy would be a good idea.

The "small" rotation is arbitrarily defined as 512 bytes or less. That is, the algorithm would use at most 512 bytes of stack extra. This is the same amount as in remove_copy / unique_copy (#5355). Additionally I think we should still prefer swap_ranges if the "small" is not very small, and the larger part is not very large, so that we end up with another significant reduction of the range. The amounts of "very small" and "not very large" are also arbitrary, not justified by profiling, only by vibe common sense.

As a bonus, with only memcpy/memmove/swap_ranges functions used, and no vector intrinsics that works with elements, the rotation for non-power-of-two sized elements is also vectorized now.

🐛 memmove performance bug

After the initial implementation I observed that some of the benchmark exhibited unexpected slowdown. The issue was in surprisingly slow memmove. I've created benchmark repro of this problem.

Benchmark

#include <benchmark/benchmark.h>
#include <cstring>

using namespace std;

alignas(4096) unsigned char v[1024 * 1024];

void bm_memmove(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    const auto n    = static_cast<ptrdiff_t>(state.range(1));

    const size_t n1 = n < 0 ? 0 : n;
    const size_t n0 = n < 0 ? -n : 0;

    benchmark::DoNotOptimize(v);

    for (auto _ : state) {
        memmove(v + n0, v + n1, size);
        benchmark::DoNotOptimize(v);
    }
}

BENCHMARK(bm_memmove)->ArgsProduct({{8191, 8193}, {-5, +5}});

BENCHMARK_MAIN();

Results on i5-1235U

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_memmove/8191/-5       71.4 ns         71.5 ns      8960000
bm_memmove/8193/-5       71.1 ns         71.5 ns      8960000
bm_memmove/8191/5        62.6 ns         61.0 ns      8960000
bm_memmove/8193/5        1903 ns         1925 ns       373333

Results on i7-8750H

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_memmove/8191/-5        143 ns          141 ns      4977778
bm_memmove/8193/-5        145 ns          146 ns      4480000
bm_memmove/8191/5        77.2 ns         76.7 ns      8960000
bm_memmove/8193/5        80.9 ns         80.2 ns      8960000

Analysis

All I know or suspect so far:

  • The problem exists on Alder Lake but does not exist on Coffee Lake or Skylake
  • The problematic instruction is rep movsb, which is used in memmove
  • The problematic behavior is recreated for me when the size is greater than 8192 and the pointer difference is smaller than 64
  • Clang on Linux is also affected, proved by recreating the issue here: https://quick-bench.com/q/HgY3kPAaUIqkfmzwz_NFeoTcj3U

I appreciate any help in investigation the issue further.

Mitigation

Ideally we'd need to report this issue to VCRuntime maintainers.
But I feel like we need to try to gather more information to report it better.

For now, _Move_to_lower_address and _Move_to_upper_address functions are introduced as a workaround.

✅ Test coverage

Apparently there's no trivial implementation of in-place rotate. Well, then we will use an implementation that uses additional buffer as LKG implementation.

📄 Headers or separate compilation?

<xutility> is non-core header, and it already drags in <cstring> functions and swap_ranges. So the algorithm may reside there.

On the other hand, since it is non-template, it can be kept as separately compiled for throughput.

The current decision is relic of an intention to use AVX2 intrinsics directly. That intention was abandoned due to introducing far more complexity for a relatively little potential gain.

⏱️ Benchmark results

Benchmark Before After Speedup
bm_rotate<uint8_t, AlgType::Std>/3333/2242 92.2 ns 71.1 ns 1.30
bm_rotate<uint8_t, AlgType::Std>/3332/1666 92.4 ns 42.3 ns 2.18
bm_rotate<uint8_t, AlgType::Std>/3333/1111 88.5 ns 61.7 ns 1.43
bm_rotate<uint8_t, AlgType::Std>/3333/501 88.4 ns 32.6 ns 2.71
bm_rotate<uint8_t, AlgType::Std>/3333/3300 90.1 ns 32.1 ns 2.81
bm_rotate<uint8_t, AlgType::Std>/3333/12 85.7 ns 25.4 ns 3.37
bm_rotate<uint8_t, AlgType::Std>/3333/5 84.3 ns 28.7 ns 2.94
bm_rotate<uint8_t, AlgType::Std>/3333/1 86.0 ns 28.1 ns 3.06
bm_rotate<uint8_t, AlgType::Std>/333/101 19.3 ns 12.1 ns 1.60
bm_rotate<uint8_t, AlgType::Std>/123/32 18.9 ns 6.41 ns 2.95
bm_rotate<uint8_t, AlgType::Std>/23/7 15.2 ns 5.20 ns 2.92
bm_rotate<uint8_t, AlgType::Std>/12/5 12.3 ns 5.19 ns 2.37
bm_rotate<uint8_t, AlgType::Std>/3/2 8.71 ns 4.64 ns 1.88
bm_rotate<uint8_t, AlgType::Rng>/3333/2242 91.4 ns 66.4 ns 1.38
bm_rotate<uint8_t, AlgType::Rng>/3332/1666 92.3 ns 39.1 ns 2.36
bm_rotate<uint8_t, AlgType::Rng>/3333/1111 88.8 ns 58.1 ns 1.53
bm_rotate<uint8_t, AlgType::Rng>/3333/501 89.0 ns 31.0 ns 2.87
bm_rotate<uint8_t, AlgType::Rng>/3333/3300 89.9 ns 31.7 ns 2.84
bm_rotate<uint8_t, AlgType::Rng>/3333/12 84.7 ns 25.4 ns 3.33
bm_rotate<uint8_t, AlgType::Rng>/3333/5 84.0 ns 28.2 ns 2.98
bm_rotate<uint8_t, AlgType::Rng>/3333/1 79.9 ns 28.2 ns 2.83
bm_rotate<uint8_t, AlgType::Rng>/333/101 19.1 ns 12.2 ns 1.57
bm_rotate<uint8_t, AlgType::Rng>/123/32 18.6 ns 6.43 ns 2.89
bm_rotate<uint8_t, AlgType::Rng>/23/7 14.0 ns 5.11 ns 2.74
bm_rotate<uint8_t, AlgType::Rng>/12/5 11.4 ns 5.11 ns 2.23
bm_rotate<uint8_t, AlgType::Rng>/3/2 2.99 ns 4.61 ns 0.65
bm_rotate<uint16_t, AlgType::Std>/3333/2242 173 ns 128 ns 1.35
bm_rotate<uint16_t, AlgType::Std>/3332/1666 179 ns 82.4 ns 2.17
bm_rotate<uint16_t, AlgType::Std>/3333/1111 179 ns 110 ns 1.63
bm_rotate<uint16_t, AlgType::Std>/3333/501 179 ns 164 ns 1.09
bm_rotate<uint16_t, AlgType::Std>/3333/3300 178 ns 59.3 ns 3.00
bm_rotate<uint16_t, AlgType::Std>/3333/12 165 ns 50.5 ns 3.27
bm_rotate<uint16_t, AlgType::Std>/3333/5 175 ns 54.7 ns 3.20
bm_rotate<uint16_t, AlgType::Std>/3333/1 176 ns 52.2 ns 3.37
bm_rotate<uint16_t, AlgType::Std>/333/101 26.0 ns 13.6 ns 1.91
bm_rotate<uint16_t, AlgType::Std>/123/32 16.5 ns 12.0 ns 1.38
bm_rotate<uint16_t, AlgType::Std>/23/7 11.7 ns 4.83 ns 2.42
bm_rotate<uint16_t, AlgType::Std>/12/5 13.4 ns 5.13 ns 2.61
bm_rotate<uint16_t, AlgType::Std>/3/2 8.44 ns 4.76 ns 1.77
bm_rotate<uint16_t, AlgType::Rng>/3333/2242 172 ns 128 ns 1.34
bm_rotate<uint16_t, AlgType::Rng>/3332/1666 180 ns 82.7 ns 2.18
bm_rotate<uint16_t, AlgType::Rng>/3333/1111 179 ns 111 ns 1.61
bm_rotate<uint16_t, AlgType::Rng>/3333/501 177 ns 163 ns 1.09
bm_rotate<uint16_t, AlgType::Rng>/3333/3300 179 ns 58.2 ns 3.08
bm_rotate<uint16_t, AlgType::Rng>/3333/12 172 ns 46.5 ns 3.70
bm_rotate<uint16_t, AlgType::Rng>/3333/5 175 ns 53.7 ns 3.26
bm_rotate<uint16_t, AlgType::Rng>/3333/1 173 ns 54.0 ns 3.20
bm_rotate<uint16_t, AlgType::Rng>/333/101 25.1 ns 13.1 ns 1.92
bm_rotate<uint16_t, AlgType::Rng>/123/32 16.3 ns 12.5 ns 1.30
bm_rotate<uint16_t, AlgType::Rng>/23/7 11.3 ns 4.86 ns 2.33
bm_rotate<uint16_t, AlgType::Rng>/12/5 12.8 ns 5.08 ns 2.52
bm_rotate<uint16_t, AlgType::Rng>/3/2 3.01 ns 4.65 ns 0.65
bm_rotate<uint32_t, AlgType::Std>/3333/2242 322 ns 246 ns 1.31
bm_rotate<uint32_t, AlgType::Std>/3332/1666 331 ns 164 ns 2.02
bm_rotate<uint32_t, AlgType::Std>/3333/1111 323 ns 200 ns 1.62
bm_rotate<uint32_t, AlgType::Std>/3333/501 324 ns 303 ns 1.07
bm_rotate<uint32_t, AlgType::Std>/3333/3300 327 ns 105 ns 3.11
bm_rotate<uint32_t, AlgType::Std>/3333/12 322 ns 89.3 ns 3.61
bm_rotate<uint32_t, AlgType::Std>/3333/5 322 ns 88.7 ns 3.63
bm_rotate<uint32_t, AlgType::Std>/3333/1 320 ns 89.3 ns 3.58
bm_rotate<uint32_t, AlgType::Std>/333/101 34.0 ns 16.1 ns 2.11
bm_rotate<uint32_t, AlgType::Std>/123/32 14.8 ns 12.7 ns 1.17
bm_rotate<uint32_t, AlgType::Std>/23/7 11.5 ns 6.92 ns 1.66
bm_rotate<uint32_t, AlgType::Std>/12/5 9.84 ns 6.66 ns 1.48
bm_rotate<uint32_t, AlgType::Std>/3/2 7.67 ns 4.71 ns 1.63
bm_rotate<uint32_t, AlgType::Rng>/3333/2242 321 ns 245 ns 1.31
bm_rotate<uint32_t, AlgType::Rng>/3332/1666 337 ns 161 ns 2.09
bm_rotate<uint32_t, AlgType::Rng>/3333/1111 324 ns 199 ns 1.63
bm_rotate<uint32_t, AlgType::Rng>/3333/501 320 ns 303 ns 1.06
bm_rotate<uint32_t, AlgType::Rng>/3333/3300 329 ns 104 ns 3.16
bm_rotate<uint32_t, AlgType::Rng>/3333/12 323 ns 88.6 ns 3.65
bm_rotate<uint32_t, AlgType::Rng>/3333/5 323 ns 88.4 ns 3.65
bm_rotate<uint32_t, AlgType::Rng>/3333/1 321 ns 89.3 ns 3.59
bm_rotate<uint32_t, AlgType::Rng>/333/101 34.2 ns 16.1 ns 2.12
bm_rotate<uint32_t, AlgType::Rng>/123/32 14.4 ns 12.6 ns 1.14
bm_rotate<uint32_t, AlgType::Rng>/23/7 10.9 ns 6.91 ns 1.58
bm_rotate<uint32_t, AlgType::Rng>/12/5 8.74 ns 6.91 ns 1.26
bm_rotate<uint32_t, AlgType::Rng>/3/2 3.01 ns 4.63 ns 0.65
bm_rotate<uint64_t, AlgType::Std>/3333/2242 644 ns 429 ns 1.50
bm_rotate<uint64_t, AlgType::Std>/3332/1666 647 ns 319 ns 2.03
bm_rotate<uint64_t, AlgType::Std>/3333/1111 586 ns 387 ns 1.51
bm_rotate<uint64_t, AlgType::Std>/3333/501 647 ns 581 ns 1.11
bm_rotate<uint64_t, AlgType::Std>/3333/3300 674 ns 209 ns 3.22
bm_rotate<uint64_t, AlgType::Std>/3333/12 651 ns 136 ns 4.79
bm_rotate<uint64_t, AlgType::Std>/3333/5 652 ns 173 ns 3.77
bm_rotate<uint64_t, AlgType::Std>/3333/1 645 ns 184 ns 3.51
bm_rotate<uint64_t, AlgType::Std>/333/101 61.6 ns 47.6 ns 1.29
bm_rotate<uint64_t, AlgType::Std>/123/32 20.4 ns 14.1 ns 1.45
bm_rotate<uint64_t, AlgType::Std>/23/7 10.9 ns 11.8 ns 0.92
bm_rotate<uint64_t, AlgType::Std>/12/5 11.6 ns 11.4 ns 1.02
bm_rotate<uint64_t, AlgType::Std>/3/2 8.34 ns 4.60 ns 1.81
bm_rotate<uint64_t, AlgType::Rng>/3333/2242 657 ns 427 ns 1.54
bm_rotate<uint64_t, AlgType::Rng>/3332/1666 653 ns 322 ns 2.03
bm_rotate<uint64_t, AlgType::Rng>/3333/1111 580 ns 387 ns 1.50
bm_rotate<uint64_t, AlgType::Rng>/3333/501 648 ns 576 ns 1.13
bm_rotate<uint64_t, AlgType::Rng>/3333/3300 645 ns 209 ns 3.09
bm_rotate<uint64_t, AlgType::Rng>/3333/12 644 ns 131 ns 4.92
bm_rotate<uint64_t, AlgType::Rng>/3333/5 650 ns 172 ns 3.78
bm_rotate<uint64_t, AlgType::Rng>/3333/1 642 ns 179 ns 3.59
bm_rotate<uint64_t, AlgType::Rng>/333/101 61.3 ns 48.0 ns 1.28
bm_rotate<uint64_t, AlgType::Rng>/123/32 22.5 ns 13.4 ns 1.68
bm_rotate<uint64_t, AlgType::Rng>/23/7 11.1 ns 11.3 ns 0.98
bm_rotate<uint64_t, AlgType::Rng>/12/5 11.6 ns 10.6 ns 1.09
bm_rotate<uint64_t, AlgType::Rng>/3/2 3.00 ns 4.63 ns 0.65
bm_rotate<color, AlgType::Std>/3333/2242 1711 ns 357 ns 4.79
bm_rotate<color, AlgType::Std>/3332/1666 1700 ns 241 ns 7.05
bm_rotate<color, AlgType::Std>/3333/1111 1705 ns 318 ns 5.36
bm_rotate<color, AlgType::Std>/3333/501 1695 ns 468 ns 3.62
bm_rotate<color, AlgType::Std>/3333/3300 1716 ns 160 ns 10.73
bm_rotate<color, AlgType::Std>/3333/12 1706 ns 131 ns 13.02
bm_rotate<color, AlgType::Std>/3333/5 1726 ns 152 ns 11.36
bm_rotate<color, AlgType::Std>/3333/1 1701 ns 151 ns 11.26
bm_rotate<color, AlgType::Std>/333/101 171 ns 46.5 ns 3.68
bm_rotate<color, AlgType::Std>/123/32 61.7 ns 14.5 ns 4.26
bm_rotate<color, AlgType::Std>/23/7 11.6 ns 11.9 ns 0.97
bm_rotate<color, AlgType::Std>/12/5 6.69 ns 7.12 ns 0.94
bm_rotate<color, AlgType::Std>/3/2 2.10 ns 4.96 ns 0.42
bm_rotate<color, AlgType::Rng>/3333/2242 1693 ns 366 ns 4.63
bm_rotate<color, AlgType::Rng>/3332/1666 1707 ns 241 ns 7.08
bm_rotate<color, AlgType::Rng>/3333/1111 1710 ns 321 ns 5.33
bm_rotate<color, AlgType::Rng>/3333/501 1717 ns 472 ns 3.64
bm_rotate<color, AlgType::Rng>/3333/3300 1695 ns 170 ns 9.97
bm_rotate<color, AlgType::Rng>/3333/12 1708 ns 135 ns 12.65
bm_rotate<color, AlgType::Rng>/3333/5 1711 ns 151 ns 11.33
bm_rotate<color, AlgType::Rng>/3333/1 1708 ns 154 ns 11.09
bm_rotate<color, AlgType::Rng>/333/101 170 ns 45.9 ns 3.70
bm_rotate<color, AlgType::Rng>/123/32 61.8 ns 14.3 ns 4.32
bm_rotate<color, AlgType::Rng>/23/7 12.0 ns 11.9 ns 1.01
bm_rotate<color, AlgType::Rng>/12/5 6.56 ns 8.81 ns 0.74
bm_rotate<color, AlgType::Rng>/3/2 2.11 ns 5.22 ns 0.40

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 13, 2025 18:59
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 13, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label May 13, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 13, 2025
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews May 15, 2025
@StephanTLavavej StephanTLavavej removed their assignment May 15, 2025
@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej moved this from Work In Progress to Ready To Merge in STL Code Reviews May 16, 2025
@StephanTLavavej
Copy link
Member

5950X speedups:

Click to embiggen:
Benchmark Before After Speedup
bm_rotate<uint8_t, AlgType::Std>/3333/2242 114 ns 121 ns 0.94
bm_rotate<uint8_t, AlgType::Std>/3332/1666 114 ns 79.2 ns 1.44
bm_rotate<uint8_t, AlgType::Std>/3333/1111 132 ns 116 ns 1.14
bm_rotate<uint8_t, AlgType::Std>/3333/501 144 ns 31.7 ns 4.54
bm_rotate<uint8_t, AlgType::Std>/3333/3300 103 ns 46.6 ns 2.21
bm_rotate<uint8_t, AlgType::Std>/3333/12 112 ns 29.2 ns 3.84
bm_rotate<uint8_t, AlgType::Std>/3333/5 137 ns 28.0 ns 4.89
bm_rotate<uint8_t, AlgType::Std>/3333/1 137 ns 29.0 ns 4.72
bm_rotate<uint8_t, AlgType::Std>/333/101 24.6 ns 13.6 ns 1.81
bm_rotate<uint8_t, AlgType::Std>/123/32 23.5 ns 10.2 ns 2.30
bm_rotate<uint8_t, AlgType::Std>/23/7 11.7 ns 9.54 ns 1.23
bm_rotate<uint8_t, AlgType::Std>/12/5 8.68 ns 9.52 ns 0.91
bm_rotate<uint8_t, AlgType::Std>/3/2 7.38 ns 9.70 ns 0.76
bm_rotate<uint8_t, AlgType::Rng>/3333/2242 114 ns 121 ns 0.94
bm_rotate<uint8_t, AlgType::Rng>/3332/1666 115 ns 79.0 ns 1.46
bm_rotate<uint8_t, AlgType::Rng>/3333/1111 132 ns 116 ns 1.14
bm_rotate<uint8_t, AlgType::Rng>/3333/501 144 ns 31.8 ns 4.53
bm_rotate<uint8_t, AlgType::Rng>/3333/3300 103 ns 46.5 ns 2.22
bm_rotate<uint8_t, AlgType::Rng>/3333/12 112 ns 29.2 ns 3.84
bm_rotate<uint8_t, AlgType::Rng>/3333/5 137 ns 28.0 ns 4.89
bm_rotate<uint8_t, AlgType::Rng>/3333/1 137 ns 28.9 ns 4.74
bm_rotate<uint8_t, AlgType::Rng>/333/101 24.6 ns 13.6 ns 1.81
bm_rotate<uint8_t, AlgType::Rng>/123/32 23.1 ns 10.2 ns 2.26
bm_rotate<uint8_t, AlgType::Rng>/23/7 11.7 ns 9.53 ns 1.23
bm_rotate<uint8_t, AlgType::Rng>/12/5 8.48 ns 9.52 ns 0.89
bm_rotate<uint8_t, AlgType::Rng>/3/2 7.58 ns 9.08 ns 0.83
bm_rotate<uint16_t, AlgType::Std>/3333/2242 206 ns 195 ns 1.06
bm_rotate<uint16_t, AlgType::Std>/3332/1666 192 ns 98.4 ns 1.95
bm_rotate<uint16_t, AlgType::Std>/3333/1111 261 ns 177 ns 1.47
bm_rotate<uint16_t, AlgType::Std>/3333/501 271 ns 268 ns 1.01
bm_rotate<uint16_t, AlgType::Std>/3333/3300 205 ns 90.1 ns 2.28
bm_rotate<uint16_t, AlgType::Std>/3333/12 229 ns 51.5 ns 4.45
bm_rotate<uint16_t, AlgType::Std>/3333/5 278 ns 50.3 ns 5.53
bm_rotate<uint16_t, AlgType::Std>/3333/1 276 ns 51.0 ns 5.41
bm_rotate<uint16_t, AlgType::Std>/333/101 33.0 ns 14.1 ns 2.34
bm_rotate<uint16_t, AlgType::Std>/123/32 22.0 ns 13.4 ns 1.64
bm_rotate<uint16_t, AlgType::Std>/23/7 14.3 ns 9.31 ns 1.54
bm_rotate<uint16_t, AlgType::Std>/12/5 10.2 ns 9.29 ns 1.10
bm_rotate<uint16_t, AlgType::Std>/3/2 8.59 ns 10.1 ns 0.85
bm_rotate<uint16_t, AlgType::Rng>/3333/2242 206 ns 195 ns 1.06
bm_rotate<uint16_t, AlgType::Rng>/3332/1666 193 ns 98.3 ns 1.96
bm_rotate<uint16_t, AlgType::Rng>/3333/1111 264 ns 177 ns 1.49
bm_rotate<uint16_t, AlgType::Rng>/3333/501 273 ns 267 ns 1.02
bm_rotate<uint16_t, AlgType::Rng>/3333/3300 205 ns 90.3 ns 2.27
bm_rotate<uint16_t, AlgType::Rng>/3333/12 231 ns 51.1 ns 4.52
bm_rotate<uint16_t, AlgType::Rng>/3333/5 278 ns 50.2 ns 5.54
bm_rotate<uint16_t, AlgType::Rng>/3333/1 275 ns 50.9 ns 5.40
bm_rotate<uint16_t, AlgType::Rng>/333/101 33.1 ns 14.1 ns 2.35
bm_rotate<uint16_t, AlgType::Rng>/123/32 22.0 ns 13.4 ns 1.64
bm_rotate<uint16_t, AlgType::Rng>/23/7 14.8 ns 9.34 ns 1.58
bm_rotate<uint16_t, AlgType::Rng>/12/5 9.88 ns 9.29 ns 1.06
bm_rotate<uint16_t, AlgType::Rng>/3/2 8.91 ns 8.91 ns 1.00
bm_rotate<uint32_t, AlgType::Std>/3333/2242 381 ns 296 ns 1.29
bm_rotate<uint32_t, AlgType::Std>/3332/1666 387 ns 191 ns 2.03
bm_rotate<uint32_t, AlgType::Std>/3333/1111 383 ns 231 ns 1.66
bm_rotate<uint32_t, AlgType::Std>/3333/501 381 ns 367 ns 1.04
bm_rotate<uint32_t, AlgType::Std>/3333/3300 387 ns 178 ns 2.17
bm_rotate<uint32_t, AlgType::Std>/3333/12 384 ns 98.6 ns 3.89
bm_rotate<uint32_t, AlgType::Std>/3333/5 383 ns 98.6 ns 3.88
bm_rotate<uint32_t, AlgType::Std>/3333/1 383 ns 98.3 ns 3.90
bm_rotate<uint32_t, AlgType::Std>/333/101 39.6 ns 16.7 ns 2.37
bm_rotate<uint32_t, AlgType::Std>/123/32 18.4 ns 13.6 ns 1.35
bm_rotate<uint32_t, AlgType::Std>/23/7 13.4 ns 9.32 ns 1.44
bm_rotate<uint32_t, AlgType::Std>/12/5 7.25 ns 8.26 ns 0.88
bm_rotate<uint32_t, AlgType::Std>/3/2 7.03 ns 9.06 ns 0.78
bm_rotate<uint32_t, AlgType::Rng>/3333/2242 383 ns 296 ns 1.29
bm_rotate<uint32_t, AlgType::Rng>/3332/1666 387 ns 191 ns 2.03
bm_rotate<uint32_t, AlgType::Rng>/3333/1111 383 ns 231 ns 1.66
bm_rotate<uint32_t, AlgType::Rng>/3333/501 381 ns 367 ns 1.04
bm_rotate<uint32_t, AlgType::Rng>/3333/3300 387 ns 178 ns 2.17
bm_rotate<uint32_t, AlgType::Rng>/3333/12 382 ns 98.7 ns 3.87
bm_rotate<uint32_t, AlgType::Rng>/3333/5 383 ns 98.5 ns 3.89
bm_rotate<uint32_t, AlgType::Rng>/3333/1 381 ns 98.7 ns 3.86
bm_rotate<uint32_t, AlgType::Rng>/333/101 39.5 ns 16.7 ns 2.37
bm_rotate<uint32_t, AlgType::Rng>/123/32 18.3 ns 13.6 ns 1.35
bm_rotate<uint32_t, AlgType::Rng>/23/7 13.4 ns 9.33 ns 1.44
bm_rotate<uint32_t, AlgType::Rng>/12/5 7.23 ns 8.26 ns 0.88
bm_rotate<uint32_t, AlgType::Rng>/3/2 7.63 ns 9.63 ns 0.79
bm_rotate<uint64_t, AlgType::Std>/3333/2242 765 ns 505 ns 1.51
bm_rotate<uint64_t, AlgType::Std>/3332/1666 769 ns 380 ns 2.02
bm_rotate<uint64_t, AlgType::Std>/3333/1111 749 ns 442 ns 1.69
bm_rotate<uint64_t, AlgType::Std>/3333/501 765 ns 680 ns 1.13
bm_rotate<uint64_t, AlgType::Std>/3333/3300 768 ns 355 ns 2.16
bm_rotate<uint64_t, AlgType::Std>/3333/12 765 ns 198 ns 3.86
bm_rotate<uint64_t, AlgType::Std>/3333/5 765 ns 198 ns 3.86
bm_rotate<uint64_t, AlgType::Std>/3333/1 766 ns 203 ns 3.77
bm_rotate<uint64_t, AlgType::Std>/333/101 72.7 ns 54.5 ns 1.33
bm_rotate<uint64_t, AlgType::Std>/123/32 28.5 ns 14.6 ns 1.95
bm_rotate<uint64_t, AlgType::Std>/23/7 11.9 ns 12.5 ns 0.95
bm_rotate<uint64_t, AlgType::Std>/12/5 12.8 ns 11.2 ns 1.14
bm_rotate<uint64_t, AlgType::Std>/3/2 7.53 ns 9.43 ns 0.80
bm_rotate<uint64_t, AlgType::Rng>/3333/2242 761 ns 505 ns 1.51
bm_rotate<uint64_t, AlgType::Rng>/3332/1666 768 ns 380 ns 2.02
bm_rotate<uint64_t, AlgType::Rng>/3333/1111 746 ns 442 ns 1.69
bm_rotate<uint64_t, AlgType::Rng>/3333/501 762 ns 681 ns 1.12
bm_rotate<uint64_t, AlgType::Rng>/3333/3300 765 ns 356 ns 2.15
bm_rotate<uint64_t, AlgType::Rng>/3333/12 764 ns 198 ns 3.86
bm_rotate<uint64_t, AlgType::Rng>/3333/5 762 ns 198 ns 3.85
bm_rotate<uint64_t, AlgType::Rng>/3333/1 762 ns 203 ns 3.75
bm_rotate<uint64_t, AlgType::Rng>/333/101 72.6 ns 54.6 ns 1.33
bm_rotate<uint64_t, AlgType::Rng>/123/32 28.6 ns 14.7 ns 1.95
bm_rotate<uint64_t, AlgType::Rng>/23/7 11.9 ns 13.1 ns 0.91
bm_rotate<uint64_t, AlgType::Rng>/12/5 12.8 ns 11.2 ns 1.14
bm_rotate<uint64_t, AlgType::Rng>/3/2 7.36 ns 9.44 ns 0.78
bm_rotate<color, AlgType::Std>/3333/2242 3267 ns 548 ns 5.96
bm_rotate<color, AlgType::Std>/3332/1666 3218 ns 286 ns 11.25
bm_rotate<color, AlgType::Std>/3333/1111 3274 ns 505 ns 6.48
bm_rotate<color, AlgType::Std>/3333/501 3265 ns 767 ns 4.26
bm_rotate<color, AlgType::Std>/3333/3300 3257 ns 267 ns 12.20
bm_rotate<color, AlgType::Std>/3333/12 3304 ns 150 ns 22.03
bm_rotate<color, AlgType::Std>/3333/5 3254 ns 150 ns 21.69
bm_rotate<color, AlgType::Std>/3333/1 3250 ns 150 ns 21.67
bm_rotate<color, AlgType::Std>/333/101 323 ns 70.7 ns 4.57
bm_rotate<color, AlgType::Std>/123/32 120 ns 14.4 ns 8.33
bm_rotate<color, AlgType::Std>/23/7 19.4 ns 13.1 ns 1.48
bm_rotate<color, AlgType::Std>/12/5 9.99 ns 9.45 ns 1.06
bm_rotate<color, AlgType::Std>/3/2 2.75 ns 9.45 ns 0.29
bm_rotate<color, AlgType::Rng>/3333/2242 3267 ns 549 ns 5.95
bm_rotate<color, AlgType::Rng>/3332/1666 3213 ns 286 ns 11.23
bm_rotate<color, AlgType::Rng>/3333/1111 3271 ns 505 ns 6.48
bm_rotate<color, AlgType::Rng>/3333/501 3265 ns 769 ns 4.25
bm_rotate<color, AlgType::Rng>/3333/3300 3251 ns 267 ns 12.18
bm_rotate<color, AlgType::Rng>/3333/12 3292 ns 150 ns 21.95
bm_rotate<color, AlgType::Rng>/3333/5 3252 ns 150 ns 21.68
bm_rotate<color, AlgType::Rng>/3333/1 3256 ns 151 ns 21.56
bm_rotate<color, AlgType::Rng>/333/101 323 ns 70.7 ns 4.57
bm_rotate<color, AlgType::Rng>/123/32 120 ns 14.6 ns 8.22
bm_rotate<color, AlgType::Rng>/23/7 19.6 ns 13.4 ns 1.46
bm_rotate<color, AlgType::Rng>/12/5 9.98 ns 9.39 ns 1.06
bm_rotate<color, AlgType::Rng>/3/2 2.75 ns 9.68 ns 0.28

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 16, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

I resolved a trivial adjacent-add conflict with #5493 in benchmarks/CMakeLists.txt.

@StephanTLavavej

This comment was marked as resolved.

@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025
@github-project-automation github-project-automation bot moved this from Done to Initial Review in STL Code Reviews May 17, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Merging in STL Code Reviews May 17, 2025
@StephanTLavavej StephanTLavavej merged commit cbd091e into microsoft:main May 17, 2025
40 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025
@StephanTLavavej
Copy link
Member

When things spin, science happens! 🔁 🧑‍🔬 🧪

@AlexGuteniev AlexGuteniev deleted the swirl branch May 17, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants