Skip to content

Comments

bitunpacking cuda kernels store output into shared memory before copying to main memory#6384

Merged
robert3005 merged 3 commits intodevelopfrom
rk/fasterbitpack
Feb 10, 2026
Merged

bitunpacking cuda kernels store output into shared memory before copying to main memory#6384
robert3005 merged 3 commits intodevelopfrom
rk/fasterbitpack

Conversation

@robert3005
Copy link
Contributor

Signed-off-by: Robert Kruszewski github@robertk.io

@robert3005 robert3005 added the changelog/performance A performance improvement label Feb 10, 2026
@0ax1
Copy link
Contributor

0ax1 commented Feb 10, 2026

Got a benchmark which outlines the perf diff? 🙂

@0ax1
Copy link
Contributor

0ax1 commented Feb 10, 2026

Keep in mind that the prev version prob already used the implicit L1.

Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really wanna see that bench first.

…ing to main memory

Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005
Copy link
Contributor Author

This is rebased on develop that has the benchmark. How do I run it?

@0ax1
Copy link
Contributor

0ax1 commented Feb 10, 2026

before:

bitunpack_cuda_u8/bitunpack/3bw
                        time:   [308.60 µs 308.77 µs 308.96 µs]
                        thrpt:  [301.44 GiB/s 301.63 GiB/s 301.79 GiB/s]
                 change:
                        time:   [-2.0430% -1.8218% -1.5659%] (p = 0.00 < 0.05)
                        thrpt:  [+1.5908% +1.8556% +2.0856%]
                        Performance has improved.

bitunpack_cuda_u16/bitunpack/5bw
                        time:   [690.15 µs 690.40 µs 690.62 µs]
                        thrpt:  [269.71 GiB/s 269.79 GiB/s 269.89 GiB/s]
                 change:
                        time:   [-0.7030% -0.6233% -0.5406%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5435% +0.6272% +0.7080%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

bitunpack_cuda_u32/bitunpack/6bw
                        time:   [1.0271 ms 1.0278 ms 1.0292 ms]
                        thrpt:  [361.98 GiB/s 362.44 GiB/s 362.70 GiB/s]
                 change:
                        time:   [-0.4044% -0.2540% -0.0700%] (p = 0.01 < 0.05)
                        thrpt:  [+0.0700% +0.2546% +0.4061%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

bitunpack_cuda_u64/bitunpack/8bw
                        time:   [1.9979 ms 1.9991 ms 2.0006 ms]
                        thrpt:  [372.43 GiB/s 372.69 GiB/s 372.91 GiB/s]
                 change:
                        time:   [-0.0673% -0.0059% +0.0617%] (p = 0.87 > 0.05)
                        thrpt:  [-0.0617% +0.0059% +0.0673%]
                        No change in performance detected.

after:

bitunpack_cuda_u8/bitunpack/3bw
                        time:   [291.98 µs 293.55 µs 296.47 µs]
                        thrpt:  [314.13 GiB/s 317.26 GiB/s 318.97 GiB/s]
                 change:
                        time:   [-5.5754% -4.8651% -4.1077%] (p = 0.00 < 0.05)
                        thrpt:  [+4.2836% +5.1138% +5.9046%]
                        Performance has improved.

bitunpack_cuda_u16/bitunpack/5bw
                        time:   [550.80 µs 551.67 µs 553.01 µs]
                        thrpt:  [336.82 GiB/s 337.63 GiB/s 338.17 GiB/s]
                 change:
                        time:   [-20.135% -19.949% -19.743%] (p = 0.00 < 0.05)
                        thrpt:  [+24.600% +24.921% +25.211%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

bitunpack_cuda_u32/bitunpack/6bw
                        time:   [1.0073 ms 1.0087 ms 1.0101 ms]
                        thrpt:  [368.82 GiB/s 369.32 GiB/s 369.83 GiB/s]
                 change:
                        time:   [-2.2293% -2.0316% -1.8645%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8999% +2.0738% +2.2802%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

bitunpack_cuda_u64/bitunpack/8bw
                        time:   [620.31 µs 620.60 µs 620.95 µs]
                        thrpt:  [1199.9 GiB/s 1200.6 GiB/s 1201.1 GiB/s]
                 change:
                        time:   [-68.988% -68.954% -68.913%] (p = 0.00 < 0.05)
                        thrpt:  [+221.67% +222.10% +222.46%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)

Signed-off-by: Robert Kruszewski <github@robertk.io>
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 10, 2026

Merging this PR will degrade performance by 12.9%

⚡ 1 improved benchmark
❌ 3 regressed benchmarks
✅ 1134 untouched benchmarks
⏩ 1265 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation true_count_arrow_buffer[128] 946.9 ns 859.4 ns +10.18%
Simulation true_count_vortex_buffer[1024] 1.1 µs 1.2 µs -11.93%
Simulation true_count_vortex_buffer[2048] 1.2 µs 1.4 µs -10.48%
Simulation true_count_vortex_buffer[128] 984.7 ns 1,130.6 ns -12.9%

Comparing rk/fasterbitpack (d196877) with develop (3cb7fab)2

Open in CodSpeed

Footnotes

  1. 1265 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on develop (00d71b8) during the generation of this report, so 3cb7fab was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 merged commit b6e49d4 into develop Feb 10, 2026
73 of 115 checks passed
@robert3005 robert3005 deleted the rk/fasterbitpack branch February 10, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants