bitunpacking cuda kernels store output into shared memory before copying to main memory by robert3005 · Pull Request #6384 · vortex-data/vortex

robert3005 · 2026-02-10T12:31:11Z

Signed-off-by: Robert Kruszewski github@robertk.io

0ax1 · 2026-02-10T12:32:33Z

Got a benchmark which outlines the perf diff? 🙂

0ax1 · 2026-02-10T12:36:40Z

Keep in mind that the prev version prob already used the implicit L1.

0ax1

Really wanna see that bench first.

…ing to main memory Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 · 2026-02-10T15:17:18Z

This is rebased on develop that has the benchmark. How do I run it?

0ax1 · 2026-02-10T15:37:38Z

before:

bitunpack_cuda_u8/bitunpack/3bw
                        time:   [308.60 µs 308.77 µs 308.96 µs]
                        thrpt:  [301.44 GiB/s 301.63 GiB/s 301.79 GiB/s]
                 change:
                        time:   [-2.0430% -1.8218% -1.5659%] (p = 0.00 < 0.05)
                        thrpt:  [+1.5908% +1.8556% +2.0856%]
                        Performance has improved.

bitunpack_cuda_u16/bitunpack/5bw
                        time:   [690.15 µs 690.40 µs 690.62 µs]
                        thrpt:  [269.71 GiB/s 269.79 GiB/s 269.89 GiB/s]
                 change:
                        time:   [-0.7030% -0.6233% -0.5406%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5435% +0.6272% +0.7080%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

bitunpack_cuda_u32/bitunpack/6bw
                        time:   [1.0271 ms 1.0278 ms 1.0292 ms]
                        thrpt:  [361.98 GiB/s 362.44 GiB/s 362.70 GiB/s]
                 change:
                        time:   [-0.4044% -0.2540% -0.0700%] (p = 0.01 < 0.05)
                        thrpt:  [+0.0700% +0.2546% +0.4061%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

bitunpack_cuda_u64/bitunpack/8bw
                        time:   [1.9979 ms 1.9991 ms 2.0006 ms]
                        thrpt:  [372.43 GiB/s 372.69 GiB/s 372.91 GiB/s]
                 change:
                        time:   [-0.0673% -0.0059% +0.0617%] (p = 0.87 > 0.05)
                        thrpt:  [-0.0617% +0.0059% +0.0673%]
                        No change in performance detected.

after:

bitunpack_cuda_u8/bitunpack/3bw
                        time:   [291.98 µs 293.55 µs 296.47 µs]
                        thrpt:  [314.13 GiB/s 317.26 GiB/s 318.97 GiB/s]
                 change:
                        time:   [-5.5754% -4.8651% -4.1077%] (p = 0.00 < 0.05)
                        thrpt:  [+4.2836% +5.1138% +5.9046%]
                        Performance has improved.

bitunpack_cuda_u16/bitunpack/5bw
                        time:   [550.80 µs 551.67 µs 553.01 µs]
                        thrpt:  [336.82 GiB/s 337.63 GiB/s 338.17 GiB/s]
                 change:
                        time:   [-20.135% -19.949% -19.743%] (p = 0.00 < 0.05)
                        thrpt:  [+24.600% +24.921% +25.211%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

bitunpack_cuda_u32/bitunpack/6bw
                        time:   [1.0073 ms 1.0087 ms 1.0101 ms]
                        thrpt:  [368.82 GiB/s 369.32 GiB/s 369.83 GiB/s]
                 change:
                        time:   [-2.2293% -2.0316% -1.8645%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8999% +2.0738% +2.2802%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

bitunpack_cuda_u64/bitunpack/8bw
                        time:   [620.31 µs 620.60 µs 620.95 µs]
                        thrpt:  [1199.9 GiB/s 1200.6 GiB/s 1201.1 GiB/s]
                 change:
                        time:   [-68.988% -68.954% -68.913%] (p = 0.00 < 0.05)
                        thrpt:  [+221.67% +222.10% +222.46%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)

Signed-off-by: Robert Kruszewski <github@robertk.io>

codspeed-hq · 2026-02-10T15:52:22Z

Merging this PR will degrade performance by 12.9%

⚡ 1 improved benchmark
❌ 3 regressed benchmarks
✅ 1134 untouched benchmarks
⏩ 1265 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`true_count_arrow_buffer[128]`	946.9 ns	859.4 ns	+10.18%
❌	Simulation	`true_count_vortex_buffer[1024]`	1.1 µs	1.2 µs	-11.93%
❌	Simulation	`true_count_vortex_buffer[2048]`	1.2 µs	1.4 µs	-10.48%
❌	Simulation	`true_count_vortex_buffer[128]`	984.7 ns	1,130.6 ns	-12.9%

_{Comparing rk/fasterbitpack (d196877) with develop (3cb7fab)²}

1265 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on develop (00d71b8) during the generation of this report, so 3cb7fab was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 added the changelog/performance A performance improvement label Feb 10, 2026

robert3005 requested review from 0ax1 and joseph-isaacs February 10, 2026 12:31

joseph-isaacs approved these changes Feb 10, 2026

View reviewed changes

0ax1 requested changes Feb 10, 2026

View reviewed changes

bitunpacking cuda kernels store output into shared memory before copy…

af6471f

…ing to main memory Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 force-pushed the rk/fasterbitpack branch from da06683 to af6471f Compare February 10, 2026 15:15

0ax1 approved these changes Feb 10, 2026

View reviewed changes

more

172b510

Signed-off-by: Robert Kruszewski <github@robertk.io>

better?

d196877

Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 merged commit b6e49d4 into develop Feb 10, 2026
73 of 115 checks passed

robert3005 deleted the rk/fasterbitpack branch February 10, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

bitunpacking cuda kernels store output into shared memory before copying to main memory#6384

bitunpacking cuda kernels store output into shared memory before copying to main memory#6384
robert3005 merged 3 commits intodevelopfrom
rk/fasterbitpack

robert3005 commented Feb 10, 2026

Uh oh!

0ax1 commented Feb 10, 2026

Uh oh!

0ax1 commented Feb 10, 2026

Uh oh!

0ax1 left a comment

Uh oh!

robert3005 commented Feb 10, 2026

Uh oh!

0ax1 commented Feb 10, 2026

Uh oh!

codspeed-hq bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

robert3005 commented Feb 10, 2026

Uh oh!

0ax1 commented Feb 10, 2026

Uh oh!

0ax1 commented Feb 10, 2026

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

robert3005 commented Feb 10, 2026

Uh oh!

0ax1 commented Feb 10, 2026

Uh oh!

codspeed-hq bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 12.9%

Performance Changes

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Feb 10, 2026 •

edited

Loading