Conversation
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
# Conflicts: # Cargo.toml # vortex-cuda/src/lib.rs
| ? (block_start + elements_per_block) | ||
| : array_len; | ||
|
|
||
| // Vectorized loop - process 16 bytes per iteration for better memory throughput. |
There was a problem hiding this comment.
Did I leave that comment. In any case the ops here are not vectorized.
There was a problem hiding this comment.
Yeah this isn't true. I thought this would happen on some archs. But CUDA only can do vec ops on loads and stores. It really is only the unrolling doing the trick here.
There was a problem hiding this comment.
Ill remove in the next one
|
|
||
| // Launch kernel | ||
| let _cuda_events = | ||
| launch_cuda_kernel_impl(&mut launch_builder, CU_EVENT_DISABLE_TIMING, array_len)?; |
There was a problem hiding this comment.
if we rework the launcher logic, it'd be nice to not record events by default for each launch.
Merging this PR will degrade performance by 18.15%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | u8_FoR[1K] |
14.4 µs | 6.2 µs | ×2.3 |
| ❌ | WallTime | u16_FoR[1M] |
6.1 µs | 7.4 µs | -18.15% |
| ⚡ | Simulation | canonical_into_non_nullable[(10000, 100, 0.01)] |
2.9 ms | 2.1 ms | +37.72% |
| ⚡ | Simulation | canonical_into_non_nullable[(10000, 100, 0.0)] |
2.7 ms | 1.9 ms | +42.32% |
| ⚡ | Simulation | canonical_into_non_nullable[(10000, 100, 0.1)] |
4.5 ms | 3.7 ms | +22.17% |
| ❌ | Simulation | canonical_into_nullable[(10000, 10, 0.0)] |
444.5 µs | 529.1 µs | -15.99% |
| ❌ | Simulation | canonical_into_nullable[(10000, 100, 0.0)] |
4.1 ms | 4.9 ms | -16.51% |
| ⚡ | Simulation | into_canonical_non_nullable[(10000, 100, 0.01)] |
3 ms | 2.2 ms | +36.64% |
| ⚡ | Simulation | into_canonical_non_nullable[(10000, 100, 0.1)] |
4.6 ms | 3.8 ms | +21.44% |
| ⚡ | Simulation | into_canonical_non_nullable[(10000, 100, 0.0)] |
2.7 ms | 1.9 ms | +41.68% |
| ⚡ | Simulation | into_canonical_nullable[(10000, 100, 0.0)] |
5.2 ms | 4.4 ms | +18.47% |
Comparing ji/scalar-gpu (f3a7bf3) with develop (03f0140)
Footnotes
-
1254 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
No description provided.