bitunpacking cuda kernels store output into shared memory before copying to main memory#6384
bitunpacking cuda kernels store output into shared memory before copying to main memory#6384robert3005 merged 3 commits intodevelopfrom
Conversation
|
Got a benchmark which outlines the perf diff? 🙂 |
|
Keep in mind that the prev version prob already used the implicit L1. |
0ax1
left a comment
There was a problem hiding this comment.
Really wanna see that bench first.
…ing to main memory Signed-off-by: Robert Kruszewski <github@robertk.io>
da06683 to
af6471f
Compare
|
This is rebased on develop that has the benchmark. How do I run it? |
|
before: after: |
Merging this PR will degrade performance by 12.9%
Performance Changes
Comparing Footnotes
|
Signed-off-by: Robert Kruszewski github@robertk.io