Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstracting block reduce and block scan from cuIO kernels with cub apis #7278

Merged
merged 20 commits into from
Feb 12, 2021

Conversation

rgsl888prabhu
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu commented Feb 2, 2021

closes #6238

This PR replaces existing usages of warp_reduce or warp_scans which were used for block reduction/scan with cub::BlockReduce/cub::BlockScan.

The changes has positive effect on mostly on numerical data processing, but seems to be little slower in case of string type.
all files.zip

Update: Graphs have been updated after fixing a bug which also resolved several other performance issues.

Perf plots

Benchmark Performance

y-axis is in ms and there are three sets of plot, one which compares mean performance change next to each other, which also has error bars which is standard deviation calculated using five sets of benchmarks. Next one is difference of performance between cub::block_reduce/cub::block_scan with generic approach and the last one is percentage change in performance compared to branch-0.18. If the value is positive, then test is taking less time, else it is taking more time compared to main branch.

CSV READER
CSV_READER_comp_0_9
CSV_READER_comp_10_19
CSV_READER_diff_0_39
CSV_READER_per_0_39
ORC_WRITER_comp_40_49

ORC READER
ORC_READER_comp_0_9
ORC_READER_comp_10_19
ORC_READER_comp_20_29
ORC_READER_comp_30_39
ORC_READER_comp_40_49
ORC_READER_comp_50_59
ORC_READER_comp_60_69
ORC_READER_comp_70_79
ORC_READER_diff_0_39
ORC_READER_diff_40_79
ORC_READER_per_0_39
ORC_READER_per_40_79

ORC WRITER
ORC_WRITER_comp_0_9
ORC_WRITER_comp_10_19
ORC_WRITER_comp_20_29
ORC_WRITER_comp_30_39
ORC_WRITER_comp_40_49
ORC_WRITER_comp_50_59
ORC_WRITER_comp_60_69
ORC_WRITER_comp_70_79
ORC_WRITER_comp_80_89
ORC_WRITER_comp_90_99
ORC_WRITER_diff_0_39
ORC_WRITER_diff_40_79
ORC_WRITER_diff_80_119
ORC_WRITER_per_0_39
ORC_WRITER_per_40_79
ORC_WRITER_per_80_119

PARQUET CHUNKED WRITER
PQ_CHUNK_WRITER_comp_0_9
PQ_CHUNK_WRITER_diff_0_39
PQ_CHUNK_WRITER_per_0_39

PARQUET READER
PQ_READER_comp_0_9
PQ_READER_comp_10_19
PQ_READER_comp_20_29
PQ_READER_comp_30_39
PQ_READER_comp_40_49
PQ_READER_comp_50_59
PQ_READER_comp_60_69
PQ_READER_comp_70_79
PQ_READER_comp_80_89
PQ_READER_comp_90_99
PQ_READER_diff_0_39
PQ_READER_diff_40_79
PQ_READER_diff_80_119
PQ_READER_per_0_39
PQ_READER_per_40_79
PQ_READER_per_80_119

PARQUET WRITER
PQ_WRITER_comp_0_9
PQ_WRITER_comp_10_19
PQ_WRITER_comp_20_29
PQ_WRITER_comp_30_39
PQ_WRITER_comp_40_49
PQ_WRITER_comp_50_59
PQ_WRITER_comp_60_69
PQ_WRITER_comp_70_79
PQ_WRITER_comp_80_89
PQ_WRITER_comp_90_99
PQ_WRITER_comp_100_109
PQ_WRITER_comp_110_119
PQ_WRITER_comp_120_129
PQ_WRITER_comp_130_139
PQ_WRITER_diff_0_39
PQ_WRITER_diff_40_79
PQ_WRITER_diff_80_119
PQ_WRITER_diff_120_159
PQ_WRITER_per_0_39
PQ_WRITER_per_40_79
PQ_WRITER_per_80_119
PQ_WRITER_per_120_159

@rgsl888prabhu rgsl888prabhu added feature request New feature or request 2 - In Progress Currently a work in progress 0 - Waiting on Author Waiting for author to respond to review cuIO cuIO issue 4 - Needs cuIO Reviewer breaking Breaking change labels Feb 2, 2021
@rgsl888prabhu rgsl888prabhu self-assigned this Feb 2, 2021
@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner February 2, 2021 01:10
@harrism harrism changed the title 6238 block reduce Replace warp reduce/scan with CUB BlockReduce/BlockScan. Feb 2, 2021
@rgsl888prabhu rgsl888prabhu changed the title Replace warp reduce/scan with CUB BlockReduce/BlockScan. Abstracting block reduce and block scan from cuIO kernels with cub apis Feb 2, 2021
@harrism
Copy link
Member

harrism commented Feb 2, 2021

Please always label graphs with units. Is higher better or is lower better?

@rgsl888prabhu
Copy link
Contributor Author

Please always label graphs with units. Is higher better or is lower better?

Lower is better

@rgsl888prabhu
Copy link
Contributor Author

Please always label graphs with units. Is higher better or is lower better?

And if you are not seeing axes. please change your github theme from black to white just for this PR, my apologies.

@vuule
Copy link
Contributor

vuule commented Feb 2, 2021

Are these numbers from the benchmarks?
As a side note - would be better to have the difference graph as percentages, but this is also useful.

@harrism
Copy link
Member

harrism commented Feb 2, 2021

"Change in perf" is percentage? Multiple? Number of seconds?

@rgsl888prabhu
Copy link
Contributor Author

"Change in perf" is percentage? Multiple? Number of seconds?

in ms

@harrism
Copy link
Member

harrism commented Feb 2, 2021

"Change in perf" is percentage? Multiple? Number of seconds?

in ms

OK, an absolute difference is not very helpful. Relative performance (speedup/slowdown) would be helpful.

@rgsl888prabhu
Copy link
Contributor Author

"Change in perf" is percentage? Multiple? Number of seconds?

in ms

OK, an absolute difference is not very helpful. Relative performance (speedup/slowdown) would be helpful.

If the value is negative, then we are having speed-up, and if it is positive then slowdown.

@harrism
Copy link
Member

harrism commented Feb 2, 2021

"Change in perf" is percentage? Multiple? Number of seconds?

in ms

OK, an absolute difference is not very helpful. Relative performance (speedup/slowdown) would be helpful.

If the value is negative, then we are having speed-up, and if it is positive then slowdown.

Yes, but an absolute change in ms is not helpful unless you know the total time. It's easier to make decisions based on old_time / new_time than on old_time - new_time.

@codecov
Copy link

codecov bot commented Feb 8, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@eb8dc88). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.19    #7278   +/-   ##
==============================================
  Coverage               ?   82.20%           
==============================================
  Files                  ?      100           
  Lines                  ?    16966           
  Branches               ?        0           
==============================================
  Hits                   ?    13947           
  Misses                 ?     3019           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb8dc88...6e50e2d. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job untangling some of the reductions, looks so much better now!
got a bunch of small questions and suggestions.

cpp/src/io/orc/dict_enc.cu Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 10, 2021
@devavret
Copy link
Contributor

Even though gpu_init_statistics_buffersize shows 15% decrease, it is being called only once and execution time is 2.752us.

As an initialization kernel, it shouldn't've been a big part of the overall execution anyway and you confirmed it. That's great. So no big slowdowns as a result of this PR.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues that I can think of. @vuule's review is quite thorough.

cpp/src/io/orc/stats_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely in favor of the latest changes. Now the code does exactly what it needs to do 👍

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer feature request New feature or request breaking Breaking change labels Feb 12, 2021
@vuule
Copy link
Contributor

vuule commented Feb 12, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7c609d2 into rapidsai:branch-0.19 Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Abstract away block reduce and block scan from cuIO kernels
6 participants