Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix contiguous_split performance #13342

Merged
merged 11 commits into from
May 12, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented May 11, 2023

This fixes a performance issue in contiguous_split that is due to pack_metadata not being implemented by an efficient way. In particular, the output bytes are copied from the internal buffer to the output buffer byte-by-byte, through std::back_inserter:

std::copy(metadata_begin,
  metadata_begin + (metadata.size() * sizeof(detail::serialized_column)),
  std::back_inserter(metadata_bytes));

This was probably optimized somehow by the compiler, but recent refactors made some changes to the code and probably prevent such optimization.

Benchmark

Latest cudf commit:

----------------------------------------------------------------------------------------------------------------------------------------------------
ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time              46.1 ms         46.1 ms            8 bytes_per_second=260.086G/s
ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time                48.1 ms         48.0 ms            8 bytes_per_second=257.527G/s
ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time                27.4 ms         27.4 ms            8 bytes_per_second=438.188G/s
ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time                  28.5 ms         28.5 ms            8 bytes_per_second=434.381G/s
ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time              34.5 ms         34.5 ms            8 bytes_per_second=231.825G/s
ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time                37.4 ms         37.4 ms            8 bytes_per_second=220.521G/s
ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time                18.9 ms         18.9 ms            8 bytes_per_second=422.259G/s
ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time                  19.4 ms         19.4 ms            8 bytes_per_second=424.595G/s
ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time                      4.35 ms         4.35 ms            8 bytes_per_second=474.47G/s
ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time              4.35 ms         4.36 ms            8 bytes_per_second=473.665G/s
ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time              22.2 ms         22.2 ms            8 bytes_per_second=90.1502G/s
ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time                25.1 ms         25.1 ms            8 bytes_per_second=82.1379G/s
ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time                5.08 ms         5.08 ms            8 bytes_per_second=393.98G/s
ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time                  5.28 ms         5.28 ms            8 bytes_per_second=390.85G/s
ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time                       4.34 ms         4.35 ms            8 bytes_per_second=474.715G/s
ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time               4.47 ms         4.47 ms            8 bytes_per_second=461.788G/s
ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time       98.1 ms         98.0 ms            8 bytes_per_second=81.6345G/s
ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time         89.5 ms         89.5 ms            8 bytes_per_second=90.843G/s
ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time         28.9 ms         29.9 ms            8 bytes_per_second=290.261G/s
ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time           20.4 ms         20.4 ms            8 bytes_per_second=417.033G/s
ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time               6.70 ms         7.32 ms            8 bytes_per_second=335.9G/s
ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time       4.35 ms         4.36 ms            8 bytes_per_second=524.386G/s
ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time       77.8 ms         77.8 ms            8 bytes_per_second=25.7184G/s
ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time         79.2 ms         79.1 ms            8 bytes_per_second=25.6833G/s
ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time         8.57 ms         8.81 ms            8 bytes_per_second=245.062G/s
ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time           7.83 ms         6.15 ms            8 bytes_per_second=272.089G/s
ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time                6.66 ms         9.17 ms            8 bytes_per_second=450.551G/s
ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time        4.41 ms         4.41 ms            8 bytes_per_second=687.88G/s

With this fix:

----------------------------------------------------------------------------------------------------------------------------------------------------
ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time              38.5 ms         38.4 ms            8 bytes_per_second=311.981G/s
ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time                42.8 ms         42.7 ms            8 bytes_per_second=289.289G/s
ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time                27.6 ms         27.5 ms            8 bytes_per_second=435.365G/s
ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time                  28.4 ms         28.3 ms            8 bytes_per_second=436.145G/s
ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time              27.2 ms         27.2 ms            8 bytes_per_second=293.677G/s
ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time                29.9 ms         29.9 ms            8 bytes_per_second=276.137G/s
ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time                19.0 ms         19.0 ms            8 bytes_per_second=421.185G/s
ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time                  19.1 ms         19.1 ms            8 bytes_per_second=431.306G/s
ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time                      4.35 ms         4.35 ms            8 bytes_per_second=474.311G/s
ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time              4.34 ms         4.35 ms            8 bytes_per_second=475.281G/s
ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time              14.6 ms         14.6 ms            8 bytes_per_second=137.131G/s
ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time                17.2 ms         17.2 ms            8 bytes_per_second=119.946G/s
ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time                4.89 ms         4.89 ms            8 bytes_per_second=409.281G/s
ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time                  5.09 ms         5.10 ms            8 bytes_per_second=404.981G/s
ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time                       4.40 ms         4.41 ms            8 bytes_per_second=469.011G/s
ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time               4.40 ms         4.41 ms            8 bytes_per_second=468.577G/s
ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time       76.0 ms         75.9 ms            8 bytes_per_second=105.396G/s
ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time         70.6 ms         70.5 ms            8 bytes_per_second=115.205G/s
ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time         28.6 ms         29.6 ms            8 bytes_per_second=293.253G/s
ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time           19.0 ms         19.0 ms            8 bytes_per_second=448.676G/s
ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time               6.69 ms         7.32 ms            8 bytes_per_second=336.342G/s
ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time       4.40 ms         4.39 ms            8 bytes_per_second=518.755G/s
ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time       55.4 ms         55.4 ms            8 bytes_per_second=36.1167G/s
ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time         57.0 ms         56.9 ms            8 bytes_per_second=35.6588G/s
ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time         8.48 ms         8.73 ms            8 bytes_per_second=247.664G/s
ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time           5.99 ms         6.00 ms            8 bytes_per_second=355.742G/s
ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time                6.69 ms         9.30 ms            8 bytes_per_second=448.359G/s
ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time        4.33 ms         4.33 ms            8 bytes_per_second=700.639G/s

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 11, 2023
@ttnghia ttnghia requested a review from abellina May 11, 2023 21:25
@ttnghia ttnghia requested a review from a team as a code owner May 11, 2023 21:25
@ttnghia ttnghia self-assigned this May 11, 2023
Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this @ttnghia!

I am not sure how the compiler would have optimized out std::copy, but this is better no matter what.

@GregoryKimball
Copy link
Contributor

Wow! Thank you @ttnghia for picking this up!

@abellina abellina mentioned this pull request May 12, 2023
3 tasks
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor Author

ttnghia commented May 12, 2023

/merge

@rapids-bot rapids-bot bot merged commit 7575e8d into rapidsai:branch-23.06 May 12, 2023
51 checks passed
@ttnghia ttnghia deleted the fix_contiguous_split branch May 16, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants