Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-pass multibyte_split #11500

Merged
merged 19 commits into from
Aug 30, 2022

Conversation

upsj
Copy link
Contributor

@upsj upsj commented Aug 9, 2022

Description

This adds a new multibyte_split implementation that needs to scan the input only once, and takes full advantage of byte_range.

To accomplish this, I introduce a new data structure output_builder (naming bikeshedding welcome 😄 ) for pre-allocation for unknown, but bounded outputs.
The structure contains a vector of exponentially growing device_uvectors, such that either the last vector has size 0 and the second-to-last vector has size() < capacity(), or all vectors but the last are full (size() == capacity()). It provides the operations

  • next_output(stream) returns a split_device_span of at least the worst_cast_size provided at construction pointing to the next free entries from the last two vectors.
  • advance_output(actual_size) marks the first actual_size entries of the previously returned split_device_span as filled. split_device_span takes care of writing to the smaller device_uvector first, and the larger device_uvector second, if the first one is full.
  • gather copies all elements that were previously written into a single device_uvector of the correct size.

This data structure should provide a good balance between allocation overheads and memory usage.

I only modified the actual multibyte_split kernel slightly to stop writing offsets once it passes the end of the byte_range. This way, we can determine all required offsets from a single scan, regardless of whether we need provide a range or not.

Benchmark results:

multibyte_split

[0] Tesla T4

source type delim size delim percent size approx byte_range percent Ref Time Cmp Time Diff %Diff
device 1 1 2^15 1 501.373 us 255.010 us -246.363 us -49.14%
file 1 1 2^15 1 10.973 ms 3.740 ms -7233.161 us -65.91%
host paged 1 1 2^15 1 514.387 us 258.478 us -255.909 us -49.75%
host pinned 1 1 2^15 1 523.335 us 273.483 us -249.852 us -47.74%
device 4 1 2^15 1 502.462 us 247.451 us -255.011 us -50.75%
file 4 1 2^15 1 9.269 ms 3.717 ms -5551.573 us -59.90%
host paged 4 1 2^15 1 517.398 us 251.635 us -265.763 us -51.37%
host pinned 4 1 2^15 1 527.321 us 266.926 us -260.395 us -49.38%
device 7 1 2^15 1 511.079 us 269.646 us -241.433 us -47.24%
file 7 1 2^15 1 10.103 ms 3.719 ms -6383.437 us -63.18%
host paged 7 1 2^15 1 530.471 us 275.805 us -254.666 us -48.01%
host pinned 7 1 2^15 1 539.715 us 284.943 us -254.771 us -47.20%
device 1 25 2^15 1 522.656 us 267.467 us -255.189 us -48.83%
file 1 25 2^15 1 10.106 ms 3.715 ms -6390.694 us -63.24%
host paged 1 25 2^15 1 536.277 us 273.775 us -262.502 us -48.95%
host pinned 1 25 2^15 1 549.473 us 281.362 us -268.111 us -48.79%
device 4 25 2^15 1 571.825 us 300.687 us -271.138 us -47.42%
file 4 25 2^15 1 10.103 ms 3.714 ms -6388.482 us -63.23%
host paged 4 25 2^15 1 588.115 us 303.107 us -285.008 us -48.46%
host pinned 4 25 2^15 1 600.452 us 314.675 us -285.777 us -47.59%
device 7 25 2^15 1 576.380 us 301.141 us -275.240 us -47.75%
file 7 25 2^15 1 10.019 ms 3.705 ms -6313.942 us -63.02%
host paged 7 25 2^15 1 590.588 us 307.916 us -282.673 us -47.86%
host pinned 7 25 2^15 1 605.426 us 315.630 us -289.795 us -47.87%
device 1 1 2^30 1 759.165 ms 6.113 ms -753052.285 us -99.19%
file 1 1 2^30 1 900.476 ms 134.544 ms -765932.007 us -85.06%
host paged 1 1 2^30 1 766.142 ms 7.020 ms -759122.773 us -99.08%
host pinned 1 1 2^30 1 813.716 ms 6.483 ms -807232.765 us -99.20%
device 4 1 2^30 1 773.977 ms 6.180 ms -767797.237 us -99.20%
file 4 1 2^30 1 933.311 ms 133.473 ms -799837.803 us -85.70%
host paged 4 1 2^30 1 778.010 ms 7.066 ms -770943.560 us -99.09%
host pinned 4 1 2^30 1 785.550 ms 6.545 ms -779004.832 us -99.17%
device 7 1 2^30 1 776.541 ms 6.212 ms -770328.726 us -99.20%
file 7 1 2^30 1 926.038 ms 130.654 ms -795384.376 us -85.89%
host paged 7 1 2^30 1 929.224 ms 7.113 ms -922110.964 us -99.23%
host pinned 7 1 2^30 1 808.404 ms 6.581 ms -801823.263 us -99.19%
device 1 25 2^30 1 649.553 ms 4.856 ms -644696.331 us -99.25%
file 1 25 2^30 1 754.855 ms 100.332 ms -654522.457 us -86.71%
host paged 1 25 2^30 1 797.427 ms 5.755 ms -791672.232 us -99.28%
host pinned 1 25 2^30 1 694.769 ms 5.235 ms -689534.397 us -99.25%
device 4 25 2^30 1 803.722 ms 6.831 ms -796891.259 us -99.15%
file 4 25 2^30 1 1.089 s 122.617 ms -965958.117 us -88.74%
host paged 4 25 2^30 1 809.025 ms 7.731 ms -801293.659 us -99.04%
host pinned 4 25 2^30 1 863.132 ms 7.206 ms -855926.593 us -99.17%
device 7 25 2^30 1 822.436 ms 6.828 ms -815608.102 us -99.17%
file 7 25 2^30 1 953.459 ms 125.146 ms -828312.756 us -86.87%
host paged 7 25 2^30 1 878.591 ms 7.729 ms -870862.213 us -99.12%
host pinned 7 25 2^30 1 853.347 ms 7.202 ms -846145.222 us -99.16%
device 1 1 2^15 5 494.135 us 267.240 us -226.896 us -45.92%
file 1 1 2^15 5 8.396 ms 3.790 ms -4605.974 us -54.86%
host paged 1 1 2^15 5 511.952 us 267.649 us -244.304 us -47.72%
host pinned 1 1 2^15 5 523.247 us 281.380 us -241.868 us -46.22%
device 4 1 2^15 5 513.061 us 277.739 us -235.322 us -45.87%
file 4 1 2^15 5 8.393 ms 3.745 ms -4647.934 us -55.38%
host paged 4 1 2^15 5 532.465 us 283.033 us -249.433 us -46.84%
host pinned 4 1 2^15 5 541.863 us 292.102 us -249.761 us -46.09%
device 7 1 2^15 5 536.721 us 282.465 us -254.256 us -47.37%
file 7 1 2^15 5 8.374 ms 3.770 ms -4603.511 us -54.97%
host paged 7 1 2^15 5 554.098 us 285.165 us -268.933 us -48.54%
host pinned 7 1 2^15 5 566.133 us 298.437 us -267.696 us -47.29%
device 1 25 2^15 5 523.197 us 274.622 us -248.575 us -47.51%
file 1 25 2^15 5 8.383 ms 3.721 ms -4661.680 us -55.61%
host paged 1 25 2^15 5 538.202 us 274.924 us -263.278 us -48.92%
host pinned 1 25 2^15 5 552.161 us 288.881 us -263.280 us -47.68%
device 4 25 2^15 5 571.249 us 300.540 us -270.709 us -47.39%
file 4 25 2^15 5 8.365 ms 3.729 ms -4636.104 us -55.42%
host paged 4 25 2^15 5 587.178 us 302.430 us -284.748 us -48.49%
host pinned 4 25 2^15 5 600.085 us 314.728 us -285.357 us -47.55%
device 7 25 2^15 5 576.040 us 301.207 us -274.833 us -47.71%
file 7 25 2^15 5 8.557 ms 3.728 ms -4828.929 us -56.43%
host paged 7 25 2^15 5 605.849 us 306.993 us -298.856 us -49.33%
host pinned 7 25 2^15 5 605.730 us 315.480 us -290.250 us -47.92%
device 1 1 2^30 5 760.295 ms 26.647 ms -733647.234 us -96.50%
file 1 1 2^30 5 900.988 ms 144.221 ms -756766.941 us -83.99%
host paged 1 1 2^30 5 861.858 ms 27.640 ms -834218.288 us -96.79%
host pinned 1 1 2^30 5 814.384 ms 27.071 ms -787313.049 us -96.68%
device 4 1 2^30 5 774.991 ms 26.829 ms -748162.355 us -96.54%
file 4 1 2^30 5 939.380 ms 145.972 ms -793408.072 us -84.46%
host paged 4 1 2^30 5 824.653 ms 27.778 ms -796875.210 us -96.63%
host pinned 4 1 2^30 5 780.742 ms 27.220 ms -753521.950 us -96.51%
device 7 1 2^30 5 777.662 ms 26.946 ms -750715.400 us -96.53%
file 7 1 2^30 5 1.062 s 146.987 ms -915037.565 us -86.16%
host paged 7 1 2^30 5 792.441 ms 27.890 ms -764551.527 us -96.48%
host pinned 7 1 2^30 5 844.303 ms 27.352 ms -816951.184 us -96.76%
device 1 25 2^30 5 650.803 ms 24.079 ms -626723.464 us -96.30%
file 1 25 2^30 5 819.158 ms 113.911 ms -705247.805 us -86.09%
host paged 1 25 2^30 5 892.323 ms 25.012 ms -867311.636 us -97.20%
host pinned 1 25 2^30 5 705.445 ms 24.465 ms -680980.519 us -96.53%
device 4 25 2^30 5 804.875 ms 27.849 ms -777026.452 us -96.54%
file 4 25 2^30 5 939.873 ms 139.019 ms -800853.228 us -85.21%
host paged 4 25 2^30 5 818.237 ms 28.763 ms -789474.346 us -96.48%
host pinned 4 25 2^30 5 982.444 ms 28.217 ms -954226.125 us -97.13%
device 7 25 2^30 5 823.389 ms 29.782 ms -793606.771 us -96.38%
file 7 25 2^30 5 1.009 s 146.087 ms -862689.868 us -85.52%
host paged 7 25 2^30 5 1.064 s 30.730 ms -1033256.695 us -97.11%
host pinned 7 25 2^30 5 877.682 ms 30.180 ms -847501.645 us -96.56%
device 1 1 2^15 25 495.226 us 260.457 us -234.769 us -47.41%
file 1 1 2^15 25 8.485 ms 3.781 ms -4703.671 us -55.43%
host paged 1 1 2^15 25 512.784 us 267.446 us -245.337 us -47.84%
host pinned 1 1 2^15 25 524.160 us 273.471 us -250.689 us -47.83%
device 4 1 2^15 25 513.812 us 268.105 us -245.707 us -47.82%
file 4 1 2^15 25 8.494 ms 3.779 ms -4715.096 us -55.51%
host paged 4 1 2^15 25 534.551 us 274.615 us -259.936 us -48.63%
host pinned 4 1 2^15 25 542.672 us 282.072 us -260.600 us -48.02%
device 7 1 2^15 25 535.988 us 277.391 us -258.597 us -48.25%
file 7 1 2^15 25 8.344 ms 3.725 ms -4619.012 us -55.36%
host paged 7 1 2^15 25 554.427 us 284.250 us -270.176 us -48.73%
host pinned 7 1 2^15 25 567.662 us 295.147 us -272.515 us -48.01%
device 1 25 2^15 25 523.935 us 275.170 us -248.765 us -47.48%
file 1 25 2^15 25 8.333 ms 3.734 ms -4599.214 us -55.19%
host paged 1 25 2^15 25 539.151 us 277.941 us -261.210 us -48.45%
host pinned 1 25 2^15 25 553.701 us 289.672 us -264.028 us -47.68%
device 4 25 2^15 25 572.250 us 300.713 us -271.537 us -47.45%
file 4 25 2^15 25 8.467 ms 3.732 ms -4734.345 us -55.92%
host paged 4 25 2^15 25 589.709 us 306.826 us -282.883 us -47.97%
host pinned 4 25 2^15 25 602.942 us 314.485 us -288.457 us -47.84%
device 7 25 2^15 25 577.125 us 301.039 us -276.086 us -47.84%
file 7 25 2^15 25 8.477 ms 3.766 ms -4711.039 us -55.57%
host paged 7 25 2^15 25 593.513 us 307.753 us -285.760 us -48.15%
host pinned 7 25 2^15 25 606.763 us 315.183 us -291.580 us -48.06%
device 1 1 2^30 25 765.578 ms 129.396 ms -636182.115 us -83.10%
file 1 1 2^30 25 929.694 ms 221.490 ms -708204.682 us -76.18%
host paged 1 1 2^30 25 954.516 ms 130.558 ms -823957.460 us -86.32%
host pinned 1 1 2^30 25 966.739 ms 129.951 ms -836787.576 us -86.56%
device 4 1 2^30 25 780.141 ms 132.059 ms -648082.202 us -83.07%
file 4 1 2^30 25 955.247 ms 224.799 ms -730447.507 us -76.47%
host paged 4 1 2^30 25 906.541 ms 133.090 ms -773450.918 us -85.32%
host pinned 4 1 2^30 25 803.184 ms 132.560 ms -670623.617 us -83.50%
device 7 1 2^30 25 782.807 ms 132.482 ms -650324.997 us -83.08%
file 7 1 2^30 25 1.006 s 224.442 ms -781614.108 us -77.69%
host paged 7 1 2^30 25 879.983 ms 133.681 ms -746302.241 us -84.81%
host pinned 7 1 2^30 25 806.090 ms 133.161 ms -672929.604 us -83.48%
device 1 25 2^30 25 657.313 ms 116.237 ms -541076.105 us -82.32%
file 1 25 2^30 25 792.993 ms 187.620 ms -605372.917 us -76.34%
host paged 1 25 2^30 25 707.506 ms 117.467 ms -590039.693 us -83.40%
host pinned 1 25 2^30 25 714.342 ms 116.772 ms -597570.059 us -83.65%
device 4 25 2^30 25 810.447 ms 138.744 ms -671702.648 us -82.88%
file 4 25 2^30 25 984.757 ms 226.860 ms -757897.389 us -76.96%
host paged 4 25 2^30 25 870.188 ms 139.849 ms -730338.049 us -83.93%
host pinned 4 25 2^30 25 1.005 s 139.282 ms -865389.388 us -86.14%
device 7 25 2^30 25 828.847 ms 142.057 ms -686789.436 us -82.86%
file 7 25 2^30 25 986.067 ms 231.655 ms -754412.174 us -76.51%
host paged 7 25 2^30 25 929.677 ms 143.445 ms -786232.463 us -84.57%
host pinned 7 25 2^30 25 913.866 ms 142.981 ms -770884.883 us -84.35%
device 1 1 2^15 50 494.177 us 259.079 us -235.098 us -47.57%
file 1 1 2^15 50 8.305 ms 3.781 ms -4524.620 us -54.48%
host paged 1 1 2^15 50 515.479 us 267.290 us -248.189 us -48.15%
host pinned 1 1 2^15 50 527.832 us 273.050 us -254.781 us -48.27%
device 4 1 2^15 50 513.578 us 267.341 us -246.237 us -47.95%
file 4 1 2^15 50 8.478 ms 3.784 ms -4694.029 us -55.37%
host paged 4 1 2^15 50 536.324 us 274.019 us -262.305 us -48.91%
host pinned 4 1 2^15 50 545.107 us 281.623 us -263.484 us -48.34%
device 7 1 2^15 50 536.213 us 279.726 us -256.487 us -47.83%
file 7 1 2^15 50 8.402 ms 3.790 ms -4611.689 us -54.89%
host paged 7 1 2^15 50 558.127 us 285.683 us -272.444 us -48.81%
host pinned 7 1 2^15 50 569.002 us 298.287 us -270.715 us -47.58%
device 1 25 2^15 50 524.190 us 275.036 us -249.154 us -47.53%
file 1 25 2^15 50 8.467 ms 3.719 ms -4748.710 us -56.08%
host paged 1 25 2^15 50 540.475 us 282.192 us -258.283 us -47.79%
host pinned 1 25 2^15 50 554.028 us 288.838 us -265.189 us -47.87%
device 4 25 2^15 50 571.287 us 300.003 us -271.283 us -47.49%
file 4 25 2^15 50 8.342 ms 3.732 ms -4609.666 us -55.26%
host paged 4 25 2^15 50 591.604 us 308.682 us -282.922 us -47.82%
host pinned 4 25 2^15 50 603.550 us 314.584 us -288.966 us -47.88%
device 7 25 2^15 50 576.753 us 300.907 us -275.846 us -47.83%
file 7 25 2^15 50 8.353 ms 3.725 ms -4628.266 us -55.41%
host paged 7 25 2^15 50 595.478 us 309.244 us -286.235 us -48.07%
host pinned 7 25 2^15 50 609.549 us 315.463 us -294.087 us -48.25%
device 1 1 2^30 50 772.301 ms 259.032 ms -513268.599 us -66.46%
file 1 1 2^30 50 969.116 ms 321.799 ms -647316.758 us -66.79%
host paged 1 1 2^30 50 1.141 s 260.091 ms -881389.084 us -77.21%
host pinned 1 1 2^30 50 949.102 ms 259.552 ms -689549.243 us -72.65%
device 4 1 2^30 50 786.655 ms 262.064 ms -524590.867 us -66.69%
file 4 1 2^30 50 1.098 s 326.710 ms -770958.900 us -70.24%
host paged 4 1 2^30 50 948.089 ms 263.495 ms -684593.933 us -72.21%
host pinned 4 1 2^30 50 907.653 ms 262.806 ms -644846.480 us -71.05%
device 7 1 2^30 50 789.401 ms 263.368 ms -526033.714 us -66.64%
file 7 1 2^30 50 1.083 s 327.480 ms -755255.187 us -69.75%
host paged 7 1 2^30 50 962.268 ms 264.524 ms -697744.071 us -72.51%
host pinned 7 1 2^30 50 971.881 ms 264.065 ms -707815.282 us -72.83%
device 1 25 2^30 50 665.425 ms 232.463 ms -432961.322 us -65.07%
file 1 25 2^30 50 911.572 ms 282.358 ms -629214.772 us -69.03%
host paged 1 25 2^30 50 924.965 ms 234.220 ms -690745.245 us -74.68%
host pinned 1 25 2^30 50 699.093 ms 233.032 ms -466060.898 us -66.67%
device 4 25 2^30 50 817.301 ms 276.820 ms -540481.648 us -66.13%
file 4 25 2^30 50 1.028 s 337.968 ms -690458.768 us -67.14%
host paged 4 25 2^30 50 989.999 ms 278.600 ms -711398.589 us -71.86%
host pinned 4 25 2^30 50 859.267 ms 278.173 ms -581093.622 us -67.63%
device 7 25 2^30 50 835.640 ms 282.740 ms -552899.603 us -66.16%
file 7 25 2^30 50 1.028 s 344.770 ms -682948.230 us -66.45%
host paged 7 25 2^30 50 1.175 s 283.889 ms -890653.851 us -75.83%
host pinned 7 25 2^30 50 878.720 ms 283.505 ms -595215.368 us -67.74%
device 1 1 2^15 100 495.879 us 265.133 us -230.747 us -46.53%
file 1 1 2^15 100 8.474 ms 3.771 ms -4703.913 us -55.51%
host paged 1 1 2^15 100 521.063 us 272.803 us -248.260 us -47.64%
host pinned 1 1 2^15 100 535.848 us 277.043 us -258.805 us -48.30%
device 4 1 2^15 100 513.719 us 271.754 us -241.965 us -47.10%
file 4 1 2^15 100 8.369 ms 3.767 ms -4602.203 us -54.99%
host paged 4 1 2^15 100 542.547 us 281.336 us -261.212 us -48.15%
host pinned 4 1 2^15 100 553.792 us 290.824 us -262.968 us -47.48%
device 7 1 2^15 100 536.841 us 284.016 us -252.826 us -47.10%
file 7 1 2^15 100 8.390 ms 3.764 ms -4626.076 us -55.14%
host paged 7 1 2^15 100 563.217 us 292.971 us -270.246 us -47.98%
host pinned 7 1 2^15 100 578.990 us 302.168 us -276.822 us -47.81%
device 1 25 2^15 100 524.359 us 280.308 us -244.051 us -46.54%
file 1 25 2^15 100 8.374 ms 3.776 ms -4598.153 us -54.91%
host paged 1 25 2^15 100 560.218 us 288.964 us -271.253 us -48.42%
host pinned 1 25 2^15 100 566.101 us 294.120 us -271.980 us -48.04%
device 4 25 2^15 100 582.539 us 305.593 us -276.946 us -47.54%
file 4 25 2^15 100 8.445 ms 3.775 ms -4670.815 us -55.31%
host paged 4 25 2^15 100 596.100 us 314.954 us -281.146 us -47.16%
host pinned 4 25 2^15 100 606.737 us 319.597 us -287.140 us -47.33%
device 7 25 2^15 100 577.114 us 305.427 us -271.687 us -47.08%
file 7 25 2^15 100 8.459 ms 3.775 ms -4683.776 us -55.37%
host paged 7 25 2^15 100 600.814 us 315.401 us -285.413 us -47.50%
host pinned 7 25 2^15 100 617.354 us 323.114 us -294.240 us -47.66%
device 1 1 2^30 100 785.740 ms 515.240 ms -270499.817 us -34.43%
file 1 1 2^30 100 1.043 s 521.471 ms -521600.353 us -50.01%
host paged 1 1 2^30 100 1.030 s 518.769 ms -511254.716 us -49.64%
host pinned 1 1 2^30 100 922.861 ms 517.999 ms -404862.705 us -43.87%
device 4 1 2^30 100 800.010 ms 521.530 ms -278479.331 us -34.81%
file 4 1 2^30 100 1.190 s 527.630 ms -661958.288 us -55.65%
host paged 4 1 2^30 100 1.090 s 524.598 ms -565287.699 us -51.87%
host pinned 4 1 2^30 100 1.059 s 524.101 ms -535396.658 us -50.53%
device 7 1 2^30 100 802.697 ms 524.818 ms -277878.923 us -34.62%
file 7 1 2^30 100 1.122 s 531.504 ms -590129.710 us -52.61%
host paged 7 1 2^30 100 1.365 s 527.805 ms -837014.954 us -61.33%
host pinned 7 1 2^30 100 942.695 ms 527.151 ms -415543.608 us -44.08%
device 1 25 2^30 100 681.754 ms 461.792 ms -219961.806 us -32.26%
file 1 25 2^30 100 888.544 ms 467.517 ms -421027.176 us -47.38%
host paged 1 25 2^30 100 873.000 ms 464.901 ms -408099.261 us -46.75%
host pinned 1 25 2^30 100 798.295 ms 463.761 ms -334534.095 us -41.91%
device 4 25 2^30 100 831.389 ms 550.050 ms -281338.656 us -33.84%
file 4 25 2^30 100 1.079 s 555.891 ms -522743.242 us -48.46%
host paged 4 25 2^30 100 1.060 s 552.930 ms -506979.337 us -47.83%
host pinned 4 25 2^30 100 945.572 ms 552.551 ms -393021.062 us -41.56%
device 7 25 2^30 100 849.229 ms 562.074 ms -287155.401 us -33.81%
file 7 25 2^30 100 1.104 s 567.974 ms -535690.202 us -48.54%
host paged 7 25 2^30 100 1.375 s 565.169 ms -809622.226 us -58.89%
host pinned 7 25 2^30 100 1.099 s 564.743 ms -534429.395 us -48.62%

TODO

  • Handle byte_range edge cases
  • Handle issues with large inputs
  • Extend to overlapping delimiters by providing previous_chunk support for data_chunk_source That will be another PR

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Closes #11197

@upsj upsj added libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Aug 9, 2022
@upsj upsj added this to the Genomics read_text support milestone Aug 9, 2022
@upsj upsj self-assigned this Aug 9, 2022
@upsj upsj added the improvement Improvement / enhancement to an existing function label Aug 9, 2022
@upsj upsj marked this pull request as ready for review August 11, 2022 11:32
@upsj upsj requested a review from a team as a code owner August 11, 2022 11:32
@upsj
Copy link
Contributor Author

upsj commented Aug 11, 2022

This still needs some work, since it duplicates the existing kernel and could use a few more tests, but I think it's already good for a first review.

@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@ccd72f2). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 4a89fe5 differs from pull request most recent head 4488965. Consider uploading reports for the commit 4488965 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11500   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22993           
  Branches                ?        0           
===============================================
  Hits                    ?    19869           
  Misses                  ?     3124           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@upsj upsj added the 3 - Ready for Review Ready for review by team label Aug 16, 2022
@ttnghia ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 16, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Aug 16, 2022

According to the recent discussion, if this more work then add DO NOT MERGE label until it is ready to merge.

@upsj
Copy link
Contributor Author

upsj commented Aug 16, 2022

I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary.

@vuule
Copy link
Contributor

vuule commented Aug 16, 2022

I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary for now.

There's still some clean up pending before review, is that right? Disregard, I see the clean up commits pushed.

@vuule vuule added this to PR-WIP in v22.10 Release via automation Aug 17, 2022
@upsj upsj removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 17, 2022
@upsj upsj force-pushed the feature/multibyte_split_local branch from 5138705 to a96bcd8 Compare August 17, 2022 13:08
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, and the benchmarks are looking great too. Cherry picking some here...

Though peak memory usage has doubled, which is interesting. Do we know why that is? Maybe there's some tuning we can do with the output_chunks class?

Existing implementation:

MultibyteSplitBenchmark/multibyte_split_simple/0/7/25/1073741824/manual_time        194 ms          194 ms            4 bytes_per_second=4.93718G/s peak_memory_usage=1.48489G
MultibyteSplitBenchmark/multibyte_split_simple/1/7/25/1073741824/manual_time        521 ms          521 ms            1 bytes_per_second=1.83643G/s peak_memory_usage=1.48908G
MultibyteSplitBenchmark/multibyte_split_simple/2/7/25/1073741824/manual_time       1328 ms         1326 ms            1 bytes_per_second=738.056M/s peak_memory_usage=1032.22M

This PR's improvements:

MultibyteSplitBenchmark/multibyte_split_simple/0/7/25/1073741824/manual_time        129 ms          129 ms            5 bytes_per_second=7.44079G/s peak_memory_usage=3.15422G
MultibyteSplitBenchmark/multibyte_split_simple/1/7/25/1073741824/manual_time        153 ms          153 ms            5 bytes_per_second=6.25106G/s peak_memory_usage=3.15422G
MultibyteSplitBenchmark/multibyte_split_simple/2/7/25/1073741824/manual_time        420 ms          420 ms            2 bytes_per_second=2.27932G/s peak_memory_usage=2.10296G

I have some comments, which are optional to address.

Requesting changes because we need some benchmarks that exemplify the byte_range improvements. I never got around to adding benchmarks for that case because the perf would be on par with full file reads. Now that we have byte range optimizations, we have a chance to demonstrate some ever bigger improvements than "just" the ~2-3x we see in the existing benchmarks.

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
@upsj upsj force-pushed the feature/multibyte_split_local branch from a96bcd8 to 221c216 Compare August 18, 2022 21:16
@github-actions github-actions bot added the CMake CMake build issue label Aug 18, 2022
@upsj upsj changed the title Single-pass multibyte_split for non-overlapping delimiters Single-pass multibyte_split Aug 18, 2022
@upsj
Copy link
Contributor Author

upsj commented Aug 22, 2022

@cwharris on the increased memory usage: With the exponential growth of the chunks, at worst we overestimate the amount of memory by the growth factor (2x in this case). A smaller growth factor might also make sense. What we could do alternatively is set a size limit for the chunks, because at some point the allocation overhead amortized over all kernel launches until the chunk is full is negligible.

@upsj upsj force-pushed the feature/multibyte_split_local branch 2 times, most recently from 7ceb82f to 814c28c Compare August 23, 2022 09:40
@github-actions github-actions bot removed the CMake CMake build issue label Aug 23, 2022
@upsj upsj force-pushed the feature/multibyte_split_local branch from 814c28c to 9ac5474 Compare August 24, 2022 11:24
@upsj upsj force-pushed the feature/multibyte_split_local branch from 9ac5474 to 918e45e Compare August 26, 2022 08:24
@github-actions github-actions bot removed the CMake CMake build issue label Aug 26, 2022
@upsj
Copy link
Contributor Author

upsj commented Aug 26, 2022

rerun tests

@vuule vuule self-requested a review August 28, 2022 06:02
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, mostly minor suggestions.
Lots of cool stuff to unpack here :)

cpp/src/io/text/multibyte_split.cu Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Show resolved Hide resolved
v22.10 Release automation moved this from PR-Reviewer approved to PR-Needs review Aug 28, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just some very minor suggestions/questions.

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 30, 2022
@upsj upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 30, 2022
@upsj
Copy link
Contributor Author

upsj commented Aug 30, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b4dd2d5 into rapidsai:branch-22.10 Aug 30, 2022
v22.10 Release automation moved this from PR-Reviewer approved to Done Aug 30, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 31, 2022
This improves the `multibyte_split` kernel by
* Reducing register pressure: Instead of storing `ITEMS_PER_THREAD` individual states, store only the initial `multistate` for the thread and recompute the individual states on-the-fly
* Eliminating local memory usage: Manipulate the `multistate` via shifts instead of array random access
* Eliminating trie overhead: Since we have only a single delimiter, the trie is a path. We can do the traversal implicitly
* Memoizing which chars were a match: We don't need to recompute this information, but can store it in a bitmask
* Changing the block load algorithm: `BLOCK_LOAD_VECTORIZE` was slightly less efficient than `BLOCK_LOAD_WARP_TRANSPOSE`
* Reducing the allocation overhead by limiting the `output_builder` max allocation size
* Tuning the parameters: `ITEMS_PER_THREAD = 64` works better, and we can improve performance further by operating on larger chunks

Overall, this gives a roughly 2x speedup in my benchmarks.

Based on #11500

Authors:
  - Tobias Ribizel (https://github.com/upsj)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11587
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[FEA] Support read_text using a byte range without scanning the full source file
5 participants