Single-pass `multibyte_split` #11500

upsj · 2022-08-09T17:47:23Z

Description

This adds a new multibyte_split implementation that needs to scan the input only once, and takes full advantage of byte_range.

To accomplish this, I introduce a new data structure output_builder (naming bikeshedding welcome 😄 ) for pre-allocation for unknown, but bounded outputs.
The structure contains a vector of exponentially growing device_uvectors, such that either the last vector has size 0 and the second-to-last vector has size() < capacity(), or all vectors but the last are full (size() == capacity()). It provides the operations

next_output(stream) returns a split_device_span of at least the worst_cast_size provided at construction pointing to the next free entries from the last two vectors.
advance_output(actual_size) marks the first actual_size entries of the previously returned split_device_span as filled. split_device_span takes care of writing to the smaller device_uvector first, and the larger device_uvector second, if the first one is full.
gather copies all elements that were previously written into a single device_uvector of the correct size.

This data structure should provide a good balance between allocation overheads and memory usage.

I only modified the actual multibyte_split kernel slightly to stop writing offsets once it passes the end of the byte_range. This way, we can determine all required offsets from a single scan, regardless of whether we need provide a range or not.

Depends on Add byte_range to multibyte_split benchmark + NVBench refactor #11562 and Fix multibyte_split benchmark for host buffers #11583 for reliable benchmarking

Benchmark results:

multibyte_split

[0] Tesla T4

source type	delim size	delim percent	size approx	byte_range percent	Ref Time	Cmp Time	Diff	%Diff
device	1	1	2^15	1	501.373 us	255.010 us	-246.363 us	-49.14%
file	1	1	2^15	1	10.973 ms	3.740 ms	-7233.161 us	-65.91%
host paged	1	1	2^15	1	514.387 us	258.478 us	-255.909 us	-49.75%
host pinned	1	1	2^15	1	523.335 us	273.483 us	-249.852 us	-47.74%
device	4	1	2^15	1	502.462 us	247.451 us	-255.011 us	-50.75%
file	4	1	2^15	1	9.269 ms	3.717 ms	-5551.573 us	-59.90%
host paged	4	1	2^15	1	517.398 us	251.635 us	-265.763 us	-51.37%
host pinned	4	1	2^15	1	527.321 us	266.926 us	-260.395 us	-49.38%
device	7	1	2^15	1	511.079 us	269.646 us	-241.433 us	-47.24%
file	7	1	2^15	1	10.103 ms	3.719 ms	-6383.437 us	-63.18%
host paged	7	1	2^15	1	530.471 us	275.805 us	-254.666 us	-48.01%
host pinned	7	1	2^15	1	539.715 us	284.943 us	-254.771 us	-47.20%
device	1	25	2^15	1	522.656 us	267.467 us	-255.189 us	-48.83%
file	1	25	2^15	1	10.106 ms	3.715 ms	-6390.694 us	-63.24%
host paged	1	25	2^15	1	536.277 us	273.775 us	-262.502 us	-48.95%
host pinned	1	25	2^15	1	549.473 us	281.362 us	-268.111 us	-48.79%
device	4	25	2^15	1	571.825 us	300.687 us	-271.138 us	-47.42%
file	4	25	2^15	1	10.103 ms	3.714 ms	-6388.482 us	-63.23%
host paged	4	25	2^15	1	588.115 us	303.107 us	-285.008 us	-48.46%
host pinned	4	25	2^15	1	600.452 us	314.675 us	-285.777 us	-47.59%
device	7	25	2^15	1	576.380 us	301.141 us	-275.240 us	-47.75%
file	7	25	2^15	1	10.019 ms	3.705 ms	-6313.942 us	-63.02%
host paged	7	25	2^15	1	590.588 us	307.916 us	-282.673 us	-47.86%
host pinned	7	25	2^15	1	605.426 us	315.630 us	-289.795 us	-47.87%
device	1	1	2^30	1	759.165 ms	6.113 ms	-753052.285 us	-99.19%
file	1	1	2^30	1	900.476 ms	134.544 ms	-765932.007 us	-85.06%
host paged	1	1	2^30	1	766.142 ms	7.020 ms	-759122.773 us	-99.08%
host pinned	1	1	2^30	1	813.716 ms	6.483 ms	-807232.765 us	-99.20%
device	4	1	2^30	1	773.977 ms	6.180 ms	-767797.237 us	-99.20%
file	4	1	2^30	1	933.311 ms	133.473 ms	-799837.803 us	-85.70%
host paged	4	1	2^30	1	778.010 ms	7.066 ms	-770943.560 us	-99.09%
host pinned	4	1	2^30	1	785.550 ms	6.545 ms	-779004.832 us	-99.17%
device	7	1	2^30	1	776.541 ms	6.212 ms	-770328.726 us	-99.20%
file	7	1	2^30	1	926.038 ms	130.654 ms	-795384.376 us	-85.89%
host paged	7	1	2^30	1	929.224 ms	7.113 ms	-922110.964 us	-99.23%
host pinned	7	1	2^30	1	808.404 ms	6.581 ms	-801823.263 us	-99.19%
device	1	25	2^30	1	649.553 ms	4.856 ms	-644696.331 us	-99.25%
file	1	25	2^30	1	754.855 ms	100.332 ms	-654522.457 us	-86.71%
host paged	1	25	2^30	1	797.427 ms	5.755 ms	-791672.232 us	-99.28%
host pinned	1	25	2^30	1	694.769 ms	5.235 ms	-689534.397 us	-99.25%
device	4	25	2^30	1	803.722 ms	6.831 ms	-796891.259 us	-99.15%
file	4	25	2^30	1	1.089 s	122.617 ms	-965958.117 us	-88.74%
host paged	4	25	2^30	1	809.025 ms	7.731 ms	-801293.659 us	-99.04%
host pinned	4	25	2^30	1	863.132 ms	7.206 ms	-855926.593 us	-99.17%
device	7	25	2^30	1	822.436 ms	6.828 ms	-815608.102 us	-99.17%
file	7	25	2^30	1	953.459 ms	125.146 ms	-828312.756 us	-86.87%
host paged	7	25	2^30	1	878.591 ms	7.729 ms	-870862.213 us	-99.12%
host pinned	7	25	2^30	1	853.347 ms	7.202 ms	-846145.222 us	-99.16%
device	1	1	2^15	5	494.135 us	267.240 us	-226.896 us	-45.92%
file	1	1	2^15	5	8.396 ms	3.790 ms	-4605.974 us	-54.86%
host paged	1	1	2^15	5	511.952 us	267.649 us	-244.304 us	-47.72%
host pinned	1	1	2^15	5	523.247 us	281.380 us	-241.868 us	-46.22%
device	4	1	2^15	5	513.061 us	277.739 us	-235.322 us	-45.87%
file	4	1	2^15	5	8.393 ms	3.745 ms	-4647.934 us	-55.38%
host paged	4	1	2^15	5	532.465 us	283.033 us	-249.433 us	-46.84%
host pinned	4	1	2^15	5	541.863 us	292.102 us	-249.761 us	-46.09%
device	7	1	2^15	5	536.721 us	282.465 us	-254.256 us	-47.37%
file	7	1	2^15	5	8.374 ms	3.770 ms	-4603.511 us	-54.97%
host paged	7	1	2^15	5	554.098 us	285.165 us	-268.933 us	-48.54%
host pinned	7	1	2^15	5	566.133 us	298.437 us	-267.696 us	-47.29%
device	1	25	2^15	5	523.197 us	274.622 us	-248.575 us	-47.51%
file	1	25	2^15	5	8.383 ms	3.721 ms	-4661.680 us	-55.61%
host paged	1	25	2^15	5	538.202 us	274.924 us	-263.278 us	-48.92%
host pinned	1	25	2^15	5	552.161 us	288.881 us	-263.280 us	-47.68%
device	4	25	2^15	5	571.249 us	300.540 us	-270.709 us	-47.39%
file	4	25	2^15	5	8.365 ms	3.729 ms	-4636.104 us	-55.42%
host paged	4	25	2^15	5	587.178 us	302.430 us	-284.748 us	-48.49%
host pinned	4	25	2^15	5	600.085 us	314.728 us	-285.357 us	-47.55%
device	7	25	2^15	5	576.040 us	301.207 us	-274.833 us	-47.71%
file	7	25	2^15	5	8.557 ms	3.728 ms	-4828.929 us	-56.43%
host paged	7	25	2^15	5	605.849 us	306.993 us	-298.856 us	-49.33%
host pinned	7	25	2^15	5	605.730 us	315.480 us	-290.250 us	-47.92%
device	1	1	2^30	5	760.295 ms	26.647 ms	-733647.234 us	-96.50%
file	1	1	2^30	5	900.988 ms	144.221 ms	-756766.941 us	-83.99%
host paged	1	1	2^30	5	861.858 ms	27.640 ms	-834218.288 us	-96.79%
host pinned	1	1	2^30	5	814.384 ms	27.071 ms	-787313.049 us	-96.68%
device	4	1	2^30	5	774.991 ms	26.829 ms	-748162.355 us	-96.54%
file	4	1	2^30	5	939.380 ms	145.972 ms	-793408.072 us	-84.46%
host paged	4	1	2^30	5	824.653 ms	27.778 ms	-796875.210 us	-96.63%
host pinned	4	1	2^30	5	780.742 ms	27.220 ms	-753521.950 us	-96.51%
device	7	1	2^30	5	777.662 ms	26.946 ms	-750715.400 us	-96.53%
file	7	1	2^30	5	1.062 s	146.987 ms	-915037.565 us	-86.16%
host paged	7	1	2^30	5	792.441 ms	27.890 ms	-764551.527 us	-96.48%
host pinned	7	1	2^30	5	844.303 ms	27.352 ms	-816951.184 us	-96.76%
device	1	25	2^30	5	650.803 ms	24.079 ms	-626723.464 us	-96.30%
file	1	25	2^30	5	819.158 ms	113.911 ms	-705247.805 us	-86.09%
host paged	1	25	2^30	5	892.323 ms	25.012 ms	-867311.636 us	-97.20%
host pinned	1	25	2^30	5	705.445 ms	24.465 ms	-680980.519 us	-96.53%
device	4	25	2^30	5	804.875 ms	27.849 ms	-777026.452 us	-96.54%
file	4	25	2^30	5	939.873 ms	139.019 ms	-800853.228 us	-85.21%
host paged	4	25	2^30	5	818.237 ms	28.763 ms	-789474.346 us	-96.48%
host pinned	4	25	2^30	5	982.444 ms	28.217 ms	-954226.125 us	-97.13%
device	7	25	2^30	5	823.389 ms	29.782 ms	-793606.771 us	-96.38%
file	7	25	2^30	5	1.009 s	146.087 ms	-862689.868 us	-85.52%
host paged	7	25	2^30	5	1.064 s	30.730 ms	-1033256.695 us	-97.11%
host pinned	7	25	2^30	5	877.682 ms	30.180 ms	-847501.645 us	-96.56%
device	1	1	2^15	25	495.226 us	260.457 us	-234.769 us	-47.41%
file	1	1	2^15	25	8.485 ms	3.781 ms	-4703.671 us	-55.43%
host paged	1	1	2^15	25	512.784 us	267.446 us	-245.337 us	-47.84%
host pinned	1	1	2^15	25	524.160 us	273.471 us	-250.689 us	-47.83%
device	4	1	2^15	25	513.812 us	268.105 us	-245.707 us	-47.82%
file	4	1	2^15	25	8.494 ms	3.779 ms	-4715.096 us	-55.51%
host paged	4	1	2^15	25	534.551 us	274.615 us	-259.936 us	-48.63%
host pinned	4	1	2^15	25	542.672 us	282.072 us	-260.600 us	-48.02%
device	7	1	2^15	25	535.988 us	277.391 us	-258.597 us	-48.25%
file	7	1	2^15	25	8.344 ms	3.725 ms	-4619.012 us	-55.36%
host paged	7	1	2^15	25	554.427 us	284.250 us	-270.176 us	-48.73%
host pinned	7	1	2^15	25	567.662 us	295.147 us	-272.515 us	-48.01%
device	1	25	2^15	25	523.935 us	275.170 us	-248.765 us	-47.48%
file	1	25	2^15	25	8.333 ms	3.734 ms	-4599.214 us	-55.19%
host paged	1	25	2^15	25	539.151 us	277.941 us	-261.210 us	-48.45%
host pinned	1	25	2^15	25	553.701 us	289.672 us	-264.028 us	-47.68%
device	4	25	2^15	25	572.250 us	300.713 us	-271.537 us	-47.45%
file	4	25	2^15	25	8.467 ms	3.732 ms	-4734.345 us	-55.92%
host paged	4	25	2^15	25	589.709 us	306.826 us	-282.883 us	-47.97%
host pinned	4	25	2^15	25	602.942 us	314.485 us	-288.457 us	-47.84%
device	7	25	2^15	25	577.125 us	301.039 us	-276.086 us	-47.84%
file	7	25	2^15	25	8.477 ms	3.766 ms	-4711.039 us	-55.57%
host paged	7	25	2^15	25	593.513 us	307.753 us	-285.760 us	-48.15%
host pinned	7	25	2^15	25	606.763 us	315.183 us	-291.580 us	-48.06%
device	1	1	2^30	25	765.578 ms	129.396 ms	-636182.115 us	-83.10%
file	1	1	2^30	25	929.694 ms	221.490 ms	-708204.682 us	-76.18%
host paged	1	1	2^30	25	954.516 ms	130.558 ms	-823957.460 us	-86.32%
host pinned	1	1	2^30	25	966.739 ms	129.951 ms	-836787.576 us	-86.56%
device	4	1	2^30	25	780.141 ms	132.059 ms	-648082.202 us	-83.07%
file	4	1	2^30	25	955.247 ms	224.799 ms	-730447.507 us	-76.47%
host paged	4	1	2^30	25	906.541 ms	133.090 ms	-773450.918 us	-85.32%
host pinned	4	1	2^30	25	803.184 ms	132.560 ms	-670623.617 us	-83.50%
device	7	1	2^30	25	782.807 ms	132.482 ms	-650324.997 us	-83.08%
file	7	1	2^30	25	1.006 s	224.442 ms	-781614.108 us	-77.69%
host paged	7	1	2^30	25	879.983 ms	133.681 ms	-746302.241 us	-84.81%
host pinned	7	1	2^30	25	806.090 ms	133.161 ms	-672929.604 us	-83.48%
device	1	25	2^30	25	657.313 ms	116.237 ms	-541076.105 us	-82.32%
file	1	25	2^30	25	792.993 ms	187.620 ms	-605372.917 us	-76.34%
host paged	1	25	2^30	25	707.506 ms	117.467 ms	-590039.693 us	-83.40%
host pinned	1	25	2^30	25	714.342 ms	116.772 ms	-597570.059 us	-83.65%
device	4	25	2^30	25	810.447 ms	138.744 ms	-671702.648 us	-82.88%
file	4	25	2^30	25	984.757 ms	226.860 ms	-757897.389 us	-76.96%
host paged	4	25	2^30	25	870.188 ms	139.849 ms	-730338.049 us	-83.93%
host pinned	4	25	2^30	25	1.005 s	139.282 ms	-865389.388 us	-86.14%
device	7	25	2^30	25	828.847 ms	142.057 ms	-686789.436 us	-82.86%
file	7	25	2^30	25	986.067 ms	231.655 ms	-754412.174 us	-76.51%
host paged	7	25	2^30	25	929.677 ms	143.445 ms	-786232.463 us	-84.57%
host pinned	7	25	2^30	25	913.866 ms	142.981 ms	-770884.883 us	-84.35%
device	1	1	2^15	50	494.177 us	259.079 us	-235.098 us	-47.57%
file	1	1	2^15	50	8.305 ms	3.781 ms	-4524.620 us	-54.48%
host paged	1	1	2^15	50	515.479 us	267.290 us	-248.189 us	-48.15%
host pinned	1	1	2^15	50	527.832 us	273.050 us	-254.781 us	-48.27%
device	4	1	2^15	50	513.578 us	267.341 us	-246.237 us	-47.95%
file	4	1	2^15	50	8.478 ms	3.784 ms	-4694.029 us	-55.37%
host paged	4	1	2^15	50	536.324 us	274.019 us	-262.305 us	-48.91%
host pinned	4	1	2^15	50	545.107 us	281.623 us	-263.484 us	-48.34%
device	7	1	2^15	50	536.213 us	279.726 us	-256.487 us	-47.83%
file	7	1	2^15	50	8.402 ms	3.790 ms	-4611.689 us	-54.89%
host paged	7	1	2^15	50	558.127 us	285.683 us	-272.444 us	-48.81%
host pinned	7	1	2^15	50	569.002 us	298.287 us	-270.715 us	-47.58%
device	1	25	2^15	50	524.190 us	275.036 us	-249.154 us	-47.53%
file	1	25	2^15	50	8.467 ms	3.719 ms	-4748.710 us	-56.08%
host paged	1	25	2^15	50	540.475 us	282.192 us	-258.283 us	-47.79%
host pinned	1	25	2^15	50	554.028 us	288.838 us	-265.189 us	-47.87%
device	4	25	2^15	50	571.287 us	300.003 us	-271.283 us	-47.49%
file	4	25	2^15	50	8.342 ms	3.732 ms	-4609.666 us	-55.26%
host paged	4	25	2^15	50	591.604 us	308.682 us	-282.922 us	-47.82%
host pinned	4	25	2^15	50	603.550 us	314.584 us	-288.966 us	-47.88%
device	7	25	2^15	50	576.753 us	300.907 us	-275.846 us	-47.83%
file	7	25	2^15	50	8.353 ms	3.725 ms	-4628.266 us	-55.41%
host paged	7	25	2^15	50	595.478 us	309.244 us	-286.235 us	-48.07%
host pinned	7	25	2^15	50	609.549 us	315.463 us	-294.087 us	-48.25%
device	1	1	2^30	50	772.301 ms	259.032 ms	-513268.599 us	-66.46%
file	1	1	2^30	50	969.116 ms	321.799 ms	-647316.758 us	-66.79%
host paged	1	1	2^30	50	1.141 s	260.091 ms	-881389.084 us	-77.21%
host pinned	1	1	2^30	50	949.102 ms	259.552 ms	-689549.243 us	-72.65%
device	4	1	2^30	50	786.655 ms	262.064 ms	-524590.867 us	-66.69%
file	4	1	2^30	50	1.098 s	326.710 ms	-770958.900 us	-70.24%
host paged	4	1	2^30	50	948.089 ms	263.495 ms	-684593.933 us	-72.21%
host pinned	4	1	2^30	50	907.653 ms	262.806 ms	-644846.480 us	-71.05%
device	7	1	2^30	50	789.401 ms	263.368 ms	-526033.714 us	-66.64%
file	7	1	2^30	50	1.083 s	327.480 ms	-755255.187 us	-69.75%
host paged	7	1	2^30	50	962.268 ms	264.524 ms	-697744.071 us	-72.51%
host pinned	7	1	2^30	50	971.881 ms	264.065 ms	-707815.282 us	-72.83%
device	1	25	2^30	50	665.425 ms	232.463 ms	-432961.322 us	-65.07%
file	1	25	2^30	50	911.572 ms	282.358 ms	-629214.772 us	-69.03%
host paged	1	25	2^30	50	924.965 ms	234.220 ms	-690745.245 us	-74.68%
host pinned	1	25	2^30	50	699.093 ms	233.032 ms	-466060.898 us	-66.67%
device	4	25	2^30	50	817.301 ms	276.820 ms	-540481.648 us	-66.13%
file	4	25	2^30	50	1.028 s	337.968 ms	-690458.768 us	-67.14%
host paged	4	25	2^30	50	989.999 ms	278.600 ms	-711398.589 us	-71.86%
host pinned	4	25	2^30	50	859.267 ms	278.173 ms	-581093.622 us	-67.63%
device	7	25	2^30	50	835.640 ms	282.740 ms	-552899.603 us	-66.16%
file	7	25	2^30	50	1.028 s	344.770 ms	-682948.230 us	-66.45%
host paged	7	25	2^30	50	1.175 s	283.889 ms	-890653.851 us	-75.83%
host pinned	7	25	2^30	50	878.720 ms	283.505 ms	-595215.368 us	-67.74%
device	1	1	2^15	100	495.879 us	265.133 us	-230.747 us	-46.53%
file	1	1	2^15	100	8.474 ms	3.771 ms	-4703.913 us	-55.51%
host paged	1	1	2^15	100	521.063 us	272.803 us	-248.260 us	-47.64%
host pinned	1	1	2^15	100	535.848 us	277.043 us	-258.805 us	-48.30%
device	4	1	2^15	100	513.719 us	271.754 us	-241.965 us	-47.10%
file	4	1	2^15	100	8.369 ms	3.767 ms	-4602.203 us	-54.99%
host paged	4	1	2^15	100	542.547 us	281.336 us	-261.212 us	-48.15%
host pinned	4	1	2^15	100	553.792 us	290.824 us	-262.968 us	-47.48%
device	7	1	2^15	100	536.841 us	284.016 us	-252.826 us	-47.10%
file	7	1	2^15	100	8.390 ms	3.764 ms	-4626.076 us	-55.14%
host paged	7	1	2^15	100	563.217 us	292.971 us	-270.246 us	-47.98%
host pinned	7	1	2^15	100	578.990 us	302.168 us	-276.822 us	-47.81%
device	1	25	2^15	100	524.359 us	280.308 us	-244.051 us	-46.54%
file	1	25	2^15	100	8.374 ms	3.776 ms	-4598.153 us	-54.91%
host paged	1	25	2^15	100	560.218 us	288.964 us	-271.253 us	-48.42%
host pinned	1	25	2^15	100	566.101 us	294.120 us	-271.980 us	-48.04%
device	4	25	2^15	100	582.539 us	305.593 us	-276.946 us	-47.54%
file	4	25	2^15	100	8.445 ms	3.775 ms	-4670.815 us	-55.31%
host paged	4	25	2^15	100	596.100 us	314.954 us	-281.146 us	-47.16%
host pinned	4	25	2^15	100	606.737 us	319.597 us	-287.140 us	-47.33%
device	7	25	2^15	100	577.114 us	305.427 us	-271.687 us	-47.08%
file	7	25	2^15	100	8.459 ms	3.775 ms	-4683.776 us	-55.37%
host paged	7	25	2^15	100	600.814 us	315.401 us	-285.413 us	-47.50%
host pinned	7	25	2^15	100	617.354 us	323.114 us	-294.240 us	-47.66%
device	1	1	2^30	100	785.740 ms	515.240 ms	-270499.817 us	-34.43%
file	1	1	2^30	100	1.043 s	521.471 ms	-521600.353 us	-50.01%
host paged	1	1	2^30	100	1.030 s	518.769 ms	-511254.716 us	-49.64%
host pinned	1	1	2^30	100	922.861 ms	517.999 ms	-404862.705 us	-43.87%
device	4	1	2^30	100	800.010 ms	521.530 ms	-278479.331 us	-34.81%
file	4	1	2^30	100	1.190 s	527.630 ms	-661958.288 us	-55.65%
host paged	4	1	2^30	100	1.090 s	524.598 ms	-565287.699 us	-51.87%
host pinned	4	1	2^30	100	1.059 s	524.101 ms	-535396.658 us	-50.53%
device	7	1	2^30	100	802.697 ms	524.818 ms	-277878.923 us	-34.62%
file	7	1	2^30	100	1.122 s	531.504 ms	-590129.710 us	-52.61%
host paged	7	1	2^30	100	1.365 s	527.805 ms	-837014.954 us	-61.33%
host pinned	7	1	2^30	100	942.695 ms	527.151 ms	-415543.608 us	-44.08%
device	1	25	2^30	100	681.754 ms	461.792 ms	-219961.806 us	-32.26%
file	1	25	2^30	100	888.544 ms	467.517 ms	-421027.176 us	-47.38%
host paged	1	25	2^30	100	873.000 ms	464.901 ms	-408099.261 us	-46.75%
host pinned	1	25	2^30	100	798.295 ms	463.761 ms	-334534.095 us	-41.91%
device	4	25	2^30	100	831.389 ms	550.050 ms	-281338.656 us	-33.84%
file	4	25	2^30	100	1.079 s	555.891 ms	-522743.242 us	-48.46%
host paged	4	25	2^30	100	1.060 s	552.930 ms	-506979.337 us	-47.83%
host pinned	4	25	2^30	100	945.572 ms	552.551 ms	-393021.062 us	-41.56%
device	7	25	2^30	100	849.229 ms	562.074 ms	-287155.401 us	-33.81%
file	7	25	2^30	100	1.104 s	567.974 ms	-535690.202 us	-48.54%
host paged	7	25	2^30	100	1.375 s	565.169 ms	-809622.226 us	-58.89%
host pinned	7	25	2^30	100	1.099 s	564.743 ms	-534429.395 us	-48.62%

TODO

Handle byte_range edge cases
Handle issues with large inputs
~~Extend to overlapping delimiters by providing previous_chunk support for data_chunk_source~~ That will be another PR

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Closes #11197

upsj · 2022-08-11T11:33:56Z

This still needs some work, since it duplicates the existing kernel and could use a few more tests, but I think it's already good for a first review.

codecov · 2022-08-15T21:35:33Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@ccd72f2). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 4a89fe5 differs from pull request most recent head 4488965. Consider uploading reports for the commit 4488965 to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11500   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22993           
  Branches                ?        0           
===============================================
  Hits                    ?    19869           
  Misses                  ?     3124           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ttnghia · 2022-08-16T17:44:08Z

According to the recent discussion, if this more work then add DO NOT MERGE label until it is ready to merge.

cpp/src/io/text/multibyte_split.cu

upsj · 2022-08-16T18:22:26Z

I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary.

vuule · 2022-08-16T22:32:06Z

I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary for now.

~~There's still some clean up pending before review, is that right?~~ Disregard, I see the clean up commits pushed.

cwharris

Implementation looks good, and the benchmarks are looking great too. Cherry picking some here...

Though peak memory usage has doubled, which is interesting. Do we know why that is? Maybe there's some tuning we can do with the output_chunks class?

Existing implementation:

MultibyteSplitBenchmark/multibyte_split_simple/0/7/25/1073741824/manual_time        194 ms          194 ms            4 bytes_per_second=4.93718G/s peak_memory_usage=1.48489G
MultibyteSplitBenchmark/multibyte_split_simple/1/7/25/1073741824/manual_time        521 ms          521 ms            1 bytes_per_second=1.83643G/s peak_memory_usage=1.48908G
MultibyteSplitBenchmark/multibyte_split_simple/2/7/25/1073741824/manual_time       1328 ms         1326 ms            1 bytes_per_second=738.056M/s peak_memory_usage=1032.22M

This PR's improvements:

MultibyteSplitBenchmark/multibyte_split_simple/0/7/25/1073741824/manual_time        129 ms          129 ms            5 bytes_per_second=7.44079G/s peak_memory_usage=3.15422G
MultibyteSplitBenchmark/multibyte_split_simple/1/7/25/1073741824/manual_time        153 ms          153 ms            5 bytes_per_second=6.25106G/s peak_memory_usage=3.15422G
MultibyteSplitBenchmark/multibyte_split_simple/2/7/25/1073741824/manual_time        420 ms          420 ms            2 bytes_per_second=2.27932G/s peak_memory_usage=2.10296G

I have some comments, which are optional to address.

Requesting changes because we need some benchmarks that exemplify the byte_range improvements. I never got around to adding benchmarks for that case because the perf would be on par with full file reads. Now that we have byte range optimizations, we have a chance to demonstrate some ever bigger improvements than "just" the ~2-3x we see in the existing benchmarks.

cpp/src/io/text/multibyte_split.cu

upsj · 2022-08-22T09:30:07Z

@cwharris on the increased memory usage: With the exponential growth of the chunks, at worst we overestimate the amount of memory by the growth factor (2x in this case). A smaller growth factor might also make sense. What we could do alternatively is set a size limit for the chunks, because at some point the allocation overhead amortized over all kernel launches until the chunk is full is negligible.

cpp/src/io/text/multibyte_split.cu

upsj · 2022-08-26T16:45:31Z

rerun tests

vuule

Partial review, mostly minor suggestions.
Lots of cool stuff to unpack here :)

cpp/src/io/text/multibyte_split.cu

vuule

Looks good. Just some very minor suggestions/questions.

cpp/src/io/text/multibyte_split.cu

upsj · 2022-08-30T18:11:55Z

@gpucibot merge

This improves the `multibyte_split` kernel by * Reducing register pressure: Instead of storing `ITEMS_PER_THREAD` individual states, store only the initial `multistate` for the thread and recompute the individual states on-the-fly * Eliminating local memory usage: Manipulate the `multistate` via shifts instead of array random access * Eliminating trie overhead: Since we have only a single delimiter, the trie is a path. We can do the traversal implicitly * Memoizing which chars were a match: We don't need to recompute this information, but can store it in a bitmask * Changing the block load algorithm: `BLOCK_LOAD_VECTORIZE` was slightly less efficient than `BLOCK_LOAD_WARP_TRANSPOSE` * Reducing the allocation overhead by limiting the `output_builder` max allocation size * Tuning the parameters: `ITEMS_PER_THREAD = 64` works better, and we can improve performance further by operating on larger chunks Overall, this gives a roughly 2x speedup in my benchmarks. Based on #11500 Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #11587

upsj added libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Aug 9, 2022

upsj added this to the Genomics read_text support milestone Aug 9, 2022

upsj self-assigned this Aug 9, 2022

upsj added the improvement Improvement / enhancement to an existing function label Aug 9, 2022

upsj marked this pull request as ready for review August 11, 2022 11:32

upsj requested a review from a team as a code owner August 11, 2022 11:32

upsj requested review from cwharris and davidwendt August 11, 2022 11:32

upsj added the 3 - Ready for Review Ready for review by team label Aug 16, 2022

ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 16, 2022

ttnghia reviewed Aug 16, 2022

View reviewed changes

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved

ttnghia reviewed Aug 16, 2022

View reviewed changes

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved

vuule added this to PR-WIP in v22.10 Release via automation Aug 17, 2022

upsj removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 17, 2022

upsj force-pushed the feature/multibyte_split_local branch from 5138705 to a96bcd8 Compare August 17, 2022 13:08

cwharris requested changes Aug 17, 2022

View reviewed changes

upsj force-pushed the feature/multibyte_split_local branch from a96bcd8 to 221c216 Compare August 18, 2022 21:16

github-actions bot added the CMake CMake build issue label Aug 18, 2022

upsj changed the title ~~Single-pass multibyte_split for non-overlapping delimiters~~ Single-pass multibyte_split Aug 18, 2022

upsj force-pushed the feature/multibyte_split_local branch 2 times, most recently from 7ceb82f to 814c28c Compare August 23, 2022 09:40

github-actions bot removed the CMake CMake build issue label Aug 23, 2022

upsj force-pushed the feature/multibyte_split_local branch from 814c28c to 9ac5474 Compare August 24, 2022 11:24

upsj added 7 commits August 26, 2022 07:31

fix memory management

9e558b3

add more tests

9e83c2d

remove unnecessary complexity

062f4f7

remove duplicate kernel

caa30f6

remove three-pass implementation

02eb4f4

reduce register pressure on scan

3ca6e51

review updates

918e45e

upsj force-pushed the feature/multibyte_split_local branch from 9ac5474 to 918e45e Compare August 26, 2022 08:24

github-actions bot removed the CMake CMake build issue label Aug 26, 2022

upsj commented Aug 26, 2022

View reviewed changes

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved

ttnghia approved these changes Aug 26, 2022

View reviewed changes

fix documentation

24d3b65

vuule self-requested a review August 28, 2022 06:02

vuule requested changes Aug 28, 2022

View reviewed changes

v22.10 Release automation moved this from PR-Reviewer approved to PR-Needs review Aug 28, 2022

upsj added 2 commits August 29, 2022 13:42

review updates

717b13a

formatting

4c49ec9

vuule approved these changes Aug 30, 2022

View reviewed changes

v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 30, 2022

upsj added 4 commits August 30, 2022 13:14

test empty output edge case

2582631

sync on last kernel

9256ef8

write out conditions

50523b7

replace assertions by CUDF_EXPECTS

4488965

upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 30, 2022

rapids-bot bot merged commit b4dd2d5 into rapidsai:branch-22.10 Aug 30, 2022

v22.10 Release automation moved this from PR-Reviewer approved to Done Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-pass `multibyte_split` #11500

Single-pass `multibyte_split` #11500

upsj commented Aug 9, 2022 •

edited by vuule

Loading

upsj commented Aug 11, 2022

codecov bot commented Aug 15, 2022 •

edited

Loading

ttnghia commented Aug 16, 2022 •

edited

Loading

upsj commented Aug 16, 2022 •

edited

Loading

vuule commented Aug 16, 2022 •

edited

Loading

cwharris left a comment

upsj commented Aug 22, 2022

upsj commented Aug 26, 2022

vuule left a comment

vuule left a comment

upsj commented Aug 30, 2022

Single-pass multibyte_split #11500

Single-pass multibyte_split #11500

Conversation

upsj commented Aug 9, 2022 • edited by vuule Loading

Description

multibyte_split

[0] Tesla T4

TODO

Checklist

upsj commented Aug 11, 2022

codecov bot commented Aug 15, 2022 • edited Loading

Codecov Report

ttnghia commented Aug 16, 2022 • edited Loading

upsj commented Aug 16, 2022 • edited Loading

vuule commented Aug 16, 2022 • edited Loading

cwharris left a comment

Choose a reason for hiding this comment

upsj commented Aug 22, 2022

upsj commented Aug 26, 2022

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

upsj commented Aug 30, 2022

Single-pass `multibyte_split` #11500

Single-pass `multibyte_split` #11500

upsj commented Aug 9, 2022 •

edited by vuule

Loading

codecov bot commented Aug 15, 2022 •

edited

Loading

ttnghia commented Aug 16, 2022 •

edited

Loading

upsj commented Aug 16, 2022 •

edited

Loading

vuule commented Aug 16, 2022 •

edited

Loading