Improve multibyte-split byte-range performance #16019

davidwendt · 2024-06-13T18:44:30Z

Description

Changes the cudf::io::text::multibyte_split() function to use std::ifstream::seekg() to skip bytes instead of std::ifstream::ignore() for a file input source.
The seekg() function is significantly faster for large files.

Also fixed the multibyte-split benchmark to correctly access the chars buffer after generating an input strings column.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2024-06-13T18:54:02Z

The benchmark only showed minor improvement

# multibyte_split_source

## [0] NVIDIA RTX A6000

|        T        |  strip_delimiters  |  delim_size  |  delim_percent  |  size_approx  |  byte_range_percent  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-----------------|--------------------|--------------|-----------------|---------------|----------------------|------------|-------------|------------|-------------|---------------|---------|----------|
| file_datasource |         0          |      1       |        1        |     2^15      |          10          | 159.175 us |      10.53% | 159.251 us |       5.97% |      0.076 us |   0.05% |   PASS   |
| file_datasource |         1          |      1       |        1        |     2^15      |          10          | 217.408 us |       1.58% | 218.453 us |       1.59% |      1.045 us |   0.48% |   PASS   |
| file_datasource |         0          |      1       |        1        |     2^30      |          10          | 245.440 ms |       0.50% | 242.862 ms |       0.50% |  -2577.882 us |  -1.05% |   FAIL   |
| file_datasource |         1          |      1       |        1        |     2^30      |          10          | 254.515 ms |       0.50% | 252.031 ms |       0.50% |  -2484.335 us |  -0.98% |   FAIL   |
| file_datasource |         0          |      1       |        1        |     2^15      |         100          | 181.939 us |       2.80% | 184.765 us |       2.54% |      2.826 us |   1.55% |   PASS   |
| file_datasource |         1          |      1       |        1        |     2^15      |         100          | 262.151 us |       1.90% | 265.842 us |       1.78% |      3.691 us |   1.41% |   PASS   |
| file_datasource |         0          |      1       |        1        |     2^30      |         100          |    1.559 s |       0.98% |    1.544 s |       0.99% | -15319.019 us |  -0.98% |   PASS   |
| file_datasource |         1          |      1       |        1        |     2^30      |         100          |    1.644 s |       0.92% |    1.629 s |       0.94% | -15622.205 us |  -0.95% |   FAIL   |

But running this on a 14GB file in 10 chunks showed a more significant improvement.
Here is a test calling cudf::io::text::multibyte_split() with the std::ifstream::ignore() function used internally.

load (0,1379596188): 1.93292 seconds
load (1379596188,1379596188): 2.14708 seconds
load (2759192376,1379596188): 2.41492 seconds
load (4138788564,1379596188): 2.67618 seconds
load (5518384752,1379596188): 2.94917 seconds
load (6897980940,1379596188): 3.20467 seconds
load (8277577128,1379596188): 3.47808 seconds
load (9657173316,1379596188): 3.73912 seconds
load (11036769504,1379596188): 4.01435 seconds
load (12416365692,1379596187): 4.18789 seconds

The 1st value in the parenthesis is the chunk starting position (byte offset) within the file.
The 2nd value in the parenthesis is the chunk size (also in bytes) to read at the position.
Note the time increases the further into the file the chunk starts.

This is the output of the same code using seekg() internally instead of ignore()

load (0,1379596188): 1.95383 seconds
load (1379596188,1379596188): 1.89438 seconds
load (2759192376,1379596188): 1.89466 seconds
load (4138788564,1379596188): 1.89445 seconds
load (5518384752,1379596188): 1.89458 seconds
load (6897980940,1379596188): 1.8943 seconds
load (8277577128,1379596188): 1.89455 seconds
load (9657173316,1379596188): 1.89433 seconds
load (11036769504,1379596188): 1.89456 seconds
load (12416365692,1379596187): 1.81201 seconds

Here loading each chunk takes roughly the same amount time.

davidwendt · 2024-06-25T15:00:04Z

/merge

Improve multibyte-split byte-range performance

7b721be

davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jun 13, 2024

davidwendt self-assigned this Jun 13, 2024

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 13, 2024

davidwendt marked this pull request as ready for review June 13, 2024 22:29

davidwendt requested a review from a team as a code owner June 13, 2024 22:29

davidwendt requested review from harrism and bdice June 13, 2024 22:29

bdice approved these changes Jun 24, 2024

View reviewed changes

vyasr approved these changes Jun 24, 2024

View reviewed changes

harrism approved these changes Jun 24, 2024

View reviewed changes

rapids-bot bot merged commit ac0f79a into rapidsai:branch-24.08 Jun 25, 2024
85 checks passed

davidwendt deleted the fix-mbs-skip-bytes branch June 25, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multibyte-split byte-range performance #16019

Improve multibyte-split byte-range performance #16019

davidwendt commented Jun 13, 2024 •

edited

Loading

davidwendt commented Jun 13, 2024

davidwendt commented Jun 25, 2024

Improve multibyte-split byte-range performance #16019

Improve multibyte-split byte-range performance #16019

Conversation

davidwendt commented Jun 13, 2024 • edited Loading

Description

Checklist

davidwendt commented Jun 13, 2024

davidwendt commented Jun 25, 2024

davidwendt commented Jun 13, 2024 •

edited

Loading