Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multibyte-split byte-range performance #16019

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jun 13, 2024

Description

Changes the cudf::io::text::multibyte_split() function to use std::ifstream::seekg() to skip bytes instead of std::ifstream::ignore() for a file input source.
The seekg() function is significantly faster for large files.

Also fixed the multibyte-split benchmark to correctly access the chars buffer after generating an input strings column.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jun 13, 2024
@davidwendt davidwendt self-assigned this Jun 13, 2024
@davidwendt
Copy link
Contributor Author

The benchmark only showed minor improvement

# multibyte_split_source

## [0] NVIDIA RTX A6000

|        T        |  strip_delimiters  |  delim_size  |  delim_percent  |  size_approx  |  byte_range_percent  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-----------------|--------------------|--------------|-----------------|---------------|----------------------|------------|-------------|------------|-------------|---------------|---------|----------|
| file_datasource |         0          |      1       |        1        |     2^15      |          10          | 159.175 us |      10.53% | 159.251 us |       5.97% |      0.076 us |   0.05% |   PASS   |
| file_datasource |         1          |      1       |        1        |     2^15      |          10          | 217.408 us |       1.58% | 218.453 us |       1.59% |      1.045 us |   0.48% |   PASS   |
| file_datasource |         0          |      1       |        1        |     2^30      |          10          | 245.440 ms |       0.50% | 242.862 ms |       0.50% |  -2577.882 us |  -1.05% |   FAIL   |
| file_datasource |         1          |      1       |        1        |     2^30      |          10          | 254.515 ms |       0.50% | 252.031 ms |       0.50% |  -2484.335 us |  -0.98% |   FAIL   |
| file_datasource |         0          |      1       |        1        |     2^15      |         100          | 181.939 us |       2.80% | 184.765 us |       2.54% |      2.826 us |   1.55% |   PASS   |
| file_datasource |         1          |      1       |        1        |     2^15      |         100          | 262.151 us |       1.90% | 265.842 us |       1.78% |      3.691 us |   1.41% |   PASS   |
| file_datasource |         0          |      1       |        1        |     2^30      |         100          |    1.559 s |       0.98% |    1.544 s |       0.99% | -15319.019 us |  -0.98% |   PASS   |
| file_datasource |         1          |      1       |        1        |     2^30      |         100          |    1.644 s |       0.92% |    1.629 s |       0.94% | -15622.205 us |  -0.95% |   FAIL   |

But running this on a 14GB file in 10 chunks showed a more significant improvement.
Here is a test calling cudf::io::text::multibyte_split() with the std::ifstream::ignore() function used internally.

load (0,1379596188): 1.93292 seconds
load (1379596188,1379596188): 2.14708 seconds
load (2759192376,1379596188): 2.41492 seconds
load (4138788564,1379596188): 2.67618 seconds
load (5518384752,1379596188): 2.94917 seconds
load (6897980940,1379596188): 3.20467 seconds
load (8277577128,1379596188): 3.47808 seconds
load (9657173316,1379596188): 3.73912 seconds
load (11036769504,1379596188): 4.01435 seconds
load (12416365692,1379596187): 4.18789 seconds

The 1st value in the parenthesis is the chunk starting position (byte offset) within the file.
The 2nd value in the parenthesis is the chunk size (also in bytes) to read at the position.
Note the time increases the further into the file the chunk starts.

This is the output of the same code using seekg() internally instead of ignore()

load (0,1379596188): 1.95383 seconds
load (1379596188,1379596188): 1.89438 seconds
load (2759192376,1379596188): 1.89466 seconds
load (4138788564,1379596188): 1.89445 seconds
load (5518384752,1379596188): 1.89458 seconds
load (6897980940,1379596188): 1.8943 seconds
load (8277577128,1379596188): 1.89455 seconds
load (9657173316,1379596188): 1.89433 seconds
load (11036769504,1379596188): 1.89456 seconds
load (12416365692,1379596187): 1.81201 seconds

Here loading each chunk takes roughly the same amount time.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 13, 2024
@davidwendt davidwendt marked this pull request as ready for review June 13, 2024 22:29
@davidwendt davidwendt requested a review from a team as a code owner June 13, 2024 22:29
@davidwendt davidwendt requested review from harrism and bdice June 13, 2024 22:29
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit ac0f79a into rapidsai:branch-24.08 Jun 25, 2024
85 checks passed
@davidwendt davidwendt deleted the fix-mbs-skip-bytes branch June 25, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants