-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read_text when byte_range is aligned with field #11371
Fix read_text when byte_range is aligned with field #11371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to 22.10?
auto out = cudf::io::text::multibyte_split( | ||
*source, | ||
delimiter, | ||
cudf::io::text::byte_range_info{0, static_cast<int64_t>(host_input.size())}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this test, it didn't occured to me we weren't testing this case explicitly, since under-the-hood we're using intmax as the byte range end when the byte range isn't specified.
@@ -383,17 +383,31 @@ std::unique_ptr<cudf::column> multibyte_split(cudf::io::text::data_chunk_source | |||
stream, | |||
streams); | |||
|
|||
// String offsets point to the first character of a field | |||
// This finds the first field whose first character starts inside or after the byte range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great comment. It makes it very clear to me where the bug was introduced:
... or after the byte range
@@ -169,4 +198,37 @@ TEST_F(MultibyteSplitTest, LargeInputMultipleRange) | |||
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected->view(), *out, debug_output_level::ALL_ERRORS); | |||
} | |||
|
|||
TEST_F(MultibyteSplitTest, SmallInputAllPossibleRanges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test.
This is otherwise broken, since even an empty range coinciding with the beginning of a field would produce this field as output.
Looks like some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small nit.
12db7a5
to
fe63370
Compare
Codecov Report
@@ Coverage Diff @@
## branch-22.10 #11371 +/- ##
===============================================
Coverage ? 86.47%
===============================================
Files ? 144
Lines ? 22856
Branches ? 0
===============================================
Hits ? 19765
Misses ? 3091
Partials ? 0 Continue to review full report at Codecov.
|
@gpucibot merge |
Description
Currently, if the beginning of a field coincides with either the beginning (inclusive) or end (exclusive) of a byte range, the field will be part of the output. This PR fixes the resulting field duplication if we concatenate the results from a partition of the input into byte ranges.
The issue stems from the fact that we use lower_bound to determine the beginning of a field, but upper_bound to determine its end, so if the end of the byte range coincides with the beginning of a field, the result from the range [a,b) doesn't fit exactly onto the result from the range [b,c).
To keep the previous behavior of emitting an empty field if the input ends with a delimiter, I needed to add a small fix that differentiates between byte ranges whose size matches the input size exactly, and ones that overrun the input size (which is the default behavior).
Checklist