Fix read_text when byte_range is aligned with field #11371

upsj · 2022-07-27T20:41:08Z

Description

Currently, if the beginning of a field coincides with either the beginning (inclusive) or end (exclusive) of a byte range, the field will be part of the output. This PR fixes the resulting field duplication if we concatenate the results from a partition of the input into byte ranges.

The issue stems from the fact that we use lower_bound to determine the beginning of a field, but upper_bound to determine its end, so if the end of the byte range coincides with the beginning of a field, the result from the range [a,b) doesn't fit exactly onto the result from the range [b,c).

To keep the previous behavior of emitting an empty field if the input ends with a delimiter, I needed to add a small fix that differentiates between byte ranges whose size matches the input size exactly, and ones that overrun the input size (which is the default behavior).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cwharris

Should this be moved to 22.10?

cwharris · 2022-07-27T20:50:53Z

cpp/tests/io/text/multibyte_split_test.cpp

+  auto out    = cudf::io::text::multibyte_split(
+    *source,
+    delimiter,
+    cudf::io::text::byte_range_info{0, static_cast<int64_t>(host_input.size())});


Thanks for making this test, it didn't occured to me we weren't testing this case explicitly, since under-the-hood we're using intmax as the byte range end when the byte range isn't specified.

cwharris · 2022-07-27T20:52:18Z

cpp/src/io/text/multibyte_split.cu

@@ -383,17 +383,31 @@ std::unique_ptr<cudf::column> multibyte_split(cudf::io::text::data_chunk_source
    stream,
    streams);

+  // String offsets point to the first character of a field
+  // This finds the first field whose first character starts inside or after the byte range


This is a great comment. It makes it very clear to me where the bug was introduced:

... or after the byte range

cwharris · 2022-07-27T20:54:58Z

cpp/tests/io/text/multibyte_split_test.cpp

@@ -169,4 +198,37 @@ TEST_F(MultibyteSplitTest, LargeInputMultipleRange)
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected->view(), *out, debug_output_level::ALL_ERRORS);
 }

+TEST_F(MultibyteSplitTest, SmallInputAllPossibleRanges)


Great test.

This is otherwise broken, since even an empty range coinciding with the beginning of a field would produce this field as output.

cpp/src/io/text/multibyte_split.cu

vuule · 2022-07-28T21:08:16Z

Looks like some multibyte_split tests are failing with the latest commit

harrism

Just one small nit.

cpp/src/io/text/multibyte_split.cu

codecov · 2022-08-01T21:38:19Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@e7e5f45). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head de87d91 differs from pull request most recent head fe63370. Consider uploading reports for the commit fe63370 to get more accurate results

@@               Coverage Diff               @@
##             branch-22.10   #11371   +/-   ##
===============================================
  Coverage                ?   86.47%           
===============================================
  Files                   ?      144           
  Lines                   ?    22856           
  Branches                ?        0           
===============================================
  Hits                    ?    19765           
  Misses                  ?     3091           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7e5f45...fe63370. Read the comment docs.

upsj · 2022-08-02T14:15:45Z

@gpucibot merge

Fix read_text when byterange is aligned with field

e121e03

upsj requested a review from a team as a code owner July 27, 2022 20:41

upsj requested review from harrism, vuule and cwharris July 27, 2022 20:41

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 27, 2022

cwharris added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change bug Something isn't working labels Jul 27, 2022

upsj self-assigned this Jul 27, 2022

cwharris approved these changes Jul 27, 2022

View reviewed changes

ttnghia approved these changes Jul 27, 2022

View reviewed changes

upsj added 2 commits July 27, 2022 21:31

forbid empty read_text ranges

6ba61d7

This is otherwise broken, since even an empty range coinciding with the beginning of a field would produce this field as output.

clean up test

7dc6d26

vuule changed the base branch from branch-22.08 to branch-22.10 July 27, 2022 21:34

vuule reviewed Jul 27, 2022

View reviewed changes

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved

make condition for last_field_empty clearer

a9aea48

vuule added 0 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Jul 28, 2022

vuule added this to PR-WIP in v22.10 Release via automation Jul 28, 2022

harrism approved these changes Jul 28, 2022

View reviewed changes

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved

v22.10 Release automation moved this from PR-WIP to PR-Reviewer approved Jul 28, 2022

fix off-by-one in dask read_text

0a91db8

v22.10 Release automation moved this from PR-Reviewer approved to PR-Needs review Aug 1, 2022

github-actions bot added the Python Affects Python cuDF API. label Aug 1, 2022

upsj requested a review from a team August 1, 2022 19:31

improve readability

fe63370

upsj force-pushed the bugfix/multibyte_split_ranges branch from 12db7a5 to fe63370 Compare August 1, 2022 19:43

v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 1, 2022

charlesbluca approved these changes Aug 1, 2022

View reviewed changes

galipremsagar approved these changes Aug 1, 2022

View reviewed changes

upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 0 - Waiting on Author Waiting for author to respond to review labels Aug 1, 2022

rapids-bot bot merged commit 2dc5c3f into rapidsai:branch-22.10 Aug 2, 2022

v22.10 Release automation moved this from PR-Reviewer approved to Done Aug 2, 2022

upsj mentioned this pull request Sep 12, 2022

[BUG] read_txt returns overlapping values with non-overlapping ranges #10620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read_text when byte_range is aligned with field #11371

Fix read_text when byte_range is aligned with field #11371

upsj commented Jul 27, 2022 •

edited

Loading

cwharris left a comment

cwharris Jul 27, 2022

cwharris Jul 27, 2022

cwharris Jul 27, 2022

vuule commented Jul 28, 2022

harrism left a comment

codecov bot commented Aug 1, 2022 •

edited

Loading

upsj commented Aug 2, 2022

Fix read_text when byte_range is aligned with field #11371

Fix read_text when byte_range is aligned with field #11371

Conversation

upsj commented Jul 27, 2022 • edited Loading

Description

Checklist

cwharris left a comment

Choose a reason for hiding this comment

cwharris Jul 27, 2022

Choose a reason for hiding this comment

cwharris Jul 27, 2022

Choose a reason for hiding this comment

cwharris Jul 27, 2022

Choose a reason for hiding this comment

vuule commented Jul 28, 2022

harrism left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 1, 2022 • edited Loading

Codecov Report

upsj commented Aug 2, 2022

upsj commented Jul 27, 2022 •

edited

Loading

codecov bot commented Aug 1, 2022 •

edited

Loading