Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read_text when byte_range is aligned with field #11371

Merged

Conversation

upsj
Copy link
Contributor

@upsj upsj commented Jul 27, 2022

Description

Currently, if the beginning of a field coincides with either the beginning (inclusive) or end (exclusive) of a byte range, the field will be part of the output. This PR fixes the resulting field duplication if we concatenate the results from a partition of the input into byte ranges.

The issue stems from the fact that we use lower_bound to determine the beginning of a field, but upper_bound to determine its end, so if the end of the byte range coincides with the beginning of a field, the result from the range [a,b) doesn't fit exactly onto the result from the range [b,c).

To keep the previous behavior of emitting an empty field if the input ends with a delimiter, I needed to add a small fix that differentiates between byte ranges whose size matches the input size exactly, and ones that overrun the input size (which is the default behavior).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@upsj upsj requested a review from a team as a code owner July 27, 2022 20:41
@upsj upsj requested review from harrism, vuule and cwharris July 27, 2022 20:41
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 27, 2022
@cwharris cwharris added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change bug Something isn't working labels Jul 27, 2022
@upsj upsj self-assigned this Jul 27, 2022
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to 22.10?

auto out = cudf::io::text::multibyte_split(
*source,
delimiter,
cudf::io::text::byte_range_info{0, static_cast<int64_t>(host_input.size())});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this test, it didn't occured to me we weren't testing this case explicitly, since under-the-hood we're using intmax as the byte range end when the byte range isn't specified.

@@ -383,17 +383,31 @@ std::unique_ptr<cudf::column> multibyte_split(cudf::io::text::data_chunk_source
stream,
streams);

// String offsets point to the first character of a field
// This finds the first field whose first character starts inside or after the byte range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great comment. It makes it very clear to me where the bug was introduced:

... or after the byte range

@@ -169,4 +198,37 @@ TEST_F(MultibyteSplitTest, LargeInputMultipleRange)
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected->view(), *out, debug_output_level::ALL_ERRORS);
}

TEST_F(MultibyteSplitTest, SmallInputAllPossibleRanges)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test.

upsj added 2 commits July 27, 2022 21:31
This is otherwise broken, since even an empty range coinciding with
the beginning of a field would produce this field as output.
@vuule vuule changed the base branch from branch-22.08 to branch-22.10 July 27, 2022 21:34
@vuule
Copy link
Contributor

vuule commented Jul 28, 2022

Looks like some multibyte_split tests are failing with the latest commit

@vuule vuule added 0 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Jul 28, 2022
@vuule vuule added this to PR-WIP in v22.10 Release via automation Jul 28, 2022
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small nit.

cpp/src/io/text/multibyte_split.cu Outdated Show resolved Hide resolved
v22.10 Release automation moved this from PR-WIP to PR-Reviewer approved Jul 28, 2022
v22.10 Release automation moved this from PR-Reviewer approved to PR-Needs review Aug 1, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 1, 2022
@upsj upsj requested a review from a team August 1, 2022 19:31
@upsj upsj force-pushed the bugfix/multibyte_split_ranges branch from 12db7a5 to fe63370 Compare August 1, 2022 19:43
v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 1, 2022
@upsj upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 0 - Waiting on Author Waiting for author to respond to review labels Aug 1, 2022
@codecov
Copy link

codecov bot commented Aug 1, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@e7e5f45). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head de87d91 differs from pull request most recent head fe63370. Consider uploading reports for the commit fe63370 to get more accurate results

@@               Coverage Diff               @@
##             branch-22.10   #11371   +/-   ##
===============================================
  Coverage                ?   86.47%           
===============================================
  Files                   ?      144           
  Lines                   ?    22856           
  Branches                ?        0           
===============================================
  Hits                    ?    19765           
  Misses                  ?     3091           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7e5f45...fe63370. Read the comment docs.

@upsj
Copy link
Contributor Author

upsj commented Aug 2, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2dc5c3f into rapidsai:branch-22.10 Aug 2, 2022
v22.10 Release automation moved this from PR-Reviewer approved to Done Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants