Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JSON multi-source reading when total source size exceeds INT_MAX bytes #15930

Merged
merged 22 commits into from
Jun 18, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jun 5, 2024

Description

Fixes #15917.

  • Batched read and parse operations
  • Fail when any single source file exceeds INT_MAX bytes. This case will be handled with a chunked reader later.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 5, 2024
@shrshi shrshi added cuIO cuIO issue non-breaking Non-breaking change bug Something isn't working labels Jun 5, 2024
@shrshi shrshi changed the title JSON multi-source reading when total source size exceeds INT_MAX bytes Fix JSON multi-source reading when total source size exceeds INT_MAX bytes Jun 5, 2024
@shrshi shrshi marked this pull request as ready for review June 5, 2024 22:02
@shrshi shrshi requested a review from a team as a code owner June 5, 2024 22:03
@shrshi shrshi requested review from karthikeyann, davidwendt, elstehle and vuule and removed request for karthikeyann and davidwendt June 5, 2024 22:03
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@github-actions github-actions bot added the CMake CMake build issue label Jun 13, 2024
@shrshi shrshi requested a review from vuule June 14, 2024 17:30
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi shrshi requested a review from vuule June 17, 2024 20:45
cpp/include/cudf/io/types.hpp Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/tests/large_strings/json_tests.cpp Outdated Show resolved Hide resolved
Comment on lines 44 to 47
for (std::size_t i = 0; i < log_repetitions; i++) {
json_string = json_string + "\n" + json_string;
numrows <<= 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time is this loop taking?
nit: newline in json_string + "\n" + json_string; may not be required because the json_string starts with \n.
This may open up optimization to speedup this loop. (reserve and += or append).

The loop took 2.5 seconds in my machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. After reserving memory for the output json_string the loop runtime has reduced from 3.02 secs to 660 ms on the ipp1-3304 machine

Copy link
Contributor Author

@shrshi shrshi Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more stats on the test runtime after some quick nsys profiling -
read_json takes 20.94s (~96.9% of total runtime) - each read_batch call takes $1.47 \pm 0.321$ s, and concatenate takes 516ms. The json test is the slowest among all large strings tests now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we get the same coverage with smaller input?

Copy link
Contributor Author

@shrshi shrshi Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For the given single source size, even two sources are sufficient to ensure batching.
EDIT: with two sources, the total runtime is 3.64s.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of small suggestions.

outfile << json_string;

constexpr int num_sources = 10;
std::vector<std::string> filepaths(num_sources, filename);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of speeding up the test, we can save on the IO by creating a vector of source_infos from {json_string.c_str(), json_string.size()}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean constructing the source_info object from vector of json_string host buffers instead? I think the builder takes a single source_info object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that :)
The only tricky part is that you need to explicitly create host_buffers because a vector of strings is treated as a set of filepaths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, shaved off another ~4.5 seconds - current runtime is 16.645 s.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating on the tests!

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
Great work!

@shrshi
Copy link
Contributor Author

shrshi commented Jun 18, 2024

/merge

@rapids-bot rapids-bot bot merged commit c83e5b3 into rapidsai:branch-24.08 Jun 18, 2024
73 of 88 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake build issue cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Integer overflow errors in JSON reader when total source size exceeds INT_MAX bytes
3 participants