Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed an issue with output chunking computation stemming from input chunking. #14889

Merged

Conversation

nvdbaranec
Copy link
Contributor

Fixes #14883

The core issue was that the output chunking code was expecting all columns to have terminating pages that end in the same row count. Previously this was the case because we always processed entire row groups. But now with the subrowgroup reader, we can split on page boundaries that cause a jagged max row index for different columns. Example:

             0       100             200
Col A     [-----------][--------------]      300
Col B     [-----------][----------------------]

The input chunking would have computed a max row index of 200 for the subpass. But when computing the output chunks, there was code that would have tried finding where row 300 was in column A, resulting in an out-of-bounds read.

The fix is simply to cap the max row seen for column B to be the max expected row for the subpass.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…olumns resulting from input pass splitting would result in

incorrect indexing when computing output chunking.
@nvdbaranec nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Jan 25, 2024
@nvdbaranec nvdbaranec requested a review from a team as a code owner January 25, 2024 20:21
@nvdbaranec
Copy link
Contributor Author

Leaving the do-not-merge label on so I can soak this through the spark integration tests.

@nvdbaranec nvdbaranec requested review from vuule, ttnghia and davidwendt and removed request for hyperbolic2346 and shrshi January 25, 2024 20:23
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit and a const comment

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
@nvdbaranec
Copy link
Contributor Author

Spark integration tests passed.

@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 25, 2024
@nvdbaranec
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 821f4de into rapidsai:branch-24.02 Jan 25, 2024
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Memcheck error found in PARQUET_TEST ParquetChunkedReaderInputLimitTest.Mixed
4 participants