Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid memory access in Parquet reader #14637

Merged
merged 14 commits into from
Dec 19, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 14, 2023

Description

Fixes #14633

When reading files in multiple passes, some pointer fields in ColumnChunkDesc that point to transient memory are not cleared out at the end of each pass. This can lead to trying to dereference deallocated memory during Parquet reader string preprocessing.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 14, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 14, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Dec 14, 2023

@nvdbaranec this PR addresses the issue, but is not the most elegant fix. I tried to find a location to zero out the pointers where I could piggyback on an existing H2D copy of the chunks, but didn't have much luck. Could you please take a look and suggest a better fix?

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Dec 14, 2023
@davidwendt
Copy link
Contributor

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented Dec 14, 2023

An alternative that doesn't require copying the chunks array to the device is to zero out the pointers at the beginning of gpuComputeStringPageBounds.

Also, the read is somewhat benign since the result of the dereference is never used. This shouldn't impact users if it went out in 23.12.

Ran the full PARQUET_TEST through compute-sanitizer after the change and got no additional errors reported.

@vuule vuule self-requested a review December 15, 2023 00:13
@etseidl
Copy link
Contributor Author

etseidl commented Dec 15, 2023

Yet another option may be to add another boolean argument to setupLocalPageInfo. This would allow passing is_decode_step as false but still get the early return for non-overlapping pages. Then we'd avoid the bad pointer dereference altogether.

@davidwendt
Copy link
Contributor

/ok to test

@etseidl etseidl marked this pull request as ready for review December 18, 2023 17:26
@etseidl etseidl requested a review from a team as a code owner December 18, 2023 17:26
@vuule
Copy link
Contributor

vuule commented Dec 18, 2023

/ok to test

@@ -616,7 +616,9 @@ __global__ void __launch_bounds__(preprocess_block_size) gpuComputeStringPageBou

// setup page info
auto const mask = BitOr(decode_kernel_mask::STRING, decode_kernel_mask::DELTA_BYTE_ARRAY);
if (!setupLocalPageInfo(s, pp, chunks, min_row, num_rows, mask_filter{mask}, true)) { return; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this PR change is_decode_step value in some of these calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string preprocessing was passing that as true, leading the setup call to believe the output buffers were valid and thus accessing invalid memory. With the new flag true and the old flag false, we get the behavior that was originally desired, but can now skip the bad pointer arithmetic.

etseidl and others added 2 commits December 18, 2023 13:16
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
@ttnghia
Copy link
Contributor

ttnghia commented Dec 18, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Dec 18, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two adjacent bool parameters are a bit sketchy, but clean up can be a separate PR

Comment on lines 1390 to 1391
// if we're in the decoding step, jump directly to the first
// value we care about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could change this to move the is_bonds_step test inside the else block...I think all that's necessary is to just not zero out those values.

@etseidl
Copy link
Contributor Author

etseidl commented Dec 18, 2023

Two adjacent bool parameters are a bit sketchy, but clean up can be a separate PR

Well, it's a tri-state: preprocessing (false false), string bounds (false true), decode (true false). Could add an enum...

@ttnghia
Copy link
Contributor

ttnghia commented Dec 18, 2023

Well, it's a tri-state: preprocessing (false false), string bounds (false true), decode (true false). Could add an enum...

Yeah an enum (with comment) should be much clearer, instead of just true/false.

@vuule
Copy link
Contributor

vuule commented Dec 19, 2023

The change to enum is great. No need to decipher the steps.

@vuule
Copy link
Contributor

vuule commented Dec 19, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Dec 19, 2023

/merge

@rapids-bot rapids-bot bot merged commit cf13b86 into rapidsai:branch-24.02 Dec 19, 2023
67 checks passed
@etseidl etseidl deleted the strings_error branch December 19, 2023 21:52
abellina pushed a commit to abellina/cudf that referenced this pull request Dec 28, 2023
Fixes rapidsai#14633

When reading files in multiple passes, some pointer fields in `ColumnChunkDesc` that point to transient memory are not cleared out at the end of each pass. This can lead to trying to dereference deallocated memory during Parquet reader string preprocessing.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#14637
abellina pushed a commit to abellina/cudf that referenced this pull request Jan 17, 2024
Fixes rapidsai#14633

When reading files in multiple passes, some pointer fields in `ColumnChunkDesc` that point to transient memory are not cleared out at the end of each pass. This can lead to trying to dereference deallocated memory during Parquet reader string preprocessing.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#14637
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Memcheck error cudf::io::parquet::detail::<unnamed>::gpuComputeStringPageBounds<unsigned char>()
4 participants