Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

Closed
abellina opened this issue Oct 7, 2022 · 1 comment · Fixed by #11879
Closed

[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

abellina opened this issue Oct 7, 2022 · 1 comment · Fixed by #11879
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Contributor

abellina commented Oct 7, 2022

We are seeing an issue where trying to read a zstd parquet file is causing a very large allocation to happen (something much higher than GPU memory like 50 or 60 GB).

We have bisected it to a change in this PR: #11652.

Specifically we see that value_or is evaluating the argument in this function batched_decompress_temp_size (changed in the PR above), where the author didn't mean this:

size_t batched_decompress_temp_size(compression_type compression,
                                    size_t num_chunks,
                                    size_t max_uncomp_chunk_size,
                                    size_t max_total_uncomp_size)
{
  size_t temp_size = 0;
  auto const nvcomp_status =
    batched_decompress_get_temp_size_ex(
      compression, num_chunks, max_uncomp_chunk_size, &temp_size, max_total_uncomp_size)
      .value_or(batched_decompress_get_temp_size(
        compression, num_chunks, max_uncomp_chunk_size, &temp_size));
...

And so temp_size is getting set twice, the second time to the wrong value. This leads to an OOM when nvcomp tries to allocate the temporary space for the decompression.

@abellina abellina added bug Something isn't working Needs Triage Need team to review and classify labels Oct 7, 2022
@jbrennan333
Copy link
Contributor

I dug up some numbers from when we first encountered the problem with nvcompBatchedZstdDecompressGetTempSize. This is before nvcomp was updated to add nvcompBatchedZstdDecompressGetTempSizeEx. For q23b from tpcds running at 3TB, I saw numbers like this:

app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8190, max_uncomp_chunk_size: 1043752 total_decomp_size: 229459225 temp_size: 17107787520
app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8187, max_uncomp_chunk_size: 1043504 total_decomp_size: 229356969 temp_size: 17097463344
app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8190, max_uncomp_chunk_size: 1043500 total_decomp_size: 229492015 temp_size: 17103659760
app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8193, max_uncomp_chunk_size: 1048556 total_decomp_size: 229400008 temp_size: 17192769288
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8127, max_uncomp_chunk_size: 1043216 total_decomp_size: 222111256 temp_size: 16967543472
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8169, max_uncomp_chunk_size: 1043620 total_decomp_size: 228883314 temp_size: 17061786936
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8118, max_uncomp_chunk_size: 1043328 total_decomp_size: 221870704 temp_size: 16950581280
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8118, max_uncomp_chunk_size: 1043380 total_decomp_size: 221966105 temp_size: 16951425552
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8172, max_uncomp_chunk_size: 1043304 total_decomp_size: 229049843 temp_size: 17062884864

So we were trying to allocate 17GB when all we needed was 230MB.

@abellina abellina self-assigned this Oct 7, 2022
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants