[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

abellina · 2022-10-07T21:01:31Z

We are seeing an issue where trying to read a zstd parquet file is causing a very large allocation to happen (something much higher than GPU memory like 50 or 60 GB).

We have bisected it to a change in this PR: #11652.

Specifically we see that value_or is evaluating the argument in this function batched_decompress_temp_size (changed in the PR above), where the author didn't mean this:

size_t batched_decompress_temp_size(compression_type compression,
                                    size_t num_chunks,
                                    size_t max_uncomp_chunk_size,
                                    size_t max_total_uncomp_size)
{
  size_t temp_size = 0;
  auto const nvcomp_status =
    batched_decompress_get_temp_size_ex(
      compression, num_chunks, max_uncomp_chunk_size, &temp_size, max_total_uncomp_size)
      .value_or(batched_decompress_get_temp_size(
        compression, num_chunks, max_uncomp_chunk_size, &temp_size));
...

And so temp_size is getting set twice, the second time to the wrong value. This leads to an OOM when nvcomp tries to allocate the temporary space for the decompression.

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2022-10-07T21:42:19Z

I dug up some numbers from when we first encountered the problem with nvcompBatchedZstdDecompressGetTempSize. This is before nvcomp was updated to add nvcompBatchedZstdDecompressGetTempSizeEx. For q23b from tpcds running at 3TB, I saw numbers like this:

app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8190, max_uncomp_chunk_size: 1043752 total_decomp_size: 229459225 temp_size: 17107787520
app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8187, max_uncomp_chunk_size: 1043504 total_decomp_size: 229356969 temp_size: 17097463344
app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8190, max_uncomp_chunk_size: 1043500 total_decomp_size: 229492015 temp_size: 17103659760
app-20220527211217-0175/12/stdout:ZSTD: num_chunks: 8193, max_uncomp_chunk_size: 1048556 total_decomp_size: 229400008 temp_size: 17192769288
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8127, max_uncomp_chunk_size: 1043216 total_decomp_size: 222111256 temp_size: 16967543472
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8169, max_uncomp_chunk_size: 1043620 total_decomp_size: 228883314 temp_size: 17061786936
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8118, max_uncomp_chunk_size: 1043328 total_decomp_size: 221870704 temp_size: 16950581280
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8118, max_uncomp_chunk_size: 1043380 total_decomp_size: 221966105 temp_size: 16951425552
app-20220527211217-0175/4/stdout:ZSTD: num_chunks: 8172, max_uncomp_chunk_size: 1043304 total_decomp_size: 229049843 temp_size: 17062884864

So we were trying to allocate 17GB when all we needed was 230MB.

abellina added bug Something isn't working Needs Triage Need team to review and classify labels Oct 7, 2022

abellina mentioned this issue Oct 7, 2022

Fixes bug in temporary decompression space estimation before calling nvcomp #11879

Merged

abellina self-assigned this Oct 7, 2022

abellina closed this as completed Oct 10, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

abellina commented Oct 7, 2022 •

edited

Loading

jbrennan333 commented Oct 7, 2022

[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

[BUG] OOM in readParquet due to too large temp memory estimation in cuDF #11878

Comments

abellina commented Oct 7, 2022 • edited Loading

jbrennan333 commented Oct 7, 2022

abellina commented Oct 7, 2022 •

edited

Loading