[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

GregoryKimball · 2023-11-15T18:01:30Z

Describe the bug
As a side effect of #14167 (see 23.10 release), we observed about 10-15% slower parquet reader benchmarks on DGX V100. This effect was observed to not impact DGX A100. However, Spark-RAPIDS reported a 4-5% slowdown in the NDS benchmarking suite, driven by changes in IO-bound benchmarks.

The changes in #14167 are not expected to impact performance at all. The difference in libcudf nvbenchmarks on V100 could be from a change in the compiler code gen, and the difference in Spark-RAPIDS NDS on A100 could relate to the multi-threaded PTDS (pre thread default stream) workflow in NDS.

This issue documents the performance data and results of investigations into the root cause.

Steps/Code to reproduce bug

On a DGX V100, you can see the difference using the this benchmark command:

./PARQUET_READER_NVBENCH --devices 0 --profile --benchmark parquet_read_io_compression --axis io=DEVICE_BUFFER --axis compression=NONE --axis cardinality=1000 --axis run_length=1

On commit b789d4ce3c090a3f25a8657d9a8582a1edb54f12 we see 1.376s time
On commit 2c19bf328ffefb97d17e5ae600197a4ea9ca4445 we see 1.572s time.

The difference is driven by longer execution time of the gpuDecodePageKernel, as observed in nsys profiling.

Possibly unrelated background includes the V100-only performance issue in #12577

Nsys profiles:
nsys profiles before and after.zip

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-11-15T22:35:22Z

I collected some PTX from page_data.cu. The PTX goes from 16199 lines to 16456 lines and the raw diff is very noisy (22K lines).
page_data POST.ptx.txt
page_data PRE.ptx.txt

I wrote a script to reduce nuisance diffs from register count and code blocks id's and this diff was left:
diff_less_noise.txt

Possibly relevant? NVIDIA/cccl#1001

GregoryKimball · 2023-11-20T16:21:02Z

I did some quick testing on a V100 and the performance hotspot appears to be set_error_code:

  inline __device__ void set_error_code(decode_error err) volatile
  {
    cuda::atomic_ref<int32_t, cuda::thread_scope_block> ref{const_cast<int&>(error)};
    ref.fetch_or(static_cast<int32_t>(err), cuda::std::memory_order_relaxed);
  }

commenting out ref.fetch_or(... recovers performance, but disables error reporting and so is not an acceptable solution
dropping the inline hint did not recover performance
using the hint __noinline__ did not recover performance
removing volatile did not recover performance
changing the cuda::thread_scope did not affect performance
changing the cuda::std::memory_order did not affect performance

GregoryKimball · 2023-11-21T05:17:02Z

I continued testing the diff in #14167 and found that commenting out these two calls to set_error_code in gpuDecodeStream recovers the performance. If either one is present, we see the regression.

        if (cur > end) {
          s->set_error_code(decode_error::LEVEL_STREAM_OVERRUN);
          break;
        }
        if (level_run <= 1) {
          s->set_error_code(decode_error::INVALID_LEVEL_RUN);
          break;
        }

Since the effect only appears in the gpuDecodeStream kernel, it makes sense that this impacts list types:

However, this observation also suggests that the libcudf benchmark regressions on V100 may NOT have the same root cause as Spark-RAPIDS NDS regressions on A100. (because NDS does not have list types!!)

@mattahrens the observation of performance issues on V100 only for list types makes getting an A100 libcudf repro even more important!

GregoryKimball · 2023-11-23T04:31:13Z

Expanding on my last comment about set_error_code in gpuDecodeStream, if you comment out these two lines the performance regression is recovered.

        if (cur > end) {
          //s->set_error_code(decode_error::LEVEL_STREAM_OVERRUN);
          break;
        }
        if (level_run <= 1) {
          //s->set_error_code(decode_error::INVALID_LEVEL_RUN);
          break;
        }

Please note that this code path is never reached (verified with printf to be certain). However, the absence of these lines causes the following PTX diff:
EDITB.txt

Potentially fixes #14415 Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #14706

etseidl mentioned this issue Jan 3, 2024

Potential fix for peformance regression in #14415 #14706

Merged

3 tasks

rapids-bot bot closed this as completed in #14706 Jan 5, 2024

This was referenced Feb 23, 2024

[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

Closed

Use hostdevice_vector in kernel_error to avoid the pageable copy #15140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

GregoryKimball commented Nov 15, 2023 •

edited

Loading

GregoryKimball commented Nov 15, 2023 •

edited

Loading

GregoryKimball commented Nov 20, 2023 •

edited

Loading

GregoryKimball commented Nov 21, 2023 •

edited

Loading

GregoryKimball commented Nov 23, 2023 •

edited

Loading

[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

Comments

GregoryKimball commented Nov 15, 2023 • edited Loading

GregoryKimball commented Nov 15, 2023 • edited Loading

GregoryKimball commented Nov 20, 2023 • edited Loading

GregoryKimball commented Nov 21, 2023 • edited Loading

GregoryKimball commented Nov 23, 2023 • edited Loading

GregoryKimball commented Nov 15, 2023 •

edited

Loading

GregoryKimball commented Nov 15, 2023 •

edited

Loading

GregoryKimball commented Nov 20, 2023 •

edited

Loading

GregoryKimball commented Nov 21, 2023 •

edited

Loading

GregoryKimball commented Nov 23, 2023 •

edited

Loading