Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance regression in cuDF after #14679 #14886

Closed
abellina opened this issue Jan 25, 2024 · 11 comments
Closed

[BUG] Performance regression in cuDF after #14679 #14886

abellina opened this issue Jan 25, 2024 · 11 comments
Labels
bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

We are seeing some really big regressions in our NDS benchmarks after this PR went in: #14679. We verified that reverting this commit fixes the regression.

Platform: 'Spark 2a', Spark version: '3.3.1'
--------------------------------------------------------------------
Name = q14a
Means = 9224.6, 10406.8
Time diff = -1182.199999999999
Speedup = 0.8864011992158974
T-Test (test statistic, p value, df) = -4.761890107264471, 0.0014233443564802676, 8.0
T-Test Confidence Interval = -1754.6949604102729, -609.7050395897251
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = q23a
Means = 17216.8, 18474.4
Time diff = -1257.6000000000022
Speedup = 0.9319274238946866
T-Test (test statistic, p value, df) = -5.767490738726307, 0.00042048826337266765, 8.0
T-Test Confidence Interval = -1760.4236596455162, -754.7763403544881
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = q67
Means = 26479.4, 3621773.8
Time diff = -3595294.4
Speedup = 0.0073111689084503296
T-Test (test statistic, p value, df) = -200.13173437293509, 4.348886771193257e-16, 8.0
T-Test Confidence Interval = -3636720.9322740515, -3553867.8677259483
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = q81
Means = 3188.2, 4205.8
Time diff = -1017.6000000000004
Speedup = 0.7580484093394835
T-Test (test statistic, p value, df) = -6.738351796028844, 0.0001467977875603567, 8.0
T-Test Confidence Interval = -1365.843884979839, -669.3561150201617
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------

When running an nsys trace, and looking at traces, it's all in the groupby, specifically in this for_each call:

void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::for_each_f<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<cudf::groupby::detail::hash::compute_single_pass_aggs_fn<concurrent_unordered_map<int, int, cudf::experimental::row::hash::device_row_hasher<cudf::hashing::detail::default_hash, cudf::nullate::DYNAMIC>, const cudf::experimental::row::equality::device_row_comparator<(bool)0, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, default_allocator<thrust::pair<int, int>>>>, void>>, int>, thrust::cuda_cub::for_each_f<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<cudf::groupby::detail::hash::compute_single_pass_aggs_fn<concurrent_unordered_map<int, int, cudf::experimental::row::hash::device_row_hasher<cudf::hashing::detail::default_hash, cudf::nullate::DYNAMIC>, const cudf::experimental::row::equality::device_row_comparator<(bool)0, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, default_allocator<thrust::pair<int, int>>>>, void>>, int>(T2, T3)
Begins: 82.6093s
Ends: 93.4495s (+10.840 s)
grid:  <<<17966, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 68
Local Memory Per Thread: 0 bytes
Local Memory Total: 155,713,536 bytes
Shared Memory executed: 32,768 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 37.5 %
Launched from thread: 3473426
Latency: ←7.040 μs
Correlation ID: 760769
Stream: Stream 52
@abellina abellina added bug Something isn't working Needs Triage Need team to review and classify Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Jan 25, 2024
@GregoryKimball GregoryKimball changed the title [BUG] Performance regression in cuDF after https://github.com/rapidsai/cudf/pull/14679 [BUG] Performance regression in cuDF after #14679 Jan 25, 2024
@GregoryKimball
Copy link
Contributor

@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks?

@abellina
Copy link
Contributor Author

These look to be simple sum aggregations on either uint64_t columns or int64_t columns. The actual spark type is a decimal, which we chunk into 4 components to perform overflow checking, but cuDF is operating on uint64_t/int64_t only here.

@SurajAralihalli
Copy link
Contributor

@SurajAralihalli would you do a quick check to see if there is a regression in our groupby nvbenchmarks?

I don't observe a significant variance in the groupby nvbenchmarks. I'll investigate further.

# groupby_nunique

## [0] NVIDIA RTX A5000

|  T  |  num_rows  |  null_probability  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-----|------------|--------------------|------------|-------------|------------|-------------|-------------|---------|----------|
| I32 |    2^12    |         0          | 324.202 us |       1.27% | 323.106 us |       1.28% |   -1.097 us |  -0.34% |   PASS   |
| I32 |    2^16    |         0          | 493.699 us |       0.89% | 500.726 us |       0.98% |    7.027 us |   1.42% |   FAIL   |
| I32 |    2^20    |         0          |   2.859 ms |       0.77% |   2.861 ms |       0.80% |    2.629 us |   0.09% |   PASS   |
| I32 |    2^24    |         0          |  54.140 ms |       0.41% |  54.183 ms |       0.46% |   43.101 us |   0.08% |   PASS   |
| I32 |    2^12    |        0.5         | 401.505 us |       1.82% | 402.353 us |       2.01% |    0.848 us |   0.21% |   PASS   |
| I32 |    2^16    |        0.5         | 634.529 us |       0.70% | 642.765 us |       0.79% |    8.236 us |   1.30% |   FAIL   |
| I32 |    2^20    |        0.5         |   2.960 ms |       1.20% |   2.953 ms |       1.39% |   -7.593 us |  -0.26% |   PASS   |
| I32 |    2^24    |        0.5         |  47.267 ms |       0.39% |  47.113 ms |       0.22% | -153.787 us |  -0.33% |   FAIL   |
| I64 |    2^12    |         0          | 396.544 us |       2.11% | 393.066 us |       2.35% |   -3.478 us |  -0.88% |   PASS   |
| I64 |    2^16    |         0          | 574.812 us |       1.03% | 570.253 us |       0.59% |   -4.559 us |  -0.79% |   FAIL   |
| I64 |    2^20    |         0          |   3.691 ms |       0.88% |   3.691 ms |       0.80% |   -0.588 us |  -0.02% |   PASS   |
| I64 |    2^24    |         0          |  69.679 ms |       0.14% |  69.604 ms |       0.16% |  -74.566 us |  -0.11% |   PASS   |
| I64 |    2^12    |        0.5         | 453.885 us |       1.54% | 453.695 us |       1.51% |   -0.190 us |  -0.04% |   PASS   |
| I64 |    2^16    |        0.5         | 695.165 us |       0.67% | 693.459 us |       0.67% |   -1.706 us |  -0.25% |   PASS   |
| I64 |    2^20    |        0.5         |   3.253 ms |       1.06% |   3.259 ms |       1.19% |    6.656 us |   0.20% |   PASS   |
| I64 |    2^24    |        0.5         |  52.952 ms |       0.27% |  52.870 ms |       0.25% |  -82.386 us |  -0.16% |   PASS   |

# groupby_rank

## [0] NVIDIA RTX A5000

|  rank_method  |  data_size  |  is_sorted  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------------|-------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
|    AVERAGE    |   1000000   |      0      |   3.407 ms |       0.77% |   3.413 ms |       1.22% |    6.699 us |   0.20% |   PASS   |
|    AVERAGE    |  10000000   |      0      |  30.715 ms |       0.23% |  30.674 ms |       0.33% |  -41.020 us |  -0.13% |   PASS   |
|    AVERAGE    |  100000000  |      0      | 349.385 ms |       0.06% | 348.759 ms |       0.10% | -626.409 us |  -0.18% |   FAIL   |
|    AVERAGE    |   1000000   |      1      | 312.883 us |       1.22% | 311.793 us |       1.30% |   -1.089 us |  -0.35% |   PASS   |
|    AVERAGE    |  10000000   |      1      |   2.151 ms |       0.21% |   2.150 ms |       0.24% |   -0.593 us |  -0.03% |   PASS   |
|    AVERAGE    |  100000000  |      1      |  20.555 ms |       0.05% |  20.559 ms |       0.04% |    3.891 us |   0.02% |   PASS   |
|     DENSE     |   1000000   |      0      |   3.275 ms |       0.77% |   3.289 ms |       1.21% |   13.529 us |   0.41% |   PASS   |
|     DENSE     |  10000000   |      0      |  29.757 ms |       0.28% |  29.817 ms |       0.19% |   60.778 us |   0.20% |   FAIL   |
|     DENSE     |  100000000  |      0      | 338.734 ms |       0.09% | 338.521 ms |       0.10% | -212.527 us |  -0.06% |   PASS   |
|     DENSE     |   1000000   |      1      | 166.441 us |       1.81% | 166.363 us |       1.80% |   -0.078 us |  -0.05% |   PASS   |
|     DENSE     |  10000000   |      1      |   1.166 ms |       0.32% |   1.165 ms |       0.25% |   -1.340 us |  -0.11% |   PASS   |
|     DENSE     |  100000000  |      1      |  11.080 ms |       0.05% |  11.080 ms |       0.06% |   -0.089 us |  -0.00% |   PASS   |
|     FIRST     |   1000000   |      0      |   3.223 ms |       0.73% |   3.229 ms |       1.13% |    6.633 us |   0.21% |   PASS   |
|     FIRST     |  10000000   |      0      |  29.518 ms |       0.34% |  29.534 ms |       0.32% |   16.806 us |   0.06% |   PASS   |
|     FIRST     |  100000000  |      0      | 335.483 ms |       0.12% | 334.901 ms |       0.15% | -581.726 us |  -0.17% |   FAIL   |
|     FIRST     |   1000000   |      1      | 106.754 us |       1.95% | 106.871 us |       2.49% |    0.117 us |   0.11% |   PASS   |
|     FIRST     |  10000000   |      1      | 791.476 us |       0.29% | 791.192 us |       0.31% |   -0.284 us |  -0.04% |   PASS   |
|     FIRST     |  100000000  |      1      |   7.571 ms |       0.07% |   7.571 ms |       0.09% |   -0.015 us |  -0.00% |   PASS   |
|      MAX      |   1000000   |      0      |   3.291 ms |       0.85% |   3.302 ms |       1.27% |   10.985 us |   0.33% |   PASS   |
|      MAX      |  10000000   |      0      |  29.963 ms |       0.26% |  30.076 ms |       0.27% |  112.820 us |   0.38% |   FAIL   |
|      MAX      |  100000000  |      0      | 340.774 ms |       0.18% | 340.540 ms |       0.17% | -233.937 us |  -0.07% |   PASS   |
|      MAX      |   1000000   |      1      | 161.971 us |       2.36% | 161.766 us |       2.29% |   -0.204 us |  -0.13% |   PASS   |
|      MAX      |  10000000   |      1      |   1.168 ms |       0.31% |   1.167 ms |       0.50% |   -0.949 us |  -0.08% |   PASS   |
|      MAX      |  100000000  |      1      |  11.191 ms |       0.07% |  11.192 ms |       0.09% |    0.956 us |   0.01% |   PASS   |
|      MIN      |   1000000   |      0      |   3.300 ms |       0.80% |   3.306 ms |       1.21% |    5.953 us |   0.18% |   PASS   |
|      MIN      |  10000000   |      0      |  30.032 ms |       0.30% |  30.058 ms |       0.22% |   26.444 us |   0.09% |   PASS   |
|      MIN      |  100000000  |      0      | 341.492 ms |       0.15% | 340.956 ms |       0.21% | -535.645 us |  -0.16% |   FAIL   |
|      MIN      |   1000000   |      1      | 167.073 us |       1.89% | 165.808 us |       1.89% |   -1.265 us |  -0.76% |   PASS   |
|      MIN      |  10000000   |      1      |   1.167 ms |       0.25% |   1.167 ms |       0.39% |    0.468 us |   0.04% |   PASS   |
|      MIN      |  100000000  |      1      |  11.087 ms |       0.07% |  11.085 ms |       0.07% |   -2.026 us |  -0.02% |   PASS   |

# Summary

- Total Matches: 100
  - Pass    (diff <= min_noise): 91
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  9

@jlowe
Copy link
Member

jlowe commented Jan 25, 2024

I have a repro case outside of Spark just using the cudf Java APIs where an aggregation call is taking over 15 seconds. I'll see if I can get this boiled down to a pure C++ repro.

@jlowe
Copy link
Member

jlowe commented Jan 25, 2024

For reference, here's the repro steps that I have using just the cudf Java APIs (still working on the C++-only repro). Attached is a Parquet file with the data used in this test which consists of a two columns, a string column and a UINT32 column.

import ai.rapids.cudf.*;

Table t = Table.readParquet(new java.io.File("testdata.parquet"));
Table.GroupByOperation g = t.groupBy(0);
Table result = g.aggregate(GroupByAggregation.sum().onColumn(1));

testdata.parquet.gz

The test simply groups by the string column and does a sum aggregation on the UINT32 column. It takes a few milliseconds to run this before #14679 and over 15 seconds afterwards.

@jlowe
Copy link
Member

jlowe commented Jan 25, 2024

Finally got the C++ repro for this. This program runs sub-second before #14679 and takes tens of seconds afterwards:

#include <cudf/groupby.hpp>
#include <cudf/io/parquet.hpp>

int main(int argc, char** argv) {
  auto read_opts = cudf::io::parquet_reader_options_builder(cudf::io::source_info{"./testdata.parquet"}).build();
  auto read_result = cudf::io::read_parquet(read_opts);
  auto t = read_result.tbl->view();
  cudf::groupby::groupby grouper(cudf::table_view({t.column(0)}), cudf::null_policy::INCLUDE, cudf::sorted::NO);
  std::vector<cudf::groupby::aggregation_request> requests;
  requests.emplace_back(cudf::groupby::aggregation_request());
  requests[0].values = t.column(1);
  requests[0].aggregations.push_back(cudf::make_sum_aggregation<cudf::groupby_aggregation>());
  auto result = grouper.aggregate(requests);
  return 0;
}

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 25, 2024

Thank you @jlowe for sharing this clear repro. Would you please share a bit about the contents of the parquet input file?

I pulled together a snapshot of our recent microbenchmarks and they do not show the sign of a large regression. There must be a case we are missing.

image

@jlowe
Copy link
Member

jlowe commented Jan 25, 2024

I didn't closely examine the contents of the parquet file. This is just data captured in the middle of query execution from an NDS run at scale factor 100g just before a long aggregation call. The original data had 9 grouping columns and 5 aggregation columns, but I was able to still reproduce it with just one grouping column and one aggregation column. I did not try reducing the number of rows. It may not be sensitive to the data at all, but I wanted to post what I had so far.

@jlowe
Copy link
Member

jlowe commented Jan 25, 2024

Also note that the file is attached at this comment: #14886 (comment) so anyone can download the data, compile the test program, and reproduce the results.

@SurajAralihalli
Copy link
Contributor

SurajAralihalli commented Jan 26, 2024

Thanks Json, the performance regression can be seen in bench_groupby_nvsum2 benchmarks for uint32 and uint64 types without parquet read. Once resolved, we can incorporate it into cudf benchmarks.

|  T  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |       %Diff |  Status  |
|-----|------------|-------------|------------|-------------|------------|-------------|----------|
| I32 |   1.640 ms |       2.89% |   1.632 ms |       6.12% |  -7.969 us |      -0.49% |   PASS   |
| U32 |   1.630 ms |       1.82% |   17.867 s |        inf% |   17.866 s | 1096222.72% |   FAIL   |
| I64 |   1.763 ms |      24.22% |   1.678 ms |      14.77% | -84.945 us |      -4.82% |   PASS   |
| U64 |   1.715 ms |      20.62% |   17.942 s |        inf% |   17.941 s | 1046062.96% |   FAIL   |

raydouglass pushed a commit that referenced this issue Jan 29, 2024
This pull request reverses the modifications made to the sum/product aggregation target type, ensuring it always produces int64. The changes implemented by  PR [14679](#14679) which led to degraded performance when the aggregation column had an unsigned type, are reverted. Additional details can be found in the issue [14886](#14886).

Authors:
   - Suraj Aralihalli (https://github.com/SurajAralihalli)

Approvers:
   - David Wendt (https://github.com/davidwendt)
   - Nghia Truong (https://github.com/ttnghia)
   - Karthikeyan (https://github.com/karthikeyann)
@GregoryKimball
Copy link
Contributor

I believe this is safe to close

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 12, 2024
…operators to detail namespace. (#14962)

This PR does a thorough refactoring of `device_atomics.cuh`.

- I moved all atomic-related functions to `cudf::detail::` (making this an API-breaking change, but most likely a low-impact break)
- I added all missing operators for natively supported types to `atomicAdd`, `atomicMin`, `atomicMax`, etc. as discussed in #10149 and #14907.
  - This should prevent fallback to the `atomicCAS` path for types that are natively supported for those atomic operators, which we suspect as the root cause of the performance regression in #14886.
- I kept `atomicAdd` rather than `cudf::detail::atomic_add` in locations where a native CUDA overload exists, and the same for min/max/CAS operations. Aggregations are the only place where we use the special overloads. We were previously calling the native CUDA function rather than our special overloads in many cases, so I retained the previous behavior. This avoids including the additional headers that implement an unnecessary level of wrapping for natively supported overloads.
- I enabled native 2-byte CAS operations (on `unsigned short int`) that eliminate the do-while loop and extra alignment-checking logic
  - The CUDA docs don't state this, but some forum posts claim this is only supported by compute capability 7.0+. We now have 7.0 as a lower bound for RAPIDS so I'm not concerned by this as long as builds/tests pass.
- I improved/cleaned the documentation and moved around some code so that the operators were in a logical order.
- I assessed the existing tests and it looks like all the types are being covered. I'm not sure if there is a good way to enforce that certain types (like `uint64_t`) are passing through native `atomicAdd` calls.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Suraj Aralihalli (https://github.com/SurajAralihalli)

URL: #14962
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

No branches or pull requests

5 participants