Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fix for peformance regression in #14415 #14706

Merged
merged 4 commits into from
Jan 5, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 3, 2024

Description

Potentially fixes #14415

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jan 3, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 3, 2024
@vuule vuule added bug Something isn't working cuIO cuIO issue Performance Performance related issue non-breaking Non-breaking change labels Jan 3, 2024
@vuule
Copy link
Contributor

vuule commented Jan 3, 2024

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented Jan 3, 2024

I'm hoping some enterprising soul can run the V100 benchmark as @GregoryKimball did here, as I lack access to a machine with that card. The resultant PTX code from this patch appears similar to the code with the lines commented out as was done here.

Edit: I scrounged up an AWS instance with a V100.
Current 24.02

## parquet_read_io_compression

### [0] Tesla V100-SXM2-16GB

|    io_type    | compression_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---------------|------------------|-------------|------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------|
| DEVICE_BUFFER |             NONE |        1000 |          1 |      1x |  1.568 s |  inf% |  1.568 s |  inf% |        342429271 |       673.217 MiB |       162.287 MiB |

This PR:

## parquet_read_io_compression

### [0] Tesla V100-SXM2-16GB

|    io_type    | compression_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---------------|------------------|-------------|------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------|
| DEVICE_BUFFER |             NONE |        1000 |          1 |      1x |  1.391 s |  inf% |  1.391 s |  inf% |        386060375 |       673.217 MiB |       162.287 MiB |

@vuule vuule self-requested a review January 4, 2024 02:50
@etseidl
Copy link
Contributor Author

etseidl commented Jan 4, 2024

Ran a few more benchmarks.

# parquet_read_io_compression

## [0] Tesla V100-SXM2-16GB

|  io_type  |  compression_type  |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |
|-----------|--------------------|---------------|--------------|------------|-------------|------------|-------------|----------------|---------|
| FILEPATH  |        NONE        |       0       |      1       |    1.625 s |       0.66% |    1.470 s |       0.56% | -154711.073 us |  -9.52% |
| FILEPATH  |        NONE        |     1000      |      1       |    1.596 s |       0.68% |    1.431 s |       0.34% | -165286.755 us | -10.36% |
| FILEPATH  |        NONE        |       0       |      32      |    1.335 s |       1.03% |    1.227 s |       0.35% | -108180.764 us |  -8.10% |
| FILEPATH  |        NONE        |     1000      |      32      |    1.325 s |       1.42% |    1.218 s |       0.38% | -106670.858 us |  -8.05% |

# Summary

- Total Matches: 4
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  4

# parquet_read_misc_options

## [0] Tesla V100-SXM2-16GB

|  column_selection  |  row_selection  |  str_to_categories  |  uses_pandas_metadata  |  timestamp_type  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |
|--------------------|-----------------|---------------------|------------------------|------------------|------------|-------------|------------|-------------|----------------|---------|
|        ALL         |       ALL       |         YES         |          YES           |      EMPTY       |    1.851 s |       0.37% |    1.738 s |       0.16% | -113446.582 us |  -6.13% |
|        ALL         |       ALL       |         YES         |           NO           |      EMPTY       |    1.849 s |       0.79% |    1.736 s |       0.27% | -112959.448 us |  -6.11% |
|        ALL         |       ALL       |         NO          |          YES           |      EMPTY       |    1.857 s |       0.64% |    1.744 s |       0.38% | -112959.587 us |  -6.08% |
|        ALL         |       ALL       |         NO          |           NO           |      EMPTY       |    1.857 s |       0.78% |    1.747 s |       0.40% | -109776.286 us |  -5.91% |

# Summary

- Total Matches: 4
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  4

# parquet_read_decode

## [0] Tesla V100-SXM2-16GB

|  data_type  |    io_type    |  cardinality  |  run_length  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |
|-------------|---------------|---------------|--------------|------------|-------------|------------|-------------|---------------|---------|
|    LIST     | DEVICE_BUFFER |       0       |      1       | 110.952 ms |       0.40% | 100.717 ms |       0.36% | -10235.271 us |  -9.22% |
|    LIST     | DEVICE_BUFFER |     1000      |      1       | 100.905 ms |       1.66% |  92.029 ms |       1.43% |  -8875.930 us |  -8.80% |
|    LIST     | DEVICE_BUFFER |       0       |      32      |  85.347 ms |       1.55% |  78.988 ms |       1.64% |  -6359.139 us |  -7.45% |
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  86.960 ms |       1.44% |  81.678 ms |       1.95% |  -5281.288 us |  -6.07% |

# Summary

- Total Matches: 4
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  4

@GregoryKimball
Copy link
Contributor

Wow @etseidl it looks like you solved the mystery!!

@etseidl
Copy link
Contributor Author

etseidl commented Jan 4, 2024

Wow @etseidl it looks like you solved the mystery!!

Nah, you solved the mystery @GregoryKimball, I just submitted the PR 😅

@vuule
Copy link
Contributor

vuule commented Jan 4, 2024

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is the only spot where I got "creative" with the control flow. Looks good now.
Thank you for solving this!

@etseidl etseidl marked this pull request as ready for review January 4, 2024 22:36
@etseidl etseidl requested a review from a team as a code owner January 4, 2024 22:36
@davidwendt
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 9c7b05b into rapidsai:branch-24.02 Jan 5, 2024
77 checks passed
@etseidl etseidl deleted the fix_14415 branch January 5, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Resolve parquet reader performance regression on V100 from #14167
5 participants