Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. #15079

Merged
merged 13 commits into from
Mar 7, 2024

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Feb 16, 2024

This PR adds a new interface to cuIO which controls where host memory allocations come from. It adds two core functions:

Addresses #14314

rmm::host_async_resource_ref set_host_memory_resource(rmm::host_async_resource_ref mr);
rmm::host_async_resource_ref get_host_memory_resource();

cudf::io::hostdevice_vector was currently implemented in terms of a thrust::host_vector<> that explicitly uses an allocator called pinned_host_vector. I copied that and made a new class called rmm_host_vector which takes any host_resource_ref. This probably makes pinned_host_vector obsolete.

Parquet benchmarks have a new commandline option which lets you toggle between 3 modes:

--cuio_host_mem pinned              (the default, an unpooled, pinned memory source)
--cuio_host_mem pinned_pool         (the pooled/pinned resource)

The ultimate intent here is to reduce the cpu-side overhead of the setup code that comes before the decode kernels in the parquet reader. The wins are pretty significant for our faster kernels (that is, where we are less dominated by gpu time)

Edit: Updated to use newly minted resource ref types from rmm itself. I also switched the type to be host_async_resource_ref even though in this case the user (thrust::host_vector) doesn't explicitly go through the async path. In addition, the pageable memory path (an experimental feature) has been removed.

Pinned

| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |     25x | 20.443 ms | 0.45% | 20.438 ms | 0.45% |      26268890178 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     26x | 19.571 ms | 0.42% | 19.565 ms | 0.42% |      27440146729 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |     28x | 18.150 ms | 0.18% | 18.145 ms | 0.18% |      29587789525 |       602.424 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     29x | 17.306 ms | 0.37% | 17.300 ms | 0.37% |      31032523423 |       597.181 MiB |        14.403 MiB |

Pooled/pinned

| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |    117x | 17.258 ms | 0.50% | 17.254 ms | 0.50% |      31115706389 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     31x | 16.413 ms | 0.43% | 16.408 ms | 0.43% |      32719609450 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |    576x | 14.885 ms | 0.58% | 14.881 ms | 0.58% |      36077859564 |       602.519 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     36x | 14.069 ms | 0.48% | 14.065 ms | 0.48% |      38171646940 |       597.243 MiB |        14.403 MiB |

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@nvdbaranec nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 16, 2024
@nvdbaranec nvdbaranec requested a review from a team as a code owner February 16, 2024 21:33
@nvdbaranec nvdbaranec marked this pull request as draft February 16, 2024 21:33
@GregoryKimball
Copy link
Contributor

Nice work @nvdbaranec !!

@hyperbolic2346
Copy link
Contributor

I like this ability. My only question is if we should follow the current optional memory resource passed into functions or if we should add this as a set/get.

table_with_metadata read_parquet(
  parquet_reader_options const& options,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Maybe this becomes:

table_with_metadata read_parquet(
  parquet_reader_options const& options,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(),
  cudf::host_resource_ref* host_mr    = rmm::mr::pinned_memory_resource);

I don't know where all this applies and the trouble of passing it through.

@sameerz sameerz requested a review from harrism February 21, 2024 00:34
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really surprised the old host_memory_resource works with the pool. I added rmm::mr::pinned_host_memory_resource (which implements the cuda::mr::async_memory_resource and cuda::mr::memory_resource concepts instead of deriving from host_memory_resource) specifically to enable use with pool_memory_resource. Please use it instead of the old one.

cpp/benchmarks/fixture/nvbench_fixture.hpp Outdated Show resolved Hide resolved
cpp/benchmarks/fixture/nvbench_fixture.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/hostdevice_vector.hpp Outdated Show resolved Hide resolved
…ref instead of host_resource_ref. Removed the pageable-memory path entirely.
@nvdbaranec nvdbaranec marked this pull request as ready for review February 28, 2024 17:28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment (non-blocking): In an ideal (and not too distant) world, this entire file will be unnecessary.

One shouldn't need to define their own allocator, or vector type. We should have an cuda::mr::allocator that can be constructed from a cuda::mr::resource_ref.

I understand not wanting to wait for that, but I just want to give you a heads up on what is coming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. This is definitely worth replacing.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush

cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/hostdevice_vector.hpp Outdated Show resolved Hide resolved
cpp/benchmarks/fixture/nvbench_fixture.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/utilities/rmm_host_vector.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/config_utils.cpp Outdated Show resolved Hide resolved
…eallocate functions so that we can pass the correct stream.
@github-actions github-actions bot added the CMake CMake build issue label Mar 6, 2024
/**
* @brief Copy constructor
*/
rmm_host_allocator(rmm_host_allocator const& other) = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rmm::device_buffer and device_uvector we delete the copy constructor and copy-assignment operator because they don't allow specifying a stream. YMMV, just suggesting it may be good practice.

https://github.com/rapidsai/rmm/blob/f132d4b0daa976e1ec6cbcef24f5454fe510a394/include/rmm/device_buffer.hpp#L85

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better done as a followup. There are a number of places in cudf code using the assignment operator and thrust itself hits the copy constructor for mysterious reasons. For example, just calling reserve on the wrapping thrust::host_vector causes it to happen (h_data.reserve(max_size);). Something happening internally in thrust::detail::contiguous_storage

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this getting close. All of my remaining comments are non-blocking, so approving.

@nvdbaranec nvdbaranec requested a review from vuule March 7, 2024 20:32
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the feedback! Looks very clean now 🔥

@nvdbaranec
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit bd68b1c into rapidsai:branch-24.04 Mar 7, 2024
73 checks passed
AyodeAwe pushed a commit that referenced this pull request Mar 8, 2024
## Description
Following #15079, we add a way to
share the pinned pool in JNI with cuIO via the new method added by
@nvdbaranec `set_host_memory_resource`.

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

---------

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

8 participants