Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. #15079

nvdbaranec · 2024-02-16T21:33:02Z

This PR adds a new interface to cuIO which controls where host memory allocations come from. It adds two core functions:

Addresses #14314

rmm::host_async_resource_ref set_host_memory_resource(rmm::host_async_resource_ref mr);
rmm::host_async_resource_ref get_host_memory_resource();

cudf::io::hostdevice_vector was currently implemented in terms of a thrust::host_vector<> that explicitly uses an allocator called pinned_host_vector. I copied that and made a new class called rmm_host_vector which takes any host_resource_ref. This probably makes pinned_host_vector obsolete.

Parquet benchmarks have a new commandline option which lets you toggle between 3 modes:

--cuio_host_mem pinned              (the default, an unpooled, pinned memory source)
--cuio_host_mem pinned_pool         (the pooled/pinned resource)

The ultimate intent here is to reduce the cpu-side overhead of the setup code that comes before the decode kernels in the parquet reader. The wins are pretty significant for our faster kernels (that is, where we are less dominated by gpu time)

Edit: Updated to use newly minted resource ref types from rmm itself. I also switched the type to be host_async_resource_ref even though in this case the user (thrust::host_vector) doesn't explicitly go through the async path. In addition, the pageable memory path (an experimental feature) has been removed.

Pinned

| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |     25x | 20.443 ms | 0.45% | 20.438 ms | 0.45% |      26268890178 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     26x | 19.571 ms | 0.42% | 19.565 ms | 0.42% |      27440146729 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |     28x | 18.150 ms | 0.18% | 18.145 ms | 0.18% |      29587789525 |       602.424 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     29x | 17.306 ms | 0.37% | 17.300 ms | 0.37% |      31032523423 |       597.181 MiB |        14.403 MiB |

Pooled/pinned

| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |    117x | 17.258 ms | 0.50% | 17.254 ms | 0.50% |      31115706389 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     31x | 16.413 ms | 0.43% | 16.408 ms | 0.43% |      32719609450 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |    576x | 14.885 ms | 0.58% | 14.881 ms | 0.58% |      36077859564 |       602.519 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     36x | 14.069 ms | 0.48% | 14.065 ms | 0.48% |      38171646940 |       597.243 MiB |        14.403 MiB |

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…th benchmarks for pooled-pinned memory.

GregoryKimball · 2024-02-16T23:06:47Z

Nice work @nvdbaranec !!

hyperbolic2346 · 2024-02-20T19:59:39Z

I like this ability. My only question is if we should follow the current optional memory resource passed into functions or if we should add this as a set/get.

table_with_metadata read_parquet(
  parquet_reader_options const& options,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Maybe this becomes:

table_with_metadata read_parquet(
  parquet_reader_options const& options,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(),
  cudf::host_resource_ref* host_mr    = rmm::mr::pinned_memory_resource);

I don't know where all this applies and the trouble of passing it through.

cpp/include/cudf/utilities/resource_ref.hpp

harrism

I'm really surprised the old host_memory_resource works with the pool. I added rmm::mr::pinned_host_memory_resource (which implements the cuda::mr::async_memory_resource and cuda::mr::memory_resource concepts instead of deriving from host_memory_resource) specifically to enable use with pool_memory_resource. Please use it instead of the old one.

cpp/benchmarks/fixture/nvbench_fixture.hpp