Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel copy for pinned memory #15934

Merged
merged 79 commits into from
Jun 27, 2024
Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jun 5, 2024

Description

Issue #15620

Added an API that enables users to set the threshold under which we perform pinned memory copies using a kernel. The default threshold is zero, so there's no change in default behavior.
The API currently only impacts hostdevice_vector H<->D synchronization.

The PR adds wrappers for cudaMemcpyAsync so we can implement configurable behavior for pageable copies as well (e.g. copy to pinned + kernel copy).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule self-assigned this Jun 5, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. labels Jun 5, 2024
@vuule
Copy link
Contributor Author

vuule commented Jun 5, 2024

impact on the Parquet reader multithreaded benchmark:
image

vuule and others added 2 commits June 5, 2024 15:14
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
* @param kind Direction of the copy and type of host memory
* @param stream CUDA stream used for the copy
*/
void cuda_memcpy_async(
Copy link
Contributor

@bdice bdice Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want another name for this, given that it does not always call cudaMemcpyAsync? Proposing: cudf_memcpy_async.

(Happy to go either way on this, the status quo is fine.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like to include cudf in the name when it's already in the cudf namespace. Named it this way to make it obvious that it replaces the use of cudaMemcpyAsync. That said, I could probably be convinced to rename it, not tied to any specific name.

Copy link
Contributor

@vyasr vyasr Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree, I don't like duplicating the namespace name in objects already within the namespace. That only encourages bad practices like using declarations to import the namespace members.

Comment on lines +61 to +63
* @param threshold The threshold size in bytes. If the size of the copy is less than this
* threshold, the copy will be done using kernels. If the size is greater than or equal to this
* threshold, the copy will be done using cudaMemcpyAsync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any "magic" sizes where we expect one strategy to outperform the other? (A page size, a multiple of 1 kiB or similar) Or is this purely empirical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair to say that we don't know what the right value is for this (yet?). It's likely to be empirical, since the only goal is to avoid too many copies going through the copy engine.

Copy link
Contributor

@bdice bdice Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s do a sweep over threshold values for the next steps where we enable this more broadly. I would like something closer to a microbenchmark (copy back and forth for different sizes with different thresholds?) than the multithreaded Parquet benchmark.


namespace cudf::detail {

enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't care for anything here since I expect that this will stay internal, but user-facing enums we usually provide a storage class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_kind seems somewhat generic, like something that could be in cudf/copying.hpp. Should we be more explicit with something like memcopy_kind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It's equivalent to cudaMemcpyKind, so this naming matches better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to reflect that only host memory type is specified now.

Comment on lines 50 to 64
void copy_pinned_to_device(void* dst,
void const* src,
std::size_t size,
rmm::cuda_stream_view stream)
{
copy_pinned(dst, src, size, stream);
}

void copy_device_to_pinned(void* dst,
void const* src,
std::size_t size,
rmm::cuda_stream_view stream)
{
copy_pinned(dst, src, size, stream);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this transparent passthrough just to have a function name that clearly indicates the direction of the transfer? You still have to get the src/dst order correct, though, so does that really help much? It seems duplicative, especially for something in an anonymous namespace inside detail that you're only using internally.

Same for pageable below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was that I wanted to allow different behavior for h2d and d2h without changing the header. But now that the entire implementation is in the source file we can simplify this and separate the implementations only when we actually need to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I really think you only need one function, no dispatch.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified?

Thinking about this, shouldn't either thrust::copy or cudaMemcpy be responsible for deciding and implementing the fastest copy possible? If not, we should file bugs.


namespace cudf::detail {

enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is copy_kind needed at all? There is exactly one case (pinned, size less than threshold) where you do anything other than pass through to cudaMemcpyAsync. You can detect that case with cudaPtrGetAttributes and call Thrust for that one case, and just call cudaMemcpyAsync(cudaMemcpyDefault) for everything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that we'll eventually have a separate threshold for pageable copies, where we copy to a pinned buffer and then thrust::copy. @abellina had this in the POC implementation, and IIRC it was helpful even with the extra copy.
I understand current implementation is just a wrapper, I just wanted to leave room for more complex behavior without future changes to the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. Does direction affect the choice at all? Could reduce 4 to 2 cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduces to two cases; only the host memory type is specified now.
I can also add an AUTO/DEFAULT option that would call cudaPointerGetAttributes. Let me know what you think.

Comment on lines 50 to 64
void copy_pinned_to_device(void* dst,
void const* src,
std::size_t size,
rmm::cuda_stream_view stream)
{
copy_pinned(dst, src, size, stream);
}

void copy_device_to_pinned(void* dst,
void const* src,
std::size_t size,
rmm::cuda_stream_view stream)
{
copy_pinned(dst, src, size, stream);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I really think you only need one function, no dispatch.

Comment on lines 87 to 95
if (kind == copy_kind::PINNED_TO_DEVICE) {
copy_pinned_to_device(dst, src, size, stream);
} else if (kind == copy_kind::DEVICE_TO_PINNED) {
copy_device_to_pinned(dst, src, size, stream);
} else if (kind == copy_kind::PAGEABLE_TO_DEVICE) {
copy_pageable_to_device(dst, src, size, stream);
} else if (kind == copy_kind::DEVICE_TO_PAGEABLE) {
copy_device_to_pageable(dst, src, size, stream);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (kind == copy_kind::PINNED_TO_DEVICE) {
copy_pinned_to_device(dst, src, size, stream);
} else if (kind == copy_kind::DEVICE_TO_PINNED) {
copy_device_to_pinned(dst, src, size, stream);
} else if (kind == copy_kind::PAGEABLE_TO_DEVICE) {
copy_pageable_to_device(dst, src, size, stream);
} else if (kind == copy_kind::DEVICE_TO_PAGEABLE) {
copy_device_to_pageable(dst, src, size, stream);
}
switch(kind) {
case copy_kind::PINNED_TO_DEVICE:
case copy_kind::DEVICE_TO_PINNED:
copy_pinned(dst, src, size, stream);
case copy_kind::PAGEABLE_TO_DEVICE:
case copy_kind::DEVICE_TO_PAGEABLE:
case default:
copy_pageable(dst, src, size, stream);

but better:

cudaPointerAttributes src_attribs;
CUDF_CUDA_TRY(cudaPointerGetAttributes(... &src_attribs));
cudaPointerAttributes dst_attribs;
CUDF_CUDA_TRY(cudaPointerGetAttributes(... &dst_attribs));
bool pageable = ((src_attribs.cudaMemoryType == cudaMemoryTypeUnregistered) or 
  (dst_attribs.cudaMemoryType == cudaMemoryTypeUnregistered));
if (pageable and size < get_kernel_pinned_copy_threshold()) {
  thrust::copy_n(rmm::exec_policy_nosync(stream),
                 static_cast<const char*>(src)
                 size,
                 static_cast<char*>(dst));
} else {
  CUDF_CUDA_TRY(cudaMemcpyAsync(dst, src, size, cudaMemcpyDefault, stream));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was told that cudaPointerGetAttributes is not trivial, so I'm trying to avoid calling it for every copy. Also, FWIW tying the strategy to the memory type prevents callers from manually overriding the strategy.
Current API is awkward to use when copying from an existing cudf::host_vector, so I'm not sure what's the best option here.

@vuule
Copy link
Contributor Author

vuule commented Jun 26, 2024

Can be simplified?

Thinking about this, shouldn't either thrust::copy or cudaMemcpy be responsible for deciding and implementing the fastest copy possible? If not, we should file bugs.

The fastest copy possible depends on the context. The goal here is not to implement SOL copy, but to reduce the copy engine bottleneck in multi-threaded environment (e.g. Spark), and thrust::copy and cudaMemcpy don't have this context.

@vuule vuule requested review from vyasr and harrism June 26, 2024 16:35
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for simplifying!

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jun 27, 2024
@vuule
Copy link
Contributor Author

vuule commented Jun 27, 2024

/merge

@rapids-bot rapids-bot bot merged commit f267b1f into rapidsai:branch-24.08 Jun 27, 2024
74 checks passed
@vuule vuule deleted the fea-smart-copy branch June 27, 2024 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants