-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Description
CUDA pinned memory is important for efficient execution because it allows for faster data transfers and non-blocking CUDA copies.
The copy from normal memory to pinned memory can take significant time. A batch of 256x3x224x224
FloatTensor takes about 110ms on my computer to copy. Currently we can only do the copy on the main process because inter-process shared Tensor/Storages are copied to non-page locked shared memory. For small conv nets on fast GPUs, we probably need to do the copy in the background.
I believe we can page-lock the shared memory via cudaHostRegister
. We would probably need to unregister it via cudaHostUnregister
before freeing the memory.
This would require some knowledge of CUDA in the shared memory code or at least a free hooks to call cudaHostUnregister
.