-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CNMeM support #443
Add CNMeM support #443
Conversation
@@ -44,6 +44,15 @@ ELSE(MAGMA_FOUND) | |||
MESSAGE(STATUS "MAGMA not found. Compiling without MAGMA support") | |||
ENDIF(MAGMA_FOUND) | |||
|
|||
SET(USE_CNMEM 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs to be auto-detected (you probably SET this for debugging?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, nevermind. I see that you embedded the whole CNMem subproject in here now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you embed CNMem as a subtree instead of a subproject please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6f557cd.
85c5ed8
to
2d2338e
Compare
Yeah, I don't know how I feel about this as a default option. There are many use cases for (cu)torch that don't involve manipulating large contiguous regions of memory for NNs. This would be subject to serious fragmentation issues, especially with large allocations, since you lose the physical -> virtual memory mapping on the GPU? Under what circumstances does cnmem ever free memory via cudaFree? What might be more appropriate is to make it so that you can plug in a memory manager instead of |
Let's not have it as default solution yet. We went through similar things with NVCaffe at NVidia and currently we are using caching allocator from CUB (see https://github.com/NVIDIA/caffe/blob/caffe-0.15/3rdparty/getCUB.sh). |
Besides fragmentation, a big problem with CNMEM slab allocator is that in case of shared memory (system and GPU) it would only use GPU memory. There is actually some potential in using CNMEM with a special out-of-memory handler that would make it allocate another slab, but so far we'd been happy with CUB. |
I'm by no means advocating for CNMeM to be the default setting, but support for other allocators could be very useful in some cases (e.g. I can give @wickedfoo's proposed approach a try, but I'm a pretty mediocre C programmer, so if you guys have any more specific insights on how to approach it please let me know. |
will converge on something in the next few days. Definitely need a custom memory allocator. |
+1 for integrating this as an option, I am seeing nice speedups with autograd too. thanks @bartvm |
For Caffe, we made use of CUB caching allocator (https://raw.githubusercontent.com/NVlabs/cub). It's not as greedy as CNMEM is and also works with no problems in case of fragmented memory or split memory (like TX1). Here's the code: https://github.com/NVIDIA/caffe/blob/caffe-0.16/include/caffe/util/gpu_memory.hpp Would be a piece of cake to integrate if it weren't C++. Some ugly adapters are in order otherwise but still straightforward. |
After reviewing CUB vs. CNMEM option, and my recent takes on FindEx intergation in CUDNN, I would suggest the following:
|
I have first version ready here, polishing a few minor details: https://github.com/borisfom/cutorch, you are welcome to peek before I submit the PR. |
cc: @colesbury Having the user specify how much memory pool to use seems unreasonable, as most of our users are oblivious to these things. @colesbury is working this week on a custom memory allocator that will be integrated into cutorch, so that we can do malloc and free without dealing with sync points. |
Hi @soumith, |
@colesbury : and yes, I'd be happy talk - I have already expressed most of my thoughts in code here: https://github.com/borisfom/cutorch, and I would love to learn about yours! |
Can one of the admins verify this patch? |
@torchbot build this |
Fixes #341.
Although Torch's
nn
package re-uses tensors efficiently,torch-autograd
simply allocates them greedily (in direct mode at least, the optimized mode tries to re-use tensors whenever possible). Because of this, using a memory pool (i.e. CNMeM) can give quite a bit of speedup. For thetorch-autograd
benchmarks speedups range from a factor of 1 (when using wrappednn
modules) to 1.82 (pure autograd code in direct mode).I tested this PR before rebasing on the new master, and I haven't tested this in a multi-GPU setting. That said, I've been running experiments using these changes and an older
cutorch
branch for a few weeks without any issues.