Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CNMeM support #443

Closed
wants to merge 4 commits into from
Closed

Add CNMeM support #443

wants to merge 4 commits into from

Conversation

bartvm
Copy link

@bartvm bartvm commented Jul 10, 2016

Fixes #341.

Although Torch's nn package re-uses tensors efficiently, torch-autograd simply allocates them greedily (in direct mode at least, the optimized mode tries to re-use tensors whenever possible). Because of this, using a memory pool (i.e. CNMeM) can give quite a bit of speedup. For the torch-autograd benchmarks speedups range from a factor of 1 (when using wrapped nn modules) to 1.82 (pure autograd code in direct mode).

I tested this PR before rebasing on the new master, and I haven't tested this in a multi-GPU setting. That said, I've been running experiments using these changes and an older cutorch branch for a few weeks without any issues.

@@ -44,6 +44,15 @@ ELSE(MAGMA_FOUND)
MESSAGE(STATUS "MAGMA not found. Compiling without MAGMA support")
ENDIF(MAGMA_FOUND)

SET(USE_CNMEM 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to be auto-detected (you probably SET this for debugging?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nevermind. I see that you embedded the whole CNMem subproject in here now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you embed CNMem as a subtree instead of a subproject please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6f557cd.

@bartvm bartvm force-pushed the cnmem2 branch 2 times, most recently from 85c5ed8 to 2d2338e Compare July 11, 2016 13:33
@wickedfoo
Copy link
Contributor

wickedfoo commented Jul 11, 2016

Yeah, I don't know how I feel about this as a default option. There are many use cases for (cu)torch that don't involve manipulating large contiguous regions of memory for NNs.

This would be subject to serious fragmentation issues, especially with large allocations, since you lose the physical -> virtual memory mapping on the GPU? Under what circumstances does cnmem ever free memory via cudaFree?

What might be more appropriate is to make it so that you can plug in a memory manager instead of #ifdefing in code, and have a setting for the current allocator via cutorch.setDefaultMemoryAllocator(...)? I'd even go so far as to have each storage indicate what memory manager its allocation came from, and free it through that, so you can mix and match as appropriate.

@borisfom
Copy link
Contributor

Let's not have it as default solution yet. We went through similar things with NVCaffe at NVidia and currently we are using caching allocator from CUB (see https://github.com/NVIDIA/caffe/blob/caffe-0.15/3rdparty/getCUB.sh).

@borisfom
Copy link
Contributor

Besides fragmentation, a big problem with CNMEM slab allocator is that in case of shared memory (system and GPU) it would only use GPU memory. There is actually some potential in using CNMEM with a special out-of-memory handler that would make it allocate another slab, but so far we'd been happy with CUB.

@bartvm
Copy link
Author

bartvm commented Jul 12, 2016

I'm by no means advocating for CNMeM to be the default setting, but support for other allocators could be very useful in some cases (e.g. torch-autograd). I'm agnostic as to whether this should be CUB or CNMeM.

I can give @wickedfoo's proposed approach a try, but I'm a pretty mediocre C programmer, so if you guys have any more specific insights on how to approach it please let me know.

@soumith
Copy link
Member

soumith commented Jul 29, 2016

will converge on something in the next few days. Definitely need a custom memory allocator.

@szagoruyko
Copy link
Member

+1 for integrating this as an option, I am seeing nice speedups with autograd too. thanks @bartvm

@borisfom
Copy link
Contributor

For Caffe, we made use of CUB caching allocator (https://raw.githubusercontent.com/NVlabs/cub). It's not as greedy as CNMEM is and also works with no problems in case of fragmented memory or split memory (like TX1). Here's the code:

https://github.com/NVIDIA/caffe/blob/caffe-0.16/include/caffe/util/gpu_memory.hpp
https://github.com/NVIDIA/caffe/blob/caffe-0.16/src/caffe/util/gpu_memory.cpp
https://github.com/NVIDIA/caffe/blob/caffe-0.16/3rdparty/getCUB.sh

Would be a piece of cake to integrate if it weren't C++. Some ugly adapters are in order otherwise but still straightforward.
@soumith, What do you think ?

@borisfom
Copy link
Contributor

After reviewing CUB vs. CNMEM option, and my recent takes on FindEx intergation in CUDNN, I would suggest the following:

  1. Let's start with adding CNMEM allocator option mostly as it is in this PR
  2. cutorch.useCNMEM should be runtime option, with env support (I'd rather eliminate #ifdefs completely)
  3. There should be cutorch.maxGPUMemPercent setting, 90-95 default, but adjustable for shared GPU situations (with env support via CUTORCH_MAX_GPU_MEM var or similar). Also: either this setting should be used by CUDNN FindEx workspace allocation directly, or CUDNN should query cutorch's memory info function for maximum available memory (probably the latter)
  4. There probably should be separate setting, cutorch.maxGPUMemPoolPercent : how much of the users's share of GPU memory the pool can take. For example, cutorch.maxGPUMemPercent = 50 and cutorch.maxGPUMemPoolPercent = 95 would restrict any user on shared system to grab more than half of total GPU memory, and CNMEM will take 95% of 50%.
  5. CNMEM Implementation should be enhanced not to fail it it can't allocate as much slab as it wants, gradually trying lesser sizes (75% or so) until success. Also, the option to grow should be set, to accomodate shared environment or split environment like TX1. Then, say, I set my limit to 90% and on TX1 it will only give me 50 for the first slab (all GPU memory), and if I need more, it would reach for the second slab and allocate 40% from shared memory.
    @bartvm, @soumith, @wickedfoo, @szagoruyko : Let me know if you want me to take a stab on this.

@borisfom
Copy link
Contributor

I have first version ready here, polishing a few minor details: https://github.com/borisfom/cutorch, you are welcome to peek before I submit the PR.

@soumith
Copy link
Member

soumith commented Aug 23, 2016

cc: @colesbury

Having the user specify how much memory pool to use seems unreasonable, as most of our users are oblivious to these things.

@colesbury is working this week on a custom memory allocator that will be integrated into cutorch, so that we can do malloc and free without dealing with sync points.
We looked at CUB and CNMem, but both fall short on a few aspects that we want, so he's working on something slightly different, but starting from CUB / CNMem.
If there's anyone at NVIDIA we should talk to as he works on this (including you @borisfom ), we'd love to get thoughts.
We want to build a drop-in replacement which doesn't have the requirement of specifying before-hand how much memory is being managed, so that this change is completely transparent to the user.

@borisfom
Copy link
Contributor

Hi @soumith,
Actually, my implementation does not really require user to specify anything : an initial size default of, say, 10% would work just fine. One thing that has to be added to improve this non-greedy approach would be to increase minimum slab size that is being used when CNMEM reaches to cudaMalloc for the next GPU memory block.

@borisfom
Copy link
Contributor

@colesbury : and yes, I'd be happy talk - I have already expressed most of my thoughts in code here: https://github.com/borisfom/cutorch, and I would love to learn about yours!

@borisfom borisfom mentioned this pull request Aug 23, 2016
@torchbot
Copy link

Can one of the admins verify this patch?

@soumith
Copy link
Member

soumith commented Aug 30, 2016

@torchbot build this

@soumith
Copy link
Member

soumith commented Sep 26, 2016

@soumith soumith closed this Oct 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants