Add CNMeM support #443

bartvm · 2016-07-10T21:24:27Z

Fixes #341.

Although Torch's nn package re-uses tensors efficiently, torch-autograd simply allocates them greedily (in direct mode at least, the optimized mode tries to re-use tensors whenever possible). Because of this, using a memory pool (i.e. CNMeM) can give quite a bit of speedup. For the torch-autograd benchmarks speedups range from a factor of 1 (when using wrapped nn modules) to 1.82 (pure autograd code in direct mode).

I tested this PR before rebasing on the new master, and I haven't tested this in a multi-GPU setting. That said, I've been running experiments using these changes and an older cutorch branch for a few weeks without any issues.

soumith · 2016-07-11T12:45:45Z

lib/THC/CMakeLists.txt

@@ -44,6 +44,15 @@ ELSE(MAGMA_FOUND)
  MESSAGE(STATUS "MAGMA not found. Compiling without MAGMA support")
 ENDIF(MAGMA_FOUND)

+SET(USE_CNMEM 1)


I believe this needs to be auto-detected (you probably SET this for debugging?)

oh, nevermind. I see that you embedded the whole CNMem subproject in here now.

Can you embed CNMem as a subtree instead of a subproject please?

Done in 6f557cd.

wickedfoo · 2016-07-11T21:27:24Z

Yeah, I don't know how I feel about this as a default option. There are many use cases for (cu)torch that don't involve manipulating large contiguous regions of memory for NNs.

This would be subject to serious fragmentation issues, especially with large allocations, since you lose the physical -> virtual memory mapping on the GPU? Under what circumstances does cnmem ever free memory via cudaFree?

What might be more appropriate is to make it so that you can plug in a memory manager instead of #ifdefing in code, and have a setting for the current allocator via cutorch.setDefaultMemoryAllocator(...)? I'd even go so far as to have each storage indicate what memory manager its allocation came from, and free it through that, so you can mix and match as appropriate.

borisfom · 2016-07-11T23:53:28Z

Let's not have it as default solution yet. We went through similar things with NVCaffe at NVidia and currently we are using caching allocator from CUB (see https://github.com/NVIDIA/caffe/blob/caffe-0.15/3rdparty/getCUB.sh).

borisfom · 2016-07-11T23:56:17Z

Besides fragmentation, a big problem with CNMEM slab allocator is that in case of shared memory (system and GPU) it would only use GPU memory. There is actually some potential in using CNMEM with a special out-of-memory handler that would make it allocate another slab, but so far we'd been happy with CUB.

bartvm · 2016-07-12T00:14:47Z

I'm by no means advocating for CNMeM to be the default setting, but support for other allocators could be very useful in some cases (e.g. torch-autograd). I'm agnostic as to whether this should be CUB or CNMeM.

I can give @wickedfoo's proposed approach a try, but I'm a pretty mediocre C programmer, so if you guys have any more specific insights on how to approach it please let me know.

soumith · 2016-07-29T21:48:16Z

will converge on something in the next few days. Definitely need a custom memory allocator.

szagoruyko · 2016-07-31T15:51:05Z

+1 for integrating this as an option, I am seeing nice speedups with autograd too. thanks @bartvm

borisfom · 2016-08-10T07:58:29Z

For Caffe, we made use of CUB caching allocator (https://raw.githubusercontent.com/NVlabs/cub). It's not as greedy as CNMEM is and also works with no problems in case of fragmented memory or split memory (like TX1). Here's the code:

https://github.com/NVIDIA/caffe/blob/caffe-0.16/include/caffe/util/gpu_memory.hpp
https://github.com/NVIDIA/caffe/blob/caffe-0.16/src/caffe/util/gpu_memory.cpp
https://github.com/NVIDIA/caffe/blob/caffe-0.16/3rdparty/getCUB.sh

Would be a piece of cake to integrate if it weren't C++. Some ugly adapters are in order otherwise but still straightforward.
@soumith, What do you think ?

borisfom · 2016-08-18T21:19:02Z

After reviewing CUB vs. CNMEM option, and my recent takes on FindEx intergation in CUDNN, I would suggest the following:

Let's start with adding CNMEM allocator option mostly as it is in this PR
cutorch.useCNMEM should be runtime option, with env support (I'd rather eliminate #ifdefs completely)
There should be cutorch.maxGPUMemPercent setting, 90-95 default, but adjustable for shared GPU situations (with env support via CUTORCH_MAX_GPU_MEM var or similar). Also: either this setting should be used by CUDNN FindEx workspace allocation directly, or CUDNN should query cutorch's memory info function for maximum available memory (probably the latter)
There probably should be separate setting, cutorch.maxGPUMemPoolPercent : how much of the users's share of GPU memory the pool can take. For example, cutorch.maxGPUMemPercent = 50 and cutorch.maxGPUMemPoolPercent = 95 would restrict any user on shared system to grab more than half of total GPU memory, and CNMEM will take 95% of 50%.
CNMEM Implementation should be enhanced not to fail it it can't allocate as much slab as it wants, gradually trying lesser sizes (75% or so) until success. Also, the option to grow should be set, to accomodate shared environment or split environment like TX1. Then, say, I set my limit to 90% and on TX1 it will only give me 50 for the first slab (all GPU memory), and if I need more, it would reach for the second slab and allocate 40% from shared memory.
@bartvm, @soumith, @wickedfoo, @szagoruyko : Let me know if you want me to take a stab on this.

borisfom · 2016-08-23T18:17:46Z

I have first version ready here, polishing a few minor details: https://github.com/borisfom/cutorch, you are welcome to peek before I submit the PR.

soumith · 2016-08-23T18:20:49Z

cc: @colesbury

Having the user specify how much memory pool to use seems unreasonable, as most of our users are oblivious to these things.

@colesbury is working this week on a custom memory allocator that will be integrated into cutorch, so that we can do malloc and free without dealing with sync points.
We looked at CUB and CNMem, but both fall short on a few aspects that we want, so he's working on something slightly different, but starting from CUB / CNMem.
If there's anyone at NVIDIA we should talk to as he works on this (including you @borisfom ), we'd love to get thoughts.
We want to build a drop-in replacement which doesn't have the requirement of specifying before-hand how much memory is being managed, so that this change is completely transparent to the user.

borisfom · 2016-08-23T18:30:32Z

Hi @soumith,
Actually, my implementation does not really require user to specify anything : an initial size default of, say, 10% would work just fine. One thing that has to be added to improve this non-greedy approach would be to increase minimum slab size that is being used when CNMEM reaches to cudaMalloc for the next GPU memory block.

borisfom · 2016-08-23T18:32:19Z

@colesbury : and yes, I'd be happy talk - I have already expressed most of my thoughts in code here: https://github.com/borisfom/cutorch, and I would love to learn about yours!

torchbot · 2016-08-30T21:15:15Z

Can one of the admins verify this patch?

soumith · 2016-08-30T21:19:01Z

@torchbot build this

soumith · 2016-09-26T20:03:46Z

@bartvm see https://github.com/torch/cutorch#cuda-memory-allocation

Add CNMeM support

2d2338e

soumith reviewed Jul 11, 2016
View reviewed changes

bartvm force-pushed the cnmem2 branch 2 times, most recently from 85c5ed8 to 2d2338e Compare July 11, 2016 13:33

bartvm added 3 commits July 11, 2016 09:38

Remove submodule

b3f2c24

Add CNMeM

6f557cd

Remove unnecessary .gitmodules file

0f62faf

borisfom mentioned this pull request Aug 23, 2016

FindEx implementetation + refactoring, take 3 soumith/cudnn.torch#250

Closed

borisfom mentioned this pull request Aug 23, 2016

CNMEM pool integration #481

Closed

soumith closed this Oct 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CNMeM support #443

Add CNMeM support #443

bartvm commented Jul 10, 2016 •

edited

Loading

soumith Jul 11, 2016

soumith Jul 11, 2016

soumith Jul 11, 2016

bartvm Jul 11, 2016

wickedfoo commented Jul 11, 2016 •

edited

Loading

borisfom commented Jul 11, 2016

borisfom commented Jul 11, 2016

bartvm commented Jul 12, 2016

soumith commented Jul 29, 2016

szagoruyko commented Jul 31, 2016

borisfom commented Aug 10, 2016

borisfom commented Aug 18, 2016

borisfom commented Aug 23, 2016

soumith commented Aug 23, 2016

borisfom commented Aug 23, 2016

borisfom commented Aug 23, 2016

torchbot commented Aug 30, 2016

soumith commented Aug 30, 2016

soumith commented Sep 26, 2016

Add CNMeM support #443

Add CNMeM support #443

Conversation

bartvm commented Jul 10, 2016 • edited Loading

soumith Jul 11, 2016

Choose a reason for hiding this comment

soumith Jul 11, 2016

Choose a reason for hiding this comment

soumith Jul 11, 2016

Choose a reason for hiding this comment

bartvm Jul 11, 2016

Choose a reason for hiding this comment

wickedfoo commented Jul 11, 2016 • edited Loading

borisfom commented Jul 11, 2016

borisfom commented Jul 11, 2016

bartvm commented Jul 12, 2016

soumith commented Jul 29, 2016

szagoruyko commented Jul 31, 2016

borisfom commented Aug 10, 2016

borisfom commented Aug 18, 2016

borisfom commented Aug 23, 2016

soumith commented Aug 23, 2016

borisfom commented Aug 23, 2016

borisfom commented Aug 23, 2016

torchbot commented Aug 30, 2016

soumith commented Aug 30, 2016

soumith commented Sep 26, 2016

bartvm commented Jul 10, 2016 •

edited

Loading

wickedfoo commented Jul 11, 2016 •

edited

Loading