When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash. #72

smartbitcoin · 2015-03-10T00:22:54Z

the code logic was like:

float *data;
THCudaCheck(cudaMalloc((void**)(&data), size * sizeof(float)));
THCudaCheck(cudaMemcpyAsync(data, self->data, THMin(self->size, size) * sizeof(float), cudaMemcpyDeviceToDevice));
THCudaCheck(cudaFree(self->data));

when considering scenario:
GPU have 4G RAM, but 3G used, if lua call resize() the storage from 3G to 3.5G, above code will crash by out of memory, but actually the GPU still have 1G spare ram.

the optimized logic should be like:
if device ram is not enough to malloc, first , copy current data from device to host, then release device ram , after that malloc new device ram , finally copy by the content from host to device.

Please consider this request's importance b/c device ram always very tight. it should be better if release ahead of malloc.

smartbitcoin · 2015-03-10T00:23:52Z

soumith · 2015-03-10T00:37:34Z

when memory pressure is high, you should do that explicitly on your own. I don't think it is fair to expect that cutorch do a particular operation on CPU implicitly in the background, as this can have many performance side effects that people generally would not expect.

smartbitcoin · 2015-03-10T14:49:48Z

soumith, device memory need better management especially when CUDA itself still not that smart there. I put a scenario here. you have 4G device ram, you alloc 1.5G first, later on you want to resize to 2.5G. in this case, the resize() call still possible "out of memory" crash if the first 1.5G alloc not align to memory boundary, then there are leaking memory in middle of whole device ram, which hold CUDA alloc continous 2.5G ram. ( but there still enough available ram there ).

resize() only called few times during whole training process, but it's the main reason cause crashing.
swap out, free, then swap in will be a good algo ( alloc small trunk instead of huge amount will be excellent one, but hard to implements. ) for the case when there do have enough RAM, but resize() still failed. it only have tiny performance impact , but it's a "life save" changes.

I did my testing, now the issue is not the performance impact, it's the alloc and free is controlled by cuda runtime. so even you free the "old" content before resize(), those memory space still not return to cuda runtime immediatly , I'll try to figure out how to do a sucess "swap" lol.

deepakjnath · 2015-07-13T01:07:48Z

@smartbitcoin where you able to find a solution for this problem? I am encountering the same issue. I find it to be a major bottleneck

smartbitcoin · 2015-07-13T12:03:54Z

Kind of. I switch to Caffe, which Blob structure can let you control GRam flexible, but you need write some c++ code.

soumith changed the title ~~THCStorage.cu resize() algo use device2device copy , will cause out of memory crash.~~ When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash. Mar 10, 2015

soumith closed this as completed Mar 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash. #72

When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash. #72

smartbitcoin commented Mar 10, 2015

smartbitcoin commented Mar 10, 2015

soumith commented Mar 10, 2015

smartbitcoin commented Mar 10, 2015

deepakjnath commented Jul 13, 2015

smartbitcoin commented Jul 13, 2015

When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash. #72

When memory pressure is high, THCStorage.cu resize() algo use device2device copy , will cause out of memory crash. #72

Comments

smartbitcoin commented Mar 10, 2015

smartbitcoin commented Mar 10, 2015

soumith commented Mar 10, 2015

smartbitcoin commented Mar 10, 2015

deepakjnath commented Jul 13, 2015

smartbitcoin commented Jul 13, 2015