Memory not being deallocated in backward() #18643

mdlockyer · 2019-03-30T00:44:22Z

🐛 Bug

I've recently discovered an issue with memory not being freed after the first iteration of training. It's not a leak, as memory usage stays consistent after the second pass through the loop. It appears on both CPU and GPU, however it is much more significant when running on CPU.

The issue seems to come from the either backward or optimizer.step(), as removing their calls provides stable memory usage.

I ran into this while attempting to train a rather large model that uses pretty much all of my available GPU memory. It will complete the first iteration successfully, then OOM during the second.

To Reproduce

Steps to reproduce the behavior:

I have compiled a minimal CPU and GPU gist that should reproduce this issue:

CPU
GPU

The CPU gist uses the memory-profile package, so that will need to be installed with pip

Expected behavior

The memory usage should be relatively the same in the first pass through the training loop, and all following loops.

Environment

PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.13.6
GCC version: Could not collect
CMake version: version 3.9.4

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.16.2
[pip3] torch==1.0.1.post2
[pip3] torchvision==0.2.2.post3
[conda] torch 1.0.1.post2
[conda] torchsummary 1.5.1
[conda] torchvision 0.2.1

Additional context

I ran some profiles on the CPU memory usage that highlight the issue:

With backward pass and update:

Iteration 1

Line #    Mem usage    Increment   Line Contents
================================================
    23    360.0 MiB    360.0 MiB   @profile
    24                             def train(model, criterion, optim):
    25    360.1 MiB      0.0 MiB       x = torch.rand(1, 3, 8, 8)
    26    360.1 MiB      0.0 MiB       y = torch.ones(1, 1, 8, 8)
    27                             
    28    402.7 MiB     42.6 MiB       out = model(x)
    29    402.7 MiB      0.1 MiB       loss = criterion(out, y)
    30                             
    31    402.7 MiB      0.0 MiB       optim.zero_grad()
    32    663.8 MiB    261.1 MiB       loss.backward()
    33    664.0 MiB      0.1 MiB       optim.step()
    34    664.0 MiB      0.0 MiB       optim.zero_grad()
    35    664.0 MiB      0.0 MiB       del x, y, out, loss
    36    664.0 MiB      0.0 MiB       gc.collect()

Iteration 2

Line #    Mem usage    Increment   Line Contents
================================================
    23    664.0 MiB    664.0 MiB   @profile
    24                             def train(model, criterion, optim):
    25    664.0 MiB      0.0 MiB       x = torch.rand(1, 3, 8, 8)
    26    664.0 MiB      0.0 MiB       y = torch.ones(1, 1, 8, 8)
    27                             
    28    701.7 MiB     37.7 MiB       out = model(x)
    29    701.7 MiB      0.0 MiB       loss = criterion(out, y)
    30                             
    31    701.7 MiB      0.0 MiB       optim.zero_grad()
    32    671.7 MiB      0.0 MiB       loss.backward()
    33    671.7 MiB      0.0 MiB       optim.step()
    34    671.7 MiB      0.0 MiB       optim.zero_grad()
    35    671.7 MiB      0.0 MiB       del x, y, out, loss
    36    671.7 MiB      0.0 MiB       gc.collect()

Without backward pass and update:

Iteration 1

Line #    Mem usage    Increment   Line Contents
================================================
    23    351.2 MiB    351.2 MiB   @profile
    24                             def train(model, criterion, optim):
    25    351.2 MiB      0.0 MiB       x = torch.rand(1, 3, 8, 8)
    26    351.3 MiB      0.0 MiB       y = torch.ones(1, 1, 8, 8)
    27                             
    28    392.4 MiB     41.1 MiB       out = model(x)
    29    392.5 MiB      0.1 MiB       loss = criterion(out, y)
    30                             
    31    392.5 MiB      0.0 MiB       optim.zero_grad()
    32                                 #loss.backward()
    33                                 #optim.step()
    34    392.5 MiB      0.0 MiB       optim.zero_grad()
    35    361.7 MiB      0.0 MiB       del x, y, out, loss
    36    361.7 MiB      0.0 MiB       gc.collect()

Iteration 2

Line #    Mem usage    Increment   Line Contents
================================================
    23    361.7 MiB    361.7 MiB   @profile
    24                             def train(model, criterion, optim):
    25    361.7 MiB      0.0 MiB       x = torch.rand(1, 3, 8, 8)
    26    361.7 MiB      0.0 MiB       y = torch.ones(1, 1, 8, 8)
    27                             
    28    392.0 MiB     30.3 MiB       out = model(x)
    29    392.0 MiB      0.0 MiB       loss = criterion(out, y)
    30                             
    31    392.0 MiB      0.0 MiB       optim.zero_grad()
    32                                 #loss.backward()
    33                                 #optim.step()
    34    392.0 MiB      0.0 MiB       optim.zero_grad()
    35    361.7 MiB      0.0 MiB       del x, y, out, loss
    36    361.7 MiB      0.0 MiB       gc.collect()

cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen

The text was updated successfully, but these errors were encountered:

ssnl · 2019-03-30T16:11:53Z

after first backward, the parameters' grad buffers are created so the model takes 2x memory, as expected. in first optim.step, if the optimizer maintains some buffer (e.g., Adam or SGD with momentum`), buffers will be created, as expected.

mdlockyer · 2019-03-30T19:56:43Z

That definitely explains the bulk of the memory usage, however it doesn’t explain the increase in memory usage. This is more at the root of the issue, and I may have chosen a bad title. If you look at the peak usage, it is higher by about 40MB in the second pass. In the model I was training when I discovered this it was more exaggerated, being almost 1GB higher. I’ve checked and double checked that no tensors are staying referenced accidentally and not being garbage collected. In fact, I tried the same pattern as the reproduction gists where I don’t return anything at all, and use del on all tensors. Still runs about 1GB higher in the second iteration onward for that architecture. Nothing is being stored internally within the model’s child Modules so I can’t explain it.

Also should note that no momentum was being used in any of the models I’ve tested. All have been SGD with momentum=0 so it shouldn’t be storing any momentum data after the first pass.

soumith · 2019-05-27T06:57:35Z

@ezyang @gchanan could you make sure someone looks into this

ezyang · 2019-06-11T19:03:38Z

cc @malvika2147

malvika2147 · 2019-06-25T14:21:33Z

@mdlockyer Can you please add the gists again, the links are currently broken.

mdlockyer · 2019-06-25T16:34:19Z

@malvika2147 They should be back up. Sorry about that. Finally got around to changing my old username which breaks all those links

KaiQiao1992 · 2019-06-26T11:51:00Z

I encounter the same problem, and memory is about 8GB higher when executing the second loss.backward(). I do not know why.

mdlockyer · 2019-06-26T18:23:16Z

@KaiQiao1992 it may be worth noting in this discussion that if you are using adaptive optimizers like Adam, there are a lot of buffers being created under the hood. They are very memory hungry. Adam creates two buffers that are of equal size to the weights being optimized(so in memory, it’s model size x3) and if amsgrad=True it will add a third. Not sure if that is relevant to your case, but I thought I’d put it out there. And as @ssnl mentioned, those buffers are created on the first call to step(). So they will only appear after the first iteration.

mdlockyer · 2019-06-27T05:38:06Z

@KaiQiao1992 8GB sounds steep for optimizer buffers though. That is a significant amount. Hopefully you’re able to figure it out. This may help. It’s a memory profiler for PyTorch. I haven’t tested it out, but it could be of use.

KaiQiao1992 · 2019-06-28T01:52:17Z

@mdlockyer I indeed used the Adam optimizer. Strangely, after rebooting the machine, i dno not encounter the "out of memory" again, though using the same Adam. because my fc layer is the size of 800000*1000, the consume of memory is big.

mdlockyer · 2019-06-28T02:41:29Z

@KaiQiao1992 that’s huge!! 800M parameters! My biggest model was 45M and I thought that was gigantic. Glad you’re not getting the OOM errors now though.

prasunanand · 2019-07-30T09:21:07Z

When I add time.sleep(20) to the reproducer code, there is no such issue. I belive the gc kicks in during the sleep().

https://gist.github.com/prasunanand/0926fe1ea453a785c967d2c444a22402

ezyang · 2019-07-30T17:13:22Z

If that's true, swapping time.sleep(20) with gc.collect() ought to work too. Sounds like a reference cycle, in that case?

mdlockyer · 2019-07-30T17:22:23Z

@prasunanand that's interesting. I'll test your gist on my end when I get a chance. @ezyang In my reproduction, I have a call to collect() after each iteration already. Not sure why the sleep works but not explicit garbage collection.

TaehwanKwon · 2019-11-07T07:06:24Z

I had same issue on python3.6.9 + torch1.3.0 but it works fine on python 3.7.5 + torch 1.3.0.

peterbell10 · 2019-11-14T21:00:06Z

I've been unable to reproduce the sharp memory spikes from the issue, only a very slight rise in memory usage. The profile looks roughly the same for all of the python and pytorch versions I tried.

@TaehwanKwon would you mind posting the memory profile that you see running the cpu script on python 3.6? Also, are you using macOS like @mdlockyer?

ezyang · 2020-01-27T17:47:41Z

Given that this reproduces inconsistently / can be fixed by upgrading (either PyTorch or Python), I'm downgrading the priority of this issue. If someone can come up with a clear configuration on the newest Python/PyTorch which exactly causes the problem please let us know.

ezyang assigned prasunanand Jul 23, 2019

patrick-kidger mentioned this issue Aug 28, 2019

Memory leak: autograd contexts not being garbage collected when assigned output Tensors as attributes #25340

Closed

rgommers unassigned prasunanand Nov 8, 2019

peterbell10 self-assigned this Nov 11, 2019

peterbell10 removed their assignment Nov 21, 2019

rgommers added the quansight-nack High-prio issues that have been reviewed by Quansight and are judged to be not actionable. label Jan 26, 2020

ezyang removed the high priority label Jan 27, 2020

MukundVarmaT mentioned this issue Jul 17, 2020

linearly increasing CPU memory usage kills the training torch-points3d/torch-points3d#324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory not being deallocated in backward() #18643

Memory not being deallocated in backward() #18643

mdlockyer commented Mar 30, 2019 •

edited by pytorch-probot bot

ssnl commented Mar 30, 2019

mdlockyer commented Mar 30, 2019 •

edited

soumith commented May 27, 2019

ezyang commented Jun 11, 2019

malvika2147 commented Jun 25, 2019

mdlockyer commented Jun 25, 2019

KaiQiao1992 commented Jun 26, 2019

mdlockyer commented Jun 26, 2019 •

edited

mdlockyer commented Jun 27, 2019

KaiQiao1992 commented Jun 28, 2019

mdlockyer commented Jun 28, 2019

prasunanand commented Jul 30, 2019 •

edited

ezyang commented Jul 30, 2019

mdlockyer commented Jul 30, 2019

TaehwanKwon commented Nov 7, 2019

peterbell10 commented Nov 14, 2019

ezyang commented Jan 27, 2020

Memory not being deallocated in backward() #18643

Memory not being deallocated in backward() #18643

Comments

mdlockyer commented Mar 30, 2019 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

With backward pass and update:

Iteration 1

Iteration 2

Without backward pass and update:

Iteration 1

Iteration 2

ssnl commented Mar 30, 2019

mdlockyer commented Mar 30, 2019 • edited

soumith commented May 27, 2019

ezyang commented Jun 11, 2019

malvika2147 commented Jun 25, 2019

mdlockyer commented Jun 25, 2019

KaiQiao1992 commented Jun 26, 2019

mdlockyer commented Jun 26, 2019 • edited

mdlockyer commented Jun 27, 2019

KaiQiao1992 commented Jun 28, 2019

mdlockyer commented Jun 28, 2019

prasunanand commented Jul 30, 2019 • edited

ezyang commented Jul 30, 2019

mdlockyer commented Jul 30, 2019

TaehwanKwon commented Nov 7, 2019

peterbell10 commented Nov 14, 2019

ezyang commented Jan 27, 2020

mdlockyer commented Mar 30, 2019 •

edited by pytorch-probot bot

mdlockyer commented Mar 30, 2019 •

edited

mdlockyer commented Jun 26, 2019 •

edited

prasunanand commented Jul 30, 2019 •

edited