Memory 'leak' issue with MapTable and clearState #1141

achalddave · 2017-02-20T17:25:19Z

Calling clearState() on MapTable seems to lead to a significant increase in memory usage for future iterations. I was unable to find the source, but the following script demonstrates the bug. (Of course, the exact memory amounts may not match across systems, but the relative amounts should be correct.)

--[[ Displays memory 'leak' with nn.MapTable after clearState() ]]--

local nn = require 'nn'
local cunn = require 'cunn'

model = nn.MapTable():add(nn.SpatialConvolution(3, 256, 3, 3, 1, 1, 1, 1, 1))
                     :cuda()
i = {torch.rand(30, 3, 224, 224):cuda(), torch.rand(30, 3, 224, 224):cuda()}

function check_mem() os.execute('nvidia-smi | grep luajit') end
-- Train two iterations without clear state:
print('Before training 1'); check_mem() -- 277 MiB
o = model:forward(i)
print('After forward 1'); check_mem() -- 3254 MiB

-- Train another iteration:
print('Before training 2'); check_mem() -- 3254 MiB
o = model:forward(i)
print('After forward 2'); check_mem() -- 3254 MiB

-- Clear state:
model:clearState()
collectgarbage()
collectgarbage()

-- Train a final iteration before clearState. This final forward call causes an
-- increase in memory usage for the rest of the program!
print('Before training 3 (after clearState)'); check_mem() -- 3254 MiB
o = model:forward(i)
print('After forward 3 (after clearState)'); check_mem() -- 4724 MiB!

-- Garbage collection doesn't fix it.
collectgarbage()
collectgarbage()
print('After collectgarbage()'); check_mem() -- 4724 MiB!

The script is at: https://gist.github.com/achalddave/6ac8390e06a23ecc6d67e3fa22ef0f04

A few notes:

The issue does not seem to occur if I attempt to only forward an input table with only one element.
The issue does not occur if nn.MapTable is removed (and replaced with just a single SpatialConvolution operating on a single input tensor).

The text was updated successfully, but these errors were encountered:

achalddave · 2017-02-20T17:36:20Z

Update: This issue is fixed if the modules in MapTable have clearState called before they are removed from MapTable, here:

nn/MapTable.lua

Line 80 in acc6b8c

function MapTable:clearState()

Fixes Issue torch#1141: torch#1141

This should fix the same issue as in torch/nn#1141

Calling clearState() seems to cause issues that, after 4-5 days of debugging, I haven't been able to fix. See, for example: torch/nn#1141 torch/cunn#441 Further, it's unclear to me if `getParameters` and memory management in general works well when a call to `clearState` can destroy modules (and therefore weight tensors). The easiest solution to all of this is simply to never call clearState on the model while it is training. When saving the model, we create a copy of it on the CPU, and call clearState on this CPU copy, which we then save to disk.

achalddave mentioned this issue Feb 20, 2017

OOM with clearState() and DataParallelTable torch/cunn#441

Open

achalddave added a commit to achalddave/nn that referenced this issue Feb 20, 2017

Fix memory issue with MapTable module removal

141b85d

Fixes Issue torch#1141: torch#1141

achalddave mentioned this issue Feb 20, 2017

Fix memory issue with MapTable module removal #1142

Merged

soumith closed this as completed Feb 21, 2017

achalddave added a commit to achalddave/predictive-corrective that referenced this issue Jul 20, 2017

Call clearState when removing modules

c8411cd

This should fix the same issue as in torch/nn#1141

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory 'leak' issue with MapTable and clearState #1141

Memory 'leak' issue with MapTable and clearState #1141

achalddave commented Feb 20, 2017 •

edited

Loading

achalddave commented Feb 20, 2017

Memory 'leak' issue with MapTable and clearState #1141

Memory 'leak' issue with MapTable and clearState #1141

Comments

achalddave commented Feb 20, 2017 • edited Loading

achalddave commented Feb 20, 2017

achalddave commented Feb 20, 2017 •

edited

Loading