Skip to content

Commit

Permalink
fix save/load bug
Browse files Browse the repository at this point in the history
  • Loading branch information
soumith committed Jan 16, 2016
1 parent 76cc974 commit 5b093d3
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions main.lua
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,8 @@ for epoch = 1, opt.niter do
paths.mkdir('checkpoints')
util.save('checkpoints/' .. opt.name .. '_' .. epoch .. '_net_G.t7', netG, opt.gpu)
util.save('checkpoints/' .. opt.name .. '_' .. epoch .. '_net_D.t7', netD, opt.gpu)
parametersD, gradParametersD = netD:getParameters() -- reflatten the params and get them
parametersG, gradParametersG = netG:getParameters()
print(('End of epoch %d / %d \t Time Taken: %.3f'):format(
epoch, opt.niter, epoch_tm:time().real))
end

6 comments on commit 5b093d3

@gwern
Copy link
Contributor

@gwern gwern commented on 5b093d3 Jan 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix the RAM consumption bug, where you can only use up to ~40% of GPU RAM because memory consumption more than doubles at the first epoch and then remains at ~90% thereafter? I was meaning to get around to asking about that since it would seriously get in the way of adding more layers / adding more filters / using bigger inputs / running multiple dcgans with different settings, but if you've fixed it already, great!

@soumith
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwern it should afaik.

@gwern
Copy link
Contributor

@gwern gwern commented on 5b093d3 Jan 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, I guess not. I set up a new run (with a 256x256px version to take a look now that I thought dcgan.torch might run in constant-memory), tuned to leave ~100MB free on the GPU (balancing a dcgan that big isn't too hard; you just need to lower the learning rate and mini-batch size and keep the usual 2x gen:dis ratio) but no, errored out at epoch 1 with a suggestion that the RAM consumption isn't fixed:

Epoch: [1][   70261 /    70261]  Time: 1.075  DataTime: 0.000    Err_G: 1.8536  Err_D: 0.3223
/home/gwern/src/torch/install/bin/luajit: /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:162: $ Torch: not enough memory: you tried to reallocate 8GB. Buy new RAM! at /home/gwern/src/torch/pkg/torch/lib/TH/THGeneral.c:251
stack traceback:
        [C]: in function 'write'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:162: in function </home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:90>
        [C]: in function 'write'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:162: in function 'writeObject'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:181: in function 'writeObject'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:172: in function 'writeObject'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:181: in function 'writeObject'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:181: in function 'writeObject'
        /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:172: in function 'writeObject'
        /home/gwern/src/torch/install/share/lua/5.1/nn/Module.lua:106: in function 'clone'
        /home/gwern/src/dcgan.torch-allbrown/util.lua:6: in function 'save'
        main-256.lua:257: in main chunk
        [C]: in function 'dofile'
        .../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

I'm not sure what's going on here... I've tried out several Torch-based applications and dcgan.torch is the only one which has such enormous spikes in RAM usage at checkpoints; eg char-rnn and neuraltalk2 can both be pushed to <50MB free without crashing at checkpoints.

@soumith
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwern ok i am simulating a ~100MB free memory setting on my side, and will fix it now.

@soumith
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the spiking issues via: 29b8dbc

You need an updated cutorch (luarocks install cutorch) and the latest dcgan.torch. After that you should be all set. The training took 400MB of GPU memory (less than that, even 380MB did not work). It never exceeds 400MB, even after checkpointing.

@gwern
Copy link
Contributor

@gwern gwern commented on 5b093d3 Jan 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I will give that a try and let you know if that fixes it.

Please sign in to comment.