Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks during training! #13

Open
Zhang-HM opened this issue Jul 2, 2019 · 2 comments
Open

Memory leaks during training! #13

Zhang-HM opened this issue Jul 2, 2019 · 2 comments

Comments

@Zhang-HM
Copy link

Zhang-HM commented Jul 2, 2019

cross_entropy: 0.001210
lr: 0.001000
speed: 2.425s / iter
/gruntdata/disk2/hm/CLWSOD/tools/../lib/nets/network.py:569: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
cross_entropy, total_loss = self._losses['wsddn_loss'].data[0],
/gruntdata/disk2/hm/CLWSOD/tools/../lib/nets/network.py:570: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
self._losses['total_loss'].data[0]
iter: 2 / 200000, total loss: 1.993461
cross_entropy: 0.000035
lr: 0.001000
speed: 1.868s / iter
iter: 3 / 200000, total loss: 0.652638
cross_entropy: 0.016665
lr: 0.001000
speed: 1.693s / iter
iter: 4 / 200000, total loss: 0.393473
cross_entropy: 0.001328
lr: 0.001000
speed: 1.708s / iter
iter: 5 / 200000, total loss: 0.351719
cross_entropy: 0.000444
lr: 0.001000
speed: 1.620s / iter
iter: 6 / 200000, total loss: 0.543326
cross_entropy: 0.050670
lr: 0.001000
speed: 1.600s / iter
iter: 7 / 200000, total loss: 0.261543
cross_entropy: 0.000296
lr: 0.001000
speed: 1.564s / iter
iter: 8 / 200000, total loss: 0.783304
cross_entropy: 0.024824
lr: 0.001000
speed: 1.529s / iter
iter: 9 / 200000, total loss: 0.537496
cross_entropy: 0.011513
lr: 0.001000
speed: 1.510s / iter
iter: 10 / 200000, total loss: 0.964071
cross_entropy: 0.010650
lr: 0.001000
speed: 1.489s / iter
iter: 11 / 200000, total loss: 0.296966
cross_entropy: 0.020692
lr: 0.001000
speed: 1.472s / iter
iter: 12 / 200000, total loss: 0.546587
cross_entropy: 0.044390
lr: 0.001000
speed: 1.480s / iter
iter: 13 / 200000, total loss: 0.693391
cross_entropy: 0.001768
lr: 0.001000
speed: 1.479s / iter
iter: 14 / 200000, total loss: 0.190509
cross_entropy: 0.051802
lr: 0.001000
speed: 1.474s / iter
iter: 15 / 200000, total loss: 0.302866
cross_entropy: 0.053017
lr: 0.001000
speed: 1.476s / iter
iter: 16 / 200000, total loss: 0.468978
cross_entropy: 0.000957
lr: 0.001000
speed: 1.456s / iter
iter: 17 / 200000, total loss: 0.609222
cross_entropy: 0.007434
lr: 0.001000
speed: 1.457s / iter
iter: 18 / 200000, total loss: 0.089435
cross_entropy: 0.003355
lr: 0.001000
speed: 1.458s / iter
iter: 19 / 200000, total loss: 0.506788
cross_entropy: 0.002159
lr: 0.001000
speed: 1.464s / iter
iter: 20 / 200000, total loss: 0.507251
cross_entropy: 0.020046
lr: 0.001000
speed: 1.464s / iter
iter: 21 / 200000, total loss: 0.365586
cross_entropy: 0.113681
lr: 0.001000
speed: 1.455s / iter
iter: 22 / 200000, total loss: 0.184315
cross_entropy: 0.084765
lr: 0.001000
speed: 1.467s / iter
iter: 23 / 200000, total loss: 0.200998
cross_entropy: 0.048887
lr: 0.001000
speed: 1.458s / iter
iter: 24 / 200000, total loss: 0.124370
cross_entropy: 0.003205
lr: 0.001000
speed: 1.461s / iter
iter: 25 / 200000, total loss: 0.102922
cross_entropy: 0.059250
lr: 0.001000
speed: 1.467s / iter
iter: 26 / 200000, total loss: 0.175924
cross_entropy: 0.031119
lr: 0.001000
speed: 1.495s / iter
iter: 27 / 200000, total loss: 0.185290
cross_entropy: 0.002968
lr: 0.001000
speed: 1.493s / iter
iter: 28 / 200000, total loss: 0.163398
cross_entropy: 0.005777
lr: 0.001000
speed: 1.484s / iter
Traceback (most recent call last):
File "./tools/trainval_net.py", line 149, in
max_iters=args.max_iters)
File "/gruntdata/disk2/hm/CLWSOD/tools/../lib/model/train_val.py", line 380, in train_net
sw.train_model(max_iters)
File "/gruntdata/disk2/hm/CLWSOD/tools/../lib/model/train_val.py", line 294, in train_model
self.net.train_step(blobs, self.optimizer)
File "/gruntdata/disk2/hm/CLWSOD/tools/../lib/nets/network.py", line 573, in train_step
self._losses['total_loss'].backward()
File "/gruntdata/disk1/anaconda3/envs/hm3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/gruntdata/disk1/anaconda3/envs/hm3/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory
Command exited with non-zero status 1
339.25user 70.95system 1:04.60elapsed 634%CPU (0avgtext+0avgdata 3661156maxresident)k
0inputs+184outputs (0major+2307976minor)pagefaults 0swaps

@Sunarker
Copy link
Owner

Sunarker commented Jul 3, 2019

Given such little info, without any quantities about your GPU, it is meaningless and inefficient to give any judgement about this error. Besides, we have pointed out the version of our pytorch is 0.2. We cannot guarantee what would happen based on the mention of your first few lines.

@Zhang-HM
Copy link
Author

Thank you!I solved this problem according to your suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants