CUDA out of memory (with light flag) #23

artempimushkin · 2019-08-15T16:00:05Z

Hi guys!
I'm using RTX 2080Ti 11GB. At first I tried to train on a dataset of 100K images (1000px) with --light flag . And after the 1000th epoch I got the "CUDA out of memory" error. Then I tried a smaller dataset of 10K images (256px) and got the same error after the 1000th epoch. Finally I tried 3400 images (256px) and there were no changes.

Here is an output:

[  997/1000000] time: 582.6236 d_loss: 0.00474171, g_loss: 1344.68078613
[  998/1000000] time: 583.2094 d_loss: 0.00624988, g_loss: 1328.24572754
[  999/1000000] time: 583.7950 d_loss: 0.00641153, g_loss: 1374.71826172
[ 1000/1000000] time: 584.3810 d_loss: 0.00178387, g_loss: 1280.08032227
/home/p0wx/prj/UGATIT-pytorch/utils.py:46: RuntimeWarning: invalid value encountered in true_divide
  cam_img = x / np.max(x)
Traceback (most recent call last):
  File "main.py", line 83, in <module>
    main()
  File "main.py", line 75, in main
    gan.train()
  File "/home/p0wx/prj/UGATIT-pytorch/UGATIT.py", line 209, in train
    fake_B2B, fake_B2B_cam_logit, _ = self.genA2B(real_B)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 108, in forward
    out = self.UpBlock2(x)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 191, in forward
    out = self.rho.expand(input.shape[0], -1, -1, -1) * out_in + (1-self.rho.expand(input.shape[0], -1, -1, -1)) * out_ln
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.32 GiB already allocated; 5.56 MiB free; 621.27 MiB cached)```

The text was updated successfully, but these errors were encountered:

Frizy-up · 2019-08-20T14:34:56Z

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

Frizy-up · 2019-08-21T01:48:02Z

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.

DaddyWesker · 2019-08-23T09:36:59Z

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.

So every 1000th step it requires 100+mb more and not releasing it? Asking since i'm facing same problem but on 2000th epoch and my images are 256x256

lxy2017 · 2019-10-25T07:10:37Z

Same error appear on my test, When I set print_freq =10000, it works.

heartInsert · 2019-11-20T09:00:03Z

Apparently there is a bug in pytorch , when you open a new dataloader , it seems that the older dataloader will not be released , I have meet this so many times.

07hyx06 · 2019-11-23T08:02:59Z

You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"!

scutlrr · 2020-04-22T13:00:26Z

You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"!
it has used self.genA2B.eval() in code

shafeeq07 · 2020-06-15T18:08:35Z

I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64

nzhang258 · 2020-07-23T02:57:43Z

@shafeeqbsse I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64

can you reproduce the results in paper?
my results are bad with this network(ch=32) :(

…ted by 07hyx06. Surrounded print output with torch.no_grad()

wilsonjwcsu added a commit to wilsonjwcsu/UGATIT-pytorch that referenced this issue Mar 16, 2022

fix to issue znxlwm#23 (CUDA out of memory at 1000th epoch) as sugges…

14e07fa

…ted by 07hyx06. Surrounded print output with torch.no_grad()

wilsonjwcsu mentioned this issue Mar 16, 2022

fix to issue #23 (CUDA out of memory at 1000th epoch) as suggested by… #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory (with light flag) #23

CUDA out of memory (with light flag) #23

artempimushkin commented Aug 15, 2019

Frizy-up commented Aug 20, 2019 •

edited

Frizy-up commented Aug 21, 2019

DaddyWesker commented Aug 23, 2019

lxy2017 commented Oct 25, 2019

heartInsert commented Nov 20, 2019 •

edited

07hyx06 commented Nov 23, 2019

scutlrr commented Apr 22, 2020

shafeeq07 commented Jun 15, 2020

nzhang258 commented Jul 23, 2020 •

edited

CUDA out of memory (with light flag) #23

CUDA out of memory (with light flag) #23

Comments

artempimushkin commented Aug 15, 2019

Frizy-up commented Aug 20, 2019 • edited

Frizy-up commented Aug 21, 2019

DaddyWesker commented Aug 23, 2019

lxy2017 commented Oct 25, 2019

heartInsert commented Nov 20, 2019 • edited

07hyx06 commented Nov 23, 2019

scutlrr commented Apr 22, 2020

shafeeq07 commented Jun 15, 2020

nzhang258 commented Jul 23, 2020 • edited

Frizy-up commented Aug 20, 2019 •

edited

heartInsert commented Nov 20, 2019 •

edited

nzhang258 commented Jul 23, 2020 •

edited