Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory (with light flag) #23

Open
artempimushkin opened this issue Aug 15, 2019 · 9 comments
Open

CUDA out of memory (with light flag) #23

artempimushkin opened this issue Aug 15, 2019 · 9 comments

Comments

@artempimushkin
Copy link

Hi guys!
I'm using RTX 2080Ti 11GB. At first I tried to train on a dataset of 100K images (1000px) with --light flag . And after the 1000th epoch I got the "CUDA out of memory" error. Then I tried a smaller dataset of 10K images (256px) and got the same error after the 1000th epoch. Finally I tried 3400 images (256px) and there were no changes.

Here is an output:

[  997/1000000] time: 582.6236 d_loss: 0.00474171, g_loss: 1344.68078613
[  998/1000000] time: 583.2094 d_loss: 0.00624988, g_loss: 1328.24572754
[  999/1000000] time: 583.7950 d_loss: 0.00641153, g_loss: 1374.71826172
[ 1000/1000000] time: 584.3810 d_loss: 0.00178387, g_loss: 1280.08032227
/home/p0wx/prj/UGATIT-pytorch/utils.py:46: RuntimeWarning: invalid value encountered in true_divide
  cam_img = x / np.max(x)
Traceback (most recent call last):
  File "main.py", line 83, in <module>
    main()
  File "main.py", line 75, in main
    gan.train()
  File "/home/p0wx/prj/UGATIT-pytorch/UGATIT.py", line 209, in train
    fake_B2B, fake_B2B_cam_logit, _ = self.genA2B(real_B)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 108, in forward
    out = self.UpBlock2(x)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 191, in forward
    out = self.rho.expand(input.shape[0], -1, -1, -1) * out_in + (1-self.rho.expand(input.shape[0], -1, -1, -1)) * out_ln
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.32 GiB already allocated; 5.56 MiB free; 621.27 MiB cached)```

@Frizy-up
Copy link

Frizy-up commented Aug 20, 2019

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

@Frizy-up
Copy link

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.

@DaddyWesker
Copy link

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.

So every 1000th step it requires 100+mb more and not releasing it? Asking since i'm facing same problem but on 2000th epoch and my images are 256x256

@lxy2017
Copy link

lxy2017 commented Oct 25, 2019

Same error appear on my test, When I set print_freq =10000, it works.

@heartInsert
Copy link

heartInsert commented Nov 20, 2019

Apparently there is a bug in pytorch , when you open a new dataloader , it seems that the older dataloader will not be released , I have meet this so many times.

@07hyx06
Copy link

07hyx06 commented Nov 23, 2019

You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"!

@scutlrr
Copy link

scutlrr commented Apr 22, 2020

You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"!
it has used self.genA2B.eval() in code

@shafeeq07
Copy link

I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64

@nzhang258
Copy link

nzhang258 commented Jul 23, 2020

@shafeeqbsse I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64

can you reproduce the results in paper?
my results are bad with this network(ch=32) :(

wilsonjwcsu added a commit to wilsonjwcsu/UGATIT-pytorch that referenced this issue Mar 16, 2022
…ted by 07hyx06. Surrounded print output with torch.no_grad()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants