Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwi throws out of memory error on a 32gb GPU; paper mentions 24gb GPU. #29

Open
kkothari93 opened this issue Nov 24, 2020 · 3 comments
Open

Comments

@kkothari93
Copy link

kkothari93 commented Nov 24, 2020

From the experiment_scripts/ folder, I try to run train_inverse_helmholtz.py experiment as follows.
python3 train_inverse_helmholtz.py --experiment_name fwi --batch_size 1

The supplementary section 5.3 of the paper states that a single 24GB GPU was used for running this experiment whereas I am using a 32GB V100 which should be sufficient. However, even with a batch size of 1 I get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 31.75 GiB total capacity; 28.32 GiB already allocated; 11.75 MiB free; 30.49 GiB reserved in total by PyTorch)

Here is the full trace:

Traceback (most recent call last):
  File "train_inverse_helmholtz.py", line 78, in <module>
    training.train(model=model, train_dataloader=dataloader, epochs=opt.num_epochs, lr=opt.lr,
  File "../siren/training.py", line 73, in train
    losses = loss_fn(model_output, gt)
  File "../siren/loss_functions.py", line 188, in helmholtz_pml
    b, _ = diff_operators.jacobian(modules.compl_mul(B, dudx2), x)
  File "../siren/diff_operators.py", line 53, in jacobian
    jac[:, :, i, :] = grad(y_flat, x, torch.ones_like(y_flat), create_graph=True)[0]
  File ".../anaconda3/envs/tf-gpu2/lib/python3.8/site-packages/torch/autograd/__init__.py", line 202, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 31.75 GiB total capacity; 28.32 GiB already allocated; 11.75 MiB free; 30.49 GiB reserved in total by PyTorch)

Can you please help?

@gabewb
Copy link

gabewb commented Mar 4, 2021

Yeah, I'm likewise not seeing the memory requirements scaling down with lower batch sizes on some experiments. I run out of memory with batch size 1 on train_img.py (I have a 6GB GPU).

Batch size memory scaling does work for train_sdf.py (point cloud). I'm able to get that <6GB with a batch size of 100000.

@pielbia
Copy link

pielbia commented Jan 17, 2022

Same here, I have also tried to reduce the batch size in train_inverse_helmholtz.py to no avail. Also running a 32 GB GPU and also getting a CUDA run out of memory error.

@xefonon
Copy link

xefonon commented Jan 17, 2022

Hi, I'm getting exactly the same problem. I tried using the python garbage collector (ie. gc.collect() ) and torch.cuda.empty_cache() but it still crashes with OOM. @vsitzmann any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants