Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: train_num_rays ZeroDivisionError: division by zero #47

Closed
Katehuuh opened this issue Nov 2, 2023 · 12 comments
Closed

[Bug]: train_num_rays ZeroDivisionError: division by zero #47

Katehuuh opened this issue Nov 2, 2023 · 12 comments

Comments

@Katehuuh
Copy link

Katehuuh commented Nov 2, 2023

Hello, i could be wrong on config but the last command 4. Mesh Extraction give me error:

...

..Wonder3D\instant-nsr-pl\systems\neus_ortho.py", line 139, in training_step
    train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples_full'].sum().item()))
ZeroDivisionError: division by zero
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
@Katehuuh
Copy link
Author

Katehuuh commented Nov 2, 2023

my Installation on commit df03557,Win,Python 3.10...
git clone https://github.com/xxlong0/Wonder3D.git && cd Wonder3D
python -m venv venv && venv\Scripts\activate

mkdir ckpts\unet
curl -L -o ckpts\unet\diffusion_pytorch_model.bin https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo/resolve/main/ckpts/unet/diffusion_pytorch_model.bin
curl -L -o ckpts\unet\config.json https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo/resolve/main/ckpts/unet/config.json
curl -L -o ckpts\random_states_0.pkl https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo/resolve/main/ckpts/random_states_0.pkl
curl -L -o ckpts\scaler.pt https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo/resolve/main/ckpts/scaler.pt
curl -L -o ckpts\scheduler.bin https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo/resolve/main/ckpts/scheduler.bin
mkdir sam_pt
curl -L -o sam_pt\sam_vit_h_4b8939.pth https://huggingface.co/spaces/flamehaze1115/Wonder3D-demo/resolve/main/sam_pt/sam_vit_h_4b8939.pth

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118

pip install einops omegaconf pytorch-lightning==1.9.5 torch_efficient_distloss nerfacc==0.3.3 PyMCubes trimesh fire diffusers==0.19.3 transformers bitsandbytes accelerate gradio rembg segment_anything chardet streamlit tensorboard tensorboardX

"%ProgramFiles%\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars32.bat" x64
pip install ninja git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

# pip install https://huggingface.co/r4ziel/xformers_pre_built/resolve/main/triton-2.0.0-cp310-cp310-win_amd64.whl
# pull/34
# python gradio_app.py

accelerate launch --config_file 1gpu.yaml test_mvdiffusion_seq.py --config configs/mvdiffusion-joint-ortho-6views.yaml
cd instant-nsr-pl
python launch.py --config configs/neuralangelo-ortho-wmask.yaml --gpu 0 --train dataset.root_dir=path\to\Wonder3D\outputs\cropsize-192-cfg3.0 dataset.scene=owl
Full error output
(venv) C:\Wonder3D\instant-nsr-pl>python launch.py --config configs/neuralangelo-ortho-wmask.yaml --gpu 0 --train dataset.root_dir=C:\Wonder3D\outputs\cropsize-192-cfg3.0 dataset.scene=owl
Global seed set to 42
C:\Wonder3D\venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Using finite difference to compute gradients with eps=progressive
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
C:\Wonder3D\outputs\cropsize-192-cfg3.0\owl
(1024, 1024, 3)
the loaded normals are defined in the system of front view
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type             | Params
-------------------------------------------
0 | cos   | CosineSimilarity | 0
1 | model | NeuSModel        | 7.7 M
-------------------------------------------
7.7 M     Trainable params
0         Non-trainable params
7.7 M     Total params
15.371    Total estimated model params size (MB)
Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.027204705103003882
Traceback (most recent call last):
  File "C:\Wonder3D\instant-nsr-pl\launch.py", line 125, in <module>
    main()
  File "C:\Wonder3D\instant-nsr-pl\launch.py", line 114, in main
    trainer.fit(system, datamodule=dm)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
    results = self._run_stage()
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
    self._run_train()
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 213, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 202, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 249, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 370, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\core\module.py", line 1754, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\core\optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\plugins\precision\native_amp.py", line 75, in optimizer_step
    closure_result = closure()
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 149, in __call__
    self._result = self.closure(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 135, in closure
    step_output = self._step_fn()
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 419, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1494, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\strategies\dp.py", line 134, in training_step
    return self.model(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\torch\nn\parallel\data_parallel.py", line 183, in forward
    return self.module(*inputs[0], **module_kwargs[0])
  File "C:\Wonder3D\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\overrides\data_parallel.py", line 77, in forward
    output = super().forward(*inputs, **kwargs)
  File "C:\Wonder3D\venv\lib\site-packages\pytorch_lightning\overrides\base.py", line 98, in forward
    output = self._forward_module.training_step(*inputs, **kwargs)
  File "C:\Wonder3D\instant-nsr-pl\systems\neus_ortho.py", line 139, in training_step
    train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples_full'].sum().item()))
ZeroDivisionError: division by zero
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

@kotaxyz
Copy link

kotaxyz commented Nov 2, 2023

i get the same error , may i ask what gpu you use

@Katehuuh
Copy link
Author

Katehuuh commented Nov 2, 2023

i get the same error , may i ask what gpu you use

4090

@kotaxyz
Copy link

kotaxyz commented Nov 2, 2023

cool , so its not related to out of memory issues i was worrying about that because i have rtx 3060ti

@kotaxyz
Copy link

kotaxyz commented Nov 2, 2023

untill the dev fixes it you can use this colab version
https://colab.research.google.com/github/camenduru/Wonder3D-colab/blob/main/Wonder3D_mesh_colab.ipynb

@xxlong0
Copy link
Owner

xxlong0 commented Nov 2, 2023

Hello. You may try NeuS-based reconstruction if you meet problems of instant-nsr-pl.

@xxlong0
Copy link
Owner

xxlong0 commented Nov 2, 2023

Hello, i could be wrong on config but the last command 4. Mesh Extraction give me error:

...

..Wonder3D\instant-nsr-pl\systems\neus_ortho.py", line 139, in training_step
    train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples_full'].sum().item()))
ZeroDivisionError: division by zero
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Could you please more info about any edits?

@Katehuuh
Copy link
Author

Katehuuh commented Nov 3, 2023

Hello. You may try NeuS-based reconstruction if you meet problems of instant-nsr-pl.

cd ./NeuS
python exp_runner.py --mode train --conf ./confs/wmask.conf --case owl --data_dir C:\path\to\Wonder3D\outputs\cropsize-192-cfg3.0

NeuS give me pyparsing.exceptions.ParseSyntaxException: , found '=' (at char 1049), (line:60, col:14).

Added config hard coded, working but had to modify:
        self.conf = ConfigFactory.from_dict({
            "general": {
                "base_exp_dir": "./exp/neus/owl/",
                "recording": ["./", "./models"]
            },
            "dataset": {
                "data_dir": "./outputs/",
                "object_name": "owl",
                "object_viewidx": 1,
                "imSize": [256, 256],
                "load_color": True,
                "stage": "coarse",
                "mtype": "mlp",
                "normal_system": "front",
                "num_views": 6
            },
            "train": {
                "learning_rate": 5e-4,
                "learning_rate_alpha": 0.05,
                "end_iter": 1000,
                "batch_size": 512,
                "validate_resolution_level": 1,
                "warm_up_end": 500,
                "anneal_end": 0,
                "use_white_bkgd": True,
                "save_freq": 5000,
                "val_freq": 5000,
                "val_mesh_freq": 5000,
                "report_freq": 100,
                "color_weight": 1.0,
                "igr_weight": 0.1,
                "mask_weight": 1.0,
                "normal_weight": 1.0,
                "sparse_weight": 0.1
            },
            "model": {
                "nerf":{
                    "D" : 8,
                    "d_in" : 4,
                    "d_in_view" : 3,
                    "W" : 256,
                    "multires" : 10,
                    "multires_view" : 4,
                    "output_ch" : 4,
                    "skips":[4],
                    "use_viewdirs" : True
                },
                'sdf_network': {
                    'd_out':257, 
                    'd_in':3, 
                    'd_hidden':256, 
                    'n_layers':8, 
                    'skip_in':[4], 
                    'multires':6, 
                    'bias':0.5, 
                    'scale':1.0, 
                    'geometric_init':True, 
                    'weight_norm':True
                },
                'variance_network': {
                    'init_val':0.3
                },
                'rendering_network': {
                    'd_feature':256, 
                    'mode':'no_view_dir', 
                    'd_in':6, 
                    'd_out':3, 
                    'd_hidden':256, 
                    'n_layers':4, 
                    'weight_norm':True, 
                    'multires_view':0, 
                    'squeeze_out':True
                },
                'neus_renderer': {
                    'n_samples':64, 
                    'n_importance':64, 
                    'n_outside':0, 
                    'up_sample_steps':4, 
                    'perturb':1.0, 
                    'sdf_decay_param':100
                }
            }
        })
num_workers=os.cpu_count() // 2

I did not make any edits on config, my logs are the only step i done.

@xxlong0
Copy link
Owner

xxlong0 commented Nov 3, 2023

C:\path\to\Wonder3D\outputs\cropsize-192-cfg3.0

Hello, I don't meet such a problem in my side. Maybe you can check whether your windows path manner is right or not.

@Katehuuh Katehuuh closed this as completed Nov 3, 2023
@luopeiyu
Copy link

luopeiyu commented Nov 6, 2023

i have meet the same problem and finally fix it.
the problem is something incompatible with pytorh_lightning on windows
i have goto the original instant-nsr-pl github repository and found there exist some issues with similar problem , and the author has publish a branch named "fix-win-data" for these problem , then i compare the changes form:

bennyguo/instant-nsr-pl@main...fix-data-win

and then i remove all ".to(self.rank)" and "device=self.dataset.all_images.device" from the code, and add ".to(self.device)" to the data that need to send to gpu.

and final , it work

@xxlong0
Copy link
Owner

xxlong0 commented Nov 6, 2023

i have meet the same problem and finally fix it. the problem is something incompatible with pytorh_lightning on windows i have goto the original instant-nsr-pl github repository and found there exist some issues with similar problem , and the author has publish a branch named "fix-win-data" for these problem , then i compare the changes form:

bennyguo/instant-nsr-pl@main...fix-data-win

and then i remove all ".to(self.rank)" and "device=self.dataset.all_images.device" from the code, and add ".to(self.device)" to the data that need to send to gpu.

and final , it work

@luopeiyu Thanks very much for your information, we will check and try to update the instant-nsr-pl.

@472756921
Copy link

i have meet the same problem and finally fix it. the problem is something incompatible with pytorh_lightning on windows i have goto the original instant-nsr-pl github repository and found there exist some issues with similar problem , and the author has publish a branch named "fix-win-data" for these problem , then i compare the changes form:

bennyguo/instant-nsr-pl@main...fix-data-win

and then i remove all ".to(self.rank)" and "device=self.dataset.all_images.device" from the code, and add ".to(self.device)" to the data that need to send to gpu.

and final , it work

Do you have a detailed case?
I tried, but it didn't work.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants