RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #16

e2r-htz · 2020-11-20T14:10:17Z

Running

python -m torch.distributed.launch --master_port=9900 --nproc_per_node=1 train.py --config ./config/cfg_kitti_fm.py --work_dir logs --gpus "0,1"

gives the following RuntimeError error

./config/cfg_kitti_fm.py
cfg is  Config (path: ./config/cfg_kitti_fm.py): {'DEPTH_LAYERS': 18, 'POSE_LAYERS': 18, 'FRAME_IDS': [0, -1, 1], 'IMGS_PER_GPU': 1, 'HEIGHT': 192, 'WIDTH': 640, 'data': {'name': 'kitti', 'split': 'exp', 'height': 192, 'width': 640, 'frame_ids': [0, -1, 1], 'in_path': '../data/kitti-raw', 'gt_depth_path': '../easy2ride_pipeline/monodepth2/splits/eigen/gt_depths.npz', 'png': True, 'stereo_scale': False}, 'model': {'name': 'mono_fm', 'depth_num_layers': 18, 'pose_num_layers': 18, 'extractor_num_layers': 50, 'frame_ids': [0, -1, 1], 'imgs_per_gpu': 1, 'height': 192, 'width': 640, 'scales': [0, 1, 2, 3], 'min_depth': 0.1, 'max_depth': 100.0, 'depth_pretrained_path': None, 'pose_pretrained_path': None, 'extractor_pretrained_path': '/home/e2r/Downloads/autoencoder.pth', 'automask': True, 'disp_norm': True, 'perception_weight': 0.001, 'smoothness_weight': 0.001}, 'resume_from': None, 'finetune': None, 'total_epochs': 40, 'imgs_per_gpu': 1, 'learning_rate': 0.0001, 'workers_per_gpu': 4, 'validate': True, 'optimizer': {'type': 'Adam', 'lr': 0.0001, 'weight_decay': 0}, 'optimizer_config': {'grad_clip': {'max_norm': 35, 'norm_type': 2}}, 'lr_config': {'policy': 'step', 'warmup': 'linear', 'warmup_iters': 500, 'warmup_ratio': 0.3333333333333333, 'step': [20, 30], 'gamma': 0.5}, 'checkpoint_config': {'interval': 1}, 'log_config': {'interval': 50, 'hooks': [{'type': 'TextLoggerHook'}]}, 'dist_params': {'backend': 'nccl'}, 'log_level': 'INFO', 'load_from': None, 'workflow': [('train', 1)], 'work_dir': 'logs', 'gpus': [0, 1]}
2020-11-20 14:46:09,398 - INFO - Distributed training: True
2020-11-20 14:46:09,398 - INFO - Set random seed to 1024
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:364: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 
  "Single-Process Multi-GPU is not the recommended mode for "
cfg work dir is  logs
validate........................
2020-11-20 14:46:17,460 - INFO - Start running, host: e2r@e2r-Super-Server, work_dir: /home/e2r/Desktop/e2r/featdepth/logs
2020-11-20 14:46:17,460 - INFO - workflow: [('train', 1)], max: 40 epochs
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/functional.py:3384: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn("Default grid_sample and affine_grid behavior has changed "
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
  File "train.py", line 105, in <module>
    main()
  File "train.py", line 101, in main
    logger=logger)
  File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 177, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 380, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 278, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 29, in batch_processor
    model_out, losses = model(data)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 528, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Traceback (most recent call last):
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/e2r/anaconda3/envs/e2r/bin/python', '-u', 'train.py', '--local_rank=0', '--config', './config/cfg_kitti_fm.py', '--work_dir', 'logs', '--gpus', '0,1']' returned non-zero exit status 1.

The text was updated successfully, but these errors were encountered:

e2r-htz · 2020-11-20T14:16:29Z

Adding the argument find_unused_parameters=True, as suggested in the error message to the line model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True) in trainer.py gives

python -m torch.distributed.launch --master_port=9900 --nproc_per_node=1 train.py --config ./config/cfg_kitti_fm.py --work_dir logs --gpus "0,1"
./config/cfg_kitti_fm.py
cfg is  Config (path: ./config/cfg_kitti_fm.py): {'DEPTH_LAYERS': 18, 'POSE_LAYERS': 18, 'FRAME_IDS': [0, -1, 1], 'IMGS_PER_GPU': 1, 'HEIGHT': 192, 'WIDTH': 640, 'data': {'name': 'kitti', 'split': 'exp', 'height': 192, 'width': 640, 'frame_ids': [0, -1, 1], 'in_path': '../data/kitti-raw', 'gt_depth_path': '../easy2ride_pipeline/monodepth2/splits/eigen/gt_depths.npz', 'png': True, 'stereo_scale': False}, 'model': {'name': 'mono_fm', 'depth_num_layers': 18, 'pose_num_layers': 18, 'extractor_num_layers': 50, 'frame_ids': [0, -1, 1], 'imgs_per_gpu': 1, 'height': 192, 'width': 640, 'scales': [0, 1, 2, 3], 'min_depth': 0.1, 'max_depth': 100.0, 'depth_pretrained_path': None, 'pose_pretrained_path': None, 'extractor_pretrained_path': '/home/e2r/Downloads/autoencoder.pth', 'automask': True, 'disp_norm': True, 'perception_weight': 0.001, 'smoothness_weight': 0.001}, 'resume_from': None, 'finetune': None, 'total_epochs': 40, 'imgs_per_gpu': 1, 'learning_rate': 0.0001, 'workers_per_gpu': 4, 'validate': True, 'optimizer': {'type': 'Adam', 'lr': 0.0001, 'weight_decay': 0}, 'optimizer_config': {'grad_clip': {'max_norm': 35, 'norm_type': 2}}, 'lr_config': {'policy': 'step', 'warmup': 'linear', 'warmup_iters': 500, 'warmup_ratio': 0.3333333333333333, 'step': [20, 30], 'gamma': 0.5}, 'checkpoint_config': {'interval': 1}, 'log_config': {'interval': 50, 'hooks': [{'type': 'TextLoggerHook'}]}, 'dist_params': {'backend': 'nccl'}, 'log_level': 'INFO', 'load_from': None, 'workflow': [('train', 1)], 'work_dir': 'logs', 'gpus': [0, 1]}
2020-11-20 14:57:12,848 - INFO - Distributed training: True
2020-11-20 14:57:12,848 - INFO - Set random seed to 1024
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:364: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 
  "Single-Process Multi-GPU is not the recommended mode for "
cfg work dir is  logs
validate........................
2020-11-20 14:57:20,817 - INFO - Start running, host: e2r@e2r-Super-Server, work_dir: /home/e2r/Desktop/e2r/featdepth/logs
2020-11-20 14:57:20,817 - INFO - workflow: [('train', 1)], max: 40 epochs
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/functional.py:3384: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn("Default grid_sample and affine_grid behavior has changed "
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
  File "train.py", line 105, in <module>
    main()
  File "train.py", line 101, in main
    logger=logger)
  File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 177, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 380, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 285, in train
    self.call_hook('after_train_iter')
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 241, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/e2r/Desktop/e2r/featdepth/mono/core/utils/dist_utils.py", line 56, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from launch_vectorized_kernel at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/ATen/native/cuda/CUDALoops.cuh:146 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7feb7c69177d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> const&) + 0x5cb (0x7feb1d621c0b in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> const&) + 0x11b (0x7feb1d62369b in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&) + 0x3a7 (0x7feb1d6269c7 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::mul_kernel_cuda(at::TensorIterator&) + 0x167 (0x7feb1d5883b7 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x92ab03 (0x7feb4e3f0b03 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::native::mul_out(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x41 (0x7feb4e3e3181 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xabbf9c (0x7feb55041f9c in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: c10d::Reducer::mark_variable_ready_dense(c10d::Reducer::VariableIndex) + 0x87 (0x7feb5503e497 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: c10d::Reducer::mark_variable_ready(c10d::Reducer::VariableIndex) + 0x111 (0x7feb55042fe1 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: c10d::Reducer::autograd_hook(c10d::Reducer::VariableIndex) + 0xeb (0x7feb55043bdb in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0xabdd16 (0x7feb55043d16 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0xac4dc6 (0x7feb5504adc6 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x4dd (0x7feb50b9193d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7feb50b93401 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7feb50b8b579 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7feb54ab099a in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xc8163 (0x7feb870a4163 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #18: <unknown function> + 0x76db (0x7feb8d5d36db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7feb8d2fca3f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7feb7c69177d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7feb7c8e1d9d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7feb7c67db1d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x25a (0x7feb5504b70a in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x2a3 (0x7feb550409f3 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7feb5501f172 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7feb547e3346 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xa99efb (0x7feb5501fefb in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x26519e (0x7feb547eb19e in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x26676e (0x7feb547ec76e in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x10e978 (0x5607dec85978 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #11: <unknown function> + 0x1a3100 (0x5607ded1a100 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #12: <unknown function> + 0x10e888 (0x5607dec85888 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #13: <unknown function> + 0x1a3100 (0x5607ded1a100 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #14: <unknown function> + 0xfdfc8 (0x5607dec74fc8 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #15: <unknown function> + 0x10f147 (0x5607dec86147 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #16: <unknown function> + 0x10f15d (0x5607dec8615d in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #17: <unknown function> + 0x10f15d (0x5607dec8615d in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #18: <unknown function> + 0x10f15d (0x5607dec8615d in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #19: PyDict_SetItem + 0x502 (0x5607decdb172 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #20: PyDict_SetItemString + 0x4f (0x5607decdbc4f in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #21: PyImport_Cleanup + 0xa0 (0x5607ded20760 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #22: Py_FinalizeEx + 0x67 (0x5607ded9b817 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #23: <unknown function> + 0x2373d3 (0x5607dedae3d3 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #24: _Py_UnixMain + 0x3c (0x5607dedae6fc in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #25: __libc_start_main + 0xe7 (0x7feb8d1fcb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: <unknown function> + 0x1dc3c0 (0x5607ded533c0 in /home/e2r/anaconda3/envs/e2r/bin/python)

Traceback (most recent call last):
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/e2r/anaconda3/envs/e2r/bin/python', '-u', 'train.py', '--local_rank=0', '--config', './config/cfg_kitti_fm.py', '--work_dir', 'logs', '--gpus', '0,1']' died with <Signals.SIGABRT: 6>.

sconlyshootery · 2020-11-22T03:01:36Z

In your commmand '--nproc_per_node=1' this setting indicates you only have 1 GPU, but you set '--gpus "0,1"' which requires 2 GPUs.

sconlyshootery · 2020-11-26T10:53:18Z

Did you fix your problem, can I close this issue now?

jiadingfang · 2021-11-13T22:20:42Z

Hi, I think I ran into the same problem here.
My command is
python3 -m torch.distributed.launch --master_port=9900 --nproc_per_node=1 train.py --config config/cfg_kitti_fm.py --work_dir /data/featdepth_logs
and my error log is

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
config/cfg_kitti_fm.py
cfg is  Config (path: config/cfg_kitti_fm.py): {'DEPTH_LAYERS': 50, 'POSE_LAYERS': 18, 'FRAME_IDS': [0, -1, 1, 's'], 'IMGS_PER_GPU': 2, 'HEIGHT': 320, 'WIDTH': 1024, 'data': {'name': 'kitti', 'split': 'exp', 'height': 320, 'width': 1024, 'frame_ids': [0, -1, 1, 's'], 'in_path': '/data/kitti_data', 'gt_depth_path': '/data//kitti_data/gt_depths.npz', 'png': False, 'stereo_scale': True}, 'model': {'name': 'mono_fm', 'depth_num_layers': 50, 'pose_num_layers': 18, 'frame_ids': [0, -1, 1, 's'], 'imgs_per_gpu': 2, 'height': 320, 'width': 1024, 'scales': [0, 1, 2, 3], 'min_depth': 0.1, 'max_depth': 100.0, 'depth_pretrained_path': '/data/weights/resnet50.pth', 'pose_pretrained_path': '/data/weights/resnet18.pth', 'extractor_pretrained_path': '/data/autoencoder.pth', 'automask': False, 'disp_norm': False, 'perception_weight': 0.001, 'smoothness_weight': 0.001}, 'resume_from': None, 'finetune': None, 'total_epochs': 40, 'imgs_per_gpu': 2, 'learning_rate': 0.0001, 'workers_per_gpu': 4, 'validate': True, 'optimizer': {'type': 'Adam', 'lr': 0.0001, 'weight_decay': 0}, 'optimizer_config': {'grad_clip': {'max_norm': 35, 'norm_type': 2}}, 'lr_config': {'policy': 'step', 'warmup': 'linear', 'warmup_iters': 500, 'warmup_ratio': 0.3333333333333333, 'step': [20, 30], 'gamma': 0.5}, 'checkpoint_config': {'interval': 1}, 'log_config': {'interval': 50, 'hooks': [{'type': 'TextLoggerHook'}]}, 'dist_params': {'backend': 'nccl'}, 'log_level': 'INFO', 'load_from': None, 'workflow': [('train', 1)], 'work_dir': '/data/featdepth_logs', 'gpus': [0]}
2021-11-13 16:12:56,698 - INFO - Distributed training: True
2021-11-13 16:12:56,699 - INFO - Set random seed to 1024
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py:287: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  warnings.warn(
cfg work dir is  /data/featdepth_logs
validate........................
2021-11-13 16:13:04,115 - INFO - Start running, host: root@5efa5949cbf5, work_dir: /data/featdepth_logs
2021-11-13 16:13:04,116 - INFO - workflow: [('train', 1)], max: 40 epochs
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:4003: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn(
Traceback (most recent call last):
  File "train.py", line 103, in <module>
    main()
  File "train.py", line 93, in main
    train_mono(model,
  File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/FeatDepth/mono/apis/trainer.py", line 177, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/runner.py", line 380, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/runner.py", line 277, in train
    outputs = self.batch_processor(
  File "/home/FeatDepth/mono/apis/trainer.py", line 29, in batch_processor
    model_out, losses = model(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 159 160 265 266
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5459) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-11-13_16:13:14
  host      : 5efa5949cbf5
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5459)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I'm running in a docker environment with 4 gpus, but it does not even work for the single gpu setting. Please help.

sconlyshootery closed this as completed Dec 1, 2020

jiadingfang mentioned this issue Nov 16, 2021

Runtime Data Parallel Error #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #16

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #16

e2r-htz commented Nov 20, 2020

e2r-htz commented Nov 20, 2020

sconlyshootery commented Nov 22, 2020

sconlyshootery commented Nov 26, 2020

jiadingfang commented Nov 13, 2021

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #16

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #16

Comments

e2r-htz commented Nov 20, 2020

e2r-htz commented Nov 20, 2020

sconlyshootery commented Nov 22, 2020

sconlyshootery commented Nov 26, 2020

jiadingfang commented Nov 13, 2021