-
-
Notifications
You must be signed in to change notification settings - Fork 654
Closed
Labels
Description
❓ Questions/Help/Support
Hi @vfdev-5 ,
I am trying to upgrade ignite to v0.4.2 in MONAI, got error when I ran this test program of MONAI:
https://github.com/Project-MONAI/MONAI/blob/master/tests/test_handler_rocauc_dist.py
I used 2 GPU in 1 node, and it passed in ignite v0.3.0 before.
Here is the error log:
root@apt-sh-ai:/workspace/data/medical/MONAI# python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="10.23.137.29" --master_port=1234 tests/test_handler_rocauc_dist.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "tests/test_handler_rocauc_dist.py", line 48, in <module>
File "tests/test_handler_rocauc_dist.py", line 48, in <module>
main()
main()
File "tests/test_handler_rocauc_dist.py", line 23, in main
File "tests/test_handler_rocauc_dist.py", line 23, in main
auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
super().__init__(output_transform, device=device)
File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
super().__init__(output_transform, device=device)
File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
if idist.get_world_size() > 1:
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
if idist.get_world_size() > 1:
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
sync(temporary=True)
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
sync(temporary=True)
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
model = comp_model_cls.create_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
model = comp_model_cls.create_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
return _NativeDistModel()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
return _NativeDistModel()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
self._init_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
self._init_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
self._setup_attrs()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
self._setup_attrs()
self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tests/test_handler_rocauc_dist.py', '--local_rank=1']' returned non-zero exit status 1.
Something wrong with my NCCL version & ignite v0.4.2?
Thanks.