Skip to content

distributed training compatible issue in ignite 0.4.2 #1307

@Nic-Ma

Description

@Nic-Ma

❓ Questions/Help/Support

Hi @vfdev-5 ,

I am trying to upgrade ignite to v0.4.2 in MONAI, got error when I ran this test program of MONAI:
https://github.com/Project-MONAI/MONAI/blob/master/tests/test_handler_rocauc_dist.py
I used 2 GPU in 1 node, and it passed in ignite v0.3.0 before.
Here is the error log:

root@apt-sh-ai:/workspace/data/medical/MONAI# python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="10.23.137.29" --master_port=1234 tests/test_handler_rocauc_dist.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "tests/test_handler_rocauc_dist.py", line 48, in <module>
  File "tests/test_handler_rocauc_dist.py", line 48, in <module>
    main()
    main()
  File "tests/test_handler_rocauc_dist.py", line 23, in main
  File "tests/test_handler_rocauc_dist.py", line 23, in main
    auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
    auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
  File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
  File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
    super().__init__(output_transform, device=device)
  File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
    super().__init__(output_transform, device=device)
  File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
    if idist.get_world_size() > 1:
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
    if idist.get_world_size() > 1:
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
    sync(temporary=True)
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
    sync(temporary=True)
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
    model = comp_model_cls.create_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
    model = comp_model_cls.create_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
    return _NativeDistModel()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
    return _NativeDistModel()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
    self._init_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
    self._init_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
    self._setup_attrs()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
    self._setup_attrs()
    self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
    self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tests/test_handler_rocauc_dist.py', '--local_rank=1']' returned non-zero exit status 1.

Something wrong with my NCCL version & ignite v0.4.2?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions