Skip to content

How to run distributed training with ignite in a nvidia-docker container? #593

@songkq

Description

@songkq

Environment:

  1. nvidia-docker container: Ubuntu 16.04 with 4 2080Ti GPUs
  2. Framework version: PyTorch 1.2
  3. Ignite version: 0.2.0
  4. CUDA version: 10.1
  5. NCCL version: 2.1.4
  6. Python version: 3.6.8
  7. GCC version: 7.4.0

Question:
Hello, I'd like to use ignite for distributed training. According to the mnist_dist.py example, I need to lauch two terminals in one machine that each terminal runs one command to start training shown as the following.
image

However, the docker container only allows me lauching one terminal, I wonder if there is an alternative with running distributed training in a nvidia-docker container?
For instance, I could run 4 docker containers with a GPU, i.e four nodes have one GPU, respectively.
Then On the Node i:
open a terminal and run the example on the GPU 0 (process rank i):
python mnist_dist.py --world_size 4 --rank i --gpu 0 --dist_method='tcp://IP_OF_NODE0:FREEPORT'
Thus can I run distributed training with the four nodes? Is there any kind advice?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions