-
-
Notifications
You must be signed in to change notification settings - Fork 685
Description
Environment:
- nvidia-docker container: Ubuntu 16.04 with 4 2080Ti GPUs
- Framework version: PyTorch 1.2
- Ignite version: 0.2.0
- CUDA version: 10.1
- NCCL version: 2.1.4
- Python version: 3.6.8
- GCC version: 7.4.0
Question:
Hello, I'd like to use ignite for distributed training. According to the mnist_dist.py example, I need to lauch two terminals in one machine that each terminal runs one command to start training shown as the following.

However, the docker container only allows me lauching one terminal, I wonder if there is an alternative with running distributed training in a nvidia-docker container?
For instance, I could run 4 docker containers with a GPU, i.e four nodes have one GPU, respectively.
Then On the Node i:
open a terminal and run the example on the GPU 0 (process rank i):
python mnist_dist.py --world_size 4 --rank i --gpu 0 --dist_method='tcp://IP_OF_NODE0:FREEPORT'
Thus can I run distributed training with the four nodes? Is there any kind advice?