Add DistributedDataParallel #1715

apaszke · 2017-06-04T11:02:54Z

import torch
import torch.distributed as dist
import torchvision

dist.init_process_group(backend='gloo')

model = torchvision.models.resnet50().cuda()
model = torch.nn.DistributedDataParallel(model) # prepend Distributed

dataset = ...
# so each process sees only a subset of the whole dataset
sampler = torch.utils.data.DistributedSampler(dataset)
data_loader = torch.utils.data.DataLoader(
        ..., sampler=sampler)

for batch, target in data_loader:
    # optimize model, log, etc.

torch/lib/THD/base/data_channels/DataChannelMPI.cpp

@@ -476,7 +476,7 @@ THDGroup DataChannelMPI::newGroup(const std::vector<rank_type>& ranks) {
  MPI_Group_incl(world_group, int_ranks.size(), int_ranks.data(), &ranks_group);

  MPI_Comm new_comm;
-  MPI_Comm_create_group(MPI_COMM_WORLD, ranks_group, 0, &new_comm);
+  //MPI_Comm_create_group(MPI_COMM_WORLD, ranks_group, 0, &new_comm);


colesbury

The parts I understand look good. A few questions about things that confused me:

torch/lib/THD/base/Cuda.cpp

@@ -0,0 +1,23 @@
+#include "Cuda.hpp"
+
+THCState** _THDCudaState;


torch/lib/THD/base/Cuda.hpp

+
+int THDGetStreamId(cudaStream_t stream);
+
+#include "Cuda.h"


torch/lib/THD/base/init_methods/InitMethodTCP.cpp

  SYSCHECK(fd = open("/dev/urandom", O_RDONLY));
-  SYSCHECK(read(fd, &seed, sizeof(seed)));
+  SYSCHECK(bytes_read = read(fd, &seed, sizeof(seed)));


torch/nn/parallel/distributed.py

+
+    Example::
+
+        >>> net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])


torch/utils/data/distributed.py

+from .sampler import Sampler
+
+
+class DistributedSampler(Sampler):


apaszke · 2017-06-08T15:40:58Z

Pushed fixes for review comments in a separate commit. I will reset it and squash the changes into appropriate commits when they're accepted.

test/run_test.sh

@@ -20,72 +20,37 @@ fi

 pushd "$(dirname "$0")"

-echo "Running torch tests"
-$PYCMD test_torch.py $@


* Add keepdim * Fix DataChannel signature * Fix incorrect locking * Use current stream in DataChannelGloo

acgtyrant · 2018-04-18T08:57:35Z

@apaszke Hi, as far as I understand, the DistributedDataParallel nccl.reduce all gradients from all machines to the machine whose rank is 0(device to device), and every machine comm.reduce all gradients from its GPUs to the device 0 respectively again. After every machine updates the module in device 0 respectively, the module in device 0 will _sync_params the parameters to others devices respectively in the forward function. Right?

I am suprised that the DistributedDataParallel does not broadcast the parameters from one machine to all machines like DataParallel broadcasting the parameters from device 0 to all devices, I think this means that the traning modules in different machines are not same.

soumith · 2018-04-18T13:48:54Z

@acgtyrant DistributedDataParallel has only one synchronization, all gradients are all_reduce to all machines, so all machines have same copy of gradients. Then every machine does it's own optimization step.

`found_non_rfactor_reduction` is used to detect errors when all reduction dims are marked as rfactors. However, this code is not finding non-rfactor reduction, but instead arbitrary reduction. Fortunately, other parts of our code could detect the same error, so this bug does not have any real effect. But still, I think we need to fix this.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105

apaszke requested a review from colesbury June 4, 2017 11:04

VirrageS reviewed Jun 6, 2017

View reviewed changes

apaszke force-pushed the dist_dp branch 2 times, most recently from 4b93573 to cbb3fdd Compare June 6, 2017 16:16

colesbury reviewed Jun 7, 2017

View reviewed changes

apaszke force-pushed the dist_dp branch from d62402d to ce23875 Compare June 8, 2017 15:40

colesbury approved these changes Jun 8, 2017

View reviewed changes

apaszke force-pushed the dist_dp branch 4 times, most recently from 51aa47a to b2e3d79 Compare June 8, 2017 18:28

fmassa reviewed Jun 10, 2017

View reviewed changes

test/run_test.sh Outdated

@@ -20,72 +20,37 @@ fi

pushd "$(dirname "$0")"

echo "Running torch tests"

$PYCMD test_torch.py $@

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

apaszke force-pushed the dist_dp branch from 42e1fbb to 89d991a Compare June 10, 2017 16:46

apaszke added 10 commits June 12, 2017 21:58

Add end callbacks to the engine

86a065e

THD updates and bug fixes

095ddc7

* Add keepdim * Fix DataChannel signature * Fix incorrect locking * Use current stream in DataChannelGloo

Add more checks in torch.distributed

5a0d5ec

Free GIL when entering THD functions

b37f18b

Support non-default streams in NCCL reduce

8db8716

Add Module._all_buffers

23ab9d4

Add DistributedDataParallel

12813b8

Fix deadlock in GlooCache

6f51b4c

Officially enable process-group mode

714351f

Rename arguments to distributed collectives

d9d50f8

soumith force-pushed the dist_dp branch from 89d991a to d9d50f8 Compare June 13, 2017 02:02

soumith merged commit d9d50f8 into master Jun 13, 2017

soumith deleted the dist_dp branch July 20, 2017 17:31

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DistributedDataParallel #1715

Add DistributedDataParallel #1715

apaszke commented Jun 4, 2017 •

edited

Loading

This comment was marked as off-topic.

colesbury left a comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

apaszke commented Jun 8, 2017

This comment was marked as off-topic.

This comment was marked as off-topic.

acgtyrant commented Apr 18, 2018

soumith commented Apr 18, 2018

		@@ -0,0 +1,23 @@
		#include "Cuda.hpp"

		THCState** _THDCudaState;


		Example::

		>>> net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])

		from .sampler import Sampler


		class DistributedSampler(Sampler):

Add DistributedDataParallel #1715

Add DistributedDataParallel #1715

Conversation

apaszke commented Jun 4, 2017 • edited Loading

This comment was marked as off-topic.

colesbury left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

apaszke commented Jun 8, 2017

This comment was marked as off-topic.

This comment was marked as off-topic.

acgtyrant commented Apr 18, 2018

soumith commented Apr 18, 2018

apaszke commented Jun 4, 2017 •

edited

Loading