Composable API: `replicate` and `DistributedState` #87649

yhcharles · 2022-10-24T23:12:12Z

Stack from ghstack (oldest at bottom):

This PR adds the first version of the replicate() composable API. For this prototype version, I try to reuse as much code from existing DistributedDataParallel as possible, and iterate on it in later changes. The basic idea of this prototype is:

create a ReplicateState object. It internally uses a ParameterList module to hold all parameters of modules marked by replicate() API.
create an internal _ddp object, which reuses existing DistributedDataParallel implementation, and wraps the ParameterList object
install pre-forward and after-forward hooks on the root module, which calls methods of _ddp to run initialization and forward

[ghstack-poisoned]

pytorch-bot · 2022-10-24T23:12:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87649

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit d10b289:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: a120dbd Pull Request resolved: #87649

[ghstack-poisoned]

ghstack-source-id: 235abc0 Pull Request resolved: #87649

zhaojuanmao · 2022-10-25T05:01:19Z

torch/distributed/_composable/replicate.py

+from typing import List, Tuple
+
+
+class DistributedState:


nit: @mrshenli could we move 'DistributedState' to contract.py?

Sure, I can move it after @mrshenli land contract.py

will this be cleaned up?

zhaojuanmao · 2022-10-25T05:01:45Z

torch/distributed/_composable/replicate.py

+
+
+def replicate(
+    *modules: nn.Module, dist_state: ReplicateState = _default_state


nit: do we need to expose 'dist_state' to users?

It's open for discussion. Maybe we don't need it and let all modules share the same dist_state.

[ghstack-poisoned]

ghstack-source-id: a634de0 Pull Request resolved: #87649

wanchaol · 2022-10-26T03:46:27Z

torch/distributed/_composable/replicate.py

+
+
+def replicate(
+    *modules: nn.Module, dist_state: ReplicateState = _default_state


what is this * for? I thought this could just be replicate(module) instead? in other words, do we expect user passing in a list/tuple of modules? My impression is that module is a tree like structure and there's always a root of it.

This allows user to pass in a single module or multiple modules, for example:
replicate(m) or replicate(m1, m2)
In this case, m1 and m2 can be different modules in the tree.

wanchaol · 2022-10-26T03:49:13Z

torch/distributed/_composable/replicate.py

+class ReplicateState(DistributedState):
+    def __init__(self) -> None:
+        self.modules: List[nn.Module] = []
+        self.parameters: List[nn.Parameter] = []


what's the purpose of this parameters field? is it all the parameters of the module, how to correlate the parameters to module.paramters() inside modules?

The replicate() API can be called multiple times for different modules in a model. This parameter field is to keep a reference of all parameters of these modules. So that they can be managed together, for example bucketizing in DDP's all-reduce.

wanchaol · 2022-10-26T03:52:14Z

torch/distributed/_composable/replicate.py

+    def forward_pre_hook(
+        self, module: nn.Module, input: Tuple[torch.Tensor]
+    ) -> None:
+        if not self.has_initialized:


will this forward_pre_hook install to all modules inside self.modules? or this is just searching the module inside self.modules and apply the prehook on that module?

Do you mean hook on self.modules only, or recursively on all sub-modules of self.modules?

My current idea is former, self.modules only. Otherwise it may hurt performance. But it may change when we come to more details in implementation.

wanchaol · 2022-10-26T03:56:16Z

torch/distributed/_composable/replicate.py

+        self.parameters: List[nn.Parameter] = []
+        self.has_initialized: bool = False
+
+    def add_modules(self, *modules: nn.Module) -> None:


what's the workflow of add_modules, forward_pre_hook, should they happen in a specific order? could you add a test to demonstrate how the workflow should work?

Sure, will do

[ghstack-poisoned]

ghstack-source-id: 4a7cea5 Pull Request resolved: #87649

[ghstack-poisoned]

torch/distributed/_composable/replicate.py

zhaojuanmao · 2022-11-01T10:10:02Z

torch/distributed/_composable/replicate.py

+        self._param_list.extend(
+            param for param in module.parameters() if param.requires_grad
+        )


same here, if it is top-down search, the implementation seems to collect duplicate parameters?

torch/distributed/_composable/replicate.py

zhaojuanmao · 2022-11-01T15:45:32Z

torch/distributed/_composable/replicate.py

+        self._param_list.extend(
+            param for param in module.parameters() if param.requires_grad
+        )


module.parameters() will include children's parameters, for the case like this, b and d are marked to be replicated, e is marked to be sharded, what will 'self._param_list' be?

root / | \ a b c / \ d e

zhaojuanmao · 2022-11-01T15:48:08Z

torch/distributed/_composable/replicate.py

+
+    def _recursive_collect_params(self, module: nn.Module) -> None:
+        if (
+            getattr(module, "_distributed_state", None) is not None


also seems module._distributed_state is not None for modules that are marked as replicated at this point, so it always returns and did not fill in 'self._param_list'?

zhaojuanmao · 2022-11-01T15:48:56Z

torch/distributed/_composable/replicate.py

+        self._param_list.extend(
+            param for param in module.parameters() if param.requires_grad
+        )


I guess the best way is to write some unit tests to verify the parameters are collected as expected

mrshenli

please add test coverage

mrshenli · 2022-11-02T04:38:19Z

torch/distributed/_composable/_ddp.py


+    def forward(self, *inputs, **kwargs):
+        self.pre_forward(*inputs, **kwargs)
+        with torch.autograd.profiler.record_function(


this is not the same as existing DDP? As the record_function used to wrap pre_forward and post_forward here as well?

That's true, it's not exactly the same.
Since it's now broken into 2 other functions, I added record_function("DistributedDataParallel.pre_forward") and record_function("DistributedDataParallel.post_forward") in each of them. Does this make sense?

mrshenli

It will be great if we can have a clean up for the _ddp.py soon to remove unnecessary features. We usually don't leave duplicated code in PyTorch. I understand the intention is to move fast. But even in that case, let's try to only allow that for a short time window.

mrshenli · 2022-11-02T04:40:17Z

torch/distributed/_composable/_ddp.py

            )
        return output

+    def forward(self, *inputs, **kwargs):


if this _ddp.py file is just for composable API, do we even need this forward method?

Will clean it up soon

[ghstack-poisoned]

ghstack-source-id: 98bfc9e Pull Request resolved: #87649

This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is: - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API. - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward [ghstack-poisoned]

ghstack-source-id: 1f2c39a Pull Request resolved: #87649

This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is: - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API. - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward [ghstack-poisoned]

ghstack-source-id: 7578906 Pull Request resolved: #87649

This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is: - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API. - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward [ghstack-poisoned]

ghstack-source-id: 6618b28 Pull Request resolved: #87649

zhaojuanmao · 2022-11-09T20:29:20Z

test/distributed/_composable/test_replicate.py

+
+    def test_replicate(self):
+        dist.init_process_group(
+            backend="gloo",


let's test on both gloo and nccl

zhaojuanmao · 2022-11-09T20:31:53Z

test/distributed/_composable/test_replicate.py

+        local_batch_size = 1
+        global_batch_size = self.world_size * local_batch_size
+        model, input, target = self._prepare_module(global_batch_size)
+        replicate_model = mark_root_module(replicate(deepcopy(model)))


let's add more test cases:

replicate one submodule instead of the root module

replicate more than one submodules

also a test case where there are some submodules of the replicated local root module are annotated by fully_shard()

zhaojuanmao · 2022-11-09T20:35:56Z

torch/distributed/_composable/replicate.py

+from typing import List, Tuple
+
+
+class DistributedState:


will this be cleaned up?

zhaojuanmao · 2022-11-09T20:43:20Z

torch/distributed/_composable/replicate.py

+        for module in modules:
+            self.modules.append(module)
+            replicate.state(module)._distributed_state = self
+            replicate.state(module)._params_collected = False


wondering how other states in ddp constructor are populated?

zhaojuanmao · 2022-11-09T20:47:35Z

torch/distributed/_composable/replicate.py

+        for module in self.modules:
+            self._recursive_collect_params(module)
+
+        self._ddp = _ddp.DistributedDataParallel(self._param_list)


oh, all the states in ddp constructor is owned in self._ddp...

assume we will not use this soon, as it is still monkey patching. instead, we will make all the states owned by replicate.state() object?

zhaojuanmao · 2022-11-09T20:48:43Z

torch/distributed/_composable/replicate.py

+        self, module: nn.Module, input: Tuple[torch.Tensor]
+    ) -> None:
+        self.init_helper()
+        self._ddp.pre_forward()


same as above, we need to make pre_forward() accept a replicate.state() object once get rid of self._ddp?

zhaojuanmao · 2022-11-09T20:50:47Z

torch/distributed/_composable/replicate.py

+@contract
+def replicate(
+    module: nn.Module,  # NOTE: contract now supports single module only
+    dist_state: ReplicateState = _default_state,


dist_state is used internally, could we remove it from user facing API here?

also, replicate needs to have similar argument as exiting DDP API, can work with important features that DDP provided, like static_graph, gradient_as_bucket_view, etc

zhaojuanmao · 2022-11-09T20:52:17Z

torch/distributed/_composable/replicate.py

+
+
+def mark_root_module(
+    module: nn.Module, dist_state: ReplicateState = _default_state


same as above, not exposing dist_state to users

zhaojuanmao · 2022-11-09T20:52:45Z

torch/distributed/_composable/replicate.py

+        >>> module = nn.Linear(3, 3)
+        >>> replicate(module)


nit: the example does not reflect the 'mark_root_module'

torch/distributed/_composable/replicate.py

This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is: - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API. - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward [ghstack-poisoned]

zhaojuanmao

stamp to get this version backed up.

Synced up offline, following PRs will be sent out:

remove mark_root_module()
clean up replicate() constructor without distState argument and add important arguments such as static_graph, find_unused_parameter, etc
clean up self._ddp
add more tests like interact with fully_shard.py

yhcharles · 2022-11-17T00:03:05Z

@pytorchbot merge

pytorchmergebot · 2022-11-17T00:04:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is: - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API. - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward Pull Request resolved: pytorch#87649 Approved by: https://github.com/zhaojuanmao

Draft of replicate and DistributedState

1b0d758

[ghstack-poisoned]

yhcharles added a commit that referenced this pull request Oct 24, 2022

Draft of replicate and DistributedState

f77e76e

ghstack-source-id: a120dbd Pull Request resolved: #87649

Update on "Draft of replicate and DistributedState"

73ecae4

[ghstack-poisoned]

yhcharles added a commit that referenced this pull request Oct 24, 2022

Draft of replicate and DistributedState

92e14fd

ghstack-source-id: 235abc0 Pull Request resolved: #87649

zhaojuanmao reviewed Oct 25, 2022

View reviewed changes

Update on "Draft of replicate and DistributedState"

09b1f88

[ghstack-poisoned]

yhcharles mentioned this pull request Oct 25, 2022

implement #87675

Closed

Update on "Draft of replicate and DistributedState"

23ad90b

[ghstack-poisoned]

yhcharles marked this pull request as ready for review October 25, 2022 18:59

yhcharles requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87 and rohan-varma as code owners October 25, 2022 18:59

Update on "Draft of replicate and DistributedState"

831b4d3

[ghstack-poisoned]

yhcharles mentioned this pull request Oct 25, 2022

Copy ddp to composable #87720

Closed

Update on "Draft of replicate and DistributedState"

0220746

[ghstack-poisoned]

yhcharles added a commit that referenced this pull request Oct 25, 2022

Draft of replicate and DistributedState

b0626b6

ghstack-source-id: a634de0 Pull Request resolved: #87649

yhcharles requested a review from wanchaol October 26, 2022 03:14

wanchaol reviewed Oct 26, 2022

View reviewed changes

Update on "Draft of replicate and DistributedState"

07d286f

[ghstack-poisoned]

yhcharles mentioned this pull request Oct 27, 2022

Copy DDP code to be reused in composable API #87836

Closed

yhcharles added a commit that referenced this pull request Oct 27, 2022

Draft of replicate and DistributedState

3b707cc

ghstack-source-id: 4a7cea5 Pull Request resolved: #87649

yhcharles changed the title ~~Draft of replicate and DistributedState~~ [WIP] Composable API: replicate and DistributedState Oct 27, 2022

pytorch-bot bot added the release notes: distributed (ddp) release notes category label Oct 27, 2022

yhcharles added the topic: not user facing topic category label Oct 27, 2022

Update on "[WIP] Composable API: replicate and DistributedState"

64db81e

[ghstack-poisoned]

zhaojuanmao reviewed Nov 1, 2022

View reviewed changes

mrshenli reviewed Nov 2, 2022

View reviewed changes

Update on "[WIP] Composable API: replicate and DistributedState"

f78f21d

[ghstack-poisoned]

yhcharles added a commit that referenced this pull request Nov 4, 2022

Composable API: replicate prototype

3633dfb

ghstack-source-id: 98bfc9e Pull Request resolved: #87649

yhcharles added a commit that referenced this pull request Nov 4, 2022

Composable API: replicate prototype

0fbd9c6

ghstack-source-id: 1f2c39a Pull Request resolved: #87649

yhcharles requested review from mrshenli, wanchaol and zhaojuanmao November 4, 2022 18:43

yhcharles added a commit that referenced this pull request Nov 5, 2022

Composable API: replicate prototype

3fd44e7

ghstack-source-id: 7578906 Pull Request resolved: #87649

yhcharles added a commit that referenced this pull request Nov 7, 2022

Composable API: replicate prototype

6ad5f66

ghstack-source-id: 6618b28 Pull Request resolved: #87649

zhaojuanmao reviewed Nov 9, 2022

View reviewed changes

yhcharles mentioned this pull request Nov 15, 2022

Add tests for replicate multiple modules #89099

Closed

zhaojuanmao approved these changes Nov 16, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2022

pytorchmergebot added the Merged label Nov 17, 2022

pytorchmergebot closed this in f3af5ba Nov 17, 2022

yhcharles changed the title ~~[WIP] Composable API: replicate and DistributedState~~ Composable API: replicate and DistributedState Nov 17, 2022

facebook-github-bot deleted the gh/yhcharles/6/head branch June 8, 2023 19:24



		def replicate(
		*modules: nn.Module, dist_state: ReplicateState = _default_state



		def mark_root_module(
		module: nn.Module, dist_state: ReplicateState = _default_state

Composable API: replicate and DistributedState #87649

Composable API: replicate and DistributedState #87649

Uh oh!

Conversation

yhcharles commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87649

⏳ No Failures, 2 Pending

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Composable API: `replicate` and `DistributedState` #87649

Composable API: `replicate` and `DistributedState` #87649

yhcharles commented Oct 24, 2022 •

edited

Loading

pytorch-bot bot commented Oct 24, 2022 •

edited

Loading