Introduce FSDPv2 by alanwaketan · Pull Request #6122 · pytorch/xla

alanwaketan · 2023-12-12T19:42:30Z

Summary:
This patch introduce a PoC of FSDPv2. The full design doc is here: go/fsdp_v2. A real world use case can be found: https://github.com/pytorch-tpu/transformers/tree/llama2-spmd-fsdp.

Test Plan:
python test/spmd/test_fsdp_v2.py

jonb377

Awesome stuff Jiewen! 👏

jonb377 · 2023-12-12T19:58:47Z

+    return output
+
+  def __getattr__(self, name: str) -> Union[torch.Tensor, nn.Module]:
+    """Forward missing attributes to wrapped module."""


This actually forwards all attributes defined on the wrapped module, right? Only attributes missing on the wrapped module will be retrieved from the SPMDFullyShardedDataParallel instance.

I just copy & paste from the existing wrapper. lol. Will need to revisit it.

jonb377 · 2023-12-12T20:01:38Z

+    mesh = self._get_mesh((self.n_devices, 1), None, ('fsdp', 'tensor'))
+    model.fc1 = fsdp.SpmdFullyShardedDataParallel(model.fc1, mesh)
+    model.fc2 = fsdp.SpmdFullyShardedDataParallel(model.fc2, mesh)
+    model = fsdp.SpmdFullyShardedDataParallel(model, mesh)


Do you mind adding some assertions that the sharding is correct here?

Is there a way to wrap recursively into the module? This is a question maybe more for the usage.. do you need to wrap each layer and then the module?

Ok, first wrap the submodules and then wrapping the outer module will take care of the rest.

Right, the auto-wrap will come later.

jonb377 · 2023-12-12T21:02:22Z

+          raise RuntimeError(
+              f"The output type is not supported: {type(output)}. Please provide your own shard_output callable.")
+
+        spmd.mark_sharding(real_output, mesh, _prepare_spmd_partition_spec(real_output))


It looks like we can use SPMDFullyShardedDataParallel to express data-parallel multislice by specifying a mesh like ('dcn', 'fsdp'), but we'd need to combine the two axes when sharding the activations' batch axis.

We could achieve this by specifying a shard_output function, but what do you think of allowing users to override which axes to shard activations along in the default shard_output_impl? e.g. a new constructor parameter for activation_sharding='fsdp', and allow it to be overridden to activation_sharding=('dcn', 'fsdp')

Oops, I missed that.

Let's do this as a follow up.

jonb377 · 2023-12-12T21:05:47Z

+  def test_fsdp_v2(self):
+    model = self.SimpleLinear().to(xm.xla_device())
+    mesh = self._get_mesh((self.n_devices, 1), None, ('fsdp', 'tensor'))
+    model.fc1 = fsdp.SpmdFullyShardedDataParallel(model.fc1, mesh)


Do we need to shard the individual layers? Since the full model is wrapped on L27

I would also go a step further to test 2 cases

wrap one of the inners and then the outer, check if all are wrapped; let's say the two wrapping uses different shardings and make sure the original sharding is unchanged.

wrap the outer, and try and except when wrapping the inner

I hope this use case is well explained in the design doc.

yeounoh

Added some minor comments, thanks!

alanwaketan · 2023-12-12T22:08:29Z

@jonb377 and @yeounoh thanks for the quick review. A lot of the things will become more clear once the design doc is ready. Will keep you posted.

JackCaoG · 2023-12-12T22:09:19Z

+
+class SpmdFullyShardedDataParallel(nn.Module):
+
+  def __init__(self, module: nn.Module, mesh: spmd.Mesh, shard_output:Optional[Callable] = None):


do we need to take mesh from caller? If it is FSDP, would it be possible to just assume we just shard at 0th dimension for all devices if mesh is not provided?

The thing is that the mesh will need to share with the dataloader. And a global mesh is a must for SPMD. I was thinking maybe we can have an API to set a global mesh. @jonb377 @yeounoh

alanwaketan · 2023-12-15T14:47:25Z

I have addressed some of the comments and polish the test case a bit. Feel free to re-review it. I will add more test cases once I'm back.

alanwaketan · 2023-12-15T23:12:19Z

I'm merging it in order to catch the 2.2 backport cutoff.

Summary: This patch introduce a PoC of FSDPv2. The full design doc is here: go/fsdp_v2. A real world use case can be found: https://github.com/pytorch-tpu/transformers/tree/llama2-spmd-fsdp. Test Plan: python test/spmd/test_fsdp_v2.py

alanwaketan added 3 commits December 12, 2023 19:39

Add Initial PoC and test

006c48d

tmp

b1d66f8

Some nits

3ac7e63

alanwaketan self-assigned this Dec 12, 2023

jonb377 reviewed Dec 12, 2023

View reviewed changes

yeounoh reviewed Dec 12, 2023

View reviewed changes

Comment thread torch_xla/experimental/spmd_fully_sharded_data_parallel.py Outdated

yeounoh reviewed Dec 12, 2023

View reviewed changes

Comment thread torch_xla/experimental/spmd_fully_sharded_data_parallel.py Outdated

yeounoh approved these changes Dec 12, 2023

View reviewed changes

JackCaoG reviewed Dec 12, 2023

View reviewed changes

JackCaoG approved these changes Dec 12, 2023

View reviewed changes

alanwaketan added 2 commits December 15, 2023 14:41

Address comments and add tests

ab44d7a

Fix linters

fe6f9f2

alanwaketan changed the title ~~[WIP] Introduce FSDPv2~~ Introduce FSDPv2 Dec 15, 2023

alanwaketan merged commit 4fe9fe7 into master Dec 15, 2023

alanwaketan mentioned this pull request Dec 15, 2023

2.2 backport PR request list #6036

Open

alanwaketan mentioned this pull request Jan 25, 2024

[RFC] FSDP via SPMD #6379

Closed


		class SpmdFullyShardedDataParallel(nn.Module):

		def __init__(self, module: nn.Module, mesh: spmd.Mesh, shard_output:Optional[Callable] = None):

Conversation

alanwaketan commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yeounoh left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Dec 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Dec 15, 2023

Uh oh!

alanwaketan commented Dec 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alanwaketan commented Dec 12, 2023 •

edited

Loading