[spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim #440

XilunWu · 2022-09-07T00:42:16Z

Adapt softmax and _softmax_backward_data (added in [Not for landing] Local change to enable TP in ViT model prototyping #382) ops to shard propagation rule and move them to tensor_ops.py
Move relevant tests to test_tensor_ops.py
Extend test coverage on (batch_dim, softmax_dim) combination except for case batch_dim == softmax_dim.

… on different batching dim; 2. add _softmax_backward_data

anj-s · 2022-09-07T15:36:42Z

test/spmd/tensor/test_tensor_ops.py

+        )
+        input = torch.rand(8, 12, 16, device=self.device_type)
+        shard0_spec = Shard(0)
+        shard1_spec = Shard(1)


Curious why we need to test this if we already have it in the list for the dtensor_ops_db test?
nit: shard1_spec and shard2_spec are unused..

@anj-s sorry I'm not familiar with the dtensor_ops_db test but one example I think appropriate to refer is https://github.com/pytorch/tau/blob/6bbe0872aeb8faa8fab73862bae9d3f806ec7836/test/spmd/tensor/test_dtensor_ops.py#L180

bmm is in the list however it's still tested in https://github.com/pytorch/tau/blob/6bbe0872aeb8faa8fab73862bae9d3f806ec7836/test/spmd/tensor/test_dtensor_ops.py#L180

Does this resolve your concern?

nit: shard1_spec and shard2_spec are unused..

Thx for catching it! I plan to add tests for the whole space of (batch dim, softmax dim) combination. e.g. [0, 1, 2, -1] $\times$ [0, 1, 2]

agreed, @XilunWu could you enable this by deleting xfail("softmax") in the test_densor_ops.py? it will test forward automatically for you, you can probably change this test to only test backward.

QQ, so since we have not supported softmax for sharding dim. Will this be a problem for test_dtensor_ops.py?

I tried running pytest test/spmd/tensor/test_dtensor_ops.py -s -k softmax and the "softmax" test is currently passing. The set of parameters tested is as follows:

Tensor Dim Softmax Dim Sharding

0D (scalar) 0 Replicate

1D 0 Replicate

2D {0, -1} Replicate

3D {2} Replicate

And the current "test_softmax" parameter set is:

Tensor Dim Softmax Dim Sharding

3D {0, 1, 2, -1} {0, 1, 2, -1}

My question is, since test_softmax in test_tensor_ops.py is not a duplicate to the "softmax" test in test_dtensor_ops.py, should we keep it?

wanchaol · 2022-09-07T16:17:44Z

test/spmd/tensor/test_tensor_ops.py

+        )
+        input = torch.rand(8, 12, 16, device=self.device_type)
+        shard0_spec = Shard(0)
+        shard1_spec = Shard(1)


agreed, @XilunWu could you enable this by deleting xfail("softmax") in the test_densor_ops.py? it will test forward automatically for you, you can probably change this test to only test backward.

wanchaol · 2022-09-07T16:19:44Z

spmd/tensor/ops/tensor_ops.py

@@ -39,6 +39,7 @@ def no_shard_prop_rule(op_schema: OpSchema) -> OutputSharding:
    "aten.is_same_size.default",
    "aten.ones_like.default",
    "aten.new_empty_strided.default",
+    "aten._softmax.default",


this is essentially a math op, not a tensor op, we should add this to the math_ops.py, i.e. just @register_prop_rule in math_ops.py for aten._softmax.default and aten._softmax_backward_data, note that in the rule you should explicitly check if the sharding dim is the same as softmax dim (if it is, we should error out for now)

fduwjj

Thanks for working on this one and sending out PR so quickly. Left out some comments.

fduwjj · 2022-09-07T22:06:42Z

test/spmd/tensor/test_dtensor_ops.py

@@ -146,7 +146,7 @@ def wrapped(fn):
    xfail("_masked.norm"),
    xfail("_masked.prod"),
    xfail("_masked.softmin"),
-    xfail("_masked.softmax"),
+    #xfail("_masked.softmax"),


Instead of commenting it out, can we remove it directly?

Will put this one back since it's the wrong test. skip("softmax") has been removed.

fduwjj · 2022-09-07T22:08:10Z

spmd/tensor/ops/tensor_ops.py

@@ -39,6 +39,7 @@ def no_shard_prop_rule(op_schema: OpSchema) -> OutputSharding:
    "aten.is_same_size.default",
    "aten.ones_like.default",
    "aten.new_empty_strided.default",
+    "aten._softmax_backward_data.default",


Curious, is this enough for it to work? We might want to give it a prop rule. Because down the road, if we want to add sharding dim softmax, we need to call collectives in the backward as well.

Right. I currently leave it as a default rule and it kind of works on y.backward() but not on y.sum().backward() as I mentioned earlier. Need investigate what is missing.

fduwjj · 2022-09-07T22:33:03Z

test/spmd/tensor/test_tensor_ops.py

+        )
+        input = torch.rand(8, 12, 16, device=self.device_type)
+        shard0_spec = Shard(0)
+        shard1_spec = Shard(1)


QQ, so since we have not supported softmax for sharding dim. Will this be a problem for test_dtensor_ops.py?

fduwjj · 2022-09-07T22:33:28Z

test/spmd/tensor/test_tensor_ops.py

+        dist_y_grad = torch.ones_like(dist_y)
+        # sum().backward() on dist_y has issue:
+        # dist_y.sum().backward(dist_y_grad)
+        # RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([8, 12, 16]) and output[0] has a shape of torch.Size([]).
+        dist_y.backward(dist_y_grad)
+        self.assertIsNotNone(dist_x.grad)


As we discussed, let's just use sum() for now. Thanks!

#440 (comment)
Is there a quick way to check? Because running pytest test/spmd/tensor/test_dtensor_ops.py takes a long time. Can I use -s option with this file? something like pytest test/spmd/tensor/test_dtensor_ops.py -s -k test_softmax?

… test can be removed if review is good since softmax test is turned on in test_dtensor_ops; 2.backward softmax has issue in op dispatch. need further investigation. 3. test on higher dimension mesh can be added in a separate PR

…'s backward()

…_schema_suggestion

…=0 or 1

XilunWu · 2022-09-10T18:27:02Z

test_softmax_with_bwd in test_tensor_ops.py has result mismatch in backward pass when doing softmax on dim=-1 on CPU but is bug free on GPU.

How to reproduce: pytest test/spmd/tensor/test_tensor_ops.py -s -k test_softmax_with_bwd

…d GPU

fduwjj · 2022-09-12T17:03:22Z

What kind of difference did we observe for CPU/GPU difference?

XilunWu · 2022-09-12T17:19:08Z

Bug Triage: Ran _softmax_backward_data on a [4, 4, 4] tensor with shard_dim = 0; aggregation_dim = 2 and shard_dim = 0; aggregation_dim = -1. Pippy produces correct result for the first pair of parameters on CPU&GPU but wrong result for the second pair. Note: this error only happens when aggregation_dim = -1 and tensor is not replicated (i.e. sharding on dim 0 or 1, otherwise it's auto replicated by my softmax rule).

Here is the output report:
local_grad=
tensor([[[ 0.0000e+00, -2.5286e-08, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 8.4356e-09, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 1.7677e-08, 0.0000e+00, 0.0000e+00]],

    [[ 0.0000e+00, -3.2625e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  0.0000e+00,  1.0334e-08,  0.0000e+00],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  9.0607e-09,  0.0000e+00,  0.0000e+00]],

    [[ 0.0000e+00, -2.3336e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  0.0000e+00,  2.0839e-08,  0.0000e+00],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  1.5786e-08,  0.0000e+00,  0.0000e+00]],

    [[ 0.0000e+00, -3.7963e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  0.0000e+00,  1.9997e-08,  0.0000e+00],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  1.7081e-08,  0.0000e+00,  0.0000e+00]]])

(Correct gradients in local computation result: small gradients. Same result on GPU&CPU)

dist_grad=
tensor([[[ 0.0203, -0.0075, 0.0298, -0.0105],
[-0.0075, 0.0298, -0.0105, 0.0140],
[ 0.0298, -0.0105, 0.0140, -0.0277],
[-0.0105, 0.0140, -0.0277, 0.0214]],

    [[ 0.0203, -0.0101,  0.0208, -0.0153],
     [-0.0101,  0.0208, -0.0153,  0.0259],
     [ 0.0208, -0.0153,  0.0259, -0.0234],
     [-0.0153,  0.0259, -0.0234,  0.0471]],

    [[ 0.0375, -0.0085,  0.0458, -0.0149],
     [-0.0085,  0.0458, -0.0149,  0.0281],
     [ 0.0458, -0.0149,  0.0281, -0.0288],
     [-0.0149,  0.0281, -0.0288,  0.0297]],

    [[ 0.0407, -0.0105,  0.0289, -0.0205],
     [-0.0105,  0.0289, -0.0205,  0.0181],
     [ 0.0289, -0.0205,  0.0181, -0.0265],
     [-0.0205,  0.0181, -0.0265,  0.0301]]])

(Wrong gradients in distributed tensor on CPU)

dist_grad=
tensor([[[ 0.0000e+00, -4.4105e-08, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 1.7012e-08, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 4.1126e-08, 0.0000e+00, 1.4592e-08],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],

    [[ 0.0000e+00, -2.1497e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  1.6550e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  2.6419e-08,  0.0000e+00,  1.8104e-08],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]],

    [[ 0.0000e+00, -2.3221e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  8.7823e-09,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  2.7092e-08,  0.0000e+00,  1.4168e-08],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]],

    [[ 0.0000e+00, -3.0386e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  1.7260e-08,  0.0000e+00,  0.0000e+00],
     [ 0.0000e+00,  2.4572e-08,  0.0000e+00,  1.2741e-08],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]]],
   device='cuda:2')

(Correct gradients in distributed tensor on GPU, from another run)

wanchaol · 2022-09-12T17:16:37Z

spmd/tensor/ops/pointwise_ops.py

+@register_prop_rule("aten._softmax_backward_data.default")
+def softmax_bwd_rule(op_schema: OpSchema) -> OutputSharding:
+    input_specs = cast(List[DTensorSpec], op_schema.args_spec)
+    ops_dim_map = pytree.tree_map(lambda spec: spec.dim_map, input_specs)


I think we better not using pytree in sharding rules unless it's absolutely needed. There're only two tensor arguments, we better just directly call dim_map instead of using pytree

wanchaol · 2022-09-12T17:21:16Z

spmd/tensor/ops/pointwise_ops.py

+def softmax_bwd_rule(op_schema: OpSchema) -> OutputSharding:
+    input_specs = cast(List[DTensorSpec], op_schema.args_spec)
+    ops_dim_map = pytree.tree_map(lambda spec: spec.dim_map, input_specs)
+    softmax_dim = cast(int, op_schema.args_schema[len(op_schema.args_spec)])


could you just unwrap like

grad_out_spec, out_spec, dim, input_dtype = op_schema.args_schema

as the backward op is just like this signature:

_softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype)

wanchaol · 2022-09-12T17:41:59Z

spmd/tensor/ops/math_ops.py

+    schema_suggestion = None
+    failed_reason = None
+    if softmax_dim < len(dim_map) and dim_map[softmax_dim] >= 0:
+        # suggest replicating the input tensor


nit: let's do this suggestion in a follow up PR.

wanchaol · 2022-09-12T17:43:10Z

spmd/tensor/ops/math_ops.py

+    dim_map = input_spec.dim_map
+    softmax_dim = cast(
+        int, op_schema.args_schema[len(op_schema.args_spec)]
+    )  # Is it better to put it into kwargs? e.g. op_schema.kwargs_schema['dim']


https://github.com/pytorch/pytorch/blob/1cad744694d7feb7c55e5f4ff4a6ae749686bfb5/aten/src/ATen/native/native_functions.yaml#L4721

softmax is only having positional argument for dim, so that's just keep positional arg

wanchaol · 2022-09-12T17:45:47Z

spmd/tensor/ops/pointwise_ops.py

+
+
+@register_prop_rule("aten._softmax_backward_data.default")
+def softmax_bwd_rule(op_schema: OpSchema) -> OutputSharding:


I know softmax_bwd_rule maybe a pointwise rule, but maybe it's better to categorize the softmax op together? Let's put them together in the math_ops.py, you can call into the pointwise_ops from math_ops.py

typo

…PU and GPU so it's problem in DTensor's softmax rule

…U but cpu/gpu discrepancy exists within softmax_backward_data when sharding on dim -1

…ta discrepancy; reconsider schema suggestion which is now simply replicating tensors

XilunWu · 2022-09-13T20:51:37Z

Split #440 into 2 parts:
Part 1: productionization of original softmax op prototyped in #382 (#455)
Part 2: complete softmax ops by enabling case shard_dim == softmax_dim (#440)

wanchaol · 2022-11-28T23:11:31Z

DTensor now lives in pytorch, related PRs need to be submitted to pytorch directly, see #576 for context.

softmax using the default shard prop rule. TODO: 1. add softmax tests…

abd1969

… on different batching dim; 2. add _softmax_backward_data

facebook-github-bot added the cla signed label Sep 7, 2022

anj-s reviewed Sep 7, 2022

View reviewed changes

wanchaol reviewed Sep 7, 2022

View reviewed changes

XilunWu added 2 commits September 7, 2022 13:50

WIP. for review.

8d4a895

Merge remote-tracking branch 'origin/main' into softmax-prop

9bf23f0

fduwjj reviewed Sep 7, 2022

View reviewed changes

XilunWu added 11 commits September 7, 2022 17:06

fix lint

cf6896d

WIP: BW softmax. Currently blocked by shape mismatch issue in softmax…

ac4b75a

…'s backward()

Merge remote-tracking branch 'origin/main' into softmax-prop

2f8f229

test expect to pass after merging Wanchao's fix on resharding

76ad844

merge to bring Wanchao's fix on sharding suggestion

d3183ff

fixing merge typos

26fe636

forward is fixed. need fix backward. consider reusing _inplace_rewrap…

4ac5106

…_schema_suggestion

backward has wrong gradients on 2 cases: batch_dim=-1 and softmax_dim…

93732f1

…=0 or 1

fix lint

e4bbd04

Try to reproduce on GPU the softmax backward bug that happens on CPU

774813e

fix previous merge conflict

ddc4a23

XilunWu marked this pull request as ready for review September 10, 2022 18:53

XilunWu marked this pull request as draft September 12, 2022 16:29

XilunWu added 2 commits September 12, 2022 09:30

TODO: test on GPU

f38e57e

[BUG TRIAGE] softmax backward on dist tensor result differs on CPU an…

69e93b0

…d GPU

wanchaol reviewed Sep 12, 2022

View reviewed changes

XilunWu added 2 commits September 12, 2022 14:44

add softmax cpu gpu discrepancy test

8353275

typo

a9b5291

XilunWu added 7 commits September 12, 2022 14:54

type

cf800b3

typo

fix tensor clone issue in test

e906737

trying to get test run

d2fa839

trying to get test run

0c6289d

[BUG TRIAGE] backward softmax has no discrepancy on torch.tensor on C…

8c47a6c

…PU and GPU so it's problem in DTensor's softmax rule

this commit has everything of softmax-prop impl. all tests pass on GP…

2737b64

…U but cpu/gpu discrepancy exists within softmax_backward_data when sharding on dim -1

fix lint

13c1b8e

XilunWu mentioned this pull request Sep 13, 2022

[spmd] adapt to softmax ops to sharding prop rules #455

Merged

Preparation for #440 part2. TODO: fix CPU DTensor softmax_backward_da…

5b2e435

…ta discrepancy; reconsider schema suggestion which is now simply replicating tensors

XilunWu added 4 commits September 13, 2022 13:56

remove debugging statement

0e877bf

fix lint. bring suggestions from #455

c2c67cd

resolve merge conflicts. merge origin/main

c46283b

resolve more merge conflicts

abec8a0

XilunWu changed the title ~~[spmd] adapt softmax and _softmax_backward_data to shard prop rule~~ [spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim Sep 19, 2022

wanchaol closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim #440

[spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim #440

XilunWu commented Sep 7, 2022

anj-s Sep 7, 2022

XilunWu Sep 7, 2022 •

edited

Loading

wanchaol Sep 7, 2022

fduwjj Sep 7, 2022

XilunWu Sep 8, 2022

wanchaol Sep 7, 2022

wanchaol Sep 7, 2022

fduwjj left a comment •

edited

Loading

fduwjj Sep 7, 2022

XilunWu Sep 8, 2022

fduwjj Sep 7, 2022

XilunWu Sep 8, 2022

fduwjj Sep 7, 2022

fduwjj Sep 7, 2022

XilunWu Sep 8, 2022

XilunWu commented Sep 10, 2022 •

edited

Loading

fduwjj commented Sep 12, 2022

XilunWu commented Sep 12, 2022 •

edited

Loading

wanchaol Sep 12, 2022

wanchaol Sep 12, 2022

wanchaol Sep 12, 2022

wanchaol Sep 12, 2022

wanchaol Sep 12, 2022

XilunWu commented Sep 13, 2022

wanchaol commented Nov 28, 2022

Tensor Dim	Softmax Dim	Sharding
0D (scalar)	0	Replicate
1D	0	Replicate
2D	{0, -1}	Replicate
3D	{2}	Replicate



		@register_prop_rule("aten._softmax_backward_data.default")
		def softmax_bwd_rule(op_schema: OpSchema) -> OutputSharding:

[spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim #440

[spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim #440

Conversation

XilunWu commented Sep 7, 2022

Choose a reason for hiding this comment

XilunWu Sep 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fduwjj left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XilunWu commented Sep 10, 2022 • edited Loading

fduwjj commented Sep 12, 2022

XilunWu commented Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XilunWu commented Sep 13, 2022

wanchaol commented Nov 28, 2022

XilunWu Sep 7, 2022 •

edited

Loading

fduwjj left a comment •

edited

Loading

XilunWu commented Sep 10, 2022 •

edited

Loading

XilunWu commented Sep 12, 2022 •

edited

Loading