[dtensor] add op support for aten.gather.default #118513

tianyu-l · 2024-01-29T08:26:27Z

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @wconstab @yf225

[ghstack-poisoned]

pytorch-bot · 2024-01-29T08:26:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118513

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 80826b4 with merge base 0f7e636 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
AlbertForQuestionAnswering

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

ghstack-source-id: 83c9536 Pull Request resolved: #118513

XilunWu · 2024-01-29T20:21:28Z

The CI report says "test_dtensor_op_db_take_along_dim_cpu_float32 in test_dtensor_ops.py has unexpected success". This means your change has made it working correctly. You can remove "xfail(take_along_dim)" from the file which will mark this test as a passing test instead of an expected failing test.

Another thing you can do in future is to locally run pytest test/distributed/_tensor/test_dtensor_ops.py to see if the test result has changed with your PR.

XilunWu

left 2 suggestions. I'm not very clear about the main body of gather_strategy and the use of _MaskPartial though.

XilunWu · 2024-01-29T20:24:06Z

test/distributed/_tensor/test_tensor_ops.py

+            input_dt = distribute_tensor(global_input, device_mesh, [Replicate()])
+            index_dt = distribute_tensor(global_index, device_mesh, [Shard(gather_dim)])
+            global_output = torch.gather(global_input, gather_dim, global_index)
+            comm_mode = CommDebugMode()


I dont think we need to instantiate another comm_mode but re-use the one instantiated above. We can instantiate one comm_mode at the beginning of the test and reuse it everywhere in the test,

XilunWu · 2024-01-29T20:36:58Z

torch/distributed/_tensor/ops/tensor_ops.py

+    input_shape = input_strategy.output_shape
+    index_shape = index_strategy.output_shape


We also need to check if the input_shape and output_shape is eligible to perform torch.gather:
https://pytorch.org/docs/stable/generated/torch.gather.html

If we don't check in dtensor sharding prop, these errors would pop up from local tensor ops, which seems still is the expected behavior?

wanchaol

lgtm, some minor suggestions inlined.

wanchaol · 2024-01-31T07:32:42Z

torch/distributed/_tensor/ops/embedding_ops.py

+        # tensor dim can be equal or larger than the mask dim, respectively.
+        if tensor.ndim == self.mask_buffer.data.ndim:
+            tensor[self.mask_buffer.data] = 0.0
+        else:


I think the main reason here is that the embedding output would produce an additional dimension compare to the input, hence the output masking logic become different? maybe have more clarification in the comment

wanchaol · 2024-01-31T07:34:29Z

torch/distributed/_tensor/ops/embedding_ops.py

-        tensor[self.mask_buffer.data, :] = 0.0
+        # NOTE: Depending on the use case (gather op or embedding op),
+        # tensor dim can be equal or larger than the mask dim, respectively.
+        if tensor.ndim == self.mask_buffer.data.ndim:


given that we are reusing the logic, let's put this logic as a common method in MaskBuffer.

tianyu-l · 2024-01-31T23:47:39Z

The CI report says "test_dtensor_op_db_take_along_dim_cpu_float32 in test_dtensor_ops.py has unexpected success". This means your change has made it working correctly. You can remove "xfail(take_along_dim)" from the file which will mark this test as a passing test instead of an expected failing test.

Another thing you can do in future is to locally run pytest test/distributed/_tensor/test_dtensor_ops.py to see if the test result has changed with your PR.

This PR doesn't touch take_along_dim. Not sure how it could make the test pass... Should I still remove it?

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

ghstack-source-id: f9fc955 Pull Request resolved: #118513

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

XilunWu · 2024-02-01T22:01:28Z

The CI report says "test_dtensor_op_db_take_along_dim_cpu_float32 in test_dtensor_ops.py has unexpected success". This means your change has made it working correctly. You can remove "xfail(take_along_dim)" from the file which will mark this test as a passing test instead of an expected failing test.
Another thing you can do in future is to locally run pytest test/distributed/_tensor/test_dtensor_ops.py to see if the test result has changed with your PR.

This PR doesn't touch take_along_dim. Not sure how it could make the test pass... Should I still remove it?

@tianyu-l check

pytorch/torch/_refs/__init__.py

Lines 4470 to 4504 in fb8ffba

    
           @out_wrapper() 
        
           def take_along_dim( 
        
               a: torch.Tensor, indices: torch.Tensor, dim: Optional[int] = None 
        
           ) -> torch.Tensor: 
        
               torch._check( 
        
                   a.ndim == indices.ndim, 
        
                   lambda: ( 
        
                       "torch.take_along_dim(): input and indices should have the same " 
        
                       f"number of dimensions, but got {a.ndim} dimensions for input, and " 
        
                       f"{indices.ndim} dimensions for indices" 
        
                   ), 
        
               ) 
        
               torch._check( 
        
                   utils.is_integer_dtype(indices.dtype), 
        
                   lambda: ( 
        
                       "torch.take_along_dim(): dtype of indices should be int but got " 
        
                       f"{indices.dtype} instead" 
        
                   ), 
        
               ) 
        
               if dim is None: 
        
                   return torch.gather(a.view(-1), 0, indices.view(-1)) 
        
               else: 
        
                   self_sizes = list(a.shape) 
        
                   self_sizes[dim] = indices.size(dim) 
        
                   broadcast_shape = utils.infer_size_shapes(self_sizes, indices.size()) 
        
                   indices_broadcast = broadcast_to(indices, broadcast_shape) 
        
                   indices_sizes = list(indices.shape) 
        
                   indices_sizes[dim] = a.size(dim) 
        
                   broadcast_shape = utils.infer_size_shapes(indices_sizes, a.size()) 
        
                   self_broadcast = broadcast_to(a, broadcast_shape) 
        
                   return torch.gather(self_broadcast, dim, indices_broadcast)

XilunWu

LGTM! Thx for the work!

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

tianyu-l · 2024-02-02T00:53:04Z

@pytorchbot merge

pytorchmergebot · 2024-02-02T00:55:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #118513 Approved by: https://github.com/wanchaol, https://github.com/XilunWu

[dtensor] add op support for aten.gather.default

b384661

[ghstack-poisoned]

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 29, 2024

tianyu-l requested a review from wanchaol January 29, 2024 08:27

Update on "[dtensor] add op support for aten.gather.default"

46b2ff5

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Jan 29, 2024

[dtensor] add op support for aten.gather.default

981dbe4

ghstack-source-id: 83c9536 Pull Request resolved: #118513

tianyu-l added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (dtensor) release notes category labels Jan 29, 2024

XilunWu reviewed Jan 29, 2024

View reviewed changes

wanchaol approved these changes Jan 31, 2024

View reviewed changes

Update on "[dtensor] add op support for aten.gather.default"

6dc1607

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Feb 1, 2024

[dtensor] add op support for aten.gather.default

c15be2c

ghstack-source-id: f9fc955 Pull Request resolved: #118513

tianyu-l requested a review from XilunWu February 1, 2024 02:17

Update on "[dtensor] add op support for aten.gather.default"

8c02c52

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

tianyu-l mentioned this pull request Feb 1, 2024

[dtensor] add op support for nll_loss_forward #118917

Closed

XilunWu approved these changes Feb 1, 2024

View reviewed changes

Update on "[dtensor] add op support for aten.gather.default"

80826b4

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

pytorchmergebot added the merging label Feb 2, 2024

pytorchmergebot closed this in 08472a4 Feb 2, 2024

pytorchmergebot added Merged and removed merging labels Feb 2, 2024

tianyu-l mentioned this pull request Feb 2, 2024

[dtensor] add support for loss parallelism #118950

Closed

facebook-github-bot deleted the gh/tianyu-l/3/head branch February 5, 2024 15:23

pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024

[dtensor] add op support for aten.gather.default (#118513)

dc60c3f

Pull Request resolved: #118513 Approved by: https://github.com/wanchaol, https://github.com/XilunWu

		input_shape = input_strategy.output_shape
		index_shape = index_strategy.output_shape

[dtensor] add op support for aten.gather.default #118513

[dtensor] add op support for aten.gather.default #118513

Conversation

tianyu-l commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118513

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

XilunWu commented Jan 29, 2024

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

XilunWu Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

XilunWu Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Jan 31, 2024

Uh oh!

XilunWu commented Feb 1, 2024

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Feb 2, 2024

Uh oh!

pytorchmergebot commented Feb 2, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianyu-l commented Jan 29, 2024 •

edited

Loading

pytorch-bot bot commented Jan 29, 2024 •

edited

Loading