[DTensor] make test_dtensor_ops report dtensor_args #152045

wconstab · 2025-04-23T20:59:16Z

Stack from ghstack (oldest at bottom):

Before:
Does not report DTensor args, and you can't tell which combination of
sharding/replication is used for that particular iteration

RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01,  1.1260e+00,  1.7686e+00, -7.8216e+
         [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00],
         [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00],
         [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]],
        [[-7.5171e+00,  5.3900e+00, -7.9208e+00,  6.1000e+00],
         [-1.7350e+00, -3.6188e-03, -7.1592e+00,  9.2951e-02],
         [ 5.7143e+00, -3.0805e+00,  7.6227e+00, -7.4862e+00],
         [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]],
        [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00],
         [-2.5225e+00, -5.2520e+00,  6.5686e+00, -6.0350e+00],
         [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00],
         [ 3.9581e+00, -6.3151e-01, -3.3223e+00,  4.0546e+00]],
        [[-2.8112e+00,  3.8742e+00, -4.4612e+00, -5.0016e+00],
         [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00],
         [ 3.1207e+00, -7.6518e+00,  7.1084e+00, -1.0500e+00],
         [ 8.8823e+00, -1.1178e+00,  4.8485e+00, -8.8593e+00]]],
       requires_grad=True)], **{})

After:
You can see the particular DTensor spec that failed

RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667,
        [[ 0.4317, -4.9678, -1.2441, -2.3042]],
        [[ 3.9581, -0.6315, -3.3223,  4.0546]],
        [[ 8.8823, -1.1178,  4.8485, -8.8593]]], requires_grad=True),
        device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{})

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k

Before: Does not report DTensor args, and you can't tell which combination of sharding/replication is used for that particular iteration ``` RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667, [[ 0.4317, -4.9678, -1.2441, -2.3042]], [[ 3.9581, -0.6315, -3.3223, 4.0546]], [[ 8.8823, -1.1178, 4.8485, -8.8593]]], requires_grad=True), device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{}) ``` After: You can see the particular DTensor spec that failed ``` RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01, 1.1260e+00, 1.7686e+00, -7.8216e+ [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00], [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00], [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]], [[-7.5171e+00, 5.3900e+00, -7.9208e+00, 6.1000e+00], [-1.7350e+00, -3.6188e-03, -7.1592e+00, 9.2951e-02], [ 5.7143e+00, -3.0805e+00, 7.6227e+00, -7.4862e+00], [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]], [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00], [-2.5225e+00, -5.2520e+00, 6.5686e+00, -6.0350e+00], [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00], [ 3.9581e+00, -6.3151e-01, -3.3223e+00, 4.0546e+00]], [[-2.8112e+00, 3.8742e+00, -4.4612e+00, -5.0016e+00], [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00], [ 3.1207e+00, -7.6518e+00, 7.1084e+00, -1.0500e+00], [ 8.8823e+00, -1.1178e+00, 4.8485e+00, -8.8593e+00]]], requires_grad=True)], **{}) ``` [ghstack-poisoned]

pytorch-bot · 2025-04-23T20:59:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152045

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c5607f7 with merge base 56e67ba ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Before: Does not report DTensor args, and you can't tell which combination of sharding/replication is used for that particular iteration ``` RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01, 1.1260e+00, 1.7686e+00, -7.8216e+ [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00], [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00], [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]], [[-7.5171e+00, 5.3900e+00, -7.9208e+00, 6.1000e+00], [-1.7350e+00, -3.6188e-03, -7.1592e+00, 9.2951e-02], [ 5.7143e+00, -3.0805e+00, 7.6227e+00, -7.4862e+00], [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]], [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00], [-2.5225e+00, -5.2520e+00, 6.5686e+00, -6.0350e+00], [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00], [ 3.9581e+00, -6.3151e-01, -3.3223e+00, 4.0546e+00]], [[-2.8112e+00, 3.8742e+00, -4.4612e+00, -5.0016e+00], [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00], [ 3.1207e+00, -7.6518e+00, 7.1084e+00, -1.0500e+00], [ 8.8823e+00, -1.1178e+00, 4.8485e+00, -8.8593e+00]]], requires_grad=True)], **{}) ``` After: You can see the particular DTensor spec that failed ``` RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667, [[ 0.4317, -4.9678, -1.2441, -2.3042]], [[ 3.9581, -0.6315, -3.3223, 4.0546]], [[ 8.8823, -1.1178, 4.8485, -8.8593]]], requires_grad=True), device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{}) ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

wconstab · 2025-04-24T16:47:12Z

@pytorchbot merge

pytorchmergebot · 2025-04-24T16:48:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-04-24T22:36:58Z

Starting merge as part of PR stack under #149764

pytorchmergebot · 2025-04-24T22:47:39Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2025-04-25T13:54:22Z

Starting merge as part of PR stack under #149764

pytorchmergebot · 2025-04-28T18:21:33Z

Starting merge as part of PR stack under #149764

Adds explicit error checking during sharding propagation for view ops rather than relying on runtime errors during local op execution. Before: An error is thrown by aten.view op called by DTensor dispatch, because the local shard size is incompatible with the (incorrectly calculated) args to the view op. `RuntimeError: shape '[384]' is invalid for input of size 512` After: We raise more specific errors for cases of incompatible view operations during sharding propagation, before getting to runtime dispatch. `RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.` Change Summary: add 'strict_view' kwarg to the helper methods that implement view/reshape op shard prop rules, so it can be decided op-by-op whether to raise these new errors enabled errors just for the 'view' op in this PR added two specific checks/errors that can occur during view ops. Details: - View ops are never allowed to flatten a dimension that is unevenly sharded, since that would likely change the size/content of the local_tensor and require redistribute - View ops are also never allowed to flatten two dims if the rightmost dim is a Shard() placment, becuase it would cause contiguity errors without redistribution Notes: - Disables support for several ops in test_dtensor_ops.py test, which decompose to an illegal view that only works by performing a redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron Follow Ups: - triage other view-like ops (besides aten::view) for using strict_view - look for other gaps where view-like ops could still perform redistribution (ban them all, and document this) Fixes #143372 Pull Request resolved: #149764 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #152045

wconstab mentioned this pull request Apr 23, 2025

Move verbose warning to warning_once #152044

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Apr 23, 2025

wconstab mentioned this pull request Apr 23, 2025

[DTensor] Error on illegal view op during sharding prop #149764

Closed

wconstab requested a review from XilunWu April 24, 2025 04:27

XilunWu approved these changes Apr 24, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 24, 2025

pytorchmergebot added the merging label Apr 24, 2025

pytorchmergebot closed this in efeed72 Apr 28, 2025

pytorchmergebot added the Merged label Apr 28, 2025

github-actions bot deleted the gh/wconstab/412/head branch June 12, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DTensor] make test_dtensor_ops report dtensor_args #152045

[DTensor] make test_dtensor_ops report dtensor_args #152045

Uh oh!

wconstab commented Apr 23, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 23, 2025 •

edited

Loading

Uh oh!

wconstab commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 25, 2025

Uh oh!

pytorchmergebot commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DTensor] make test_dtensor_ops report dtensor_args #152045

[DTensor] make test_dtensor_ops report dtensor_args #152045

Uh oh!

Conversation

wconstab commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152045

✅ No Failures

Uh oh!

wconstab commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 24, 2025

Merge started

Uh oh!

pytorchmergebot commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 24, 2025

Uh oh!

pytorchmergebot commented Apr 25, 2025

Uh oh!

pytorchmergebot commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wconstab commented Apr 23, 2025 •

edited

Loading

pytorch-bot bot commented Apr 23, 2025 •

edited

Loading