-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DTensor] make test_dtensor_ops report dtensor_args #152045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before: Does not report DTensor args, and you can't tell which combination of sharding/replication is used for that particular iteration ``` RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667, [[ 0.4317, -4.9678, -1.2441, -2.3042]], [[ 3.9581, -0.6315, -3.3223, 4.0546]], [[ 8.8823, -1.1178, 4.8485, -8.8593]]], requires_grad=True), device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{}) ``` After: You can see the particular DTensor spec that failed ``` RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01, 1.1260e+00, 1.7686e+00, -7.8216e+ [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00], [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00], [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]], [[-7.5171e+00, 5.3900e+00, -7.9208e+00, 6.1000e+00], [-1.7350e+00, -3.6188e-03, -7.1592e+00, 9.2951e-02], [ 5.7143e+00, -3.0805e+00, 7.6227e+00, -7.4862e+00], [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]], [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00], [-2.5225e+00, -5.2520e+00, 6.5686e+00, -6.0350e+00], [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00], [ 3.9581e+00, -6.3151e-01, -3.3223e+00, 4.0546e+00]], [[-2.8112e+00, 3.8742e+00, -4.4612e+00, -5.0016e+00], [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00], [ 3.1207e+00, -7.6518e+00, 7.1084e+00, -1.0500e+00], [ 8.8823e+00, -1.1178e+00, 4.8485e+00, -8.8593e+00]]], requires_grad=True)], **{}) ``` [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152045
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c5607f7 with merge base 56e67ba ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Before: Does not report DTensor args, and you can't tell which combination of sharding/replication is used for that particular iteration ``` RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01, 1.1260e+00, 1.7686e+00, -7.8216e+ [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00], [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00], [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]], [[-7.5171e+00, 5.3900e+00, -7.9208e+00, 6.1000e+00], [-1.7350e+00, -3.6188e-03, -7.1592e+00, 9.2951e-02], [ 5.7143e+00, -3.0805e+00, 7.6227e+00, -7.4862e+00], [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]], [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00], [-2.5225e+00, -5.2520e+00, 6.5686e+00, -6.0350e+00], [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00], [ 3.9581e+00, -6.3151e-01, -3.3223e+00, 4.0546e+00]], [[-2.8112e+00, 3.8742e+00, -4.4612e+00, -5.0016e+00], [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00], [ 3.1207e+00, -7.6518e+00, 7.1084e+00, -1.0500e+00], [ 8.8823e+00, -1.1178e+00, 4.8485e+00, -8.8593e+00]]], requires_grad=True)], **{}) ``` After: You can see the particular DTensor spec that failed ``` RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667, [[ 0.4317, -4.9678, -1.2441, -2.3042]], [[ 3.9581, -0.6315, -3.3223, 4.0546]], [[ 8.8823, -1.1178, 4.8485, -8.8593]]], requires_grad=True), device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{}) ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Starting merge as part of PR stack under #149764 |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Starting merge as part of PR stack under #149764 |
1 similar comment
Starting merge as part of PR stack under #149764 |
Adds explicit error checking during sharding propagation for view ops rather than relying on runtime errors during local op execution. Before: An error is thrown by aten.view op called by DTensor dispatch, because the local shard size is incompatible with the (incorrectly calculated) args to the view op. `RuntimeError: shape '[384]' is invalid for input of size 512` After: We raise more specific errors for cases of incompatible view operations during sharding propagation, before getting to runtime dispatch. `RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.` Change Summary: add 'strict_view' kwarg to the helper methods that implement view/reshape op shard prop rules, so it can be decided op-by-op whether to raise these new errors enabled errors just for the 'view' op in this PR added two specific checks/errors that can occur during view ops. Details: - View ops are never allowed to flatten a dimension that is unevenly sharded, since that would likely change the size/content of the local_tensor and require redistribute - View ops are also never allowed to flatten two dims if the rightmost dim is a Shard() placment, becuase it would cause contiguity errors without redistribution Notes: - Disables support for several ops in test_dtensor_ops.py test, which decompose to an illegal view that only works by performing a redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron Follow Ups: - triage other view-like ops (besides aten::view) for using strict_view - look for other gaps where view-like ops could still perform redistribution (ban them all, and document this) Fixes #143372 Pull Request resolved: #149764 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #152045
Stack from ghstack (oldest at bottom):
Before:
Does not report DTensor args, and you can't tell which combination of
sharding/replication is used for that particular iteration
After:
You can see the particular DTensor spec that failed
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k