Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dtensor] fix dtensor _to_copy op for mix precision #116426

Closed
wants to merge 4 commits into from

Conversation

Copy link

pytorch-bot bot commented Dec 26, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116426

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8579064 with merge base ca4df16 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, might want to add a unit test?


register_op_strategy(
aten._to_copy.default, schema_info=RuntimeSchemaInfo(static_kwargkey=["dtype"])
)(default_strategy)
Copy link
Contributor

@bdhirsh bdhirsh Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I confirm my understanding of the fix (just out of interest :) )

It looks like static_kwargkey ends up getting used in the cache-lookup for DTensor's sharding prop: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/op_schema.py#L299

So was the problem that we had a graph with _to_copy() showing up twice, with different dtypes, and we ended up using the cached sharding strategy, when we should have recomputed it for a different dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! this is correct, we should recomputed the dtype in this case but it was reusing the cached type

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 3, 2024
Context: Existing FSDPExtension have some bug in the case when the
unflatten tensor involves some compute/communications in cuda stream,
the current logic of FSDPExtension unflatten tensor happens in the
unshard stream, which makes runtime lost sync with the compute stream,
and if there're some dependencies between the compute stream and the
unflatten tensor logic, currently it would lose sync point, which could
possibly lead to NaN.

This PR make the FSDPExtension to record the compute stream and let
DTensorExtension to directly use the compute stream for unflatten_tensor.

In long term we might want to directly make the FSDP runtime logic to only
make the unshard happen in unshard stream, and use unshard views to
happen in the compute stream. We currently fix this in the Extension
directly as this is the simplest thing to do without affecting FSDP
runtime logic

Pull Request resolved: #116559
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang
ghstack dependencies: #116426
pytorchmergebot pushed a commit that referenced this pull request Jan 3, 2024
Disable some runtime assertion first as it does not work with
torch.compile properly, I'll have a follow up fix in dynamo and reenable
this check again

Pull Request resolved: #116573
Approved by: https://github.com/awgu, https://github.com/XilunWu
ghstack dependencies: #116426, #116559
pytorchmergebot pushed a commit that referenced this pull request Jan 3, 2024
This PR adds devices to register_backend of multithraeded pg, to avoid
seeing tons of warnings.

Pull Request resolved: #116678
Approved by: https://github.com/awgu, https://github.com/XilunWu
ghstack dependencies: #116426, #116559, #116573
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/419/head branch January 6, 2024 15:22
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Feb 12, 2024
atalman pushed a commit that referenced this pull request Feb 14, 2024
Co-authored-by: Wanchao Liang <wanchaol@users.noreply.github.com>
fix dtensor _to_copy op for mix precision (#116426)
resolved: #116426
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants