[dtensor] fix dtensor _to_copy op for mix precision #116426

wanchaol · 2023-12-26T19:17:26Z

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225

[ghstack-poisoned]

pytorch-bot · 2023-12-26T19:17:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116426

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8579064 with merge base ca4df16 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

fduwjj

LGTM, might want to add a unit test?

bdhirsh · 2024-01-02T20:58:53Z

torch/distributed/_tensor/ops/tensor_ops.py

+
+register_op_strategy(
+    aten._to_copy.default, schema_info=RuntimeSchemaInfo(static_kwargkey=["dtype"])
+)(default_strategy)


Mind if I confirm my understanding of the fix (just out of interest :) )

It looks like static_kwargkey ends up getting used in the cache-lookup for DTensor's sharding prop: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/op_schema.py#L299

So was the problem that we had a graph with _to_copy() showing up twice, with different dtypes, and we ended up using the cached sharding strategy, when we should have recomputed it for a different dtype?

Yes! this is correct, we should recomputed the dtype in this case but it was reusing the cached type

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

Context: Existing FSDPExtension have some bug in the case when the unflatten tensor involves some compute/communications in cuda stream, the current logic of FSDPExtension unflatten tensor happens in the unshard stream, which makes runtime lost sync with the compute stream, and if there're some dependencies between the compute stream and the unflatten tensor logic, currently it would lose sync point, which could possibly lead to NaN. This PR make the FSDPExtension to record the compute stream and let DTensorExtension to directly use the compute stream for unflatten_tensor. In long term we might want to directly make the FSDP runtime logic to only make the unshard happen in unshard stream, and use unshard views to happen in the compute stream. We currently fix this in the Extension directly as this is the simplest thing to do without affecting FSDP runtime logic Pull Request resolved: #116559 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang ghstack dependencies: #116426

Disable some runtime assertion first as it does not work with torch.compile properly, I'll have a follow up fix in dynamo and reenable this check again Pull Request resolved: #116573 Approved by: https://github.com/awgu, https://github.com/XilunWu ghstack dependencies: #116426, #116559

This PR adds devices to register_backend of multithraeded pg, to avoid seeing tons of warnings. Pull Request resolved: #116678 Approved by: https://github.com/awgu, https://github.com/XilunWu ghstack dependencies: #116426, #116559, #116573

Pull Request resolved: pytorch#116426 Approved by: https://github.com/fduwjj

Co-authored-by: Wanchao Liang <wanchaol@users.noreply.github.com> fix dtensor _to_copy op for mix precision (#116426) resolved: #116426

[dtensor] fix dtensor _to_copy op for mix precision

c7c31a8

[ghstack-poisoned]

This was referenced Dec 26, 2023

[2d] Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension #116122

Closed

delete sharded tensor from fsdp/tp tests #116244

Closed

[wip] don't review, debug #116427

Closed

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Dec 26, 2023

Update on "[dtensor] fix dtensor _to_copy op for mix precision"

6465808

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

wanchaol mentioned this pull request Dec 30, 2023

[2d] unflatten_tensor on compute stream for DTensorExtension #116559

Closed

Update on "[dtensor] fix dtensor _to_copy op for mix precision"

0b7658e

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

wanchaol mentioned this pull request Dec 31, 2023

[tp] disable some assertion temporarily for torch.compile #116573

Closed

fduwjj approved these changes Jan 2, 2024

View reviewed changes

bdhirsh reviewed Jan 2, 2024

View reviewed changes

Update on "[dtensor] fix dtensor _to_copy op for mix precision"

8579064

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

wanchaol mentioned this pull request Jan 3, 2024

[threaded pg] add devices to avoid seeing warnings #116678

Closed

pytorchmergebot added the Merged label Jan 3, 2024

pytorchmergebot closed this in 29674b8 Jan 3, 2024

facebook-github-bot deleted the gh/wanchaol/419/head branch January 6, 2024 15:22

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Feb 12, 2024

[dtensor] fix dtensor _to_copy op for mix precision (pytorch#116426)

b6b051d

Pull Request resolved: pytorch#116426 Approved by: https://github.com/fduwjj

This was referenced Feb 12, 2024

[dtensor] fix dtensor _to_copy op for mix precision (#116426) #119687

Merged

[v2.2.1] Release Tracker #119295

Closed

atalman pushed a commit that referenced this pull request Feb 14, 2024

[dtensor] fix dtensor _to_copy op for mix precision (#116426) (#119687)

940358f

Co-authored-by: Wanchao Liang <wanchaol@users.noreply.github.com> fix dtensor _to_copy op for mix precision (#116426) resolved: #116426

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtensor] fix dtensor _to_copy op for mix precision #116426

[dtensor] fix dtensor _to_copy op for mix precision #116426

wanchaol commented Dec 26, 2023 •

edited

pytorch-bot bot commented Dec 26, 2023 •

edited

fduwjj left a comment

bdhirsh Jan 2, 2024 •

edited

wanchaol Jan 2, 2024

[dtensor] fix dtensor _to_copy op for mix precision #116426

[dtensor] fix dtensor _to_copy op for mix precision #116426

Conversation

wanchaol commented Dec 26, 2023 • edited

pytorch-bot bot commented Dec 26, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116426

✅ No Failures

fduwjj left a comment

Choose a reason for hiding this comment

bdhirsh Jan 2, 2024 • edited

Choose a reason for hiding this comment

wanchaol Jan 2, 2024

Choose a reason for hiding this comment

wanchaol commented Dec 26, 2023 •

edited

pytorch-bot bot commented Dec 26, 2023 •

edited

bdhirsh Jan 2, 2024 •

edited