nn.Module: use swap_tensors for Tensor subclasses (#122755) #123106

wanchaol · 2024-04-01T17:46:16Z

This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors.

This uses the swap_tensors method to swap all of the tensors not just the .data field.

Test plan:

pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting'
python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True

Pull Request resolved: #122755
Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki

(cherry picked from commit e6ee832)

Fixes #ISSUE_NUMBER

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors. This uses the `swap_tensors` method to swap all of the tensors not just the .data field. Test plan: ``` pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting' python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True ``` Pull Request resolved: #122755 Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki (cherry picked from commit e6ee832)

pytorch-bot · 2024-04-01T17:46:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123106

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a49712f with merge base 86a2d67 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3.11-clang10 / test (default, 2, 3, linux.2xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mikaylagawarecki

Hey, I had initially wanted to gate this behavior behind the flag for 2.3, but if you view this as a critical fix for DTensor/traceable wrapper subclasses this sounds good to me!

Just want to note again the constraints for this path mentioned here that might not be present for the regular compute_should_use_set_data path though.

Granted that _apply was broken for wrapper subclasses so this is still an improvement to the state of the world nevertheless.

wanchaol · 2024-04-01T18:27:41Z

Hey, I had initially wanted to gate this behavior behind the flag for 2.3, but if you view this as a critical fix for DTensor/traceable wrapper subclasses this sounds good to me!

Just want to note again the constraints for this path mentioned here that might not be present for the regular compute_should_use_set_data path though.

Granted that _apply was broken for wrapper subclasses so this is still an improvement to the state of the world nevertheless.

@mikaylagawarecki Thanks for the context! Yeah I would indeed say this is a critical fix for subclasses given that the Module._apply was broken for subclasses like DTensor/Float8Tensor without the swap_tensors feature, so I think it is at least a improvements for the subclasses. Getting this fix to a stable release quicker would be nice since it shows working for the tracable subclasses :) Let me know if you have some concerns though!

For the constraints mentioned, I think these looks fine to me! wondering if you think we should submit a cherry-pick PR for #122800?

mikaylagawarecki · 2024-04-01T18:30:09Z

Sounds good! Yea we could cherry pick #122800 indeed so that this feature works with all nn.Modules, I will do this

albanD · 2024-04-02T12:08:42Z

I'm not sure about this. This is not a bugfix in the sense that this never worked or had any chance to work. Also as Mikayla mentioned, there is some risk associated with this being enabled where the original plan was to keep it behind a flag for 2.3 to test it out without too much risk.
I would personally prefer to keep this safe approach and not enable this by default.

wanchaol · 2024-04-02T17:59:58Z

I'm not sure about this. This is not a bugfix in the sense that this never worked or had any chance to work. Also as Mikayla mentioned, there is some risk associated with this being enabled where the original plan was to keep it behind a flag for 2.3 to test it out without too much risk. I would personally prefer to keep this safe approach and not enable this by default.

I think technically this is a bug for wrapper tensor subclasses. Before this enablement, if user called module.to to move the wrapper subclass parameters to either a different dtype/device, the behavior was always silently wrong. So the swap_tensors fixes definitely fixes the bug for users who use wrapper subclasses (i.e. DTensor, Float8Tensor). Given that the PR only enables wrapper tensor subclasses to use swap_tensors but not any other paths, I think the risk is relatively low and it fixes the issue with wrapper tensor subclasses, the feature can still be behind the flag for all other paths for 2.3.

Wondering if you think this make sense or not :)

wanchaol requested review from albanD, jbschlosser and mikaylagawarecki as code owners April 1, 2024 17:46

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 1, 2024

wanchaol mentioned this pull request Apr 1, 2024

[v.2.3.0] Release Tracker #121760

Closed

Skylion007 requested a review from malfet April 1, 2024 18:02

mikaylagawarecki approved these changes Apr 1, 2024

View reviewed changes

mikaylagawarecki mentioned this pull request Apr 1, 2024

Fix swap_tensors path in _apply for modules that inherit from RNNBase… #123116

Merged

huydhn approved these changes Apr 2, 2024

View reviewed changes

huydhn merged commit ef38d05 into release/2.3 Apr 2, 2024
97 of 98 checks passed

github-actions bot deleted the subclass_swap_tensor_2.3 branch May 3, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nn.Module: use swap_tensors for Tensor subclasses (#122755) #123106

nn.Module: use swap_tensors for Tensor subclasses (#122755) #123106

wanchaol commented Apr 1, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 1, 2024 •

edited

mikaylagawarecki left a comment

wanchaol commented Apr 1, 2024

mikaylagawarecki commented Apr 1, 2024 •

edited

albanD commented Apr 2, 2024

wanchaol commented Apr 2, 2024

nn.Module: use swap_tensors for Tensor subclasses (#122755) #123106

nn.Module: use swap_tensors for Tensor subclasses (#122755) #123106

Conversation

wanchaol commented Apr 1, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Apr 1, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123106

✅ You can merge normally! (1 Unrelated Failure)

mikaylagawarecki left a comment

Choose a reason for hiding this comment

wanchaol commented Apr 1, 2024

mikaylagawarecki commented Apr 1, 2024 • edited

albanD commented Apr 2, 2024

wanchaol commented Apr 2, 2024

wanchaol commented Apr 1, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 1, 2024 •

edited

mikaylagawarecki commented Apr 1, 2024 •

edited