[dtensor] enable foreach operators for adam optimizer #112108

wanchaol · 2023-10-26T01:28:29Z

Stack from ghstack (oldest at bottom):

-> [dtensor] enable foreach operators for adam optimizer #112108

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms

foreach multitensor adam: 4ms

so around 4.25x improvement

cc @awgu

@awgu

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. cc @awgu [ghstack-poisoned]

pytorch-bot · 2023-10-26T01:28:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112108

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e17ac89 with merge base 08dbfec ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. cc awgu ghstack-source-id: bf89d590477fe43dc8f00220ac9713e3c775a442 Pull Request resolved: #112108

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. cc awgu [ghstack-poisoned]

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. cc awgu ghstack-source-id: d4e77bc5939eb6955f536c906ef874a51c4f6dc6 Pull Request resolved: #112108

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. Some latency measurement, on a 5-layer MLP model: single tensor adam: 17ms ![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d) foreach multitensor adam: 4ms ![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194) so around 4.25x improvement cc awgu [ghstack-poisoned]

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. cc awgu ghstack-source-id: 99df084c81043d803a6a18f82dd924ed5dc88096 Pull Request resolved: #112108

wz337

LGTM!

torch/distributed/_tensor/op_schema.py

wanchaol · 2023-10-31T04:02:37Z

@pytorchbot merge

pytorchmergebot · 2023-10-31T04:04:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-10-31T04:20:39Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x b572f73e06bbce755d9e8bfad886adffb23c33ea returned non-zero exit code 1

Auto-merging test/distributed/_tensor/test_tensor_ops.py
CONFLICT (content): Merge conflict in test/distributed/_tensor/test_tensor_ops.py
Auto-merging torch/distributed/_tensor/dispatch.py
error: could not apply b572f73e06b... [dtensor] enable foreach operators for adam optimizer
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

@XilunWu

As titled. cc @XilunWu [ghstack-poisoned]

As titled. cc XilunWu ghstack-source-id: 875fb91d4e8dbe9e34718f9df86715ec41981923 Pull Request resolved: #112472

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. Some latency measurement, on a 5-layer MLP model: single tensor adam: 17ms ![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d) foreach multitensor adam: 4ms ![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194) so around 4.25x improvement cc awgu [ghstack-poisoned]

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. cc awgu ghstack-source-id: b37c497620b9b15895bbe997c4774e601b2bf9fd Pull Request resolved: #112108

wanchaol · 2023-10-31T04:27:28Z

@pytorchbot merge

pytorchmergebot · 2023-10-31T04:29:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2023-10-31T14:40:25Z

@wanchaol What is the technical complexity to further enable fused Adam?

wanchaol · 2023-10-31T21:58:17Z

@wanchaol What is the technical complexity to further enable fused Adam?

@awgu It should be relatively easy, I haven't try that yet but if you need this I can take a look soon

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance compare to optimizer using torch.Tensor. Currently by default optimizer won't do this for tensor subclass, we will need to enable this by default in DTensor when all ops are covered, or early enable this when exploring new FSDP, we just need to append DTensor to the optimizer allow list. Some latency measurement, on a 5-layer MLP model: single tensor adam: 17ms ![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d) foreach multitensor adam: 4ms ![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194) so around 4.25x improvement Pull Request resolved: pytorch#112108 Approved by: https://github.com/wz337

wanchaol requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu and kwen2501 as code owners October 26, 2023 01:28

wanchaol mentioned this pull request Oct 26, 2023

[dtensor] fix pointwise op linearity with strategy #112107

Closed

wanchaol requested review from fegin, fduwjj, wz337, kiukchung and d4l3k as code owners October 26, 2023 01:28

wanchaol added the release notes: distributed (dtensor) release notes category label Oct 26, 2023

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 30, 2023

wanchaol requested a review from XilunWu October 31, 2023 00:14

wz337 approved these changes Oct 31, 2023

View reviewed changes

wanchaol added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 31, 2023