Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dtensor] enable foreach operators for adam optimizer #112108

Closed
wants to merge 4 commits into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented Oct 26, 2023

Stack from ghstack (oldest at bottom):

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
Screenshot 2023-10-29 at 10 48 22 PM
foreach multitensor adam: 4ms
Screenshot 2023-10-29 at 10 50 58 PM

so around 4.25x improvement

cc @awgu

This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

cc @awgu

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 26, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112108

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e17ac89 with merge base 08dbfec (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol added a commit that referenced this pull request Oct 26, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

cc awgu

ghstack-source-id: bf89d590477fe43dc8f00220ac9713e3c775a442
Pull Request resolved: #112108
@wanchaol wanchaol added the release notes: distributed (dtensor) release notes category label Oct 26, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

cc awgu

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Oct 30, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

cc awgu

ghstack-source-id: d4e77bc5939eb6955f536c906ef874a51c4f6dc6
Pull Request resolved: #112108
@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 30, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d)
foreach multitensor adam: 4ms
![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194)

so around 4.25x improvement

cc awgu

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Oct 30, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

cc awgu

ghstack-source-id: 99df084c81043d803a6a18f82dd924ed5dc88096
Pull Request resolved: #112108
@wanchaol wanchaol requested a review from XilunWu October 31, 2023 00:14
Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@wanchaol wanchaol added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 31, 2023
@wanchaol
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x b572f73e06bbce755d9e8bfad886adffb23c33ea returned non-zero exit code 1

Auto-merging test/distributed/_tensor/test_tensor_ops.py
CONFLICT (content): Merge conflict in test/distributed/_tensor/test_tensor_ops.py
Auto-merging torch/distributed/_tensor/dispatch.py
error: could not apply b572f73e06b... [dtensor] enable foreach operators for adam optimizer
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

wanchaol added a commit that referenced this pull request Oct 31, 2023
As titled.

cc @XilunWu

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Oct 31, 2023
As titled.

cc XilunWu

ghstack-source-id: 875fb91d4e8dbe9e34718f9df86715ec41981923
Pull Request resolved: #112472
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d)
foreach multitensor adam: 4ms
![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194)

so around 4.25x improvement

cc awgu

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Oct 31, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

cc awgu

ghstack-source-id: b37c497620b9b15895bbe997c4774e601b2bf9fd
Pull Request resolved: #112108
@wanchaol
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@awgu
Copy link
Contributor

awgu commented Oct 31, 2023

@wanchaol What is the technical complexity to further enable fused Adam?

@wanchaol
Copy link
Contributor Author

@wanchaol What is the technical complexity to further enable fused Adam?

@awgu It should be relatively easy, I haven't try that yet but if you need this I can take a look soon

@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/381/head branch November 3, 2023 14:27
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d)
foreach multitensor adam: 4ms
![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194)

so around 4.25x improvement

Pull Request resolved: pytorch#112108
Approved by: https://github.com/wz337
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d)
foreach multitensor adam: 4ms
![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194)

so around 4.25x improvement

Pull Request resolved: pytorch#112108
Approved by: https://github.com/wz337
andreigh pushed a commit to andreigh/pytorch that referenced this pull request Nov 19, 2023
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d)
foreach multitensor adam: 4ms
![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194)

so around 4.25x improvement

Pull Request resolved: pytorch#112108
Approved by: https://github.com/wz337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants