Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dcp] fix fsdp state_dict to use run_check=False #114995

Closed
wants to merge 1 commit into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented Dec 2, 2023

Stack from ghstack (oldest at bottom):

from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @kiukchung @d4l3k @LucasLLC

from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 2, 2023
Copy link

pytorch-bot bot commented Dec 2, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114995

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7a6325e with merge base 67562c8 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wanchaol wanchaol requested a review from wz337 December 2, 2023 00:16
@wanchaol wanchaol requested a review from fegin December 2, 2023 00:16
wanchaol added a commit that referenced this pull request Dec 2, 2023
from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

ghstack-source-id: 9c6ec637e8fe24ae155a72c483f5b3a8b2007090
Pull Request resolved: #114995
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 2, 2023
@wanchaol
Copy link
Contributor Author

wanchaol commented Dec 2, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@wanchaol
Copy link
Contributor Author

wanchaol commented Dec 2, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@wanchaol
Copy link
Contributor Author

wanchaol commented Dec 2, 2023

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@albanD albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed module: distributed labels Dec 8, 2023
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this pull request Dec 21, 2023
from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

Pull Request resolved: pytorch#114995
Approved by: https://github.com/fegin, https://github.com/XilunWu, https://github.com/wz337
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this pull request Dec 21, 2023
from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

Pull Request resolved: pytorch#114995
Approved by: https://github.com/fegin, https://github.com/XilunWu, https://github.com/wz337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants