New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dcp] fix fsdp state_dict to use run_check=False #114995
Conversation
from_local with replicate placement would run mesh_broadcast if run_check=True, by default from_local have run_check=True, but for FSDP state_dict case we are for sure that these are replica already, so we don't need to check/force check it. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114995
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 7a6325e with merge base 67562c8 (): NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
from_local with replicate placement would run mesh_broadcast if run_check=True, by default from_local have run_check=True, but for FSDP state_dict case we are for sure that these are replica already, so we don't need to check/force check it. ghstack-source-id: 9c6ec637e8fe24ae155a72c483f5b3a8b2007090 Pull Request resolved: #114995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
from_local with replicate placement would run mesh_broadcast if run_check=True, by default from_local have run_check=True, but for FSDP state_dict case we are for sure that these are replica already, so we don't need to check/force check it. Pull Request resolved: pytorch#114995 Approved by: https://github.com/fegin, https://github.com/XilunWu, https://github.com/wz337
from_local with replicate placement would run mesh_broadcast if run_check=True, by default from_local have run_check=True, but for FSDP state_dict case we are for sure that these are replica already, so we don't need to check/force check it. Pull Request resolved: pytorch#114995 Approved by: https://github.com/fegin, https://github.com/XilunWu, https://github.com/wz337
Stack from ghstack (oldest at bottom):
from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @kiukchung @d4l3k @LucasLLC