-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[DCP] Fix Optimizer Learning Rate not being loaded correctly #129398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129398
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit b94592b with merge base 1c75ddf ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
3062c45
to
88d600e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, are we sure that the meta device still works?
88d600e
to
259d7a3
Compare
# So we need to temporariy update the each element in the state dict with meta tensor. | ||
for k in state_dict.keys(): | ||
state_dict[k] = state_dict_assigned_storage[k] | ||
tree_map_only_(torch.Tensor, lambda v: _init_meta_tensor(v), state_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason a lambda function is used here? Would tree_map_only_(torch.Tensor, _init_meta_tensor, state_dict)
not have worked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Thanks! Good catch. I modified it to remove the lambda.
259d7a3
to
b94592b
Compare
@pytorchmergebot merge -i |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchmergebot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge), periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu), periodic / win-vs2019-cuda11.8-py3 / test (default, 2, 4, windows.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchmergebot merge -i |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge), periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu), periodic / win-vs2019-cuda11.8-py3 / test (default, 2, 4, windows.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…#129398) Fixes pytorch#129079 Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after pytorch@f518cf8, which is a regression introduced in 2.3. This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading. Test: ``` python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device ``` Pull Request resolved: pytorch#129398 Approved by: https://github.com/fegin (cherry picked from commit 8b8e2fc)
…tly (#129398) (#129683) [DCP] Fix Optimizer Learning Rate not being loaded correctly (#129398) Fixes #129079 Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after f518cf8, which is a regression introduced in 2.3. This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading. Test: ``` python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device ``` Pull Request resolved: #129398 Approved by: https://github.com/fegin (cherry picked from commit 8b8e2fc)
Fixes #129079
Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after f518cf8, which is a regression introduced in 2.3.
This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading.
Test:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC @MeetVadakkanchery @mhorowitz