- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.7k
remove unnecessary sync point in AveragedModel update #158017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @gl3lan, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. | 
| 
 
 | 
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158017
 Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b65811b with merge base f638854 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
| This pull request was exported from Phabricator. Differential Revision: D78074709 | 
| This pull request was exported from Phabricator. Differential Revision: D78074709 | 
| This pull request was exported from Phabricator. Differential Revision: D78074709 | 
| This pull request was exported from Phabricator. Differential Revision: D78074709 | 
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
| This pull request was exported from Phabricator. Differential Revision: D78074709 | 
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
ec192fb    to
    6a831d2      
    Compare
  
    | This pull request was exported from Phabricator. Differential Revision: D78074709 | 
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
6a831d2    to
    ce429ea      
    Compare
  
    Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
| This pull request was exported from Phabricator. Differential Revision: D78074709 | 
ce429ea    to
    49f50ec      
    Compare
  
    | @janeyx99 took me a while to get back to it. | 
…nter (pytorch#158017) Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based boolean variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709
036bf60    to
    b65811b      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank u!
| @pytorchbot merge | 
| Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team | 
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709 Pull Request resolved: pytorch#158017 Approved by: https://github.com/janeyx99
| @pytorchmergebot revert -m "discussed with author - expecting this to break checkpointing" -c ghfirst | 
| @pytorchbot successfully started a revert job. Check the current status here. | 
This reverts commit cb7f45f. Reverted #158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](#158017 (comment)))
| @gl3lan your PR has been successfully reverted. | 
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
| @janeyx99 I propose to revert to the original idea of keeping  | 
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709 Pull Request resolved: pytorch#158017 Approved by: https://github.com/janeyx99
…h#158017)" This reverts commit cb7f45f. Reverted pytorch#158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](pytorch#158017 (comment)))
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709 Pull Request resolved: pytorch#158017 Approved by: https://github.com/janeyx99
…h#158017)" This reverts commit cb7f45f. Reverted pytorch#158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](pytorch#158017 (comment)))
Summary: The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update. This test is only meant to know whether the AveragedModel copy has been initialized or not. This diff introduces a CPU-based variable for that purpose. When loading from checkpoint we also make sure the parameter is refreshed. After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction). Test Plan: contbuild & OSS CI Test plan from GitHub: CI Rollback Plan: Differential Revision: D78074709 Pull Request resolved: pytorch#158017 Approved by: https://github.com/janeyx99
…h#158017)" This reverts commit cb7f45f. Reverted pytorch#158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](pytorch#158017 (comment)))
Summary:
The test
bool(self.n_averaged == 0)is a CPU/GPU synchronization point that is called for each update.This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.
After this fix, each
update_parametercall is reduced to 6ms from 333ms (98% reduction).Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI
Rollback Plan:
Differential Revision: D78074709