Skip to content

Conversation

@gl3lan
Copy link
Contributor

@gl3lan gl3lan commented Jul 10, 2025

Summary:
The test bool(self.n_averaged == 0) is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each update_parameter call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709

@gl3lan gl3lan requested review from albanD and janeyx99 as code owners July 10, 2025 09:30
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 10, 2025

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @gl3lan, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 10, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: gl3lan / name: Gaël Le Lan (b65811b)

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158017

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b65811b with merge base f638854 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

@albanD albanD removed their request for review July 10, 2025 13:46
@gl3lan gl3lan force-pushed the export-D78074709 branch from 6463fb5 to 5c3ca33 Compare July 10, 2025 16:58
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

@gl3lan gl3lan force-pushed the export-D78074709 branch from 5c3ca33 to f61c1f5 Compare July 14, 2025 08:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

gl3lan added a commit to gl3lan/pytorch that referenced this pull request Jul 14, 2025
Summary:

The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based boolean variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709
@gl3lan gl3lan marked this pull request as draft July 14, 2025 08:38
@gl3lan gl3lan force-pushed the export-D78074709 branch from f61c1f5 to ec192fb Compare July 22, 2025 08:06
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

gl3lan added a commit to gl3lan/pytorch that referenced this pull request Jul 22, 2025
Summary:

The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based boolean variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709
@gl3lan gl3lan marked this pull request as ready for review July 22, 2025 09:01
@gl3lan gl3lan force-pushed the export-D78074709 branch from ec192fb to 6a831d2 Compare August 3, 2025 20:58
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

gl3lan added a commit to gl3lan/pytorch that referenced this pull request Aug 3, 2025
Summary:

The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based boolean variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709
@gl3lan gl3lan force-pushed the export-D78074709 branch from 6a831d2 to ce429ea Compare August 3, 2025 22:02
gl3lan added a commit to gl3lan/pytorch that referenced this pull request Aug 3, 2025
Summary:

The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based boolean variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78074709

@gl3lan gl3lan force-pushed the export-D78074709 branch from ce429ea to 49f50ec Compare August 4, 2025 18:41
@gl3lan
Copy link
Contributor Author

gl3lan commented Sep 16, 2025

@janeyx99 took me a while to get back to it.

…nter (pytorch#158017)

Summary:

The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based boolean variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709
@facebook-github-bot
Copy link
Contributor

@gl3lan has exported this pull request. If you are a Meta employee, you can view the originating diff in D78074709.

janeyx99
janeyx99 previously approved these changes Sep 16, 2025
Copy link
Contributor

@janeyx99 janeyx99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank u!

@janeyx99
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 16, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Summary:
The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709

Pull Request resolved: pytorch#158017
Approved by: https://github.com/janeyx99
@wdvr
Copy link
Contributor

wdvr commented Sep 17, 2025

@pytorchmergebot revert -m "discussed with author - expecting this to break checkpointing" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Sep 17, 2025
This reverts commit cb7f45f.

Reverted #158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](#158017 (comment)))
@pytorchmergebot
Copy link
Collaborator

@gl3lan your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Sep 17, 2025
@pytorch-bot pytorch-bot bot dismissed janeyx99’s stale review September 17, 2025 08:02

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

@gl3lan
Copy link
Contributor Author

gl3lan commented Sep 17, 2025

@janeyx99 I propose to revert to the original idea of keeping n_averaged as tensor and using a proxy boolean. Otherwise it breaks DCP which will search for ._extra_state.n_averaged in the saved checkpoint (since DCP reads the state_dict of the model to know what keys to search for in the checkpoint, before loading back the updated state_dict).

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Summary:
The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709

Pull Request resolved: pytorch#158017
Approved by: https://github.com/janeyx99
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…h#158017)"

This reverts commit cb7f45f.

Reverted pytorch#158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](pytorch#158017 (comment)))
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Summary:
The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709

Pull Request resolved: pytorch#158017
Approved by: https://github.com/janeyx99
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…h#158017)"

This reverts commit cb7f45f.

Reverted pytorch#158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](pytorch#158017 (comment)))
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Summary:
The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709

Pull Request resolved: pytorch#158017
Approved by: https://github.com/janeyx99
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…h#158017)"

This reverts commit cb7f45f.

Reverted pytorch#158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](pytorch#158017 (comment)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants