Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak in _LRScheduler.step() #85602

Closed

Conversation

KinglittleQ
Copy link
Contributor

Fixes #85410

This diff removed the cyclic references in _LRScheduler.step().

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 25, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85602

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 266ac1b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 26, 2022
@albanD
Copy link
Collaborator

albanD commented Sep 26, 2022

The change sounds ok but the CI failures are real right?

@KinglittleQ
Copy link
Contributor Author

Thanks for your comments. I will fix the CI.

1 similar comment
@KinglittleQ
Copy link
Contributor Author

Thanks for your comments. I will fix the CI.

@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@KinglittleQ
Copy link
Contributor Author

@albanD There are some updates to try to fix the tests. Could you please take another look?

@albanD
Copy link
Collaborator

albanD commented Oct 5, 2022

Thanks!
(GitHub is in a bad state right now, so CI might take a bit of time to trigger properly).

import gc
gc.collect()
run()
garbage = gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want the collect here I think?
THe whole point here is not to create ref cycles right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if there are any ref cycles created in run(), the second gc.collect() would return non-zero.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was still under the idea of using weakref and didn't read the update carefully. Sounds good!

@KinglittleQ
Copy link
Contributor Author

@albanD The CI needs to be triggered by a maintainer since I am a first-time contributor. Could you take a look?

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks

import gc
gc.collect()
run()
garbage = gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was still under the idea of using weakref and didn't read the update carefully. Sounds good!

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 6, 2022
@albanD albanD added release notes: python_frontend release notes category topic: improvements topic category labels Oct 6, 2022
@albanD
Copy link
Collaborator

albanD commented Oct 6, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@albanD
Copy link
Collaborator

albanD commented Oct 6, 2022

@pytorchbot -h

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 6, 2022

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. 

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR

Merge

usage: @pytorchbot merge [-g | -f MESSAGE | -l] [-r [{viable/strict,master}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -g, --green           Merge when all status checks running on the PR pass. To add status checks, use labels like `ciflow/trunk`.
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
  -l, --land-checks     [Deprecated - your PR instead now gets the `ciflow/trunk` label on approval] Merge with land time checks. This will create a new branch with your changes rebased on viable/strict and run a majority of trunk tests _before_ landing to increase trunk reliability and decrease risk of revert. The tests added are: pull, Lint and trunk. Note that periodic is excluded.
  -r [{viable/strict,master}], --rebase [{viable/strict,master}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or master as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
You, along with any member of the pytorch organization, can rebase your PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR

positional arguments:
  labels  Labels to add to given Pull Request

@albanD
Copy link
Collaborator

albanD commented Oct 6, 2022

@pytorchbot revert -m "newly added test is flaky" -c nosignal

@albanD
Copy link
Collaborator

albanD commented Oct 6, 2022

For more information on the flaky test, see #86413

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@KinglittleQ your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 6, 2022
This reverts commit eb32330.

Reverted #85602 on behalf of https://github.com/albanD due to newly added test is flaky
@albanD
Copy link
Collaborator

albanD commented Oct 6, 2022

@KinglittleQ my best guess of the flakyness here is that if any other object happens to be ready for collection when you run gc.collect(), it will fail the test even though we didn't create a cycle in the function.

Could you update the test to use weakref again so that we make sure that: while the gc is disabled, the weakef remains alive after the function AND after running the gc manually the weakref is gone.

@albanD albanD reopened this Oct 6, 2022
@KinglittleQ
Copy link
Contributor Author

@albanD The test has been updated now.

gc.disable()
ref = run()
assert ref() is None
gc.enable() # restore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler object should be freed after run() if there are no ref cycles.

This test ensures that the weakref will be unavailable even if the gc is disabled. (gc would collect objects with ref cycles)

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thanks!

@albanD
Copy link
Collaborator

albanD commented Oct 7, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit that referenced this pull request Oct 10, 2022
Summary:
Fixes #85410

This diff removed the cyclic references in `_LRScheduler.step()`.

Pull Request resolved: #85602
Approved by: https://github.com/albanD

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/b3fdb02fb25508d9c61d70b594f8a7fac3b2a365

Reviewed By: seemethere

Differential Revision: D40197030

Pulled By: seemethere

fbshipit-source-id: ed4be542b6f19bc1030d0ae74b40994a055cff29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged open source release notes: python_frontend release notes category Reverted topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_LRScheduler memory leak
6 participants