Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix anomaly mode memory leak #51610

Closed
wants to merge 7 commits into from

Conversation

soulitzer
Copy link
Contributor

@soulitzer soulitzer commented Feb 3, 2021

Fixes #51349

The memory leak happens when 1) create_graph is True AND 2) detect anomaly mode is on. When a backward node's constructor is called during backward, the current evaluating node is assigned as a "parent" of the created node. The code that assigns the parent encounters the below issue:

functionToPyObject(parent_node) returns a new PyObject (with refcount 1) or if PyObject already exists, increments its refcount by 1. However PyDict_SetItem calls into insertdict which increments refcount again. This means that when dict is destroyed, the refcount of the PyObject is at least one. This keeps parent_node (the backward function) alive, which then keeps the saved tensor alive.

Similar calls in the codebase to functionToPyObject won't require Py_DECREF if it is then passed into a tuple (instead of dict), because the analogous PyTuple_SetItem call does not increment refcount.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 3, 2021

💊 CI failures summary and remediations

As of commit c11db23 (more details on the Dr. CI page):


  • 1/2 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 1/2 broken upstream at merge base 34d4d79 from Feb 02 until Feb 03

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@soulitzer soulitzer marked this pull request as ready for review February 3, 2021 05:34
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
Actually looking more closely into this file, it might not be the only one :/
In general, how we do things is:

  • If a function returns a borrowed reference, then we store it in a PyObject*. That way we know we don't own it and nothing needs to be done on destruction.
  • If a function returns a new reference, then we store it in a THPObjectPtr. Is is a custom class we have that ensures that the refcount is decremented when it is destructed.

So here the result of functionToPyObject should definitely be in a THPObjectPtr.
I think I didn't do a proper review of that on the parent printing PR because the calls to PyObject_GetAttrString and PyObject_CallMethod above should also be stored in THPObjectPtr.

@soulitzer soulitzer requested a review from albanD February 3, 2021 17:17
@soulitzer
Copy link
Contributor Author

Good catch!
Actually looking more closely into this file, it might not be the only one :/
In general, how we do things is:

* If a function returns a borrowed reference, then we store it in a `PyObject*`. That way we know we don't own it and nothing needs to be done on destruction.

* If a function returns a new reference, then we store it in a `THPObjectPtr`. Is is a custom class we have that ensures that the refcount is decremented when it is destructed.

So here the result of functionToPyObject should definitely be in a THPObjectPtr.
I think I didn't do a proper review of that on the parent printing PR because the calls to PyObject_GetAttrString and PyObject_CallMethod above should also be stored in THPObjectPtr.

Ahh good to know! I've made the updates.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great!

Do you think we can write a simple test to make sure this is properly fixed?
I think you can make one by using a similar logic as this one:

pytorch/test/test_autograd.py

Lines 6499 to 6527 in a3f2fe0

import weakref
def get_tensor_and_weak_ref():
# Helper function to get a Tensor and a weak ref that tells us
# if the c++ version of this Tensor is still alive or not.
#
# Create the following reference chain to do so:
# - python Tensor t
# - c++ Tensor corresponding by t
# - c++ Node corresponding to t.grad_fn
# - python dict of metadata from this Node
# - an object in this dict that we can take a weakref of
# Create a new Tensor and Node
t = torch.rand(2, requires_grad=True).clone()
# Create the metadata dict
meta_dict = t.grad_fn.metadata
# Create the object in the dict
class Foo(object):
pass
my_obj = Foo()
meta_dict[0] = my_obj
# After exiting this function, the python Tensor t is the only
# thing keeping ref alive
ref = weakref.ref(my_obj)
return t, ref

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test!
LGTM

@soulitzer
Copy link
Contributor Author

@albanD ahh I added another test in the meantime... this one checks that metadata dict is destroyed properly in the nested case. But now I'm thinking it might've not been worth the effort.

@albanD
Copy link
Collaborator

albanD commented Feb 3, 2021

More tests is always better. Especially if they run instantly like this one!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soulitzer has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@soulitzer merged this pull request in 2e8e560.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

autograd.grad with set_detect_anomaly(True) will cause memory leak
3 participants