New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix anomaly mode memory leak #51610
Fix anomaly mode memory leak #51610
Conversation
💊 CI failures summary and remediationsAs of commit c11db23 (more details on the Dr. CI page):
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Actually looking more closely into this file, it might not be the only one :/
In general, how we do things is:
- If a function returns a borrowed reference, then we store it in a
PyObject*
. That way we know we don't own it and nothing needs to be done on destruction. - If a function returns a new reference, then we store it in a
THPObjectPtr
. Is is a custom class we have that ensures that the refcount is decremented when it is destructed.
So here the result of functionToPyObject
should definitely be in a THPObjectPtr
.
I think I didn't do a proper review of that on the parent printing PR because the calls to PyObject_GetAttrString and PyObject_CallMethod above should also be stored in THPObjectPtr
.
Ahh good to know! I've made the updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great!
Do you think we can write a simple test to make sure this is properly fixed?
I think you can make one by using a similar logic as this one:
Lines 6499 to 6527 in a3f2fe0
import weakref | |
def get_tensor_and_weak_ref(): | |
# Helper function to get a Tensor and a weak ref that tells us | |
# if the c++ version of this Tensor is still alive or not. | |
# | |
# Create the following reference chain to do so: | |
# - python Tensor t | |
# - c++ Tensor corresponding by t | |
# - c++ Node corresponding to t.grad_fn | |
# - python dict of metadata from this Node | |
# - an object in this dict that we can take a weakref of | |
# Create a new Tensor and Node | |
t = torch.rand(2, requires_grad=True).clone() | |
# Create the metadata dict | |
meta_dict = t.grad_fn.metadata | |
# Create the object in the dict | |
class Foo(object): | |
pass | |
my_obj = Foo() | |
meta_dict[0] = my_obj | |
# After exiting this function, the python Tensor t is the only | |
# thing keeping ref alive | |
ref = weakref.ref(my_obj) | |
return t, ref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test!
LGTM
@albanD ahh I added another test in the meantime... this one checks that metadata dict is destroyed properly in the nested case. But now I'm thinking it might've not been worth the effort. |
More tests is always better. Especially if they run instantly like this one! |
d051da2
to
30ea95d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soulitzer has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@soulitzer merged this pull request in 2e8e560. |
Fixes #51349
The memory leak happens when 1)
create_graph
is True AND 2) detect anomaly mode is on. When a backward node's constructor is called during backward, the current evaluating node is assigned as a "parent" of the created node. The code that assigns the parent encounters the below issue:functionToPyObject(parent_node)
returns a new PyObject (with refcount 1) or if PyObject already exists, increments its refcount by 1. However PyDict_SetItem calls into insertdict which increments refcount again. This means that when dict is destroyed, the refcount of the PyObject is at least one. This keepsparent_node
(the backward function) alive, which then keeps the saved tensor alive.Similar calls in the codebase to
functionToPyObject
won't require Py_DECREF if it is then passed into a tuple (instead of dict), because the analogous PyTuple_SetItem call does not increment refcount.