Fix anomaly mode memory leak #51610

soulitzer · 2021-02-03T02:34:08Z

The memory leak happens when 1) create_graph is True AND 2) detect anomaly mode is on. When a backward node's constructor is called during backward, the current evaluating node is assigned as a "parent" of the created node. The code that assigns the parent encounters the below issue:

functionToPyObject(parent_node) returns a new PyObject (with refcount 1) or if PyObject already exists, increments its refcount by 1. However PyDict_SetItem calls into insertdict which increments refcount again. This means that when dict is destroyed, the refcount of the PyObject is at least one. This keeps parent_node (the backward function) alive, which then keeps the saved tensor alive.

Similar calls in the codebase to functionToPyObject won't require Py_DECREF if it is then passed into a tuple (instead of dict), because the analogous PyTuple_SetItem call does not increment refcount.

facebook-github-bot · 2021-02-03T02:34:16Z

💊 CI failures summary and remediations

As of commit c11db23 (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 broken upstream at merge base 34d4d79 from Feb 02 until Feb 03

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 from Feb 02 until Feb 03 (209e27e - 7918f37)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

albanD

Good catch!
Actually looking more closely into this file, it might not be the only one :/
In general, how we do things is:

If a function returns a borrowed reference, then we store it in a PyObject*. That way we know we don't own it and nothing needs to be done on destruction.
If a function returns a new reference, then we store it in a THPObjectPtr. Is is a custom class we have that ensures that the refcount is decremented when it is destructed.

So here the result of functionToPyObject should definitely be in a THPObjectPtr.
I think I didn't do a proper review of that on the parent printing PR because the calls to PyObject_GetAttrString and PyObject_CallMethod above should also be stored in THPObjectPtr.

soulitzer · 2021-02-03T17:18:10Z

Good catch!
Actually looking more closely into this file, it might not be the only one :/
In general, how we do things is:
* If a function returns a borrowed reference, then we store it in a `PyObject*`. That way we know we don't own it and nothing needs to be done on destruction.

* If a function returns a new reference, then we store it in a `THPObjectPtr`. Is is a custom class we have that ensures that the refcount is decremented when it is destructed.
So here the result of functionToPyObject should definitely be in a THPObjectPtr.
I think I didn't do a proper review of that on the parent printing PR because the calls to PyObject_GetAttrString and PyObject_CallMethod above should also be stored in THPObjectPtr.

Ahh good to know! I've made the updates.

albanD

That looks great!

Do you think we can write a simple test to make sure this is properly fixed?
I think you can make one by using a similar logic as this one:

pytorch/test/test_autograd.py

Lines 6499 to 6527 in a3f2fe0

    
           import weakref 
        
           def get_tensor_and_weak_ref(): 
        
               # Helper function to get a Tensor and a weak ref that tells us 
        
               # if the c++ version of this Tensor is still alive or not. 
        
               # 
        
               # Create the following reference chain to do so: 
        
               #   - python Tensor t 
        
               #   - c++ Tensor corresponding by t 
        
               #   - c++ Node corresponding to t.grad_fn 
        
               #   - python dict of metadata from this Node 
        
               #   - an object in this dict that we can take a weakref of 
        
               # Create a new Tensor and Node 
        
               t = torch.rand(2, requires_grad=True).clone() 
        
               # Create the metadata dict 
        
               meta_dict = t.grad_fn.metadata 
        
               # Create the object in the dict 
        
               class Foo(object): 
        
                   pass 
        
               my_obj = Foo() 
        
               meta_dict[0] = my_obj 
        
               # After exiting this function, the python Tensor t is the only 
        
               # thing keeping ref alive 
        
               ref = weakref.ref(my_obj) 
        
               return t, ref

albanD

Thanks for the test!
LGTM

soulitzer · 2021-02-03T23:03:59Z

@albanD ahh I added another test in the meantime... this one checks that metadata dict is destroyed properly in the nested case. But now I'm thinking it might've not been worth the effort.

albanD · 2021-02-03T23:15:33Z

More tests is always better. Especially if they run instantly like this one!

facebook-github-bot

@soulitzer has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-02-04T19:57:47Z

@soulitzer merged this pull request in 2e8e560.

Decref extra ref

de59600

facebook-github-bot added the cla signed label Feb 3, 2021

soulitzer marked this pull request as ready for review February 3, 2021 05:34

soulitzer requested a review from albanD as a code owner February 3, 2021 05:34

albanD reviewed Feb 3, 2021

View reviewed changes

Wrap with THPObjectPtr

f123ce3

soulitzer requested a review from albanD February 3, 2021 17:17

albanD reviewed Feb 3, 2021

View reviewed changes

Add test

122aa49

albanD approved these changes Feb 3, 2021

View reviewed changes

Add test that checks print_stack

2242827

flake8

4e16a8d

Simplify test

30ea95d

soulitzer force-pushed the anomaly-mode-refcount branch from d051da2 to 30ea95d Compare February 3, 2021 23:42

Clean up

c11db23

facebook-github-bot reviewed Feb 4, 2021

View reviewed changes

facebook-github-bot closed this in 2e8e560 Feb 4, 2021

facebook-github-bot added the Merged label Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix anomaly mode memory leak #51610

Fix anomaly mode memory leak #51610

soulitzer commented Feb 3, 2021 •

edited

facebook-github-bot commented Feb 3, 2021 •

edited

albanD left a comment

soulitzer commented Feb 3, 2021

albanD left a comment

albanD left a comment

soulitzer commented Feb 3, 2021

albanD commented Feb 3, 2021

facebook-github-bot left a comment

facebook-github-bot commented Feb 4, 2021

	import weakref

	def get_tensor_and_weak_ref():
	# Helper function to get a Tensor and a weak ref that tells us
	# if the c++ version of this Tensor is still alive or not.
	#
	# Create the following reference chain to do so:
	# - python Tensor t
	# - c++ Tensor corresponding by t
	# - c++ Node corresponding to t.grad_fn
	# - python dict of metadata from this Node
	# - an object in this dict that we can take a weakref of


	# Create a new Tensor and Node
	t = torch.rand(2, requires_grad=True).clone()
	# Create the metadata dict
	meta_dict = t.grad_fn.metadata
	# Create the object in the dict

	class Foo(object):
	pass
	my_obj = Foo()
	meta_dict[0] = my_obj

	# After exiting this function, the python Tensor t is the only
	# thing keeping ref alive
	ref = weakref.ref(my_obj)
	return t, ref

Fix anomaly mode memory leak #51610

Fix anomaly mode memory leak #51610

Conversation

soulitzer commented Feb 3, 2021 • edited

facebook-github-bot commented Feb 3, 2021 • edited

💊 CI failures summary and remediations

🚧 1 fixed upstream failure:

albanD left a comment

Choose a reason for hiding this comment

soulitzer commented Feb 3, 2021

albanD left a comment

Choose a reason for hiding this comment

albanD left a comment

Choose a reason for hiding this comment

soulitzer commented Feb 3, 2021

albanD commented Feb 3, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 4, 2021

soulitzer commented Feb 3, 2021 •

edited

facebook-github-bot commented Feb 3, 2021 •

edited