Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix python gc race condition with THPVariable_traverse #4437

Merged
merged 2 commits into from
Jan 2, 2018

Conversation

zou3519
Copy link
Contributor

@zou3519 zou3519 commented Jan 2, 2018

Fixes #3883: asserts being triggered in DataParallel for long training regiments like imagenet.

The bug is that a python Function in the computation graph gets garbage collected due to a C++ shared pointer to the C++ Function it wraps changing use_count() from 1 to more than 1 during python garbage collection. This happens because the "python state" of Variable changes: the THPVariable_traverse method indicates that when the shared pointer's use_count() is 1, the python Variable holds a python reference to the python Function, but not when use_count() > 1. See this gist for a more detailed exploration of the diagnosis.

The bug could be fixed by treating the shared pointer (a Variable's grad_fn) as part of the python state and intelligently grabbing the GIL when necessary. The downside to this approach is the complexity and potential overhead.

Alternatively, after talking with @colesbury, we think it might be better to remove the checking of the shared pointer's use_count() in THPVariable_traverse. This effectively disables reference cycle garbage collection for python Functions.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment next to the deleted C++ code that links to your gist and maybe says a few words about why we don't want to traverse to the grad_fn. Also, mention that this in fact leaks when hooks are involved in a cycle

@apaszke apaszke merged commit 2060f35 into pytorch:master Jan 2, 2018
@zou3519 zou3519 deleted the fix-race branch January 3, 2018 19:50
@soumith soumith mentioned this pull request Jan 23, 2018
16 tasks
@soumith soumith added the 0.3.1 label Feb 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion pos >= 0 && pos < buffer.size() failed
3 participants