-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Ensure that call before redispatch work well for PythonTLSSnapshot #73045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 009a049 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks pretty reasonable, I don't understand how the test case tests this though
test/test_python_dispatch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "Direct Subclass" and who shouldn't do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a direct subclass in the sense that the Tensor that we send to the backend and the subclass are the same.
It is ok to do it, but not composable and tricky in general. Since a lot of people use these tests as example, I prefered to warn here that this is not a good example to copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ you should probably put that into the comment in a code, as a user I would be more even confused haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related question: does the bug with the non-stack-based version only show up when you do a "Direct Subclass" (and not show up when you do a wrapper subclass)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct subclass is the simplest repro I could come up with.
@Chillee managed to hit the same problem with some of the functorch tests. In that case, it was while going down a clamp_min_
op, an extra clone
was done before executing the function itself. So I guess functionalization was involved? But I don't know how to enable that easily in a small repro. Hence the use of direct subclass and conjugate fallback.
@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
test/test_python_dispatch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially a stupid question, but, why does call_clone go to PythonTLSSnapshot? I would have expected the redispatch from Conjugate to go into the Python dispatch key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a redispatch, this is a call. Namely:
auto resolved_tensor = at::clone(tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think that's because the conjugate fallback ends up calling at::clone()
, which enters the dispatcher through Dispatcher::call()
. PythonTLSSnapshot is set up to always run first every time you call Dispatcher::call()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PythonTLSSnapshot key in the local exclude key set at this point? So I would expect the at::clone call to skip past PythonTLSSnapshot, AutogradCPU, and Conjugate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, you don't use ExcludeDispatchKeyGuards to do the redispatch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redispatch here means actually using redispatch vs clone.
The ExcludeDispatchKeyGuards
is not used anymore except by autograd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh does this mean on the at::clone call, AutogradCPU doesn't get hit? AutogradCPU is in the local exclude set (because VariableType::blah both uses an exclude key guard AND at::redispatch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct! My bad!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comment and explanations, I understand what's going on now. I agree that this test is testing the above code now.
On an orthogonal note, I'm a bit concerned that AutogradCPU doesn't get hit, but maybe that's not a problem because that's how things work even without the Python keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what happens when your key is "below" autograd: autogad is already handled above so you don't need to care about it here :)
@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hey @albanD. |
Reverted, as it clearly broke OSS CI |
This pull request has been reverted by 9a96604. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
This pull request has been reverted by 9a96604. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
…73045) Summary: Pull Request resolved: pytorch/pytorch#73045 Reviewed By: zou3519 Differential Revision: D34318185 Pulled By: albanD fbshipit-source-id: abc30fe69176ba474e28bb045406a410e17cfd79 (cherry picked from commit 4d9a305d3a2688e0d6264193f5dd692a2d44aedb)
Summary: Reland of pytorch/pytorch#73045 Tweak class visibility to avoid windows linking issues. Pull Request resolved: pytorch/pytorch#73231 Reviewed By: bdhirsh Differential Revision: D34402767 Pulled By: albanD fbshipit-source-id: 50aaadf5389ca516fa6a5034d42eee56abe3c7f7 (cherry picked from commit 0fe53bdfb770f8e39eb3017f98e1d210f85eb78b)
…73045) Summary: Pull Request resolved: pytorch/pytorch#73045 Reviewed By: zou3519 Differential Revision: D34318185 Pulled By: albanD fbshipit-source-id: abc30fe69176ba474e28bb045406a410e17cfd79 (cherry picked from commit 4d9a305d3a2688e0d6264193f5dd692a2d44aedb)
Summary: Reland of pytorch/pytorch#73045 Tweak class visibility to avoid windows linking issues. Pull Request resolved: pytorch/pytorch#73231 Reviewed By: bdhirsh Differential Revision: D34402767 Pulled By: albanD fbshipit-source-id: 50aaadf5389ca516fa6a5034d42eee56abe3c7f7 (cherry picked from commit 0fe53bdfb770f8e39eb3017f98e1d210f85eb78b)
No description provided.