-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix guarding issues w/ numpy #106431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix guarding issues w/ numpy #106431
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106431
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 2 Pending, 3 Unrelated FailuresAs of commit f38030f: BROKEN TRUNK - The following jobs failed but were present on the merge base 3db2550:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
# lose optimization opportunities this way. Devs, if your benchmark model is failing | ||
# this way, you should figure out why instead of suppressing it. | ||
suppress_errors = os.environ.get("TORCHDYNAMO_SUPPRESS_ERRORS", "1") == "1" | ||
suppress_errors = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lezcano this is intentional, this flag makes it hard to debug.
self._produce_guard_code(guard, [shape_guard], shape_env=True) | ||
|
||
def TENSOR_MATCH(self, guard: Guard): | ||
def TENSOR_MATCH(self, guard: Guard, value=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if .get() invokes an fn, like in the source added in this PR (NumpyTensorSource) - we do not have id stability, which is required between example values from get
and traced values at fake-ification time, because of how we build the sizes/strides for tensor guards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I really need a comment here explaining what the value arg does. Still not sure from the comment either, need to read more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, will add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, with this PR in the current state, we still have that a function that just has NumPy inputs and NumPy outputs is not being compiled. A small modification of the example in the OP shows this.
OTOH, they seem to have correct guards, and their shapes are correctly traced with symints when dynamic shapes kick in.
1a8a61f
to
49c975a
Compare
@lezcano I undid your mega merge, this PR is back to being based on head of your branch. If you need work there, push it to that branch and rebase this branch, don't push to this branch please :P |
|
||
def add_graph_output(self, value): | ||
graph_outputs_key = id(value.proxy) | ||
graph_outputs_key = id(value.as_proxy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this... do anything lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the better convention, but no, should be same.
def TENSOR_MATCH(self, guard: Guard): | ||
def TENSOR_MATCH(self, guard: Guard, value=None): | ||
if guard.is_nn_module(): | ||
self.ID_MATCH(guard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy ndarray on nn module 🤔 hilarious failure mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not fail, but will overspecialize. HOWEVER - we throw out nn module guards, so this will just be okay for now.
torch/_dynamo/variables/tensor.py
Outdated
**kwargs, | ||
): | ||
super().__init__(proxy, **kwargs) | ||
self.proxy = proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here? Are there two proxies floating around now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need this line, super already does that
) | ||
options = {"source": source} | ||
options = {"source": source, "guards": tensor_vt.guards} | ||
numpy_ndarray_variable = wrap_fx_proxy_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears these are the substantive changes
torch/_dynamo/variables/builder.py
Outdated
example_value=value, | ||
guards=self.make_guards(GuardBuilder.TENSOR_MATCH), | ||
guards=self.make_guards( | ||
functools.partial(GuardBuilder.TENSOR_MATCH, value=value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you passing value for EVERYBODY, not just ndarray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this actually saves us an eval, and should always be sound.
Sounds good will get this over. |
3f693fe
to
dcf6212
Compare
dcf6212
to
3f693fe
Compare
if isinstance(value, np.ndarray): | ||
return torch.as_tensor(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to remove this one, as, if our logic is correct, an np.ndarray
should never get to this point. We should probably even assert not isinstance(value, np.ndarray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me double check, I am pretty certain we do hit this, and that is why I added it. Why would you expect us to not get here?
The issue in #106431 (review) is still present (i.e., if you remove the tensor arg to the |
Let me take a look :) |
BTW this would be in another PR - its not blocking this one. I have not looked yet, but my money is on something that looks for tensor compute in frames. We do this in a few specific places. Will report back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, another PR SGTM.
On the interaction with torch_np
, this PR LGTM. Let's wait for Ed's review about the general strategy tho.
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
numpy_to_tensor_wrapper(func), | ||
*proxy_args_kwargs(args, kwargs), | ||
) | ||
return NumpyNdarrayVariable.create(tx, proxy, **options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference before and after? Or is this just refactoring.
I merged it because @voznesenskym was going to. |
RFC: pytorch/rfcs#54 First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/ We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core. In the next commits, I do a number of things in this order - Fix a few small issues - Make the tests that this PR adds pass - Bend backwards until lintrunner passes - Remove the optional dependency on `torch_np` and simply rely on the upstreamed code - Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now. Missing from this PR (but not blocking): - Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate. - #106431 (comment). @voznesenskym to submit a fix after we merge. All the tests in `tests/torch_np` take about 75s to run. This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions. Pull Request resolved: #106211 Approved by: https://github.com/ezyang
cc @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @anijain2305