-
Notifications
You must be signed in to change notification settings - Fork 560
Use some new codegen functionality #2915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…xternal kernels" **Overview** This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends. - If the backend provides a functional kernel but not an out kernel, we generate one for them - If the backend provides a functional kernel but not an inplace kernel, we generate one for them - If they explicitly provide either kernel, we use whatever they provided With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates: - `RegisterXLA.cpp` - `RegisterAutogradXLA.cpp` **functionality delta** For context, what XLA’s codegen currently provides is: - Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel - No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead. **code changes** The PR also splits up the external backend codegen logic over two places. - `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet. - `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks. **logic re-use in the codegen** Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes: 1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured` 2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case. 3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla. **What this PR does not do** Structured kernel support for external backends. To add at some point in the future. [ghstack-poisoned]
**Overview** This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends. - If the backend provides a functional kernel but not an out kernel, we generate one for them - If the backend provides a functional kernel but not an inplace kernel, we generate one for them - If they explicitly provide either kernel, we use whatever they provided With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates: - `RegisterXLA.cpp` - `RegisterAutogradXLA.cpp` **functionality delta** For context, what XLA’s codegen currently provides is: - Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel - No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead. **code changes** The PR also splits up the external backend codegen logic over two places. - `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet. - `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks. **logic re-use in the codegen** Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes: 1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured` 2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case. 3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla. **What this PR does not do** Structured kernel support for external backends. To add at some point in the future. [ghstack-poisoned]
…xternal kernels" **Overview** This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends. - If the backend provides a functional kernel but not an out kernel, we generate one for them - If the backend provides a functional kernel but not an inplace kernel, we generate one for them - If they explicitly provide either kernel, we use whatever they provided With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates: - `RegisterXLA.cpp` - `RegisterAutogradXLA.cpp` **functionality delta** For context, what XLA’s codegen currently provides is: - Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel - No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead. **code changes** The PR also splits up the external backend codegen logic over two places. - `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet. - `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks. **logic re-use in the codegen** Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes: 1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`, with some changes to make it work for external backend kernels. 2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case. 3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla. Before (pictorially) ``` RegisterDispatchKey.__call__() -------> gen_unstructured() ``` After (pictorially). My ascii art is awful but hopefully you get the idea ``` RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured() \ / \ / --> gen_inplace_out_wrapper() ``` `gen_inplace_out_wrapper()` has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to `gen_unstructured()`. One wrinkle: `gen_unstructured` takes in a single `Union[NativeFunction, ExternalBackendFunction]`, which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an `out` kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a `functional` kernel. I ended up adding an extra optional argument to `gen_unstructured()` to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper. **What this PR does not do** Structured kernel support for external backends. To add at some point in the future. [ghstack-poisoned]
**Overview** This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends. - If the backend provides a functional kernel but not an out kernel, we generate one for them - If the backend provides a functional kernel but not an inplace kernel, we generate one for them - If they explicitly provide either kernel, we use whatever they provided With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates: - `RegisterXLA.cpp` - `RegisterAutogradXLA.cpp` **functionality delta** For context, what XLA’s codegen currently provides is: - Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel - No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead. **code changes** The PR also splits up the external backend codegen logic over two places. - `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet. - `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks. **logic re-use in the codegen** Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes: 1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`, with some changes to make it work for external backend kernels. 2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case. 3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla. Before (pictorially) ``` RegisterDispatchKey.__call__() -------> gen_unstructured() ``` After (pictorially). My ascii art is awful but hopefully you get the idea ``` RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured() \ / \ / --> gen_inplace_out_wrapper() ``` `gen_inplace_out_wrapper()` has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to `gen_unstructured()`. One wrinkle: `gen_unstructured` takes in a single `Union[NativeFunction, ExternalBackendFunction]`, which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an `out` kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a `functional` kernel. I ended up adding an extra optional argument to `gen_unstructured()` to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper. **What this PR does not do** Structured kernel support for external backends. To add at some point in the future. [ghstack-poisoned]
e39f26e
to
6179761
Compare
I realized if I don't stop making changes on ops' signature, we can never get this pr merge..... We can probably decide a time to get this pr merge (when it is ready) and I will update all of the pending prs. |
6179761
to
b1124ca
Compare
at::native::alpha_check(at::result_type(self, other), alpha); | ||
CheckBinaryOpTypePromotion(self, self, other); | ||
XLATensor self_tensor = bridge::GetXlaTensor(self); | ||
XLATensor::add_(self_tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I should have removed the code for XLATensor::add_
in this PR too. I'm planning a followup PR where I remove most other inplace ops, so I'll remove it there too.
@JackCaoG I think we're finally okay now that #2940 has landed 😛. I just updated this PR with the set of changes it provides along with it's corresponding pytorch PR. Still have to make sure that IC is green, but lmk if you have any questions! One of my followups will be to kill most of the inplace kernel code in pytorch/xla, now that we can code-generate it. |
Thanks! You might need to update inplace ops cpp test to check a different counter like in https://github.com/pytorch/xla/blob/master/test/cpp/test_aten_xla_tensor.cpp#L446 |
…e kernels when possible
e2fe1f4
to
1db52f3
Compare
Thanks for the pointer. Surprisingly it looks like tests all passed even though I removed Side note: I'm probably going to try and merge this + the corresponding pytorch PR real soon, since it looks like CI is finally green (still need to double check that on the pytorch side, there are no errors aside from the xla tests). |
CopyTensor(tensor, dst.scalar_type(), /*copy=*/false); | ||
dst.resize_as_(typed_tensor).copy_(typed_tensor); | ||
} else { | ||
// at this point we know dst is an XLA tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand why we know dst is an XLA tensor here? I guess I am trying to understand the use case of this function. What does checking self_tensor
means? Are we checking if they are empty tensor or CPU tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this function to be similar to AtenXlaType::_copy_from
, in that it handles all 3 cases:
self
is on CPU anddst
is on XLAself
is on XLA anddst
is on CPU- they're both on XLA
The main difference is just that in the case that dst
is on XLA, it handles resizing properly. I got the resizing logic mostly by looking at the function that the codegen used to call.
Could you help me understand why we know dst is an XLA tensor here?
In the } else {
section here, we know that both bridge::TryGetXlaTensor(self)
and bridge::TryGetXlaTensor(dst)
returned valid XLATensor
objects, which I took to mean that they both live on XLA. So this piece of the code handles the cases where both self
and dst
live on XLA, and I took the corresponding logic from here to figure out how to copy and resize the dst XLA tensor.
This piece of logic gets hit inside of the generated out
wrappers. I pasted an example codegen'd out
wrapper for add
below:
at::Tensor & wrapper_out_add_out(const at::Tensor & self, const at::Tensor & other, co
XLA_FN_TRACK(3);
TF_VLOG(3) << "XLA wrapper_out_add_out :" << " self=" << self.toString() << " other=
auto wrapper_out_add_out_tmp = wrapper_Tensor_add(self, other, alpha);
at::_copy_from_and_resize(wrapper_out_add_out_tmp, out);
return out;
}
auto self_tensor = bridge::TryGetXlaTensor(self); | ||
if (!self_tensor) { | ||
XLA_CHECK(dst_tensor); | ||
dst_tensor->UpdateFromTensorOut(self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq, in here you chose to call UpdateFromTensorOut
which also handle the view check and defualt sync-update
to false. Is there a reaon to do this instead of dst_tensor->UpdateFromTensor(self, /*sync=*/sync_update);
in AtenXlaType::_copy_from
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was actually my original motivation for writing this op. When I call dst_tensor->UpdateFromTensor(self, /*sync=*/sync_update);
instead (like in _copy_from), xla complains if dst
and self
have different sizes. The next thing that I tried was to explicitly resize the dest tensor before the call, but then I hit #2881. I got the idea to call UpdateFromTensorOut
from the bridge::
API that the codegen used previously, which is here. The only reason I didn't pass in sync_update
was because UpdateFromTensorOut
doesn't accept that arg :) but the original codegen also didn't use that argument, so I figured that this would be more in line with existing functionality.
Side note: implicitly resizing output tensors is actually currently allowed for in-tree kernels, but it's deprecated. So right now we want to allow that case, but eventually it'll probably go away. That's out of the scope of this function though, since that can be fixed through the codegen (we can just call _copy_from instead of _copy_from_and_resize).
Let me know if that all clears things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! @bdhirsh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
nit: we'll get more signal in the followup PR when inplace lowerings are removed from xla codebase. But a nice addition in this PR would be comparing against the _copy_from_and_resize with the part it's trying to replace ;) So that we are confident this change is benign :D
Given that 1.9 will be cut for pytorch next Monday(likely sometime next week for pt/xla as well), what's our plan for this and upstream codegen change? I assume we are not trying to push this into 1.9? |
That's fair; two of the 3 code paths from that function are used directly in this PR; (XLA->XLA) in the generated
This definitely isn't an official feature of the 1.9 release, but given that CI is passing, I think it's probably fine to merge unrelated to the branch cut deadline. AFAIK, we don't have a policy around merging new changes in close to the cut (?). |
@bdhirsh Sounds good. My main concern is that pt/xla 1.9 is not cut yet. If the upstream pytorch 1.9 doesn't have your corresponding pr, we need to make sure not including this pt/xla pr in our pt/xla 1.9 branch. |
That's a good point, thanks @JackCaoG. I'll check and confirm that (a) both or (b) neither commits are included. |
This PR uses a few changes on the pytorch side - see everything at pytorch/pytorch#57510 plus the PR's below it in the stack.
Main changes:
(1) [Refactor]
A refactor of the codegen on the pytorch side that unifies it more with existing codegen: This by itself doesn't cause a behavior change, but makes the existing codegen a lot more maintainable. It generates two new files,
RegisterXLA.cpp
andRegisterAutogradXLA.cpp
, which do some of the work thataten_xla_type_default.cpp
used to do. The breakdown of work is now:Register{XLA|AutogradXLA}.cpp
registers xla kernels to the pytorch dispatcher.aten_xla_type_default.cpp
generates cpu fallback kernels and registers them to the dispatcher.(2) [Generate inplace/out wrappers]
The previous version of the codegen had some logic to code-generate
out
versions of operators, that were implemented by calling into the corresponding functional op. Some changes on the pytorch side do some additional generation:out
wrappers for any xla ops that have a functional kernel but do not have anout
kernel (rather than using a hardcoded list of operators).inplace
wrappers for any xla ops that have a functional kernel but do not have aninplace
kernel.The inplace piece is new: As an example in this PR, I removed xla's entry for
add_.Tensor
, but the new codegen will actually let us remove all inplace kernels as long as there is a corresponding functional kernel implemented somewhere. Planning to do that code removal in a follow-up(3) [Remove bridge:: API from codegen]
The pytorch codegen currently hardcodes some xla-specific code in order to get everything working: the biggest piece of this is the API from
aten_xla_bridge.h
to convert between cpu and xla tensors. This API is actually unnecessary and can be implemented in an xla-indendent way, since those calls really just involve moving tensor data between the CPU and the external backend device (XLA).I had to add two new ops in order to get this to work though:
at::_to_cpu
: XLA needs to use special logic to move the xla tensor arguments onto cpu all at the same time, instead of moving them all independently onto CPU. That's the only purpose of this op. I also implemented a non-xla version of it in pytorch core, which just does the simple thing over iterating through each tensor in a list and moving each one to cpu.at::_copy_from_and_resize
: There's an existing_copy_from
that xla uses to move data from cpu to xla, but it doesn't handle resizes properly. I also opened up an issue for a bug I noticed where callingresize()
and thencopy()
doesn't work properly, but a_copy_from_and_resize()
operator seems like a fine workaround.