Skip to content

Conversation

bdhirsh
Copy link
Collaborator

@bdhirsh bdhirsh commented Apr 23, 2021

This PR uses a few changes on the pytorch side - see everything at pytorch/pytorch#57510 plus the PR's below it in the stack.

Main changes:

(1) [Refactor]
A refactor of the codegen on the pytorch side that unifies it more with existing codegen: This by itself doesn't cause a behavior change, but makes the existing codegen a lot more maintainable. It generates two new files, RegisterXLA.cpp and RegisterAutogradXLA.cpp, which do some of the work that aten_xla_type_default.cpp used to do. The breakdown of work is now:

  • Register{XLA|AutogradXLA}.cpp registers xla kernels to the pytorch dispatcher.
  • aten_xla_type_default.cpp generates cpu fallback kernels and registers them to the dispatcher.

(2) [Generate inplace/out wrappers]
The previous version of the codegen had some logic to code-generate out versions of operators, that were implemented by calling into the corresponding functional op. Some changes on the pytorch side do some additional generation:

  • unconditionally generate out wrappers for any xla ops that have a functional kernel but do not have an out kernel (rather than using a hardcoded list of operators).
  • unconditionally generate inplace wrappers for any xla ops that have a functional kernel but do not have an inplace kernel.
    The inplace piece is new: As an example in this PR, I removed xla's entry for add_.Tensor, but the new codegen will actually let us remove all inplace kernels as long as there is a corresponding functional kernel implemented somewhere. Planning to do that code removal in a follow-up

(3) [Remove bridge:: API from codegen]
The pytorch codegen currently hardcodes some xla-specific code in order to get everything working: the biggest piece of this is the API from aten_xla_bridge.h to convert between cpu and xla tensors. This API is actually unnecessary and can be implemented in an xla-indendent way, since those calls really just involve moving tensor data between the CPU and the external backend device (XLA).

I had to add two new ops in order to get this to work though:

  • at::_to_cpu: XLA needs to use special logic to move the xla tensor arguments onto cpu all at the same time, instead of moving them all independently onto CPU. That's the only purpose of this op. I also implemented a non-xla version of it in pytorch core, which just does the simple thing over iterating through each tensor in a list and moving each one to cpu.
  • at::_copy_from_and_resize: There's an existing _copy_from that xla uses to move data from cpu to xla, but it doesn't handle resizes properly. I also opened up an issue for a bug I noticed where calling resize() and then copy() doesn't work properly, but a _copy_from_and_resize() operator seems like a fine workaround.

bdhirsh added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
…xternal kernels"

**Overview**
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
- If the backend provides a functional kernel but not an out kernel, we generate one for them
- If the backend provides a functional kernel but not an inplace kernel, we generate one for them
- If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
- `RegisterXLA.cpp`
- `RegisterAutogradXLA.cpp`


**functionality delta**
For context, what XLA’s codegen currently provides is:
- Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
- No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead.


**code changes**
The PR also splits up the external backend codegen logic over two places.
- `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet.
- `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.


**logic re-use in the codegen**
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`
2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

**What this PR does not do**
Structured kernel support for external backends. To add at some point in the future.




[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
**Overview**
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
- If the backend provides a functional kernel but not an out kernel, we generate one for them
- If the backend provides a functional kernel but not an inplace kernel, we generate one for them
- If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
- `RegisterXLA.cpp`
- `RegisterAutogradXLA.cpp`


**functionality delta**
For context, what XLA’s codegen currently provides is:
- Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
- No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead.


**code changes**
The PR also splits up the external backend codegen logic over two places.
- `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet.
- `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.


**logic re-use in the codegen**
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`
2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

**What this PR does not do**
Structured kernel support for external backends. To add at some point in the future.




[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
…xternal kernels"

**Overview**
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
- If the backend provides a functional kernel but not an out kernel, we generate one for them
- If the backend provides a functional kernel but not an inplace kernel, we generate one for them
- If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
- `RegisterXLA.cpp`
- `RegisterAutogradXLA.cpp`


**functionality delta**
For context, what XLA’s codegen currently provides is:
- Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
- No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead.


**code changes**

The PR also splits up the external backend codegen logic over two places.
- `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet.
- `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.


**logic re-use in the codegen**
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`, with some changes to make it work for external backend kernels.
2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

Before (pictorially)
```
RegisterDispatchKey.__call__() -------> gen_unstructured()
```

After (pictorially). My ascii art is awful but hopefully you get the idea
```
RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured()
                             \                                       /
                               \                                   /
                                     --> gen_inplace_out_wrapper()

```
`gen_inplace_out_wrapper()` has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to `gen_unstructured()`.

One wrinkle: `gen_unstructured` takes in a single `Union[NativeFunction, ExternalBackendFunction]`, which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an `out` kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a `functional` kernel. I ended up adding an extra optional argument to `gen_unstructured()` to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper.

**What this PR does not do**
Structured kernel support for external backends. To add at some point in the future.




[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
**Overview**
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
- If the backend provides a functional kernel but not an out kernel, we generate one for them
- If the backend provides a functional kernel but not an inplace kernel, we generate one for them
- If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
- `RegisterXLA.cpp`
- `RegisterAutogradXLA.cpp`


**functionality delta**
For context, what XLA’s codegen currently provides is:
- Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
- No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead.


**code changes**

The PR also splits up the external backend codegen logic over two places.
- `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet.
- `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.


**logic re-use in the codegen**
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`, with some changes to make it work for external backend kernels.
2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

Before (pictorially)
```
RegisterDispatchKey.__call__() -------> gen_unstructured()
```

After (pictorially). My ascii art is awful but hopefully you get the idea
```
RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured()
                             \                                       /
                               \                                   /
                                     --> gen_inplace_out_wrapper()

```
`gen_inplace_out_wrapper()` has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to `gen_unstructured()`.

One wrinkle: `gen_unstructured` takes in a single `Union[NativeFunction, ExternalBackendFunction]`, which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an `out` kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a `functional` kernel. I ended up adding an extra optional argument to `gen_unstructured()` to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper.

**What this PR does not do**
Structured kernel support for external backends. To add at some point in the future.




[ghstack-poisoned]
@bdhirsh bdhirsh force-pushed the remove_bridge_api_from_codegen branch 2 times, most recently from e39f26e to 6179761 Compare May 4, 2021 17:40
@JackCaoG
Copy link
Collaborator

JackCaoG commented May 8, 2021

I realized if I don't stop making changes on ops' signature, we can never get this pr merge..... We can probably decide a time to get this pr merge (when it is ready) and I will update all of the pending prs.

@bdhirsh bdhirsh force-pushed the remove_bridge_api_from_codegen branch from 6179761 to b1124ca Compare May 11, 2021 17:39
at::native::alpha_check(at::result_type(self, other), alpha);
CheckBinaryOpTypePromotion(self, self, other);
XLATensor self_tensor = bridge::GetXlaTensor(self);
XLATensor::add_(self_tensor,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I should have removed the code for XLATensor::add_ in this PR too. I'm planning a followup PR where I remove most other inplace ops, so I'll remove it there too.

@bdhirsh bdhirsh changed the title [WIP] Remove bridge api from codegen Use some new codegen functionality May 11, 2021
@bdhirsh
Copy link
Collaborator Author

bdhirsh commented May 11, 2021

I realized if I don't stop making changes on ops' signature, we can never get this pr merge..... We can probably decide a time to get this pr merge (when it is ready) and I will update all of the pending prs.

@JackCaoG I think we're finally okay now that #2940 has landed 😛.

I just updated this PR with the set of changes it provides along with it's corresponding pytorch PR. Still have to make sure that IC is green, but lmk if you have any questions! One of my followups will be to kill most of the inplace kernel code in pytorch/xla, now that we can code-generate it.

@JackCaoG
Copy link
Collaborator

Thanks! You might need to update inplace ops cpp test to check a different counter like in https://github.com/pytorch/xla/blob/master/test/cpp/test_aten_xla_tensor.cpp#L446

@bdhirsh bdhirsh force-pushed the remove_bridge_api_from_codegen branch from e2fe1f4 to 1db52f3 Compare May 12, 2021 14:13
@bdhirsh
Copy link
Collaborator Author

bdhirsh commented May 12, 2021

Thanks! You might need to update inplace ops cpp test to check a different counter like in https://github.com/pytorch/xla/blob/master/test/cpp/test_aten_xla_tensor.cpp#L446

Thanks for the pointer. Surprisingly it looks like tests all passed even though I removed torch_xla::add_ in this PR. But I'm sure ill hit the failures you're talking about when I try to remove the other inplace ops.

Side note: I'm probably going to try and merge this + the corresponding pytorch PR real soon, since it looks like CI is finally green (still need to double check that on the pytorch side, there are no errors aside from the xla tests).

CopyTensor(tensor, dst.scalar_type(), /*copy=*/false);
dst.resize_as_(typed_tensor).copy_(typed_tensor);
} else {
// at this point we know dst is an XLA tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why we know dst is an XLA tensor here? I guess I am trying to understand the use case of this function. What does checking self_tensor means? Are we checking if they are empty tensor or CPU tensor?

Copy link
Collaborator Author

@bdhirsh bdhirsh May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this function to be similar to AtenXlaType::_copy_from, in that it handles all 3 cases:

  • self is on CPU and dst is on XLA
  • self is on XLA and dst is on CPU
  • they're both on XLA

The main difference is just that in the case that dst is on XLA, it handles resizing properly. I got the resizing logic mostly by looking at the function that the codegen used to call.

Could you help me understand why we know dst is an XLA tensor here?

In the } else { section here, we know that both bridge::TryGetXlaTensor(self) and bridge::TryGetXlaTensor(dst) returned valid XLATensor objects, which I took to mean that they both live on XLA. So this piece of the code handles the cases where both self and dst live on XLA, and I took the corresponding logic from here to figure out how to copy and resize the dst XLA tensor.

This piece of logic gets hit inside of the generated out wrappers. I pasted an example codegen'd out wrapper for add below:

at::Tensor & wrapper_out_add_out(const at::Tensor & self, const at::Tensor & other, co
  XLA_FN_TRACK(3);
  TF_VLOG(3) << "XLA wrapper_out_add_out :" << " self=" << self.toString() << " other=

  auto wrapper_out_add_out_tmp = wrapper_Tensor_add(self, other, alpha);
  at::_copy_from_and_resize(wrapper_out_add_out_tmp, out);
  return out;
}

auto self_tensor = bridge::TryGetXlaTensor(self);
if (!self_tensor) {
XLA_CHECK(dst_tensor);
dst_tensor->UpdateFromTensorOut(self);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, in here you chose to call UpdateFromTensorOut which also handle the view check and defualt sync-update to false. Is there a reaon to do this instead of dst_tensor->UpdateFromTensor(self, /*sync=*/sync_update); in AtenXlaType::_copy_from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was actually my original motivation for writing this op. When I call dst_tensor->UpdateFromTensor(self, /*sync=*/sync_update); instead (like in _copy_from), xla complains if dst and self have different sizes. The next thing that I tried was to explicitly resize the dest tensor before the call, but then I hit #2881. I got the idea to call UpdateFromTensorOut from the bridge:: API that the codegen used previously, which is here. The only reason I didn't pass in sync_update was because UpdateFromTensorOut doesn't accept that arg :) but the original codegen also didn't use that argument, so I figured that this would be more in line with existing functionality.

Side note: implicitly resizing output tensors is actually currently allowed for in-tree kernels, but it's deprecated. So right now we want to allow that case, but eventually it'll probably go away. That's out of the scope of this function though, since that can be fixed through the codegen (we can just call _copy_from instead of _copy_from_and_resize).

Let me know if that all clears things up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @bdhirsh

Copy link
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
nit: we'll get more signal in the followup PR when inplace lowerings are removed from xla codebase. But a nice addition in this PR would be comparing against the _copy_from_and_resize with the part it's trying to replace ;) So that we are confident this change is benign :D

@JackCaoG
Copy link
Collaborator

Given that 1.9 will be cut for pytorch next Monday(likely sometime next week for pt/xla as well), what's our plan for this and upstream codegen change? I assume we are not trying to push this into 1.9?

@bdhirsh
Copy link
Collaborator Author

bdhirsh commented May 17, 2021

But a nice addition in this PR would be comparing against the _copy_from_and_resize with the part it's trying to replace ;) So that we are confident this change is benign :D

That's fair; two of the 3 code paths from that function are used directly in this PR; (XLA->XLA) in the generated out wrappers, and (CPU->XLA) in the CPU fallbacks, but I can followup with explicit tests.

Given that 1.9 will be cut for pytorch next Monday(likely sometime next week for pt/xla as well), what's our plan for this and upstream codegen change? I assume we are not trying to push this into 1.9?

This definitely isn't an official feature of the 1.9 release, but given that CI is passing, I think it's probably fine to merge unrelated to the branch cut deadline. AFAIK, we don't have a policy around merging new changes in close to the cut (?).

@bdhirsh bdhirsh merged commit 3e3704f into master May 17, 2021
@bdhirsh bdhirsh deleted the remove_bridge_api_from_codegen branch May 17, 2021 22:17
@JackCaoG
Copy link
Collaborator

@bdhirsh Sounds good. My main concern is that pt/xla 1.9 is not cut yet. If the upstream pytorch 1.9 doesn't have your corresponding pr, we need to make sure not including this pt/xla pr in our pt/xla 1.9 branch.

@bdhirsh
Copy link
Collaborator Author

bdhirsh commented May 17, 2021

@bdhirsh Sounds good. My main concern is that pt/xla 1.9 is not cut yet. If the upstream pytorch 1.9 doesn't have your corresponding pr, we need to make sure not including this pt/xla pr in our pt/xla 1.9 branch.

That's a good point, thanks @JackCaoG. I'll check and confirm that (a) both or (b) neither commits are included.

antoniojkim pushed a commit to antoniojkim/pytorch_xla that referenced this pull request May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants