-
Notifications
You must be signed in to change notification settings - Fork 25k
Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) #74763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 0c1f842 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
@@ -416,6 +418,7 @@ enum class DispatchKey : uint16_t { | |||
_QuantizedHIP, | |||
_QuantizedXLA, | |||
_QuantizedMLC, | |||
_QuantizedIPU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been some recent changes to the DispatchKey representation, and there are now a few questions to think through when adding a new dispatch key.
Does IPU currently have existing support (existing kernels) for quantization? Sparsity? Does it need to override any autograd formulas from core?
If the answer to all of those is no, then you probably don't need to assign a BackendComponent
enum slot for IPU
. That's also the case for e.g. ORT
, Metal
and Vulkan
dispatch keys, which don't currently override any of those functionalities (they just have backend kernels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably misunderstood the code: I thought the _
prefixed enums were placeholders rather than implementations?
To answer your questions: yes we override some formulas for Autograd.
For Quantization / Sparsity it's too early to tell (Which is why I used the _
version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we override some formulas for Autograd.
Just took a quick look at the poptorch repo - is the autograd registration mostly to deal with convolution? (here)
There has actually been a bunch of work (from @jbschlosser) to clean up the convolution ops in our aten namespace, and today we have two main convolution entry-points:
at::convolution
is a non-composite op that you can directly write a backend kernel for with the backend IPU
key (here)
And it's derivative formula directly calls the at::convolution_backward
op (here), which you an also directly write a kernel for with the IPU
key.
I'm mostly just wondering if convolution was the only case where you had to override autograd (more because of the way that convolution was structured in core than anything specific to IPU's). cc @albanD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't gone very far with the testing of the training part of the integration (Which is why for now it's only convolution, but you're right we had noticed convolution had changed in 1.11, so this will go away), but we think we'll potentially still need an Autograd key in 2 cases:
- For ops using the random numbers we need to make sure the prng reference is the same between forward and backward. (This is one example, but we might need to track more states to get correct and efficient behaviour).
- If we add custom operations which don't exist in pytorch then we need to register the autograd function against the autograd key? (We haven't tried that part yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info! I'm less familiar with the rng example, but for point 2: if you have custom operators, and you want to define backward formulas for them, then yep you'll need an AutogradIPU
key.
Hey @AnthonyBarbier, just a few questions on IPU: What’s the state of the open-source IPU backend (https://github.com/graphcore/poptorch)? I just want to call out that we have a few PrivateUse dispatch keys, which you can use to prototype the integration entirely out of tree (just register all of your out-of-tree kernels to e.g. the PrivateUse3 dispatch key, and ensure that all pytorch operator calls route to that key). Those private use keys don’t give you a device enum though - so if poptorch is in imminent need of the ability to do stuff like torch.ones(…, device=“ipu”), then this seems ok (but if that isn’t needed just yet, you can always prototype with the private use keys until then). Separately — I discussed with @albanD, and there are two downsides to this type of PR that we want to fix in the long term (although until they’re fixed, we’ll still accept this type of PR): (1) Adding a new backend out-of-tree is still pretty fraught, and requires a bunch of changes to our internal enums across many files (you can see that from the size of this PR) In the long term, we’ll probably want to have some open device enum registration, so you can write cc @ezyang |
Hi @bdhirsh, Since our last release we've got a prototype using the PrivateKey: https://github.com/graphcore/poptorch/blob/sdk-release-2.4/poptorch/source/dispatch_tracer/RegisterAtenOverloads.cpp Poptorch is still using jit.trace() for now by default, but we're hoping to fully switch to the dispatcher by our June/July release. (we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then). And yes, once Torch 1.12 is out we're hoping to support IPU devices at the python level and generally we'd like to offer a more native pytorch experience to our users. |
Thanks for the info!
It's a little unclear to me what the second key would be for - is there a reason you can't use the new |
Sorry for the confusion: we only need one key. It's just that PopTorch builds against Torch releases, so while we wait for this patch to make it into a release we might internally use somebody else's key for development purposes. The reason I'm suggesting that is that I couldn't find a way to move layers or tensors / parameters from the CPU to the device using PrivateKey? (Whereas if we register some callbacks for let's say XLA then we start seeing the copy to and from cpu) Do you have any idea why the CI is unhappy? Last time it was because my branch wasn't in sync with the tip of master, but this time I'm using the latest master AFAICT? |
@@ -437,6 +440,7 @@ enum class DispatchKey : uint16_t { | |||
// [Masquerading as CUDA] | |||
_SparseXLA, | |||
_SparseMLC, | |||
_SparseIPU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also on the leading underscore: this is fine, the idea is that by making IPU a "backend component", you're automatically adding space in the dispatcher to be able to register dense + sparse + quantized + autograd kernels. If the Sparse and Quantized keys aren't actually needed yet though (like some of the other keys below), then they have a leading underscore just to make that clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok to me! Hopefully I answered your question on the PrivateUse key limitations on slack. Also, the failing CI looks unrelated to this PR
@pytorchbot merge this please |
Hey @AnthonyBarbier. |
…74763) Summary: We need a key to register our out of tree backend: https://github.com/graphcore/poptorch Pull Request resolved: #74763 Approved by: https://github.com/bdhirsh Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ce9e27a0fc49864a75e373a25ced7eaba41e37fc Reviewed By: b0noI Differential Revision: D35485705 fbshipit-source-id: ce4f1d9eaf1cfd60e2f701fb2445cb96c12436ce
We need a key to register our out of tree backend: https://github.com/graphcore/poptorch