Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) #74763

AnthonyBarbier · 2022-03-25T18:05:14Z

We need a key to register our out of tree backend: https://github.com/graphcore/poptorch

facebook-github-bot · 2022-03-25T18:05:20Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74763
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 0c1f842 (more details on the Dr. CI page):

1/2 failures introduced in this PR
1/2 broken upstream at merge base d88c454 on Apr 05 from 12:51am to 8:21am

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-05T15:46:52.0118967Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in

2022-04-05T15:46:51.9609233Z     #10 0x556e544c3c81 in run_mod /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:1037
2022-04-05T15:46:51.9609843Z     #11 0x556e544cec69 in PyRun_StringFlags /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:961
2022-04-05T15:46:51.9610274Z     #12 0x556e544ceccb in PyRun_SimpleStringFlags /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:455
2022-04-05T15:46:51.9611224Z     #13 0x556e544cedc8 in pymain_run_command /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:420
2022-04-05T15:46:51.9612802Z     #14 0x556e544cedc8 in pymain_run_python /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:2907
2022-04-05T15:46:51.9613453Z     #15 0x556e544cedc8 in pymain_main /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:3460
2022-04-05T15:46:51.9614232Z     #16 0x556e544cf18b in _Py_UnixMain /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:3495
2022-04-05T15:46:52.0118008Z     #17 0x7f2a90d9283f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
2022-04-05T15:46:52.0118479Z     #18 0x556e54474039 in _start (/opt/conda/bin/python3.7+0x1d8039)
2022-04-05T15:46:52.0118650Z 
2022-04-05T15:46:52.0118967Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
2022-04-05T15:46:52.0317497Z + retcode=1
2022-04-05T15:46:52.0317877Z + set -e
2022-04-05T15:46:52.0318137Z + return 1
2022-04-05T15:46:52.0321312Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]
2022-04-05T15:46:52.0321790Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X ]]
2022-04-05T15:46:52.0322390Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]
2022-04-05T15:46:52.0322893Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
2022-04-05T15:46:52.0323685Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]
2022-04-05T15:46:52.0324447Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
2022-04-05T15:46:52.0326720Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) on Apr 05 from 12:51am to 8:21am (1bcae0d - 79cae53)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh · 2022-03-28T16:59:11Z

c10/core/DispatchKey.h

@@ -416,6 +418,7 @@ enum class DispatchKey : uint16_t {
  _QuantizedHIP,
  _QuantizedXLA,
  _QuantizedMLC,
+  _QuantizedIPU,


There have been some recent changes to the DispatchKey representation, and there are now a few questions to think through when adding a new dispatch key.

Does IPU currently have existing support (existing kernels) for quantization? Sparsity? Does it need to override any autograd formulas from core?

If the answer to all of those is no, then you probably don't need to assign a BackendComponent enum slot for IPU. That's also the case for e.g. ORT, Metal and Vulkan dispatch keys, which don't currently override any of those functionalities (they just have backend kernels)

I probably misunderstood the code: I thought the _ prefixed enums were placeholders rather than implementations?

To answer your questions: yes we override some formulas for Autograd.
For Quantization / Sparsity it's too early to tell (Which is why I used the _ version)

C.f https://github.com/pytorch/pytorch/pull/74763/files/23d0af1a32c8c94f59fd0fa0280638ceb38dec42#diff-8b18203fde3227caea34b6078124836a22aca55e94d3d2b04fe781145adbddbbL409-L410

yes we override some formulas for Autograd.

Just took a quick look at the poptorch repo - is the autograd registration mostly to deal with convolution? (here)

There has actually been a bunch of work (from @jbschlosser) to clean up the convolution ops in our aten namespace, and today we have two main convolution entry-points:

at::convolution is a non-composite op that you can directly write a backend kernel for with the backend IPU key (here)

And it's derivative formula directly calls the at::convolution_backward op (here), which you an also directly write a kernel for with the IPU key.

I'm mostly just wondering if convolution was the only case where you had to override autograd (more because of the way that convolution was structured in core than anything specific to IPU's). cc @albanD.

We haven't gone very far with the testing of the training part of the integration (Which is why for now it's only convolution, but you're right we had noticed convolution had changed in 1.11, so this will go away), but we think we'll potentially still need an Autograd key in 2 cases:

For ops using the random numbers we need to make sure the prng reference is the same between forward and backward. (This is one example, but we might need to track more states to get correct and efficient behaviour).

If we add custom operations which don't exist in pytorch then we need to register the autograd function against the autograd key? (We haven't tried that part yet).

Thanks for the info! I'm less familiar with the rng example, but for point 2: if you have custom operators, and you want to define backward formulas for them, then yep you'll need an AutogradIPU key.

bdhirsh · 2022-04-05T15:22:44Z

Hey @AnthonyBarbier, just a few questions on IPU:

What’s the state of the open-source IPU backend (https://github.com/graphcore/poptorch)? I just want to call out that we have a few PrivateUse dispatch keys, which you can use to prototype the integration entirely out of tree (just register all of your out-of-tree kernels to e.g. the PrivateUse3 dispatch key, and ensure that all pytorch operator calls route to that key).

Those private use keys don’t give you a device enum though - so if poptorch is in imminent need of the ability to do stuff like torch.ones(…, device=“ipu”), then this seems ok (but if that isn’t needed just yet, you can always prototype with the private use keys until then).

Separately — I discussed with @albanD, and there are two downsides to this type of PR that we want to fix in the long term (although until they’re fixed, we’ll still accept this type of PR):

(1) Adding a new backend out-of-tree is still pretty fraught, and requires a bunch of changes to our internal enums across many files (you can see that from the size of this PR)
(2) We still don’t have that many dispatch keys that are free

In the long term, we’ll probably want to have some open device enum registration, so you can write torch.ones(…, device=“ipu”) without needing a bunch of boilerplate changes in core. We’d also probably like to eventually have them all route to the same dispatch key, so we don’t need to use up so much key space on backends (this has the minor downside that you wouldn’t be able to interleave calls to two different external backends within the same pytorch session).

cc @ezyang

AnthonyBarbier · 2022-04-05T16:10:21Z

Hey @AnthonyBarbier, just a few questions on IPU:

What’s the state of the open-source IPU backend (https://github.com/graphcore/poptorch)? I just want to call out that we have a few PrivateUse dispatch keys, which you can use to prototype the integration entirely out of tree (just register all of your out-of-tree kernels to e.g. the PrivateUse3 dispatch key, and ensure that all pytorch operator calls route to that key).

Those private use keys don’t give you a device enum though - so if poptorch is in imminent need of the ability to do stuff like torch.ones(…, device=“ipu”), then this seems ok (but if that isn’t needed just yet, you can always prototype with the private use keys until then).

Separately — I discussed with @albanD, and there are two downsides to this type of PR that we want to fix in the long term (although until they’re fixed, we’ll still accept this type of PR):

(1) Adding a new backend out-of-tree is still pretty fraught, and requires a bunch of changes to our internal enums across many files (you can see that from the size of this PR) (2) We still don’t have that many dispatch keys that are free

In the long term, we’ll probably want to have some open device enum registration, so you can write torch.ones(…, device=“ipu”) without needing a bunch of boilerplate changes in core. We’d also probably like to eventually have them all route to the same dispatch key, so we don’t need to use up so much key space on backends (this has the minor downside that you wouldn’t be able to interleave calls to two different external backends within the same pytorch session).

cc @ezyang

Hi @bdhirsh,

Since our last release we've got a prototype using the PrivateKey: https://github.com/graphcore/poptorch/blob/sdk-release-2.4/poptorch/source/dispatch_tracer/RegisterAtenOverloads.cpp

Poptorch is still using jit.trace() for now by default, but we're hoping to fully switch to the dispatcher by our June/July release. (we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then).

And yes, once Torch 1.12 is out we're hoping to support IPU devices at the python level and generally we'd like to offer a more native pytorch experience to our users.

bdhirsh · 2022-04-05T18:19:34Z

Thanks for the info!

we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then

It's a little unclear to me what the second key would be for - is there a reason you can't use the new IPU key (and potentially temporarily using the PrivateUse keys if you end up needing a second key?)

AnthonyBarbier · 2022-04-05T18:27:54Z

Thanks for the info!

we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then

It's a little unclear to me what the second key would be for - is there a reason you can't use the new IPU key (and potentially temporarily using the PrivateUse keys if you end up needing a second key?)

Sorry for the confusion: we only need one key. It's just that PopTorch builds against Torch releases, so while we wait for this patch to make it into a release we might internally use somebody else's key for development purposes.

The reason I'm suggesting that is that I couldn't find a way to move layers or tensors / parameters from the CPU to the device using PrivateKey? (Whereas if we register some callbacks for let's say XLA then we start seeing the copy to and from cpu)
Might be easier to discuss this on Slack.

Do you have any idea why the CI is unhappy? Last time it was because my branch wasn't in sync with the tip of master, but this time I'm using the latest master AFAICT?

bdhirsh · 2022-04-07T15:21:47Z

c10/core/DispatchKey.h

@@ -437,6 +440,7 @@ enum class DispatchKey : uint16_t {
  // [Masquerading as CUDA]
  _SparseXLA,
  _SparseMLC,
+  _SparseIPU,


Also on the leading underscore: this is fine, the idea is that by making IPU a "backend component", you're automatically adding space in the dispatcher to be able to register dense + sparse + quantized + autograd kernels. If the Sparse and Quantized keys aren't actually needed yet though (like some of the other keys below), then they have a leading underscore just to make that clear.

bdhirsh

looks ok to me! Hopefully I answered your question on the PrivateUse key limitations on slack. Also, the failing CI looks unrelated to this PR

bdhirsh · 2022-04-07T17:17:10Z

@pytorchbot merge this please

github-actions · 2022-04-07T17:19:27Z

Hey @AnthonyBarbier.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…74763) Summary: We need a key to register our out of tree backend: https://github.com/graphcore/poptorch Pull Request resolved: #74763 Approved by: https://github.com/bdhirsh Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ce9e27a0fc49864a75e373a25ced7eaba41e37fc Reviewed By: b0noI Differential Revision: D35485705 fbshipit-source-id: ce4f1d9eaf1cfd60e2f701fb2445cb96c12436ce

AnthonyBarbier added 2 commits March 25, 2022 17:49

Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType)

9622d88

Clean up

c6c7017

AnthonyBarbier requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, mingzhe09088, H-Huang, albanD, jbschlosser and soulitzer as code owners March 25, 2022 18:05

facebook-github-bot added the cla signed label Mar 25, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 25, 2022

pytorchbot added the open source label Mar 25, 2022

AnthonyBarbier added 2 commits March 28, 2022 11:54

Fix JIT tensor support

4465de1

Merge remote-tracking branch 'origin/master' into graphcoreKey

23d0af1

albanD requested review from bdhirsh and removed request for albanD March 28, 2022 16:45

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 28, 2022

bdhirsh reviewed Mar 28, 2022

View reviewed changes

soulitzer removed their request for review April 4, 2022 20:54

Merge remote-tracking branch 'origin/master' into graphcoreKey

0c1f842

AnthonyBarbier requested a review from awgu as a code owner April 5, 2022 14:41

bdhirsh reviewed Apr 7, 2022

View reviewed changes

bdhirsh approved these changes Apr 7, 2022

View reviewed changes

pytorchmergebot closed this in ce9e27a Apr 7, 2022

Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) #74763

Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) #74763

Uh oh!

Conversation

AnthonyBarbier commented Mar 25, 2022

Uh oh!

facebook-github-bot commented Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (1/1)

🚧 1 fixed upstream failure:

Uh oh!

bdhirsh Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

AnthonyBarbier Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

AnthonyBarbier Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AnthonyBarbier Apr 6, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh Apr 7, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Apr 5, 2022

Uh oh!

AnthonyBarbier commented Apr 5, 2022

Uh oh!

bdhirsh commented Apr 5, 2022

Uh oh!

AnthonyBarbier commented Apr 5, 2022

Uh oh!

bdhirsh Apr 7, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Apr 7, 2022

Uh oh!

github-actions bot commented Apr 7, 2022

Uh oh!

Uh oh!

facebook-github-bot commented Mar 25, 2022 •

edited

Loading

bdhirsh Apr 5, 2022 •

edited

Loading

bdhirsh left a comment •

edited

Loading