Skip to content

Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) #74763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

AnthonyBarbier
Copy link
Collaborator

We need a key to register our out of tree backend: https://github.com/graphcore/poptorch

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 25, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 0c1f842 (more details on the Dr. CI page):


  • 1/2 failures introduced in this PR
  • 1/2 broken upstream at merge base d88c454 on Apr 05 from 12:51am to 8:21am

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-05T15:46:52.0118967Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in
2022-04-05T15:46:51.9609233Z     #10 0x556e544c3c81 in run_mod /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:1037
2022-04-05T15:46:51.9609843Z     #11 0x556e544cec69 in PyRun_StringFlags /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:961
2022-04-05T15:46:51.9610274Z     #12 0x556e544ceccb in PyRun_SimpleStringFlags /home/builder/tkoch/workspace/python_1648536129212/work/Python/pythonrun.c:455
2022-04-05T15:46:51.9611224Z     #13 0x556e544cedc8 in pymain_run_command /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:420
2022-04-05T15:46:51.9612802Z     #14 0x556e544cedc8 in pymain_run_python /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:2907
2022-04-05T15:46:51.9613453Z     #15 0x556e544cedc8 in pymain_main /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:3460
2022-04-05T15:46:51.9614232Z     #16 0x556e544cf18b in _Py_UnixMain /home/builder/tkoch/workspace/python_1648536129212/work/Modules/main.c:3495
2022-04-05T15:46:52.0118008Z     #17 0x7f2a90d9283f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
2022-04-05T15:46:52.0118479Z     #18 0x556e54474039 in _start (/opt/conda/bin/python3.7+0x1d8039)
2022-04-05T15:46:52.0118650Z 
2022-04-05T15:46:52.0118967Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
2022-04-05T15:46:52.0317497Z + retcode=1
2022-04-05T15:46:52.0317877Z + set -e
2022-04-05T15:46:52.0318137Z + return 1
2022-04-05T15:46:52.0321312Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]
2022-04-05T15:46:52.0321790Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X ]]
2022-04-05T15:46:52.0322390Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]
2022-04-05T15:46:52.0322893Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
2022-04-05T15:46:52.0323685Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]
2022-04-05T15:46:52.0324447Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
2022-04-05T15:46:52.0326720Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 25, 2022
@albanD albanD requested review from bdhirsh and removed request for albanD March 28, 2022 16:45
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 28, 2022
@@ -416,6 +418,7 @@ enum class DispatchKey : uint16_t {
_QuantizedHIP,
_QuantizedXLA,
_QuantizedMLC,
_QuantizedIPU,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been some recent changes to the DispatchKey representation, and there are now a few questions to think through when adding a new dispatch key.

Does IPU currently have existing support (existing kernels) for quantization? Sparsity? Does it need to override any autograd formulas from core?

If the answer to all of those is no, then you probably don't need to assign a BackendComponent enum slot for IPU. That's also the case for e.g. ORT, Metal and Vulkan dispatch keys, which don't currently override any of those functionalities (they just have backend kernels)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably misunderstood the code: I thought the _ prefixed enums were placeholders rather than implementations?

To answer your questions: yes we override some formulas for Autograd.
For Quantization / Sparsity it's too early to tell (Which is why I used the _ version)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@bdhirsh bdhirsh Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we override some formulas for Autograd.

Just took a quick look at the poptorch repo - is the autograd registration mostly to deal with convolution? (here)

There has actually been a bunch of work (from @jbschlosser) to clean up the convolution ops in our aten namespace, and today we have two main convolution entry-points:

at::convolution is a non-composite op that you can directly write a backend kernel for with the backend IPU key (here)

And it's derivative formula directly calls the at::convolution_backward op (here), which you an also directly write a kernel for with the IPU key.

I'm mostly just wondering if convolution was the only case where you had to override autograd (more because of the way that convolution was structured in core than anything specific to IPU's). cc @albanD.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't gone very far with the testing of the training part of the integration (Which is why for now it's only convolution, but you're right we had noticed convolution had changed in 1.11, so this will go away), but we think we'll potentially still need an Autograd key in 2 cases:

  • For ops using the random numbers we need to make sure the prng reference is the same between forward and backward. (This is one example, but we might need to track more states to get correct and efficient behaviour).
  • If we add custom operations which don't exist in pytorch then we need to register the autograd function against the autograd key? (We haven't tried that part yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info! I'm less familiar with the rng example, but for point 2: if you have custom operators, and you want to define backward formulas for them, then yep you'll need an AutogradIPU key.

@soulitzer soulitzer removed their request for review April 4, 2022 20:54
@AnthonyBarbier AnthonyBarbier requested a review from awgu as a code owner April 5, 2022 14:41
@bdhirsh
Copy link
Contributor

bdhirsh commented Apr 5, 2022

Hey @AnthonyBarbier, just a few questions on IPU:

What’s the state of the open-source IPU backend (https://github.com/graphcore/poptorch)? I just want to call out that we have a few PrivateUse dispatch keys, which you can use to prototype the integration entirely out of tree (just register all of your out-of-tree kernels to e.g. the PrivateUse3 dispatch key, and ensure that all pytorch operator calls route to that key).

Those private use keys don’t give you a device enum though - so if poptorch is in imminent need of the ability to do stuff like torch.ones(…, device=“ipu”), then this seems ok (but if that isn’t needed just yet, you can always prototype with the private use keys until then).

Separately — I discussed with @albanD, and there are two downsides to this type of PR that we want to fix in the long term (although until they’re fixed, we’ll still accept this type of PR):

(1) Adding a new backend out-of-tree is still pretty fraught, and requires a bunch of changes to our internal enums across many files (you can see that from the size of this PR)
(2) We still don’t have that many dispatch keys that are free

In the long term, we’ll probably want to have some open device enum registration, so you can write torch.ones(…, device=“ipu”) without needing a bunch of boilerplate changes in core. We’d also probably like to eventually have them all route to the same dispatch key, so we don’t need to use up so much key space on backends (this has the minor downside that you wouldn’t be able to interleave calls to two different external backends within the same pytorch session).

cc @ezyang

@AnthonyBarbier
Copy link
Collaborator Author

Hey @AnthonyBarbier, just a few questions on IPU:

What’s the state of the open-source IPU backend (https://github.com/graphcore/poptorch)? I just want to call out that we have a few PrivateUse dispatch keys, which you can use to prototype the integration entirely out of tree (just register all of your out-of-tree kernels to e.g. the PrivateUse3 dispatch key, and ensure that all pytorch operator calls route to that key).

Those private use keys don’t give you a device enum though - so if poptorch is in imminent need of the ability to do stuff like torch.ones(…, device=“ipu”), then this seems ok (but if that isn’t needed just yet, you can always prototype with the private use keys until then).

Separately — I discussed with @albanD, and there are two downsides to this type of PR that we want to fix in the long term (although until they’re fixed, we’ll still accept this type of PR):

(1) Adding a new backend out-of-tree is still pretty fraught, and requires a bunch of changes to our internal enums across many files (you can see that from the size of this PR) (2) We still don’t have that many dispatch keys that are free

In the long term, we’ll probably want to have some open device enum registration, so you can write torch.ones(…, device=“ipu”) without needing a bunch of boilerplate changes in core. We’d also probably like to eventually have them all route to the same dispatch key, so we don’t need to use up so much key space on backends (this has the minor downside that you wouldn’t be able to interleave calls to two different external backends within the same pytorch session).

cc @ezyang

Hi @bdhirsh,

Since our last release we've got a prototype using the PrivateKey: https://github.com/graphcore/poptorch/blob/sdk-release-2.4/poptorch/source/dispatch_tracer/RegisterAtenOverloads.cpp

Poptorch is still using jit.trace() for now by default, but we're hoping to fully switch to the dispatcher by our June/July release. (we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then).

And yes, once Torch 1.12 is out we're hoping to support IPU devices at the python level and generally we'd like to offer a more native pytorch experience to our users.

@bdhirsh
Copy link
Contributor

bdhirsh commented Apr 5, 2022

Thanks for the info!

we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then

It's a little unclear to me what the second key would be for - is there a reason you can't use the new IPU key (and potentially temporarily using the PrivateUse keys if you end up needing a second key?)

@AnthonyBarbier
Copy link
Collaborator Author

Thanks for the info!

we might need to temporarily steal another backend key for that particular release, but we should have most of the logic implemented by then

It's a little unclear to me what the second key would be for - is there a reason you can't use the new IPU key (and potentially temporarily using the PrivateUse keys if you end up needing a second key?)

Sorry for the confusion: we only need one key. It's just that PopTorch builds against Torch releases, so while we wait for this patch to make it into a release we might internally use somebody else's key for development purposes.

The reason I'm suggesting that is that I couldn't find a way to move layers or tensors / parameters from the CPU to the device using PrivateKey? (Whereas if we register some callbacks for let's say XLA then we start seeing the copy to and from cpu)
Might be easier to discuss this on Slack.

Do you have any idea why the CI is unhappy? Last time it was because my branch wasn't in sync with the tip of master, but this time I'm using the latest master AFAICT?

@@ -437,6 +440,7 @@ enum class DispatchKey : uint16_t {
// [Masquerading as CUDA]
_SparseXLA,
_SparseMLC,
_SparseIPU,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also on the leading underscore: this is fine, the idea is that by making IPU a "backend component", you're automatically adding space in the dispatcher to be able to register dense + sparse + quantized + autograd kernels. If the Sparse and Quantized keys aren't actually needed yet though (like some of the other keys below), then they have a leading underscore just to make that clear.

Copy link
Contributor

@bdhirsh bdhirsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me! Hopefully I answered your question on the PrivateUse key limitations on slack. Also, the failing CI looks unrelated to this PR

@bdhirsh
Copy link
Contributor

bdhirsh commented Apr 7, 2022

@pytorchbot merge this please

@github-actions
Copy link
Contributor

github-actions bot commented Apr 7, 2022

Hey @AnthonyBarbier.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Apr 8, 2022
…74763)

Summary:
We need a key to register our out of tree backend: https://github.com/graphcore/poptorch

Pull Request resolved: #74763
Approved by: https://github.com/bdhirsh

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ce9e27a0fc49864a75e373a25ced7eaba41e37fc

Reviewed By: b0noI

Differential Revision: D35485705

fbshipit-source-id: ce4f1d9eaf1cfd60e2f701fb2445cb96c12436ce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants