[inductor] Add prims._inductor_bucketize and add lowerings #104007

davidberard98 · 2023-06-21T23:48:19Z

TL;DR: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize.

Background - torch.bucketize: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2].

Implementation: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in triton_helpers.py. This PR also adds a new prim _inductor_bucketize() for testing purposes and adds lowering for this op.

"right": The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. I'm open to switching the behavior of the inductor op - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize). Switched the behavior of the inductor bucketize op to match the torch op

places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior)
- current torch.bucketize behavior
- table in torch.bucketize docs
places where "right" means "the right boundary of a bucket is closed":
- the text description of torch.bucketize docs (observed in Documentation of bucketize() is wrong, and is contradicting itself. #91580)
- numpy.digitize (which is basically the same op)

Performance: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets).

As is:

Eager 0.30117499828338623 ms
PT2   0.9298200011253357 ms

But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456):

Eager 0.3015420138835907 ms
PT2   0.23028500378131866 ms

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78

pytorch-bot · 2023-06-21T23:48:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104007

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2cc9c94:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2023-06-21T23:55:34Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/_inductor/triton_helpers.py

torch/_inductor/codegen/triton.py

aakhundov · 2023-06-29T09:07:59Z

torch/_inductor/codegen/common.py

Although this, indeed, seems to correspond to the semantics of torch.bucketize, in the jagged tensor context, seems that we'll need to subtract 1 from the result (as the values falling into the first bucket should have index 0).

In some rough benchmarks I didn't see any measurable difference in perf from subtracting the 1 vs. not having to do that (in fbgemm lowerings). But good point, we'll need to make sure not to forget this

torch/_inductor/lowering.py

facebook-github-bot · 2023-06-29T17:01:29Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-06-29T17:04:41Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-06-30T04:35:13Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-06-30T04:50:17Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch.bucketize takes a tensor of values, and a "boundaries" tensor, which is a sorted list of values that represent buckets. It returns the bucket that each value lies in. E.g. if values = [1, 5, 3, 6] and boundaries=[0, 2, 4, 6, 8], the output will be [1, 3, 2, 4]. The current decomposition of this op doesn't work well with dynamic shapes. It performs a binary search, which bakes in the number of iterations in the binary search and requires recompiling (I don't completely understand why/where this happens). I can't think if whether there's a good way to write a decomposition for this op that will work with dynamic shapes. Use case: this op is very similar to some operations needed by jagged tensors. As a first step, I want to add a lowering for aten.bucketize and make use of opinfos. #104007 Pull Request resolved: #104396 Approved by: https://github.com/Chillee

facebook-github-bot · 2023-06-30T05:09:12Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jansel · 2023-06-30T17:17:38Z

torch/_inductor/lowering.py

+):
+    assert len(boundaries.get_size()) == 1
+
+    if input.get_device().type != "cuda" or boundaries.get_device().type != "cuda":


Use the is_triton() helper (which does the same thing)

jansel · 2023-06-30T17:20:29Z

torch/_inductor/lowering.py

+    input_loader = input.make_loader()
+
+    index_dtype = torch.int32 if out_int32 else torch.int64
+    triton_dtype = "tl.int32" if out_int32 else "tl.int64"


Let's not expose the triton_dytpe in the device-agnostic IR. This will make it harder to extend to non-triton backends.

jansel · 2023-06-30T17:22:04Z

torch/_inductor/lowering.py

+            boundaries.get_name(),
+            ops.index_expr(boundaries_size, index_dtype),
+            triton_dtype,
+            not right,  # ops.bucketize and torch.bucketize have opposite semantics for "right"


My initial feelings here are we should match the torch semantics rather than the numpy one. That will at least make our codebase self-consistent. I don't feel that strongly about that though.

1. use is_triton() 2. pass a torch.dtype instead of a triton dtype into the bucketize inductor op 3. Switch the behavior of "right" back to torch semantics

facebook-github-bot · 2023-06-30T17:41:35Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/_inductor/triton_helpers.py

facebook-github-bot · 2023-06-30T20:05:52Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/_inductor/inductor_prims.py

davidberard98 · 2023-07-03T14:27:32Z

@pytorchbot merge

pytorchmergebot · 2023-07-03T14:30:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… where num_elements_per_warp=32" In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. I benchmarked #104007 with and without this change on a 16MB pointwise kernel. This change reduces the latency from 1ms to 0.35ms. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

…nts_per_warp=32" In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. I benchmarked #104007 with and without this change on a 16MB pointwise kernel. This change reduces the latency from 1ms to 0.35ms. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

… try using config with num_elements_per_warp=32" In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

…g with num_elements_per_warp=32" In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

… try using config with num_elements_per_warp=32" In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103) [ghstack-poisoned]

…g with num_elements_per_warp=32" In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103) [ghstack-poisoned]

…elements_per_warp=32 (#104456) In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103) Pull Request resolved: #104456 Approved by: https://github.com/eellison

github-actions bot added ciflow/inductor module: inductor labels Jun 21, 2023

davidberard98 force-pushed the bucket_index branch from 5c59c46 to b36bfbb Compare June 21, 2023 23:55

davidberard98 added the topic: not user facing topic category label Jun 21, 2023

davidberard98 force-pushed the bucket_index branch 6 times, most recently from 0d1b96a to cb4be47 Compare June 29, 2023 00:25

davidberard98 mentioned this pull request Jun 29, 2023

Don't decompose aten.bucketize #104396

Closed

aakhundov reviewed Jun 29, 2023

View reviewed changes

davidberard98 mentioned this pull request Jun 30, 2023

[inductor] If a kernel contains bucketize, try using config with num_elements_per_warp=32 #104456

Closed

davidberard98 force-pushed the bucket_index branch from 95d13d7 to 8abfc45 Compare June 30, 2023 04:21

davidberard98 changed the title ~~[WIP][inductor] add prims.inductor_bucket_index and inductor lowerings~~ [WIP][inductor] Add prims._inductor_bucketize and add lowerings Jun 30, 2023

davidberard98 added 6 commits June 29, 2023 22:05

[WIP][inductor] add prims.inductor_bucket_index and inductor lowerings

48dd75b

WIP - moving impl to triton_helpers.py

89ac267

bucket_index -> bucketize

3b4b1f3

lint

0091473

More efficient impl, cleanup, docs

d12b6f9

lint

1b993ec

davidberard98 force-pushed the bucket_index branch from 4d9fc32 to 1b993ec Compare June 30, 2023 05:05

davidberard98 requested a review from eellison June 30, 2023 05:10

davidberard98 marked this pull request as ready for review June 30, 2023 05:10

davidberard98 changed the title ~~[WIP][inductor] Add prims._inductor_bucketize and add lowerings~~ [inductor] Add prims._inductor_bucketize and add lowerings Jun 30, 2023

davidberard98 requested a review from xw285cornell June 30, 2023 05:43

lint

125f8bb

jansel requested changes Jun 30, 2023

View reviewed changes

Address Jason's comments

bdd815b

1. use is_triton() 2. pass a torch.dtype instead of a triton dtype into the bucketize inductor op 3. Switch the behavior of "right" back to torch semantics

jansel reviewed Jun 30, 2023

View reviewed changes

torch/_inductor/triton_helpers.py Outdated Show resolved Hide resolved

davidberard98 added 2 commits June 30, 2023 11:02

fix triton_helpers comment about 'right'

386b4ed

lint

2cc9c94

davidberard98 requested a review from jansel June 30, 2023 20:05

peterbell10 reviewed Jun 30, 2023

View reviewed changes

torch/_inductor/inductor_prims.py Show resolved Hide resolved

jansel approved these changes Jul 1, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 3, 2023

pytorchmergebot added the merging label Jul 3, 2023

pytorchmergebot added Merged and removed merging labels Jul 3, 2023

pytorchmergebot closed this in e9d2d74 Jul 3, 2023

[inductor] Add prims._inductor_bucketize and add lowerings #104007

[inductor] Add prims._inductor_bucketize and add lowerings #104007

Uh oh!

Conversation

davidberard98 commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104007

✅ No Failures

Uh oh!

facebook-github-bot commented Jun 21, 2023

Uh oh!

Uh oh!

Uh oh!

aakhundov Jun 29, 2023

Choose a reason for hiding this comment

Uh oh!

davidberard98 Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Jun 29, 2023

Uh oh!

facebook-github-bot commented Jun 29, 2023

Uh oh!

facebook-github-bot commented Jun 30, 2023

Uh oh!

facebook-github-bot commented Jun 30, 2023

Uh oh!

facebook-github-bot commented Jun 30, 2023

Uh oh!

jansel Jun 30, 2023

Choose a reason for hiding this comment

Uh oh!

jansel Jun 30, 2023

Choose a reason for hiding this comment

Uh oh!

jansel Jun 30, 2023

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 30, 2023

Uh oh!

Uh oh!

facebook-github-bot commented Jun 30, 2023

Uh oh!

Uh oh!

davidberard98 commented Jul 3, 2023

Uh oh!

pytorchmergebot commented Jul 3, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

davidberard98 commented Jun 21, 2023 •

edited

Loading

pytorch-bot bot commented Jun 21, 2023 •

edited

Loading

davidberard98 Jun 30, 2023 •

edited

Loading