[halide-backend] Generate standalone runtime #129025

jansel · 2024-06-19T01:52:56Z

Stack from ghstack (oldest at bottom):

This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-19T01:52:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129025

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7644ed3 with merge base bc8883a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

shunting314 · 2024-06-20T17:44:30Z

torch/_inductor/codegen/halide.py

+                    stride = []
+                    prod = 1
+                    for x in self.store_buffer_dimensions[arg.name]:
+                        stride.append(cexpr(prod))


Wondering why halide put the fastest moving dimension at the beginning rather than in the end (like pytorch).

So halide can only support contiguous tensors?

This is the convention in image processing -- the opposite of PyTorch.

I see. GPU layout in triton also picks this kind of order (fastest moving dimension first).

torch/_inductor/codecache.py

eellison · 2024-06-20T19:52:15Z

torch/_inductor/runtime/hints.py

@@ -134,6 +134,9 @@ class HalideInputSpec(typing.NamedTuple):
    ctype: str
    name: str
    shape: Optional[List[str]] = None
+    stride: Optional[List[str]] = None
+    offset: Optional[str] = None
+    alias_of: Optional[str] = None


Do triton and halide have different assumptions in alias analysis, is that what necessitates this ?

I wonder if we could be more aggressive with not annotating any aliases given that our codegen already works under that assumption for triton

The alias_of thing is Halide-specific and used in the next PR. We are adding extra aliases they don't exist in the Triton backend.

The issue is if we want to index a tensor with two different strides. In Triton we just use indexing formulas, but for Halide+dimension based indexing we introduce aliases with the same data_ptr but different strides.

eellison · 2024-06-20T19:56:52Z

torch/_inductor/codegen/halide.py

+        target.append("strict_float")
+
+        # without this we will initialize cuda once per kernel and hit errors
+        target.append("no_runtime")


Do we need this for cpu device ?

Not needed for CPU, but will give smaller binary size since we only get one copy of the runtime.

[ghstack-poisoned]

ghstack-source-id: c8dbbf2b0479f97d7190a1baed4fa3964a1216f1 Pull Request resolved: pytorch#129025

[ghstack-poisoned]

jansel · 2024-06-22T17:38:02Z

@pytorchbot merge

pytorchmergebot · 2024-06-22T17:39:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: b25c6c73c3cf0557f2c13040e0ec619d6db30328 Pull Request resolved: pytorch#129025

fbgheith · 2024-06-24T16:45:57Z

@pytorchbot revert -m "breaking internal builds" -c ghfirst

pytorchmergebot · 2024-06-24T16:47:16Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 10c64c3. Reverted #129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](#129025 (comment)))

pytorchmergebot · 2024-06-24T16:47:28Z

@jansel your PR has been successfully reverted.

[ghstack-poisoned]

Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs. Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues. This PR infers dimensions and changes the indexing in the generated code. Before ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 1) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 xindex = hl.Var('xindex') rindex = hl.Var('rindex') r1 = rindex x0 = xindex idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)]) odom = hl.RDom([hl.Range(0, 16)]) rdom = hl.RDom([hl.Range(0, 32)]) xindex_idom = idom.x xindex_odom = odom.x rindex_idom = idom.y r1_idom = rindex_idom x0_idom = xindex_idom x0_odom = xindex_odom tmp0 = hl.Func('tmp0') tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)] tmp1 = hl.Func('tmp1') tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex]) tmp2 = hl.Func('tmp2') tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex] tmp3 = hl.Func('tmp3') tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex]) tmp4 = hl.Func('tmp4') tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex]) tmp5 = hl.Func('tmp5') tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex] out_ptr3_i0 = hl.Var('out_ptr3_i0') out_ptr3_i1 = hl.Var('out_ptr3_i1') out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1]) assert g.using_autoscheduler() in_ptr0.set_estimates([hl.Range(0, 512)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` After ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 2) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 h0 = hl.Var('h0') h1 = hl.Var('h1') rdom = hl.RDom([hl.Range(0, 32)]) hr1 = rdom[0] tmp0 = hl.Func('tmp0') tmp0[h0, h1] = in_ptr0[h0, h1,] tmp1 = hl.Func('tmp1') tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1]) tmp2 = hl.Func('tmp2') tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1] tmp3 = hl.Func('tmp3') tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1]) tmp4 = hl.Func('tmp4') tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1]) tmp5 = hl.Func('tmp5') tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1] out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1]) assert g.using_autoscheduler() in_ptr0.dim(0).set_min(0) in_ptr0.dim(0).set_stride(1) in_ptr0.dim(0).set_extent(32) in_ptr0.dim(1).set_min(0) in_ptr0.dim(1).set_stride(32) in_ptr0.dim(1).set_extent(16) in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` Pull Request resolved: #129026 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025

Pull Request resolved: #127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026

Requires halide/Halide#8255 Pull Request resolved: #129036 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506

In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: #129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506, #129036

Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: #129321 Approved by: https://github.com/shunting314 ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320

Update

aa6de48

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 19, 2024

This was referenced Jun 19, 2024

[inductor] Refactors for Halide backend #129024

Closed

[halide-backend] Initial implementation of HalideKernel and HalideScheduling #126417

Closed

[halide-backend] Dimension-based indexing #129026

Closed

[halide-backend] Add GPU support #127506

Closed

Update

12769ec

[ghstack-poisoned]

This was referenced Jun 19, 2024

[inductor] Run more test on correct device #129033

Closed

[inductor] Add --inductor-config benchmark flag #129034

Closed

[halide-backend] Support scan kernels #129035

Closed

[halide-backend] Enable bfloat16 support #129036

Closed

jansel requested review from eellison and shunting314 June 19, 2024 04:13

jansel added the release notes: inductor label Jun 19, 2024

jansel added 3 commits June 18, 2024 21:22

Update

242af11

[ghstack-poisoned]

Update

e740d9f

[ghstack-poisoned]

Update

c3a565c

[ghstack-poisoned]

shunting314 approved these changes Jun 20, 2024

View reviewed changes

eellison approved these changes Jun 20, 2024

View reviewed changes

Update

4263236

[ghstack-poisoned]

jansel mentioned this pull request Jun 21, 2024

[benchmarking] Add join_results.py #129202

Closed

jansel added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024

OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request Jun 21, 2024

[halide-backend] Generate standalone runtime

3cb6e4e

ghstack-source-id: c8dbbf2b0479f97d7190a1baed4fa3964a1216f1 Pull Request resolved: pytorch#129025

jansel added 3 commits June 21, 2024 14:35

Update

f1180c7

[ghstack-poisoned]

Update

798cc2e

[ghstack-poisoned]

Update

a27c2a2

[ghstack-poisoned]

pytorchmergebot added the merging label Jun 22, 2024

pytorchmergebot added the Merged label Jun 22, 2024

pytorchmergebot closed this in 10c64c3 Jun 22, 2024

pytorchmergebot removed the merging label Jun 22, 2024

OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request Jun 23, 2024

[halide-backend] Generate standalone runtime

3399b6f

ghstack-source-id: b25c6c73c3cf0557f2c13040e0ec619d6db30328 Pull Request resolved: pytorch#129025

pytorchmergebot added the Reverted label Jun 24, 2024

pytorchmergebot reopened this Jun 24, 2024

Update

7b29e0f

[ghstack-poisoned]

This was referenced Jun 25, 2024

[halide-backend] Disable split reductions for Halide #129320

Closed

[halide-backend] Support manual schedules #129321

Closed

jansel added 3 commits June 25, 2024 23:02

Update

32a892c

[ghstack-poisoned]

Update

43a690b

[ghstack-poisoned]

Update

7644ed3

[ghstack-poisoned]

pytorchmergebot closed this in da5f375 Jun 29, 2024

pytorchmergebot pushed a commit that referenced this pull request Jun 29, 2024

[halide-backend] Add GPU support (#127506)

b93bf55

Pull Request resolved: #127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026

github-actions bot deleted the gh/jansel/353/head branch July 31, 2024 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[halide-backend] Generate standalone runtime #129025

[halide-backend] Generate standalone runtime #129025

jansel commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

shunting314 Jun 20, 2024

jansel Jun 21, 2024

shunting314 Jun 21, 2024

eellison Jun 20, 2024

jansel Jun 21, 2024

eellison Jun 20, 2024

jansel Jun 21, 2024

jansel commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

fbgheith commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

[halide-backend] Generate standalone runtime #129025

[halide-backend] Generate standalone runtime #129025

Conversation

jansel commented Jun 19, 2024 • edited Loading

pytorch-bot bot commented Jun 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129025

✅ No Failures

shunting314 Jun 20, 2024

Choose a reason for hiding this comment

jansel Jun 21, 2024

Choose a reason for hiding this comment

shunting314 Jun 21, 2024

Choose a reason for hiding this comment

eellison Jun 20, 2024

Choose a reason for hiding this comment

jansel Jun 21, 2024

Choose a reason for hiding this comment

eellison Jun 20, 2024

Choose a reason for hiding this comment

jansel Jun 21, 2024

Choose a reason for hiding this comment

jansel commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

Merge started

fbgheith commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

jansel commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading