-
Notifications
You must be signed in to change notification settings - Fork 22.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[halide-backend] Generate standalone runtime #129025
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129025
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7644ed3 with merge base bc8883a (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
stride = [] | ||
prod = 1 | ||
for x in self.store_buffer_dimensions[arg.name]: | ||
stride.append(cexpr(prod)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why halide put the fastest moving dimension at the beginning rather than in the end (like pytorch).
So halide can only support contiguous tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the convention in image processing -- the opposite of PyTorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. GPU layout in triton also picks this kind of order (fastest moving dimension first).
@@ -134,6 +134,9 @@ class HalideInputSpec(typing.NamedTuple): | |||
ctype: str | |||
name: str | |||
shape: Optional[List[str]] = None | |||
stride: Optional[List[str]] = None | |||
offset: Optional[str] = None | |||
alias_of: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do triton and halide have different assumptions in alias analysis, is that what necessitates this ?
I wonder if we could be more aggressive with not annotating any aliases given that our codegen already works under that assumption for triton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alias_of thing is Halide-specific and used in the next PR. We are adding extra aliases they don't exist in the Triton backend.
The issue is if we want to index a tensor with two different strides. In Triton we just use indexing formulas, but for Halide+dimension based indexing we introduce aliases with the same data_ptr but different strides.
target.append("strict_float") | ||
|
||
# without this we will initialize cuda once per kernel and hit errors | ||
target.append("no_runtime") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this for cpu device ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed for CPU, but will give smaller binary size since we only get one copy of the runtime.
ghstack-source-id: c8dbbf2b0479f97d7190a1baed4fa3964a1216f1 Pull Request resolved: pytorch#129025
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
ghstack-source-id: b25c6c73c3cf0557f2c13040e0ec619d6db30328 Pull Request resolved: pytorch#129025
@pytorchbot revert -m "breaking internal builds" -c ghfirst |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 10c64c3. Reverted #129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](#129025 (comment)))
@jansel your PR has been successfully reverted. |
Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs. Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues. This PR infers dimensions and changes the indexing in the generated code. Before ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 1) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 xindex = hl.Var('xindex') rindex = hl.Var('rindex') r1 = rindex x0 = xindex idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)]) odom = hl.RDom([hl.Range(0, 16)]) rdom = hl.RDom([hl.Range(0, 32)]) xindex_idom = idom.x xindex_odom = odom.x rindex_idom = idom.y r1_idom = rindex_idom x0_idom = xindex_idom x0_odom = xindex_odom tmp0 = hl.Func('tmp0') tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)] tmp1 = hl.Func('tmp1') tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex]) tmp2 = hl.Func('tmp2') tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex] tmp3 = hl.Func('tmp3') tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex]) tmp4 = hl.Func('tmp4') tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex]) tmp5 = hl.Func('tmp5') tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex] out_ptr3_i0 = hl.Var('out_ptr3_i0') out_ptr3_i1 = hl.Var('out_ptr3_i1') out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1]) assert g.using_autoscheduler() in_ptr0.set_estimates([hl.Range(0, 512)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` After ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 2) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 h0 = hl.Var('h0') h1 = hl.Var('h1') rdom = hl.RDom([hl.Range(0, 32)]) hr1 = rdom[0] tmp0 = hl.Func('tmp0') tmp0[h0, h1] = in_ptr0[h0, h1,] tmp1 = hl.Func('tmp1') tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1]) tmp2 = hl.Func('tmp2') tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1] tmp3 = hl.Func('tmp3') tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1]) tmp4 = hl.Func('tmp4') tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1]) tmp5 = hl.Func('tmp5') tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1] out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1]) assert g.using_autoscheduler() in_ptr0.dim(0).set_min(0) in_ptr0.dim(0).set_stride(1) in_ptr0.dim(0).set_extent(32) in_ptr0.dim(1).set_min(0) in_ptr0.dim(1).set_stride(32) in_ptr0.dim(1).set_extent(16) in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` Pull Request resolved: #129026 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025
Pull Request resolved: #127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026
Requires halide/Halide#8255 Pull Request resolved: #129036 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: #129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506, #129036
Stack from ghstack (oldest at bottom):
This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang