Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[halide-backend] Generate standalone runtime #129025

Closed
wants to merge 13 commits into from

Conversation

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jun 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129025

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7644ed3 with merge base bc8883a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
stride = []
prod = 1
for x in self.store_buffer_dimensions[arg.name]:
stride.append(cexpr(prod))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why halide put the fastest moving dimension at the beginning rather than in the end (like pytorch).

So halide can only support contiguous tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the convention in image processing -- the opposite of PyTorch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. GPU layout in triton also picks this kind of order (fastest moving dimension first).

torch/_inductor/codecache.py Show resolved Hide resolved
@@ -134,6 +134,9 @@ class HalideInputSpec(typing.NamedTuple):
ctype: str
name: str
shape: Optional[List[str]] = None
stride: Optional[List[str]] = None
offset: Optional[str] = None
alias_of: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do triton and halide have different assumptions in alias analysis, is that what necessitates this ?

I wonder if we could be more aggressive with not annotating any aliases given that our codegen already works under that assumption for triton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alias_of thing is Halide-specific and used in the next PR. We are adding extra aliases they don't exist in the Triton backend.

The issue is if we want to index a tensor with two different strides. In Triton we just use indexing formulas, but for Halide+dimension based indexing we introduce aliases with the same data_ptr but different strides.

target.append("strict_float")

# without this we will initialize cuda once per kernel and hit errors
target.append("no_runtime")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for cpu device ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for CPU, but will give smaller binary size since we only get one copy of the runtime.

[ghstack-poisoned]
@jansel jansel added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024
OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request Jun 21, 2024
ghstack-source-id: c8dbbf2b0479f97d7190a1baed4fa3964a1216f1
Pull Request resolved: pytorch#129025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@jansel
Copy link
Contributor Author

jansel commented Jun 22, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request Jun 23, 2024
ghstack-source-id: b25c6c73c3cf0557f2c13040e0ec619d6db30328
Pull Request resolved: pytorch#129025
@fbgheith
Copy link
Contributor

@pytorchbot revert -m "breaking internal builds" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jun 24, 2024
This reverts commit 10c64c3.

Reverted #129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](#129025 (comment)))
@pytorchmergebot
Copy link
Collaborator

@jansel your PR has been successfully reverted.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jun 29, 2024
Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs.  Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues.  This PR infers dimensions and changes the indexing in the generated code.

Before
```py
@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
    out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)

    def generate(g):
        in_ptr0 = g.in_ptr0
        out_ptr3 = g.out_ptr3
        xindex = hl.Var('xindex')
        rindex = hl.Var('rindex')
        r1 = rindex
        x0 = xindex
        idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)])
        odom = hl.RDom([hl.Range(0, 16)])
        rdom = hl.RDom([hl.Range(0, 32)])
        xindex_idom = idom.x
        xindex_odom = odom.x
        rindex_idom = idom.y
        r1_idom = rindex_idom
        x0_idom = xindex_idom
        x0_odom = xindex_odom
        tmp0 = hl.Func('tmp0')
        tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)]
        tmp1 = hl.Func('tmp1')
        tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex])
        tmp2 = hl.Func('tmp2')
        tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex]
        tmp3 = hl.Func('tmp3')
        tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex])
        tmp4 = hl.Func('tmp4')
        tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex])
        tmp5 = hl.Func('tmp5')
        tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex]
        out_ptr3_i0 = hl.Var('out_ptr3_i0')
        out_ptr3_i1 = hl.Var('out_ptr3_i1')
        out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1])

        assert g.using_autoscheduler()
        in_ptr0.set_estimates([hl.Range(0, 512)])
        out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```

After
```py
@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 2)
    out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)

    def generate(g):
        in_ptr0 = g.in_ptr0
        out_ptr3 = g.out_ptr3
        h0 = hl.Var('h0')
        h1 = hl.Var('h1')
        rdom = hl.RDom([hl.Range(0, 32)])
        hr1 = rdom[0]
        tmp0 = hl.Func('tmp0')
        tmp0[h0, h1] = in_ptr0[h0, h1,]
        tmp1 = hl.Func('tmp1')
        tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1])
        tmp2 = hl.Func('tmp2')
        tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1]
        tmp3 = hl.Func('tmp3')
        tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1])
        tmp4 = hl.Func('tmp4')
        tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1])
        tmp5 = hl.Func('tmp5')
        tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1]
        out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1])

        assert g.using_autoscheduler()
        in_ptr0.dim(0).set_min(0)
        in_ptr0.dim(0).set_stride(1)
        in_ptr0.dim(0).set_extent(32)
        in_ptr0.dim(1).set_min(0)
        in_ptr0.dim(1).set_stride(32)
        in_ptr0.dim(1).set_extent(16)
        in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
        out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```

Pull Request resolved: #129026
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025
pytorchmergebot pushed a commit that referenced this pull request Jun 29, 2024
pytorchmergebot pushed a commit that referenced this pull request Jun 29, 2024
pytorchmergebot pushed a commit that referenced this pull request Jun 29, 2024
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.

Pull Request resolved: #129320
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026, #127506, #129036
pytorchmergebot pushed a commit that referenced this pull request Jun 29, 2024
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.

Pull Request resolved: #129321
Approved by: https://github.com/shunting314
ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320
@github-actions github-actions bot deleted the gh/jansel/353/head branch July 31, 2024 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants