[export] don't decompose custom triton op when exporting #142426

ydwu4 · 2024-12-09T23:28:17Z

Stack from ghstack (oldest at bottom):

-> [export] don't decompose custom triton op when exporting #142426

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable.

The alternative:

If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because:

it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes.
changes to triton or the serialization logic for triton arguments can be BC breaking
exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction.

Future plans:

After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC.

In the long term, we may export multiple cubins for the triton op directly.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-12-09T23:28:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142426

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures

As of commit 450fab8 with merge base e56768f ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 4, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
export/test_experimental.py::TestExperiment::test_joint_basic
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 5, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_export_custom_triton_kernel_mutable_training_ir_to_decomp
pull / linux-focal-py3.13-clang10 / test (default, 2, 5, lf.linux.4xlarge) (gh)
export/test_experimental.py::TestExperiment::test_joint_basic
pull / linux-focal-py3.9-clang10 / test (default, 2, 5, lf.linux.4xlarge) (gh)
export/test_experimental.py::TestExperiment::test_joint_basic
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge) (gh)
exir/tests/test_joint_graph.py::TestJointGraph::test_joint_graph
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, lf.linux.4xlarge) (gh)
export/test_experimental.py::TestExperiment::test_joint_basic
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, lf.linux.2xlarge) (gh)
export/test_experimental.py::TestExperiment::test_joint_basic
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
export/test_experimental.py::TestExperiment::test_joint_basic

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ydwu4 · 2024-12-09T23:41:42Z

torch/_library/triton.py

+                )
+                from torch._subclasses.functional_tensor import PythonFunctionalizeAPI
+
+                if can_auto_functionalize(op):


This special implementation seems simpler but maybe it's better to just do mode.dispatch()?

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file **on the same machine that users call export**, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 7017a2e Pull Request resolved: #142426

zou3519 · 2024-12-12T21:31:48Z

torch/_library/triton.py

+            if is_exporting():
+                from torch._higher_order_ops.auto_functionalize import (
+                    can_auto_functionalize,
+                    do_auto_functionalize,
+                )
+                from torch._subclasses.functional_tensor import PythonFunctionalizeAPI
+
+                if can_auto_functionalize(op):
+                    return do_auto_functionalize(op, args, kwargs)
+
+                assert (
+                    not op._schema.is_mutable
+                ), "custom triton op need to be auto_functionalized if it's mutable"
+                ctx = PythonFunctionalizeAPI(mode, mode.pre_dispatch)
+                unwrapped_args, unwrapped_kwargs = ctx.unwrap_tensors((args, kwargs))
+                return ctx.wrap_tensors(op(*unwrapped_args, **unwrapped_kwargs))
+            else:


if is_exporting(): return mode.__torch_dispatch__(op, types, args ,kwargs)

updated, this works out of box.

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file **on the same machine that users call export**, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: f7493ca Pull Request resolved: #142426

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file **on the same machine that users call export**, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

pytorchmergebot · 2024-12-18T21:29:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2024-12-19T21:19:54Z

@pytorchbot revert -m 'This fails one internal MTIA test, checking with the author that we need to revert and reland this' -c ghfirst

pytorchmergebot · 2024-12-19T21:21:29Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…2426)" This reverts commit 10b9c59. Reverted #142426 on behalf of https://github.com/huydhn due to This fails one internal MTIA test, checking with the author that we need to revert and reland this ([comment](#142426 (comment)))

pytorchmergebot · 2024-12-19T21:21:41Z

@ydwu4 your PR has been successfully reverted.

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file **on the same machine that users call export**, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: f5eb33c Pull Request resolved: #142426

…orch#144284) Summary: A reland of pytorch#142426. Copying the description over here: For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. changes to triton or the serialization logic for triton arguments can be BC breaking exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Test Plan: see new tests. Differential Revision: D67879685

ydwu4 · 2025-01-10T23:40:42Z

Close as replaced by #144284.

…4284) Summary: A reland of #142426. Copying the description over here: For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. changes to triton or the serialization logic for triton arguments can be BC breaking exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Test Plan: see new tests. Differential Revision: D67879685 Pull Request resolved: #144284 Approved by: https://github.com/zou3519

[export] don't decompose custom triton op when exporting

9201bf9

[ghstack-poisoned]

ydwu4 mentioned this pull request Dec 9, 2024

[export] add is_exporting flag #142425

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Dec 9, 2024

ydwu4 added the topic: not user facing topic category label Dec 9, 2024

ydwu4 commented Dec 9, 2024

View reviewed changes

ydwu4 added a commit that referenced this pull request Dec 10, 2024

[export] don't decompose custom triton op when exporting

049bc25

ghstack-source-id: 7017a2e Pull Request resolved: #142426

ydwu4 requested review from SherlockNoMad, bdhirsh and zou3519 and removed request for SherlockNoMad December 10, 2024 17:31

zou3519 reviewed Dec 12, 2024

View reviewed changes

ydwu4 added a commit that referenced this pull request Dec 13, 2024

[export] don't decompose custom triton op when exporting

ba4c8f7

ghstack-source-id: f7493ca Pull Request resolved: #142426

zou3519 approved these changes Dec 17, 2024

View reviewed changes

pytorchmergebot added the Merged label Dec 18, 2024

pytorchmergebot closed this in 10b9c59 Dec 18, 2024

pytorchmergebot removed the merging label Dec 18, 2024

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Dec 19, 2024

pytorchmergebot reopened this Dec 19, 2024

ydwu4 requested review from angelayi, avikchaudhuri, tugsbayasgalan and zhxchen17 as code owners January 6, 2025 22:10

ydwu4 added a commit that referenced this pull request Jan 6, 2025

[export] don't decompose custom triton op when exporting

f1f2b67

ghstack-source-id: f5eb33c Pull Request resolved: #142426

ydwu4 mentioned this pull request Jan 6, 2025

[reland][export] don't decompose custom triton op when exporting #144284

Closed

ydwu4 closed this Jan 10, 2025

github-actions bot deleted the gh/ydwu4/189/head branch February 12, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[export] don't decompose custom triton op when exporting #142426

[export] don't decompose custom triton op when exporting #142426

Uh oh!

ydwu4 commented Dec 9, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 9, 2024 •

edited

Loading

Uh oh!

ydwu4 Dec 9, 2024 •

edited

Loading

Uh oh!

zou3519 Dec 12, 2024 •

edited

Loading

Uh oh!

ydwu4 Dec 13, 2024 •

edited

Loading

Uh oh!

pytorchmergebot commented Dec 18, 2024

Uh oh!

huydhn commented Dec 19, 2024

Uh oh!

pytorchmergebot commented Dec 19, 2024

Uh oh!

pytorchmergebot commented Dec 19, 2024

Uh oh!

ydwu4 commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[export] don't decompose custom triton op when exporting #142426

[export] don't decompose custom triton op when exporting #142426

Uh oh!

Conversation

ydwu4 commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The alternative:

Future plans:

Uh oh!

pytorch-bot bot commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142426

❌ 8 New Failures

Uh oh!

ydwu4 Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydwu4 Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Dec 18, 2024

Merge started

Uh oh!

huydhn commented Dec 19, 2024

Uh oh!

pytorchmergebot commented Dec 19, 2024

Uh oh!

pytorchmergebot commented Dec 19, 2024

Uh oh!

ydwu4 commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ydwu4 commented Dec 9, 2024 •

edited

Loading

pytorch-bot bot commented Dec 9, 2024 •

edited

Loading

ydwu4 Dec 9, 2024 •

edited

Loading

zou3519 Dec 12, 2024 •

edited

Loading

ydwu4 Dec 13, 2024 •

edited

Loading