Add new (private) capture_triton API #130178

zou3519 · 2024-07-05T21:18:41Z

Stack from ghstack (oldest at bottom):

When applied to a triton kernel, capture_triton allows the triton kernel
to be captured when tracing with make_fx. It does this by transforming the
call to the triton kernel into a call to the
triton_kernel_wrapper_mutation HOP, which can actually be traced into a
graph via make_fx.

We have two main uses cases for this:

non-strict export doesn't use Dynamo, but people want to use
non-strict export to export programs with triton kernels.
non-strict export uses make_fx tracing, so this is a necessary step in
that direction.
People want to write inductor passes that replace a sequence of
operators with a call to a function that may contain a triton kernel.
The way these passes work today is that we have a FX graph and want to
replace a subgraph of it with a new subgraph. We obtain said subgraph
from calling make_fx on the function; this won't work on raw triton
kernels but will work if one uses capture_triton.

Test Plan:

I wrote some manual tests to run make_fx over two of the triton
kernels in test_triton_kernels. It would be nice to be able to run
make_fx through all of the tests in the file but I'm not sure how to
do that refactor right now.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. [ghstack-poisoned]

pytorch-bot · 2024-07-05T21:18:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130178

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8b770a8 with merge base a5f816d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. ghstack-source-id: fcc9614 Pull Request resolved: #130178

torch/_higher_order_ops/triton_kernel_wrap.py

oulgen · 2024-07-08T16:52:28Z

torch/_higher_order_ops/triton_kernel_wrap.py

+        return grid(meta)
+
+    def check_grid(self, grid):
+        if not isinstance(grid, tuple):


I haven't actually verified this but is it really only ever tuple? would say a list fail here? IIRC triton only reads it as decomposition, but good to verify

not sure how the orchestration here happens but if grid is a function, we do already have it fully resolved before here? can you point to who calls these?

I haven't actually verified this but is it really only ever tuple? would say a list fail here? IIRC triton only reads it as decomposition, but good to verify

A list works here, so I'll update the code to convert to tuple (and add a test for it)

not sure how the orchestration here happens but if grid is a function, we do already have it fully resolved before here? can you point to who calls these?

If grid is a function, it's fully resolved before here. The orchestrator resolves the function before calling check_grid; the code for this is over at https://github.com/pytorch/pytorch/pull/130178/files#diff-f5fa7d0e418e91c63fa56d577a92a294c87e19318a3c8b3736ac4254eaa51db9R907-R915

oulgen · 2024-07-08T16:53:13Z

torch/_higher_order_ops/triton_kernel_wrap.py

+        non_graphable_args = {
+            k: v
+            for k, v in combined_args.items()
+            if not isinstance(v, (torch.Tensor, int, float, bool))


how did you come up with this list? pretty sure string also belongs in it

From memory, I can add string to it. I'll go digging to see if there's a helper function for this in torch.fx already

Side table in this file also contains a check for this, maybe use the same one as that one?

Side table in this file also contains a check for this, maybe use the same one as that one?

afaict in the Dynamo path, we put all constant args into the side table. In make_fx there's not a concept of constant vs non-constant args, so it doesn't look like we can reuse that check?

Ah ok, I did that because triton.dtype was passed as an arg and we could not put that on the fx graph

torch.fx has a way for checking this -- they encode the acceptable types in BaseArgumentType. I'll use that check

torch/_higher_order_ops/triton_kernel_wrap.py

oulgen · 2024-07-08T16:55:46Z

torch/_higher_order_ops/triton_kernel_wrap.py

+
+        if not is_fx_tracing() or torch._dynamo.is_compiling():
+            assert self.kernel is not None
+            return self.kernel[self.grid](*args, **kwargs)


be consistent with above

Suggested change

return self.kernel[self.grid](*args, **kwargs)

return self.kernel.run(*args, **kwargs, grid=self.grid)

Getting TypeError: JITFunction.run() missing 1 required keyword-only argument: 'warmup' when I use .run here -- doing the __call__ doesn't seem to require the warmup arg

Pass warmup=False, __call__ does that: https://github.com/triton-lang/triton/blob/c14b033cd979d5c39e5fdb3847c022fa5d71a0c1/python/triton/runtime/jit.py#L326C59-L326C72

Thank you, that worked

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. ghstack-source-id: 4b6b203 Pull Request resolved: #130178

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. ghstack-source-id: 5292d93 Pull Request resolved: #130178

zou3519 · 2024-07-10T03:07:36Z

@pytorchbot merge -f "rocm tests not ending; everything else passed"

pytorchmergebot · 2024-07-10T03:09:14Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. Pull Request resolved: pytorch#130178 Approved by: https://github.com/oulgen ghstack dependencies: pytorch#130177

zou3519 mentioned this pull request Jul 5, 2024

Refactor TritonKernelVariable's logic so it can be shared #130177

Closed

pytorch-bot bot added the module: inductor label Jul 5, 2024

zou3519 requested review from eellison and oulgen July 8, 2024 13:47

oulgen approved these changes Jul 8, 2024

View reviewed changes

zou3519 added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Jul 9, 2024

zou3519 added the ci-no-td Do not run TD on this PR label Jul 10, 2024

pytorchmergebot added the merging label Jul 10, 2024

pytorchmergebot added the Merged label Jul 10, 2024

pytorchmergebot closed this in 5abe7eb Jul 10, 2024

pytorchmergebot removed the merging label Jul 10, 2024

github-actions bot deleted the gh/zou3519/1018/head branch August 10, 2024 01:58

	return self.kernel[self.grid](args, *kwargs)
	return self.kernel.run(args, *kwargs, grid=self.grid)

Add new (private) capture_triton API #130178

Add new (private) capture_triton API #130178

Uh oh!

Conversation

zou3519 commented Jul 5, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130178

✅ No Failures

Uh oh!

Uh oh!

oulgen Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Jul 10, 2024

Uh oh!

pytorchmergebot commented Jul 10, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zou3519 commented Jul 5, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 5, 2024 •

edited

Loading

oulgen Jul 8, 2024 •

edited

Loading

zou3519 Jul 9, 2024 •

edited

Loading