Only call triton in worker process; run async_compile.triton ahead of time in Scheduler.codegen #146342

jamesjwu · 2025-02-03T20:48:17Z

Stack from ghstack (oldest at bottom):

(to be filled)

Big idea

This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.

Implementation Overview

In total, the diff does the following:

Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
Extend @eellison's future cache to a class, mostly as a refactor
Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start.

In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.

Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Can we split the diff for easier review?

It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here:
D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen
D68633454 - Only call triton in worker process

Differential Revision: D69013710

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov

@eellison

… time in Scheduler.codegen ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen D68633454 - Only call triton in worker process Differential Revision: [D69013710](https://our.internmc.facebook.com/intern/diff/D69013710/) [ghstack-poisoned]

pytorch-bot · 2025-02-03T20:48:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146342

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-02-03T20:48:51Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

pytorch-bot bot added ciflow/inductor module: inductor labels Feb 3, 2025

jamesjwu closed this Feb 3, 2025

github-actions bot deleted the gh/jamesjwu/103/head branch March 6, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only call triton in worker process; run async_compile.triton ahead of time in Scheduler.codegen #146342

Only call triton in worker process; run async_compile.triton ahead of time in Scheduler.codegen #146342

Uh oh!

jamesjwu commented Feb 3, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 3, 2025

Uh oh!

github-actions bot commented Feb 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Only call triton in worker process; run async_compile.triton ahead of time in Scheduler.codegen #146342

Only call triton in worker process; run async_compile.triton ahead of time in Scheduler.codegen #146342

Uh oh!

Conversation

jamesjwu commented Feb 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Big idea

Implementation Overview

Can we split the diff for easier review?

Uh oh!

pytorch-bot bot commented Feb 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146342

Uh oh!

github-actions bot commented Feb 3, 2025

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamesjwu commented Feb 3, 2025 •

edited by pytorch-bot bot

Loading

This PR needs a `release notes:` label