Skip to content

Conversation

jamesjwu
Copy link
Contributor

@jamesjwu jamesjwu commented Feb 3, 2025

Stack from ghstack (oldest at bottom):

  • (to be filled)

Big idea

This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.

Implementation Overview

In total, the diff does the following:

  • Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
  • Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
  • Extend @eellison's future cache to a class, mostly as a refactor
  • Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start.

In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.

Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Can we split the diff for easier review?

It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here:
D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen
D68633454 - Only call triton in worker process

Differential Revision: D69013710

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov

… time in Scheduler.codegen

### Big idea
This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.

### Implementation Overview
In total, the diff does the following:
- Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
- Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
- Extend @eellison's future cache to a class, mostly as a refactor
- Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start.

In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.

Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

### Can we split the diff for easier review?
It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here:
D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen
D68633454 - Only call triton in worker process

Differential Revision: [D69013710](https://our.internmc.facebook.com/intern/diff/D69013710/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Feb 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146342

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

github-actions bot commented Feb 3, 2025

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@jamesjwu jamesjwu closed this Feb 3, 2025
@github-actions github-actions bot deleted the gh/jamesjwu/103/head branch March 6, 2025 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant