-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[inductor] Only call triton.compile in worker processes #144288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144288
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 6d26cf6 with merge base 10e4d3a ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
These are refactors needed for #144288 Pull Request resolved: #144044 Approved by: https://github.com/eellison
Impacts on compile time for this PR are mixed: Given it shows some pretty big compile time wins in some models, it tells me there is something here. However, it seems to slow down other models. I suspect this is due to:
|
This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69013710](https://our.internmc.facebook.com/intern/diff/D69013710/) [ghstack-poisoned]
This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69013710](https://our.internmc.facebook.com/intern/diff/D69013710/) ghstack-source-id: 264436138 Pull Request resolved: #146334
… time in Scheduler.codegen ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen D68633454 - Only call triton in worker process Differential Revision: [D69013710](https://our.internmc.facebook.com/intern/diff/D69013710/) [ghstack-poisoned]
… time in Scheduler.codegen ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen D68633454 - Only call triton in worker process Differential Revision: [D69013710](https://our.internmc.facebook.com/intern/diff/D69013710/) [ghstack-poisoned]
…of time compile" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ff worker processes earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ses earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ff worker processes earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ses earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ff worker processes earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ses earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
Pull Request resolved: #146417 ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process ghstack-source-id: 265162875 Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)
…ff worker processes earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ses earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
Pull Request resolved: #146417 ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process ghstack-source-id: 265282066 Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)
…ff worker processes earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ses earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
Pull Request resolved: #146417 ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ghstack-source-id: 265615554 Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)
…ff worker processes earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…ses earlier, during inductor codegen" ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames desertfire chauhang aakhundov [ghstack-poisoned]
…, during inductor codegen (#146417) ### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) Pull Request resolved: #146417 Approved by: https://github.com/jansel
Pull Request resolved: pytorch/pytorch#146417 ### Big idea This PR extends pytorch/pytorch#144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) ghstack-source-id: 22f0e5a
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Before this parallel compile would call triton.compile once in a subprocess (to warm the disk cache) then again in the parent to load the result with cache hit. This calls triton.compile once in the subprocess then pickles the result back to the parent.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov