-
Notifications
You must be signed in to change notification settings - Fork 129
Use ProcessPoolExecutor for triton compiles #1666
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool! Learnt a lot from the code comments.
|
!!!!!!!!!!!!!!!!!!!! |
|
would it make sense to switch to You can test with creating a cuda context first, with and then starting dynamo stuff: |
|
Triton also caches things under |
I tested this with cuda being live and it still works. The reason is because the triton worker processes don't need cuda, so they don't try to reinitialize it. |
I checked and it doesn't seem to affect the timing whether I delete this directory or not whereas there is a clear difference in triton compile times when the /tmp/torchinductor directory is emptied. |
This patch significantly improves the parallel compilation performance for compiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and serial work on the main thread for everything else (this work couldn't be parallelized anyway since it is mostly in python). In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark, but the results are significant: 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) no pandas black format CI fixes format lint
|
We have migrated More details and instructions to port this PR over can be found in #1588 |
This patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.
Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.
In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.
Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.
(cold compilation is done after running
rm -rf /tmp/torchinductor_$USER).