Use ProcessPoolExecutor for triton compiles #1666

zdevito · 2022-10-14T06:06:43Z

This patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.

Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.

In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.

39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)

(cold compilation is done after running rm -rf /tmp/torchinductor_$USER).

anijain2305

Really cool! Learnt a lot from the code comments.

Chillee · 2022-10-14T09:14:23Z

!!!!!!!!!!!!!!!!!!!!

soumith · 2022-10-14T13:51:54Z

would it make sense to switch to spawn or forkserver?
The reason is that fork might not work if a CUDA context is already initialized (unless Triton is only doing compilation and no tuning (and hence using the CUDA context))

You can test with creating a cuda context first, with and then starting dynamo stuff:

import torch
torch.randn(10).cuda()

import torchdynamo
torchdynamo.optimize()(model)

desertfire · 2022-10-14T14:20:42Z

Triton also caches things under $HOME/.triton/cache. For a true cold start measurement, should we clean those as well?

zdevito · 2022-10-14T19:16:18Z

would it make sense to switch to spawn or forkserver? The reason is that fork might not work if a CUDA context is already initialized (unless Triton is only doing compilation and no tuning (and hence using the CUDA context))

You can test with creating a cuda context first, with and then starting dynamo stuff:
import torch
torch.randn(10).cuda()

import torchdynamo
torchdynamo.optimize()(model)

I tested this with cuda being live and it still works. The reason is because the triton worker processes don't need cuda, so they don't try to reinitialize it.

zdevito · 2022-10-14T19:41:10Z

Triton also caches things under $HOME/.triton/cache. For a true cold start measurement, should we clean those as well?

I checked and it doesn't seem to affect the timing whether I delete this directory or not whereas there is a clear difference in triton compile times when the /tmp/torchinductor directory is emptied.

This patch significantly improves the parallel compilation performance for compiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and serial work on the main thread for everything else (this work couldn't be parallelized anyway since it is mostly in python). In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark, but the results are significant: 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) no pandas black format CI fixes format lint

jansel · 2022-10-15T21:11:25Z

We have migrated torchdynamo to torch._dynamo and will use the pytorch/pytorch repo for future development. Please resubmit this PR to https://github.com/pytorch/pytorch/

More details and instructions to port this PR over can be found in #1588

facebook-github-bot added the cla signed label Oct 14, 2022

anijain2305 approved these changes Oct 14, 2022

View reviewed changes

zdevito force-pushed the workers branch from f766c21 to 642230f Compare October 14, 2022 19:24

zdevito added 2 commits October 15, 2022 02:23

lint

3fe22f0

zdevito force-pushed the workers branch from 5d31fe4 to 3fe22f0 Compare October 15, 2022 02:23

jansel closed this Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use ProcessPoolExecutor for triton compiles #1666

Use ProcessPoolExecutor for triton compiles #1666

Uh oh!

zdevito commented Oct 14, 2022

Uh oh!

anijain2305 left a comment

Uh oh!

Chillee commented Oct 14, 2022

Uh oh!

soumith commented Oct 14, 2022

Uh oh!

desertfire commented Oct 14, 2022

Uh oh!

zdevito commented Oct 14, 2022

Uh oh!

zdevito commented Oct 14, 2022

Uh oh!

jansel commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Use ProcessPoolExecutor for triton compiles #1666

Use ProcessPoolExecutor for triton compiles #1666

Uh oh!

Conversation

zdevito commented Oct 14, 2022

Uh oh!

anijain2305 left a comment

Choose a reason for hiding this comment

Uh oh!

Chillee commented Oct 14, 2022

Uh oh!

soumith commented Oct 14, 2022

Uh oh!

desertfire commented Oct 14, 2022

Uh oh!

zdevito commented Oct 14, 2022

Uh oh!

zdevito commented Oct 14, 2022

Uh oh!

jansel commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants