Skip to content
This repository was archived by the owner on Aug 1, 2025. It is now read-only.

Conversation

@zdevito
Copy link

@zdevito zdevito commented Oct 14, 2022

This patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.

Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.

In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.

39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)

(cold compilation is done after running rm -rf /tmp/torchinductor_$USER).

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool! Learnt a lot from the code comments.

@Chillee
Copy link
Contributor

Chillee commented Oct 14, 2022

!!!!!!!!!!!!!!!!!!!!

@soumith
Copy link
Member

soumith commented Oct 14, 2022

would it make sense to switch to spawn or forkserver?
The reason is that fork might not work if a CUDA context is already initialized (unless Triton is only doing compilation and no tuning (and hence using the CUDA context))

You can test with creating a cuda context first, with and then starting dynamo stuff:

import torch
torch.randn(10).cuda()

import torchdynamo
torchdynamo.optimize()(model)

@desertfire
Copy link
Contributor

Triton also caches things under $HOME/.triton/cache. For a true cold start measurement, should we clean those as well?

@zdevito
Copy link
Author

zdevito commented Oct 14, 2022

would it make sense to switch to spawn or forkserver? The reason is that fork might not work if a CUDA context is already initialized (unless Triton is only doing compilation and no tuning (and hence using the CUDA context))

You can test with creating a cuda context first, with and then starting dynamo stuff:

import torch
torch.randn(10).cuda()

import torchdynamo
torchdynamo.optimize()(model)

I tested this with cuda being live and it still works. The reason is because the triton worker processes don't need cuda, so they don't try to reinitialize it.

@zdevito
Copy link
Author

zdevito commented Oct 14, 2022

Triton also caches things under $HOME/.triton/cache. For a true cold start measurement, should we clean those as well?

I checked and it doesn't seem to affect the timing whether I delete this directory or not whereas there is a clear difference in triton compile times when the /tmp/torchinductor directory is emptied.

This patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.

Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and serial work on the main thread
for everything else (this work couldn't be parallelized anyway since it is mostly in python).

In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark, but the results are significant:

39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)

no pandas

black

format

CI fixes

format

lint
@jansel
Copy link
Contributor

jansel commented Oct 15, 2022

We have migrated torchdynamo to torch._dynamo and will use the pytorch/pytorch repo for future development. Please resubmit this PR to https://github.com/pytorch/pytorch/

More details and instructions to port this PR over can be found in #1588

@jansel jansel closed this Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants