Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

Closed
wants to merge 5 commits into from

Conversation

masnesral
Copy link
Contributor

@masnesral masnesral commented Sep 12, 2023

Stack from ghstack (oldest at bottom):

Test Plan:
python test/inductor/test_max_autotune.py
TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart
TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 12, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109127

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cdd5ed3 with merge base 264f1e7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

masnesral added a commit that referenced this pull request Sep 12, 2023
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

ghstack-source-id: 290da2b3040ca6a5913d13c0c5bc3ac37ee71d4a
Pull Request resolved: #109127
…PUs"

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 12, 2023
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

ghstack-source-id: 611a1f9b20378432130f1dd403ae12b6649e91b7
Pull Request resolved: #109127
…PUs"

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 12, 2023
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

ghstack-source-id: 611a1f9b20378432130f1dd403ae12b6649e91b7
Pull Request resolved: #109127
@masnesral masnesral added the topic: not user facing topic category label Sep 12, 2023
@masnesral masnesral marked this pull request as ready for review September 12, 2023 19:22
@masnesral
Copy link
Contributor Author

@shunting314 FYI this is a redo of #107983 which we had to revert since it didn't work well in fbcode.

@aakhundov, FYI I took your suggestions from that PR.

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 12, 2023
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

ghstack-source-id: 290da2b3040ca6a5913d13c0c5bc3ac37ee71d4a
Pull Request resolved: #109127
…PUs"

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 13, 2023
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

ghstack-source-id: 3ae61dec864266d43d5e539eabb41111a576973a
Pull Request resolved: #109127
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

count = torch.cuda.device_count()

# If the user specified the visible devices in the env, use those.
if CUDA_VISIBLE_DEVICES in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, so this includes the fix @aakhundov mentioned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masnesral Thanks for the fix!

@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 13, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

log.debug("Entering TuningProcess child main")
log.debug("Entering TuningProcess child main: %s", device)
if device is not None:
os.environ[CUDA_VISIBLE_DEVICES] = str(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that this works. I guess, it means that pytorch is able to picks up the device value automatically from the newly update os.environ["CUDA_VISIBLE_DEVICES"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakhundov Hmmmmm, good catch. It does not work as I expected. I saw speedups comparable to the first implementation, so blindly assumed it was due to using multiple GPUs. But I guess the use of multiple subprocesses (even if using the same GPU) provides perf benefits? I found nvidia-smi pmon and it's reporting all my parallel subprocesses on the same device. The reason for organizing it this way is the multiprocessing API doesn't seem to allow passing a new environment before spawning the new procees. Lemme dig some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the benchmark result will be very much distorted in this case. We have to make sure different subprocess are running on different devices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if anyone has any good idea for testing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's incorrect to run more than one profiling on the same device in parallel. So in AITemplate we do this to launch a process with a specific CUDA_VISIBLE_DEVICES env var (dev_select_flag is "CUDA_VISIBLE_DEVICES"; device is a zero-based int id of the device). And this seems to work: whatever the process does on a GPU ends up on the #device. Could we do something similar here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masnesral I believe, the reason you saw speedups was because the actual GPU work during benchmarking is not that intensive. As I understand, most of the time may be spent in launching and processing. But we can't guarantee that the kernels will not overlap (hence distort the measurements), so still must serialize them on each device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in AITemplate we do this

Yes, setting that env var works as expected, generally speaking. The problem in this current implementation is that setting the env var after the sub-process has already started does not have the desired affect. The previous impl that used a Popen (basically the same approach as the AIT impl) worked correctly in this regard. AFAICT, if we stick with the multiprocessing lib (which has other advantages) the only way to manipulate the env is to do it in the parent process before starting the sub-process. I'll put up an RFC soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@facebook-github-bot facebook-github-bot deleted the gh/masnesral/10/head branch September 17, 2023 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants