[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

masnesral · 2023-09-12T18:21:39Z

Stack from ghstack (oldest at bottom):

Test Plan:
python test/inductor/test_max_autotune.py
TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart
TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` [ghstack-poisoned]

pytorch-bot · 2023-09-12T18:21:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109127

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cdd5ed3 with merge base 264f1e7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 290da2b3040ca6a5913d13c0c5bc3ac37ee71d4a Pull Request resolved: #109127

…PUs" Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 611a1f9b20378432130f1dd403ae12b6649e91b7 Pull Request resolved: #109127

…PUs" Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 611a1f9b20378432130f1dd403ae12b6649e91b7 Pull Request resolved: #109127

masnesral · 2023-09-12T19:27:58Z

@shunting314 FYI this is a redo of #107983 which we had to revert since it didn't work well in fbcode.

@aakhundov, FYI I took your suggestions from that PR.

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 290da2b3040ca6a5913d13c0c5bc3ac37ee71d4a Pull Request resolved: #109127

…PUs" Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 3ae61dec864266d43d5e539eabb41111a576973a Pull Request resolved: #109127

eellison

nice!

eellison · 2023-09-13T17:31:53Z

torch/_inductor/autotune_process.py

+        count = torch.cuda.device_count()
+
+        # If the user specified the visible devices in the env, use those.
+        if CUDA_VISIBLE_DEVICES in os.environ:


cool, so this includes the fix @aakhundov mentioned

@masnesral Thanks for the fix!

masnesral · 2023-09-13T21:28:29Z

@pytorchbot merge

pytorchmergebot · 2023-09-13T21:30:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

aakhundov · 2023-09-15T09:43:00Z

torch/_inductor/autotune_process.py

-        log.debug("Entering TuningProcess child main")
+        log.debug("Entering TuningProcess child main: %s", device)
+        if device is not None:
+            os.environ[CUDA_VISIBLE_DEVICES] = str(device)


Interesting that this works. I guess, it means that pytorch is able to picks up the device value automatically from the newly update os.environ["CUDA_VISIBLE_DEVICES"]?

@aakhundov Hmmmmm, good catch. It does not work as I expected. I saw speedups comparable to the first implementation, so blindly assumed it was due to using multiple GPUs. But I guess the use of multiple subprocesses (even if using the same GPU) provides perf benefits? I found nvidia-smi pmon and it's reporting all my parallel subprocesses on the same device. The reason for organizing it this way is the multiprocessing API doesn't seem to allow passing a new environment before spawning the new procees. Lemme dig some more.

the benchmark result will be very much distorted in this case. We have to make sure different subprocess are running on different devices.

I wonder if anyone has any good idea for testing this

I agree it's incorrect to run more than one profiling on the same device in parallel. So in AITemplate we do this to launch a process with a specific CUDA_VISIBLE_DEVICES env var (dev_select_flag is "CUDA_VISIBLE_DEVICES"; device is a zero-based int id of the device). And this seems to work: whatever the process does on a GPU ends up on the #device. Could we do something similar here?

@masnesral I believe, the reason you saw speedups was because the actual GPU work during benchmarking is not that intensive. As I understand, most of the time may be spent in launching and processing. But we can't guarantee that the kernels will not overlap (hence distort the measurements), so still must serialize them on each device.

So in AITemplate we do this

Yes, setting that env var works as expected, generally speaking. The problem in this current implementation is that setting the env var after the sub-process has already started does not have the desired affect. The previous impl that used a Popen (basically the same approach as the AIT impl) worked correctly in this regard. AFAICT, if we stick with the multiprocessing lib (which has other advantages) the only way to manipulate the env is to do it in the parent process before starting the sub-process. I'll put up an RFC soon.

masnesral mentioned this pull request Sep 12, 2023

[inductor] Parallelize Max Autotune step 1: refactor autotune_process #109126

Closed

github-actions bot added module: inductor ciflow/inductor labels Sep 12, 2023

masnesral added the topic: not user facing topic category label Sep 12, 2023

masnesral marked this pull request as ready for review September 12, 2023 19:22

masnesral requested review from shunting314, eellison and aakhundov September 12, 2023 19:26

shunting314 approved these changes Sep 12, 2023

View reviewed changes

eellison approved these changes Sep 13, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 13, 2023

pytorchmergebot added the merging label Sep 13, 2023

pytorchmergebot added Merged and removed merging labels Sep 14, 2023

pytorchmergebot closed this in 4a09ed5 Sep 14, 2023

aakhundov reviewed Sep 15, 2023

View reviewed changes

facebook-github-bot deleted the gh/masnesral/10/head branch September 17, 2023 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

masnesral commented Sep 12, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Sep 12, 2023 •

edited

masnesral commented Sep 12, 2023

eellison left a comment

eellison Sep 13, 2023

aakhundov Sep 15, 2023

masnesral commented Sep 13, 2023

pytorchmergebot commented Sep 13, 2023

aakhundov Sep 15, 2023

masnesral Sep 15, 2023

shunting314 Sep 15, 2023

eellison Sep 15, 2023

aakhundov Sep 15, 2023

aakhundov Sep 15, 2023

masnesral Sep 15, 2023

masnesral Sep 18, 2023

[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

Conversation

masnesral commented Sep 12, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Sep 12, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109127

✅ No Failures

masnesral commented Sep 12, 2023

eellison left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masnesral commented Sep 13, 2023

pytorchmergebot commented Sep 13, 2023

Merge started

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masnesral commented Sep 12, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Sep 12, 2023 •

edited