A Launch script with Best Recipe of Deep Learning on Intel Xeon CPU #63932

jingxu10 · 2021-08-25T07:25:40Z

Fixes #63556

Usage: python -m torch.backends.xeon.launch [--knobs] <script> [script parameters]

facebook-github-bot · 2021-08-25T07:25:47Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/63932
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 39d88cd (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

jgong5 · 2021-08-25T22:14:08Z

@VitalyFedyunin

jingxu10 · 2021-11-03T23:12:21Z

Hi @VitalyFedyunin , any updates?

dzhulgakov · 2021-11-18T08:01:02Z

Given that the script is cpu specific atm, should it be called launch_cpu or something like that?

@kiukchung might also want to take a look. It might not be a bad idea to embed it in to torchrun command that is a standard way for launching multi-worker jobs: https://pytorch.org/docs/stable/elastic/run.html That way UX can be similar for cpu and gpu

jingxu10 · 2021-11-18T08:30:52Z

@dzhulgakov It is a fantastic idea to merge this script into torchrun.

jingxu10 · 2021-12-02T03:05:22Z

Hi @dzhulgakov , do you have updates about embedding this PR to torchrun? torchrun is a wrap script which imports torch.distributed.run. Should we do similar work to import the PRed launch module in torchrun with a flag to choose from or integrate the functionality into torch.distributed.run?

github-actions · 2022-04-13T17:06:09Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

jingxu10 · 2022-04-15T22:20:04Z

Could you remove the stale label?

torch/utils/launch.py

malfet

Please explicitly call out whether MacOS is supported or not
Use f-strings rather than .format and please use double quotes per https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings

torch/utils/launch.py

malfet · 2022-04-21T16:31:16Z

torch/utils/launch.py

+            if len(matches) > 0:
+                lst_valid.append(item)
+            else:
+                logger.warning("{} doesn't exist. Removing it from LD_PRELOAD.".format(item))


We are in Python-3 world, please use f-strings

Suggested change

logger.warning("{} doesn't exist. Removing it from LD_PRELOAD.".format(item))

logger.warning(f"{item} doesn't exist. Removing it from LD_PRELOAD.")

torch/utils/launch.py

malfet · 2022-04-21T16:32:26Z

torch/utils/launch.py

+            logger.info("Begin to validate the ip connect")
+            args.master_addr = ip_list[0]
+            for ip in ip_list[1:]:
+                completed_process = subprocess.run("ssh -o PasswordAuthentication=no {} ':'".format(ip), shell=True)


Please use f-strings

Suggested change

completed_process = subprocess.run("ssh -o PasswordAuthentication=no {} ':'".format(ip), shell=True)

completed_process = subprocess.run(f"ssh -o PasswordAuthentication=no {ip} ':'", shell=True)

torch/utils/launch.py

kiukchung

Thank you for working on this.

Please include unittests with sufficient and necessary test-cases to assert the correctness of the launcher. Feel free to mock certain system-dependent calls where appropriate.

kiukchung · 2022-04-25T17:58:41Z

setup.py

@@ -874,7 +874,7 @@ def make_relative_rpath_args(path):
        'console_scripts': [
            'convert-caffe2-to-onnx = caffe2.python.onnx.bin.conversion:caffe2_to_onnx',
            'convert-onnx-to-caffe2 = caffe2.python.onnx.bin.conversion:onnx_to_caffe2',
-            'torchrun = torch.distributed.run:main',
+            'torchrun = torch.utils.launch:main',


could we keep the verb consistent and rename this to torch.utils.run (and equivalently: torch.utils.launch_cpu_inference -> torch.utils.inference.run)

torch.utils.launch_cpu_inference -> torch.utils.inference.run
Should I create a directory inference inside torch/utils, and move the torch/utils/launch_cpu_inference.py to torch/utils/inference/run.py?

kiukchung · 2022-04-25T18:00:33Z

torch/utils/launch.py

+    parser = argparse.ArgumentParser()
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument('--inference', default=False, action="store_true")
+    group.add_argument('--distributed', default=False, action="store_true")


this would break backwards compatibility. I suggest we make this default to True.

Hi @kiukchung , do you expect to use it as the following?
torchrun xxxxx => DDP
torchrun --inference xxxxxx => inference

kiukchung · 2022-04-25T18:03:28Z

torch/utils/launch.py

+import argparse
+import sys
+from torch.utils import launch_cpu_inference
+from torch.distributed import run as launch_ddp


lets avoid module renames when possible or keep the import statement closer to the call-site. We can move this line down to L26 as:

if args.distributed: from torch.distributed import run as run run.main(args)

kiukchung · 2022-04-25T18:04:12Z

torch/utils/launch.py

+    if args.inference:
+        launch_cpu_inference.main()
+    if args.distributed:
+        launch_ddp.main()


pass the remaining args explicitly to launch_ddp.main(args=argv_script) instead of mutating sys.argv. Same goes for launch_cpu_inference.main(args=argv_script)

kiukchung · 2022-04-25T18:11:22Z