Skip to content

Conversation

jingxu10
Copy link
Contributor

@jingxu10 jingxu10 commented Aug 25, 2021

Fixes #63556

Usage: python -m torch.backends.xeon.launch [--knobs] <script> [script parameters]

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 25, 2021

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 39d88cd (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@jingxu10 jingxu10 changed the title add launch.py A Launch script with Best Recipe of Deep Learning on Intel Xeon CPU Aug 25, 2021
@jgong5
Copy link
Collaborator

jgong5 commented Aug 25, 2021

@VitalyFedyunin

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 27, 2021
@msaroufim msaroufim self-requested a review November 2, 2021 22:53
@jingxu10
Copy link
Contributor Author

jingxu10 commented Nov 3, 2021

Hi @VitalyFedyunin , any updates?

@dzhulgakov
Copy link
Collaborator

Given that the script is cpu specific atm, should it be called launch_cpu or something like that?

@kiukchung might also want to take a look. It might not be a bad idea to embed it in to torchrun command that is a standard way for launching multi-worker jobs: https://pytorch.org/docs/stable/elastic/run.html That way UX can be similar for cpu and gpu

@jingxu10
Copy link
Contributor Author

@dzhulgakov It is a fantastic idea to merge this script into torchrun.

@jingxu10
Copy link
Contributor Author

jingxu10 commented Dec 2, 2021

Hi @dzhulgakov , do you have updates about embedding this PR to torchrun? torchrun is a wrap script which imports torch.distributed.run. Should we do similar work to import the PRed launch module in torchrun with a flag to choose from or integrate the functionality into torch.distributed.run?

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 13, 2022
@jingxu10
Copy link
Contributor Author

Could you remove the stale label?

@kiukchung kiukchung removed the Stale label Apr 16, 2022
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explicitly call out whether MacOS is supported or not
Use f-strings rather than .format and please use double quotes per https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings

if len(matches) > 0:
lst_valid.append(item)
else:
logger.warning("{} doesn't exist. Removing it from LD_PRELOAD.".format(item))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in Python-3 world, please use f-strings

Suggested change
logger.warning("{} doesn't exist. Removing it from LD_PRELOAD.".format(item))
logger.warning(f"{item} doesn't exist. Removing it from LD_PRELOAD.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

logger.info("Begin to validate the ip connect")
args.master_addr = ip_list[0]
for ip in ip_list[1:]:
completed_process = subprocess.run("ssh -o PasswordAuthentication=no {} ':'".format(ip), shell=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use f-strings

Suggested change
completed_process = subprocess.run("ssh -o PasswordAuthentication=no {} ':'".format(ip), shell=True)
completed_process = subprocess.run(f"ssh -o PasswordAuthentication=no {ip} ':'", shell=True)

@jingxu10 jingxu10 force-pushed the jingxu10/launch branch 2 times, most recently from 451a387 to 9ac268f Compare April 25, 2022 07:02
Copy link
Collaborator

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.

Please include unittests with sufficient and necessary test-cases to assert the correctness of the launcher. Feel free to mock certain system-dependent calls where appropriate.

setup.py Outdated
@@ -874,7 +874,7 @@ def make_relative_rpath_args(path):
'console_scripts': [
'convert-caffe2-to-onnx = caffe2.python.onnx.bin.conversion:caffe2_to_onnx',
'convert-onnx-to-caffe2 = caffe2.python.onnx.bin.conversion:onnx_to_caffe2',
'torchrun = torch.distributed.run:main',
'torchrun = torch.utils.launch:main',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we keep the verb consistent and rename this to torch.utils.run (and equivalently: torch.utils.launch_cpu_inference -> torch.utils.inference.run)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.utils.launch_cpu_inference -> torch.utils.inference.run
Should I create a directory inference inside torch/utils, and move the torch/utils/launch_cpu_inference.py to torch/utils/inference/run.py?

parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument('--inference', default=False, action="store_true")
group.add_argument('--distributed', default=False, action="store_true")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would break backwards compatibility. I suggest we make this default to True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kiukchung , do you expect to use it as the following?
torchrun xxxxx => DDP
torchrun --inference xxxxxx => inference

import argparse
import sys
from torch.utils import launch_cpu_inference
from torch.distributed import run as launch_ddp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets avoid module renames when possible or keep the import statement closer to the call-site. We can move this line down to L26 as:

if args.distributed:
   from torch.distributed import run as run
   run.main(args)

if args.inference:
launch_cpu_inference.main()
if args.distributed:
launch_ddp.main()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the remaining args explicitly to launch_ddp.main(args=argv_script) instead of mutating sys.argv. Same goes for launch_cpu_inference.main(args=argv_script)

from argparse import ArgumentParser, REMAINDER
from argparse import RawTextHelpFormatter
import logging
import psutil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this import required? didn't see any usages of it, afaik torch doesn't define a dependency to psutil already so if this is required, we'd have to add it to requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

self.set_env("KMP_BLOCKTIME", "1")
self.logger_env("LD_PRELOAD")

class MultiInstanceLauncher(Launcher):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this MultiInstanceLauncher now deals with single instance launches as well (which makes sense since single instance launching is a subset of multi-instance launching), do we need this class then? can't we fold all of this into Launcher now?

# multi-instance control
group.add_argument("--ncore_per_instance", metavar="\b", default=-1, type=int,
help="Cores per instance")
group.add_argument("--ninstances", metavar="\b", default=-1, type=int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this script support multi-host runs? An "instance" in this case maps to a process? If so, could we keep it consistent with torchrun and use --nprocs instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is supposed for inference only at this moment. Both of these 2 arguments handle inference with multiple independent instances on a single node. An instance maps to a process.
--ncore_per_instance indicates how many cores are bound to an instance. How about making --ninstances to --nprocs?

or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or
{expanduser("~")}/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance""")

def logger_env(self, env_name=""):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method logs the environment variable name and value correct? then it should be called log_env_var(env_var_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This prints an environment variable with name and value pair.

Comment on lines 288 to 290
def set_env(self, env_name, env_value=None):
if not env_value:
logger.warning(f"{env_name} is None")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the default value of None for env_value intentional? if it was intentional (to clear env vars) then a warning is a bit misleading (change to info). Also the message should be clarified as:

logger.info(f"Clearing environment variable: {env_name}. Previous value: {previous_value}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this default value of env_value parameter.

if env_name not in os.environ:
os.environ[env_name] = env_value
elif os.environ[env_name] != env_value:
logger.warning(f"{env_name} in environment variable is {os.environ[env_name]} while the value you set is {env_value}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, if it is a valid action to set env vars that are already present this should be an "info" level log. Also the message can be improved:

logger.warning(f"Overriding already set environment variable: {env_name}. New value: {env_value}. Previous value: {prev_value}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't mean to override the environment variable. We take values that are set by environment variables as the highest priority. This just notifies the user that he/she has set values via the environment variable, which is different than that set in the script.

@jingxu10 jingxu10 force-pushed the jingxu10/launch branch 2 times, most recently from 75966c2 to b6017bc Compare April 26, 2022 20:11
@jingxu10
Copy link
Contributor Author

jingxu10 commented Jun 2, 2022

This empty init.py is to solve the following lint issue.
Cannot find implementation or library stub for module named "torch.utils.inference".
Does it make sense to add the empty init.py back? @albanD

Ho mypy is not using plain python import and is divergent. That's unfortunate. We can add it back then sure.

Got it. Done.

@yanbing-j yanbing-j added the intel This tag is for PR from Intel label Jun 29, 2022
@jingxu10 jingxu10 force-pushed the jingxu10/launch branch 2 times, most recently from 6b09f56 to 96b728b Compare June 29, 2022 08:40
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update on this!
I mainly have questions about the test but the rest looks good.

"""
from torch.backends.xeon.run_cpu import _CPUinfo
cpuinfo = _CPUinfo(lscpu_info)
assert cpuinfo._physical_core_nums() == 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use self.assertEqual() for these.

assert num == 4, "Failed to launch multiple instances for inference"

if __name__ == "__main__":
subprocess.call("yes | conda install -c conda-forge gperftools jemalloc", shell=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever do such things in our tests.
I think it is better to raise a nice error if they are not available and hint the user towards installing these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is to enable testing of preloading jemalloc and tcmalloc. If this cannot be executed, then I'll need to remove test_jemalloc (line 51) and test_tcmalloc (line 62) from this UT script. Is it OK to remove these 2 test functions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way to go would be to check if these are installed and if not, either:

  • Skip the tests that require it.
  • Raise an error asking the user to install these before running the test.

@@ -0,0 +1,112 @@
# Owner(s): ["module: unknown"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be run in CI?
If so:

  • The file should be renamed to be test_* so that it actually runs
  • We should have a proper module for it. Should I add a module: xeon and add you to the cc list for it?
  • We need to make sure that it runs on a machine that has all the proper requirements. Do we have any CI machine with the right hardware to run this?

Copy link
Contributor Author

@jingxu10 jingxu10 Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yeah, sure. I'll rename it to test_launch_xeon.py.
  • Yeah, sounds good to me.
  • 4 out of 5 test functions depend on installation of Jemalloc, TCMalloc and IOMP python packages (test_jemalloc, test_tcmalloc, test_iomp and test_default_config). If the installation cannot be done, I'll remove them, but just keep test_multi_threads to verify if multiple instances can be launched. This one doesn't require any hardware resources, just a loop to execute a bash command for several times.

How about we keep 2 test functions, test_multi_threads and test_cpu_info, in this UT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @malfet who is more familiar with the config on our current CI machines? Do we have the right hardware? And if so, can we install these libraries by default on the runners?


"""

from __future__ import absolute_import, division, print_function, unicode_literals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? We only support python 3.7+

format_str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
logging.basicConfig(level=logging.INFO, format=format_str)
logger = logging.getLogger(__name__)
# from torch.distributed.elastic.utils.logging import get_logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious comment?

from typing import List, Dict

format_str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
logging.basicConfig(level=logging.INFO, format=format_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done inside the if main below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is imported during testing so this is going to change the global state of logging during testing?

lscpu_cmd = ["lscpu", "--parse=CPU,Core,Socket,Node"]
lscpu_info = subprocess.check_output(lscpu_cmd, universal_newlines=True).split("\n")
else:
print("Running test to verify correctness of CPUinfo class")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log?

@yanbing-j yanbing-j added the intel priority matters to intel architecture from performance wise label Jul 14, 2022
@@ -0,0 +1,63 @@
# Owner(s): ["module: unknown"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be module: intel ?

from typing import List, Dict

format_str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
logging.basicConfig(level=logging.INFO, format=format_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is imported during testing so this is going to change the global state of logging during testing?

@yanbing-j yanbing-j requested a review from albanD July 27, 2022 03:46
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update sounds good to me.
Can you please make sure the tests pass though? If that is only going to work on XEON machines, you can skip the test (with unittest.skipIf()) if you're not on XEON.

@jingxu10
Copy link
Contributor Author

Hi @albanD , I've done fixing CI failures.
2 failing checks:

  1. build docs. torch.backends.xeon is not added to *.rst files. I think you will update docs after the integration.
  2. usage_log.txt : The process cannot access the file because it is being used by another process. It seems to be related to the CI environment.

@albanD
Copy link
Collaborator

albanD commented Jul 28, 2022

build docs. torch.backends.xeon is not added to *.rst files. I think you will update docs after the integration.

This will need to be fixed before merging.
The .rst files in question are located in docs/
Depending on how you want to document this new submodule, you can update that. You most likely just want to add it with the others in https://pytorch.org/docs/stable/backends.html?highlight=backends#module-torch.backends by modifying backends.rst.

@albanD
Copy link
Collaborator

albanD commented Jul 28, 2022

CI failure is real.

@jingxu10
Copy link
Contributor Author

jingxu10 commented Jul 28, 2022

Oh, the launch script doesn't support Windows because hardware info is read in Linux way... I'll skip this UT on Windows.

def tearDown(self):
shutil.rmtree(self._test_dir)

@unittest.skipIf(IS_LINUX, "Windows not supported")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • You can apply this to the class constructor on top if you want to skip all the tests in that class
  • You want to skip if it is NOT linux right?
  • it is both windows and macos that won't run? The comment could be "Only works on linux" if that's the case.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.
Looks good to me.
Could you rebase on latest master to ensure that CI does run fine?

@jingxu10
Copy link
Contributor Author

Sure. Will do. Thanks.

correct module name in run_cpu.py

update UT

update module name

add __init__.py to torch/backends/xeon

fix build doc error

add skipif to the UT to run on Linux only

import unittest in test_launch.sh

fix improper skipif in test_launch.py
@albanD
Copy link
Collaborator

albanD commented Jul 29, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

Hey @jingxu10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Aug 1, 2022
…63932) (#63932)

Summary:
Fixes #63556

Usage: `python -m torch.backends.xeon.launch [--knobs] <script> [script parameters]`

Pull Request resolved: #63932
Approved by: https://github.com/albanD

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5257d1d64ba8d5faffe8082342d2949c4e1c0e1e

Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Reviewed By: osalpekar

Differential Revision: D38290825

Pulled By: osalpekar

fbshipit-source-id: 579fc5bb2b3075a2fa8de939796d1263c36ea311
@jingxu10 jingxu10 deleted the jingxu10/launch branch June 5, 2025 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] A Launch script with Best Recipe of Deep Learning on Intel Xeon CPU