-
Notifications
You must be signed in to change notification settings - Fork 25.2k
A Launch script with Best Recipe of Deep Learning on Intel Xeon CPU #63932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 39d88cd (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
22b1e9c
to
8e8561b
Compare
84a2d5d
to
9114db6
Compare
Hi @VitalyFedyunin , any updates? |
Given that the script is cpu specific atm, should it be called @kiukchung might also want to take a look. It might not be a bad idea to embed it in to |
@dzhulgakov It is a fantastic idea to merge this script into |
Hi @dzhulgakov , do you have updates about embedding this PR to |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Could you remove the stale label? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explicitly call out whether MacOS is supported or not
Use f-strings rather than .format
and please use double quotes per https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings
torch/utils/launch.py
Outdated
if len(matches) > 0: | ||
lst_valid.append(item) | ||
else: | ||
logger.warning("{} doesn't exist. Removing it from LD_PRELOAD.".format(item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in Python-3 world, please use f-strings
logger.warning("{} doesn't exist. Removing it from LD_PRELOAD.".format(item)) | |
logger.warning(f"{item} doesn't exist. Removing it from LD_PRELOAD.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
torch/utils/launch.py
Outdated
logger.info("Begin to validate the ip connect") | ||
args.master_addr = ip_list[0] | ||
for ip in ip_list[1:]: | ||
completed_process = subprocess.run("ssh -o PasswordAuthentication=no {} ':'".format(ip), shell=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use f-strings
completed_process = subprocess.run("ssh -o PasswordAuthentication=no {} ':'".format(ip), shell=True) | |
completed_process = subprocess.run(f"ssh -o PasswordAuthentication=no {ip} ':'", shell=True) |
451a387
to
9ac268f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this.
Please include unittests with sufficient and necessary test-cases to assert the correctness of the launcher. Feel free to mock certain system-dependent calls where appropriate.
setup.py
Outdated
@@ -874,7 +874,7 @@ def make_relative_rpath_args(path): | |||
'console_scripts': [ | |||
'convert-caffe2-to-onnx = caffe2.python.onnx.bin.conversion:caffe2_to_onnx', | |||
'convert-onnx-to-caffe2 = caffe2.python.onnx.bin.conversion:onnx_to_caffe2', | |||
'torchrun = torch.distributed.run:main', | |||
'torchrun = torch.utils.launch:main', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we keep the verb consistent and rename this to torch.utils.run
(and equivalently: torch.utils.launch_cpu_inference
-> torch.utils.inference.run
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.utils.launch_cpu_inference
-> torch.utils.inference.run
Should I create a directory inference
inside torch/utils
, and move the torch/utils/launch_cpu_inference.py
to torch/utils/inference/run.py
?
torch/utils/launch.py
Outdated
parser = argparse.ArgumentParser() | ||
group = parser.add_mutually_exclusive_group(required=True) | ||
group.add_argument('--inference', default=False, action="store_true") | ||
group.add_argument('--distributed', default=False, action="store_true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would break backwards compatibility. I suggest we make this default to True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kiukchung , do you expect to use it as the following?
torchrun xxxxx => DDP
torchrun --inference xxxxxx => inference
torch/utils/launch.py
Outdated
import argparse | ||
import sys | ||
from torch.utils import launch_cpu_inference | ||
from torch.distributed import run as launch_ddp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets avoid module renames when possible or keep the import statement closer to the call-site. We can move this line down to L26 as:
if args.distributed:
from torch.distributed import run as run
run.main(args)
torch/utils/launch.py
Outdated
if args.inference: | ||
launch_cpu_inference.main() | ||
if args.distributed: | ||
launch_ddp.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass the remaining args explicitly to launch_ddp.main(args=argv_script)
instead of mutating sys.argv
. Same goes for launch_cpu_inference.main(args=argv_script)
torch/utils/launch_cpu_inference.py
Outdated
from argparse import ArgumentParser, REMAINDER | ||
from argparse import RawTextHelpFormatter | ||
import logging | ||
import psutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this import required? didn't see any usages of it, afaik torch doesn't define a dependency to psutil
already so if this is required, we'd have to add it to requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
torch/utils/launch_cpu_inference.py
Outdated
self.set_env("KMP_BLOCKTIME", "1") | ||
self.logger_env("LD_PRELOAD") | ||
|
||
class MultiInstanceLauncher(Launcher): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this MultiInstanceLauncher
now deals with single instance launches as well (which makes sense since single instance launching is a subset of multi-instance launching), do we need this class then? can't we fold all of this into Launcher
now?
torch/utils/launch_cpu_inference.py
Outdated
# multi-instance control | ||
group.add_argument("--ncore_per_instance", metavar="\b", default=-1, type=int, | ||
help="Cores per instance") | ||
group.add_argument("--ninstances", metavar="\b", default=-1, type=int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this script support multi-host runs? An "instance" in this case maps to a process? If so, could we keep it consistent with torchrun
and use --nprocs
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is supposed for inference only at this moment. Both of these 2 arguments handle inference with multiple independent instances on a single node. An instance maps to a process.
--ncore_per_instance
indicates how many cores are bound to an instance. How about making --ninstances
to --nprocs
?
torch/utils/launch_cpu_inference.py
Outdated
or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or | ||
{expanduser("~")}/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance""") | ||
|
||
def logger_env(self, env_name=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method logs the environment variable name and value correct? then it should be called log_env_var(env_var_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. This prints an environment variable with name and value pair.
torch/utils/launch_cpu_inference.py
Outdated
def set_env(self, env_name, env_value=None): | ||
if not env_value: | ||
logger.warning(f"{env_name} is None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the default value of None
for env_value
intentional? if it was intentional (to clear env vars) then a warning
is a bit misleading (change to info). Also the message should be clarified as:
logger.info(f"Clearing environment variable: {env_name}. Previous value: {previous_value}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this default value of env_value
parameter.
torch/utils/launch_cpu_inference.py
Outdated
if env_name not in os.environ: | ||
os.environ[env_name] = env_value | ||
elif os.environ[env_name] != env_value: | ||
logger.warning(f"{env_name} in environment variable is {os.environ[env_name]} while the value you set is {env_value}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, if it is a valid action to set env vars that are already present this should be an "info" level log. Also the message can be improved:
logger.warning(f"Overriding already set environment variable: {env_name}. New value: {env_value}. Previous value: {prev_value}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't mean to override the environment variable. We take values that are set by environment variables as the highest priority. This just notifies the user that he/she has set values via the environment variable, which is different than that set in the script.
75966c2
to
b6017bc
Compare
Got it. Done. |
03dbb1a
to
2368fa0
Compare
6b09f56
to
96b728b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update on this!
I mainly have questions about the test but the rest looks good.
test/backends/xeon/run_test.py
Outdated
""" | ||
from torch.backends.xeon.run_cpu import _CPUinfo | ||
cpuinfo = _CPUinfo(lscpu_info) | ||
assert cpuinfo._physical_core_nums() == 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use self.assertEqual()
for these.
test/backends/xeon/run_test.py
Outdated
assert num == 4, "Failed to launch multiple instances for inference" | ||
|
||
if __name__ == "__main__": | ||
subprocess.call("yes | conda install -c conda-forge gperftools jemalloc", shell=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we ever do such things in our tests.
I think it is better to raise a nice error if they are not available and hint the user towards installing these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command is to enable testing of preloading jemalloc and tcmalloc. If this cannot be executed, then I'll need to remove test_jemalloc
(line 51) and test_tcmalloc
(line 62) from this UT script. Is it OK to remove these 2 test functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way to go would be to check if these are installed and if not, either:
- Skip the tests that require it.
- Raise an error asking the user to install these before running the test.
test/backends/xeon/run_test.py
Outdated
@@ -0,0 +1,112 @@ | |||
# Owner(s): ["module: unknown"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be run in CI?
If so:
- The file should be renamed to be
test_*
so that it actually runs - We should have a proper module for it. Should I add a
module: xeon
and add you to the cc list for it? - We need to make sure that it runs on a machine that has all the proper requirements. Do we have any CI machine with the right hardware to run this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yeah, sure. I'll rename it to
test_launch_xeon.py
. - Yeah, sounds good to me.
- 4 out of 5 test functions depend on installation of Jemalloc, TCMalloc and IOMP python packages (
test_jemalloc
,test_tcmalloc
,test_iomp
andtest_default_config
). If the installation cannot be done, I'll remove them, but just keeptest_multi_threads
to verify if multiple instances can be launched. This one doesn't require any hardware resources, just a loop to execute a bash command for several times.
How about we keep 2 test functions, test_multi_threads
and test_cpu_info
, in this UT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @malfet who is more familiar with the config on our current CI machines? Do we have the right hardware? And if so, can we install these libraries by default on the runners?
torch/backends/xeon/run_cpu.py
Outdated
|
||
""" | ||
|
||
from __future__ import absolute_import, division, print_function, unicode_literals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? We only support python 3.7+
torch/backends/xeon/run_cpu.py
Outdated
format_str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s" | ||
logging.basicConfig(level=logging.INFO, format=format_str) | ||
logger = logging.getLogger(__name__) | ||
# from torch.distributed.elastic.utils.logging import get_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious comment?
from typing import List, Dict | ||
|
||
format_str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s" | ||
logging.basicConfig(level=logging.INFO, format=format_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done inside the if main below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is imported during testing so this is going to change the global state of logging during testing?
torch/backends/xeon/run_cpu.py
Outdated
lscpu_cmd = ["lscpu", "--parse=CPU,Core,Socket,Node"] | ||
lscpu_info = subprocess.check_output(lscpu_cmd, universal_newlines=True).split("\n") | ||
else: | ||
print("Running test to verify correctness of CPUinfo class") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log?
test/backends/xeon/test_launch.py
Outdated
@@ -0,0 +1,63 @@ | |||
# Owner(s): ["module: unknown"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should be module: intel
?
from typing import List, Dict | ||
|
||
format_str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s" | ||
logging.basicConfig(level=logging.INFO, format=format_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is imported during testing so this is going to change the global state of logging during testing?
0f7bce9
to
669b87f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update sounds good to me.
Can you please make sure the tests pass though? If that is only going to work on XEON machines, you can skip the test (with unittest.skipIf()) if you're not on XEON.
a926fe5
to
4193cca
Compare
Hi @albanD , I've done fixing CI failures.
|
This will need to be fixed before merging. |
4193cca
to
62407a6
Compare
CI failure is real. |
Oh, the launch script doesn't support Windows because hardware info is read in Linux way... I'll skip this UT on Windows. |
test/backends/xeon/test_launch.py
Outdated
def tearDown(self): | ||
shutil.rmtree(self._test_dir) | ||
|
||
@unittest.skipIf(IS_LINUX, "Windows not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You can apply this to the class constructor on top if you want to skip all the tests in that class
- You want to skip if it is NOT linux right?
- it is both windows and macos that won't run? The comment could be "Only works on linux" if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update.
Looks good to me.
Could you rebase on latest master to ensure that CI does run fine?
Sure. Will do. Thanks. |
a640f5a
to
39d88cd
Compare
correct module name in run_cpu.py update UT update module name add __init__.py to torch/backends/xeon fix build doc error add skipif to the UT to run on Linux only import unittest in test_launch.sh fix improper skipif in test_launch.py
@pytorchbot merge -g |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @jingxu10. |
…63932) (#63932) Summary: Fixes #63556 Usage: `python -m torch.backends.xeon.launch [--knobs] <script> [script parameters]` Pull Request resolved: #63932 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/5257d1d64ba8d5faffe8082342d2949c4e1c0e1e Original Phabricator Test Plan: Imported from GitHub, without a `Test Plan:` line. Reviewed By: osalpekar Differential Revision: D38290825 Pulled By: osalpekar fbshipit-source-id: 579fc5bb2b3075a2fa8de939796d1263c36ea311
Fixes #63556
Usage:
python -m torch.backends.xeon.launch [--knobs] <script> [script parameters]