New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to log subprocess output to files in DDP launcher. #33193
Conversation
Will add reviewers after polishing |
💊 CI failures summary and remediationsAs of commit 6a5261e (more details on the Dr. CI page):
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 20 times. |
@rohan-varma is there any official pytorch solution for this? We are running into this problem in YOLOv5, and are attempting to manually patch together a messy python logger solution here ultralytics/yolov5#719, but it would be ideal if there were a more general solution for the wider community. |
@rohan-varma What is the status of this PR? I would love to take advantage of this capability. |
Thanks for checking in @glenn-jocher @hendrytl! I will clean up this PR and aim to publish it for review this week. |
@rohan-varma awesome, thank you! This will help a lot of YOLOv5 users training multi-gpu with DDP :) |
torch/distributed/launch.py
Outdated
subprocess_stdout = None | ||
if args.logdir: | ||
directory_path = os.path.join(os.getcwd(), args.logdir) | ||
file_handle = open(os.path.join(directory_path, "process_{}".format(local_rank)), "w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also include the node_rank so that log file names from multiple machines do not collide? Perhaps "process_{}_{}".format(node_rank, local_rank)?
@glenn-jocher @hendrytl I updated the PR, feel free to take a look and let me know if it works for your use case. I included an example in the PR description. |
torch/distributed/launch.py
Outdated
@@ -255,6 +289,10 @@ def main(): | |||
raise subprocess.CalledProcessError(returncode=process.returncode, | |||
cmd=cmd) | |||
|
|||
# close open file descriptors | |||
for file_handle in subprocess_file_handles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to a finally block to ensure file handles are closed if an error is raised.
Codecov Report
@@ Coverage Diff @@
## master #33193 +/- ##
==========================================
- Coverage 68.99% 68.95% -0.04%
==========================================
Files 433 433
Lines 55915 55941 +26
==========================================
- Hits 38576 38575 -1
- Misses 17339 17366 +27 |
@pritamdamania87 @mrshenli Would be great if you can take a look at this as it seems like it would be a useful addition for users of |
fwiw we were planning to add this exact feature to the torchelastic agent (and naturally we’d have to add it to torchelastic.distributed.launch). |
torch/distributed/launch.py
Outdated
print("passed in --logdir must be a relative path to a directory. Ignoring argument.") | ||
args.logdir = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably raise an Error here instead of continuing without logging.
torch/distributed/launch.py
Outdated
print(f"Note: Stdout for node {node_rank} rank {local_rank} will be written to {file_path}") | ||
|
||
|
||
process = subprocess.Popen(cmd, env=current_env, stdout=subprocess_stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also write stderr to a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the diff to do this.
torch/distributed/launch.py
Outdated
@@ -235,7 +236,7 @@ def main(): | |||
# Possibly create the directory to write subprocess log output to. | |||
if os.path.exists(args.logdir): | |||
if not os.path.isdir(args.logdir): | |||
print("passed in --logdir must be a relative path to a directory. Ignoring argument.") | |||
raise ValueError("argument --logdir must be a path to a directory.") | |||
args.logdir = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be never executed and can be removed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@rohan-varma merged this pull request in ccb79f3. |
1 similar comment
@rohan-varma merged this pull request in ccb79f3. |
Closes #7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.
The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.
To support this, we add an optional argument
--logdir
set the subprocess stdout to be the a file of the format "node_rank_{}local_rank{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py
. This results in a directorytest_logdir
with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.