Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to log subprocess output to files in DDP launcher. #33193

Closed
wants to merge 14 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Feb 11, 2020

Closes #7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.

The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.

To support this, we add an optional argument --logdir set the subprocess stdout to be the a file of the format "node_rank_{}local_rank{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.

Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py. This results in a directory test_logdir with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.

@rohan-varma
Copy link
Member Author

Will add reviewers after polishing

@dr-ci
Copy link

dr-ci bot commented Feb 11, 2020

💊 CI failures summary and remediations

As of commit 6a5261e (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

codecov.io: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

@glenn-jocher
Copy link

@rohan-varma is there any official pytorch solution for this? We are running into this problem in YOLOv5, and are attempting to manually patch together a messy python logger solution here ultralytics/yolov5#719, but it would be ideal if there were a more general solution for the wider community.

@hendrytl
Copy link

@rohan-varma What is the status of this PR? I would love to take advantage of this capability.

@rohan-varma
Copy link
Member Author

Thanks for checking in @glenn-jocher @hendrytl! I will clean up this PR and aim to publish it for review this week.

@glenn-jocher
Copy link

glenn-jocher commented Sep 21, 2020

@rohan-varma awesome, thank you! This will help a lot of YOLOv5 users training multi-gpu with DDP :)

subprocess_stdout = None
if args.logdir:
directory_path = os.path.join(os.getcwd(), args.logdir)
file_handle = open(os.path.join(directory_path, "process_{}".format(local_rank)), "w")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also include the node_rank so that log file names from multiple machines do not collide? Perhaps "process_{}_{}".format(node_rank, local_rank)?

@rohan-varma rohan-varma changed the title [WIP] Add option to log subprocess output to files in DDP launcher. Add option to log subprocess output to files in DDP launcher. Sep 24, 2020
@rohan-varma
Copy link
Member Author

@glenn-jocher @hendrytl I updated the PR, feel free to take a look and let me know if it works for your use case. I included an example in the PR description.

@@ -255,6 +289,10 @@ def main():
raise subprocess.CalledProcessError(returncode=process.returncode,
cmd=cmd)

# close open file descriptors
for file_handle in subprocess_file_handles:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to a finally block to ensure file handles are closed if an error is raised.

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #33193 into master will decrease coverage by 0.03%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #33193      +/-   ##
==========================================
- Coverage   68.99%   68.95%   -0.04%     
==========================================
  Files         433      433              
  Lines       55915    55941      +26     
==========================================
- Hits        38576    38575       -1     
- Misses      17339    17366      +27     

@rohan-varma
Copy link
Member Author

@pritamdamania87 @mrshenli Would be great if you can take a look at this as it seems like it would be a useful addition for users of torch.distributed.launch.

@kiukchung
Copy link
Collaborator

fwiw we were planning to add this exact feature to the torchelastic agent (and naturally we’d have to add it to torchelastic.distributed.launch).

Comment on lines 238 to 239
print("passed in --logdir must be a relative path to a directory. Ignoring argument.")
args.logdir = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably raise an Error here instead of continuing without logging.

print(f"Note: Stdout for node {node_rank} rank {local_rank} will be written to {file_path}")


process = subprocess.Popen(cmd, env=current_env, stdout=subprocess_stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also write stderr to a file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the diff to do this.

@@ -235,7 +236,7 @@ def main():
# Possibly create the directory to write subprocess log output to.
if os.path.exists(args.logdir):
if not os.path.isdir(args.logdir):
print("passed in --logdir must be a relative path to a directory. Ignoring argument.")
raise ValueError("argument --logdir must be a path to a directory.")
args.logdir = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be never executed and can be removed now?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in ccb79f3.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in ccb79f3.

@facebook-github-bot facebook-github-bot deleted the launch_log branch January 27, 2021 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.distributed.launch outputs all processes to stdout (single node)
6 participants