Add option to log subprocess output to files in DDP launcher. #33193

rohan-varma · 2020-02-11T20:40:22Z

Closes #7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.

The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.

To support this, we add an optional argument --logdir set the subprocess stdout to be the a file of the format "node_rank_{}local_rank{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.

Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py. This results in a directory test_logdir with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.

rohan-varma · 2020-02-11T20:40:49Z

Will add reviewers after polishing

dr-ci · 2020-02-11T21:31:32Z

💊 CI failures summary and remediations

As of commit 6a5261e (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

codecov.io: 1 failed

Failed: codecov/patch

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

glenn-jocher · 2020-08-12T05:36:48Z

@rohan-varma is there any official pytorch solution for this? We are running into this problem in YOLOv5, and are attempting to manually patch together a messy python logger solution here ultralytics/yolov5#719, but it would be ideal if there were a more general solution for the wider community.

hendrytl · 2020-09-21T20:03:37Z

@rohan-varma What is the status of this PR? I would love to take advantage of this capability.

rohan-varma · 2020-09-21T20:19:54Z

Thanks for checking in @glenn-jocher @hendrytl! I will clean up this PR and aim to publish it for review this week.

glenn-jocher · 2020-09-21T20:21:25Z

@rohan-varma awesome, thank you! This will help a lot of YOLOv5 users training multi-gpu with DDP :)

hendrytl · 2020-09-21T21:13:54Z

torch/distributed/launch.py

+        subprocess_stdout = None
+        if args.logdir:
+            directory_path = os.path.join(os.getcwd(), args.logdir)
+            file_handle = open(os.path.join(directory_path, "process_{}".format(local_rank)), "w")


Would it make sense to also include the node_rank so that log file names from multiple machines do not collide? Perhaps "process_{}_{}".format(node_rank, local_rank)?

rohan-varma · 2020-09-24T20:57:16Z

@glenn-jocher @hendrytl I updated the PR, feel free to take a look and let me know if it works for your use case. I included an example in the PR description.

hendrytl · 2020-09-24T21:44:02Z

torch/distributed/launch.py

@@ -255,6 +289,10 @@ def main():
            raise subprocess.CalledProcessError(returncode=process.returncode,
                                                cmd=cmd)

+    # close open file descriptors
+    for file_handle in subprocess_file_handles:


Add to a finally block to ensure file handles are closed if an error is raised.

codecov · 2020-09-25T02:32:24Z

Codecov Report

Merging #33193 into master will decrease coverage by 0.03%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #33193      +/-   ##
==========================================
- Coverage   68.99%   68.95%   -0.04%     
==========================================
  Files         433      433              
  Lines       55915    55941      +26     
==========================================
- Hits        38576    38575       -1     
- Misses      17339    17366      +27

rohan-varma · 2020-09-28T19:29:35Z

@pritamdamania87 @mrshenli Would be great if you can take a look at this as it seems like it would be a useful addition for users of torch.distributed.launch.

kiukchung · 2020-09-29T03:31:27Z

fwiw we were planning to add this exact feature to the torchelastic agent (and naturally we’d have to add it to torchelastic.distributed.launch).

pritamdamania87 · 2020-09-30T21:20:58Z

torch/distributed/launch.py

+                print("passed in --logdir must be a relative path to a directory. Ignoring argument.")
+                args.logdir = None


We should probably raise an Error here instead of continuing without logging.

pritamdamania87 · 2020-09-30T21:21:45Z

torch/distributed/launch.py

+            print(f"Note: Stdout for node {node_rank} rank {local_rank} will be written to {file_path}")
+
+
+        process = subprocess.Popen(cmd, env=current_env, stdout=subprocess_stdout)


Shouldn't we also write stderr to a file?

Updated the diff to do this.

pritamdamania87 · 2020-10-16T01:50:22Z

torch/distributed/launch.py

@@ -235,7 +236,7 @@ def main():
        # Possibly create the directory to write subprocess log output to.
        if os.path.exists(args.logdir):
            if not os.path.isdir(args.logdir):
-                print("passed in --logdir must be a relative path to a directory. Ignoring argument.")
+                raise ValueError("argument --logdir must be a path to a directory.")
                args.logdir = None


This will be never executed and can be removed now?

facebook-github-bot

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-10-23T20:17:53Z

@rohan-varma merged this pull request in ccb79f3.

facebook-github-bot · 2020-10-23T20:18:03Z

@rohan-varma merged this pull request in ccb79f3.

rohan-varma added 2 commits February 11, 2020 12:38

Add option to write subprocess output to files

ea136d5

Merge remote-tracking branch 'origin/master' into launch_log

b0db86b

rohan-varma requested review from apaszke, mrshenli, pietern and zhaojuanmao as code owners February 11, 2020 20:40

rohan-varma removed request for pietern, apaszke, mrshenli and zhaojuanmao February 11, 2020 20:40

glenn-jocher mentioned this pull request Sep 21, 2020

Fix Logging ultralytics/yolov5#719

Merged

hendrytl reviewed Sep 21, 2020

View reviewed changes

rohan-varma added 2 commits September 24, 2020 13:01

Merge remote-tracking branch 'origin/master' into launch_log

96cf177

Update

00fb0a9

rohan-varma requested a review from pritamdamania87 as a code owner September 24, 2020 20:46

rohan-varma changed the title ~~[WIP] Add option to log subprocess output to files in DDP launcher.~~ Add option to log subprocess output to files in DDP launcher. Sep 24, 2020

rohan-varma added 2 commits September 24, 2020 13:52

Update

24300bb

Update

fa0d777

rohan-varma requested review from kiukchung and mrshenli September 24, 2020 21:16

Update

d51f8a5

hendrytl reviewed Sep 24, 2020

View reviewed changes

Update

cb58a77

hendrytl approved these changes Sep 24, 2020

View reviewed changes

glenn-jocher approved these changes Sep 24, 2020

View reviewed changes

pritamdamania87 reviewed Sep 30, 2020

View reviewed changes

rohan-varma added 2 commits October 2, 2020 13:56

Merge remote-tracking branch 'origin/master' into launch_log

5e332ca

Update

e39b194

rohan-varma requested a review from mingzhe09088 as a code owner October 2, 2020 21:36

Update

5bc1acc

rohan-varma requested a review from pritamdamania87 October 2, 2020 21:39

Lint

d5b9120

pritamdamania87 approved these changes Oct 16, 2020

View reviewed changes

rohan-varma added 2 commits October 22, 2020 18:31

Merge remote-tracking branch 'origin/master' into launch_log

b309d70

Remove unused code

6a5261e

facebook-github-bot reviewed Oct 23, 2020

View reviewed changes

facebook-github-bot closed this in ccb79f3 Oct 23, 2020

facebook-github-bot added the Merged label Oct 23, 2020

facebook-github-bot deleted the launch_log branch January 27, 2021 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to log subprocess output to files in DDP launcher. #33193

Add option to log subprocess output to files in DDP launcher. #33193

rohan-varma commented Feb 11, 2020 •

edited

rohan-varma commented Feb 11, 2020

dr-ci bot commented Feb 11, 2020 •

edited

glenn-jocher commented Aug 12, 2020

hendrytl commented Sep 21, 2020

rohan-varma commented Sep 21, 2020

glenn-jocher commented Sep 21, 2020 •

edited

hendrytl Sep 21, 2020

rohan-varma commented Sep 24, 2020

hendrytl Sep 24, 2020

codecov bot commented Sep 25, 2020 •

edited

rohan-varma commented Sep 28, 2020

kiukchung commented Sep 29, 2020

pritamdamania87 Sep 30, 2020

pritamdamania87 Sep 30, 2020

rohan-varma Oct 2, 2020

pritamdamania87 Oct 16, 2020

facebook-github-bot left a comment

facebook-github-bot commented Oct 23, 2020

facebook-github-bot commented Oct 23, 2020

		print("passed in --logdir must be a relative path to a directory. Ignoring argument.")
		args.logdir = None

		print(f"Note: Stdout for node {node_rank} rank {local_rank} will be written to {file_path}")


		process = subprocess.Popen(cmd, env=current_env, stdout=subprocess_stdout)

Add option to log subprocess output to files in DDP launcher. #33193

Add option to log subprocess output to files in DDP launcher. #33193

Conversation

rohan-varma commented Feb 11, 2020 • edited

rohan-varma commented Feb 11, 2020

dr-ci bot commented Feb 11, 2020 • edited

💊 CI failures summary and remediations

codecov.io: 1 failed

glenn-jocher commented Aug 12, 2020

hendrytl commented Sep 21, 2020

rohan-varma commented Sep 21, 2020

glenn-jocher commented Sep 21, 2020 • edited

hendrytl Sep 21, 2020

Choose a reason for hiding this comment

rohan-varma commented Sep 24, 2020

hendrytl Sep 24, 2020

Choose a reason for hiding this comment

codecov bot commented Sep 25, 2020 • edited

Codecov Report

rohan-varma commented Sep 28, 2020

kiukchung commented Sep 29, 2020

pritamdamania87 Sep 30, 2020

Choose a reason for hiding this comment

pritamdamania87 Sep 30, 2020

Choose a reason for hiding this comment

rohan-varma Oct 2, 2020

Choose a reason for hiding this comment

pritamdamania87 Oct 16, 2020

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 23, 2020

facebook-github-bot commented Oct 23, 2020

rohan-varma commented Feb 11, 2020 •

edited

dr-ci bot commented Feb 11, 2020 •

edited

glenn-jocher commented Sep 21, 2020 •

edited

codecov bot commented Sep 25, 2020 •

edited