[dtensor][debug] Added functionality to convert log into a json file #129994

sinhaanshul · 2024-07-02T22:32:02Z

Stack from ghstack (oldest at bottom):

[dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn #130072
-> [dtensor][debug] Added functionality to convert log into a json file #129994

Summary
Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below:

Test Plan

torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-07-02T22:32:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129994

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 0ac7e2b with merge base 784e3b4 ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh) (#130257)
test_decomp 11/12 failed!
trunk / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral) (gh) (#130257)
test_decomp 2/12 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 710a830 Pull Request resolved: #129994

XilunWu

LGTM except 2 niches. You can merge this PR after addressing it.

torch/distributed/_tensor/debug/comm_mode.py

XilunWu · 2024-07-03T19:00:57Z

torch/distributed/_tensor/debug/comm_mode.py

+
+        # converts dictonary into json file
+        with open(file_name, "w") as json_file:
+            json.dump(json_dict, json_file, indent=4)


curious, is the indent=4 a requirement by json dataloader or just a good number?

This was the number I chose as it matches the indentation used in python files, just makes json easy to read when debugging

XilunWu · 2024-07-03T19:03:45Z

torch/distributed/_tensor/examples/comm_mode_features_example.py

        print(comm_mode.generate_operation_tracing_table())
        comm_mode.log_operation_tracing_table_to_file()

+    def test_MLP_json_dump(self) -> None:


for example snippet, I'm not sure if we should name it with test_ prefix. Does other reviewer stand with this decision? If no, I suggest that we rename them with prefix example_.

[ghstack-poisoned]

sinhaanshul · 2024-07-03T23:24:54Z

@pytorchbot merge

pytorchmergebot · 2024-07-03T23:26:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-04T01:35:23Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 5, 5, linux.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

sinhaanshul · 2024-07-05T22:23:26Z

@pytorchbot merge -f

pytorch-bot · 2024-07-05T22:23:28Z

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

[ghstack-poisoned]

sinhaanshul · 2024-07-05T22:56:40Z

@pytorchbot merge

pytorchmergebot · 2024-07-05T22:58:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-06T00:21:14Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team

Raised by workflow job

sinhaanshul · 2024-07-08T17:14:03Z

@pytorchbot merge -f "unrelated CI failure"

pytorchmergebot · 2024-07-08T17:15:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…parameter sharding and module fqn (#130072) **Summary** In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below: **Test Plan** 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: #130072 Approved by: https://github.com/XilunWu ghstack dependencies: #129994

Update

5c00206

[ghstack-poisoned]

sinhaanshul mentioned this pull request Jul 2, 2024

[dtensor][debug] Added forward and backward differentiation for module level tracing #129602

Closed

sinhaanshul mentioned this pull request Jun 28, 2024

[dtensor][be] Reduced redundant LOC by creating functions to set up models used in example #129613

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 2, 2024

sinhaanshul added a commit that referenced this pull request Jul 2, 2024

[dtensor][debug] Added functionality to convert log into a json file

0730724

ghstack-source-id: 710a830 Pull Request resolved: #129994

sinhaanshul requested review from XilunWu, tianyu-l and wz337 July 2, 2024 22:42

sinhaanshul added the topic: not user facing topic category label Jul 2, 2024

XilunWu approved these changes Jul 3, 2024

View reviewed changes

Update

fc16cbc

[ghstack-poisoned]

sinhaanshul mentioned this pull request Jul 3, 2024

[dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn #130072

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 3, 2024

pytorchmergebot added the merging label Jul 3, 2024

pytorchmergebot removed the merging label Jul 4, 2024

Update

a543740

[ghstack-poisoned]

Update

0ac7e2b

[ghstack-poisoned]

pytorchmergebot added the merging label Jul 5, 2024

pytorchmergebot removed the merging label Jul 6, 2024

pytorchmergebot added the merging label Jul 8, 2024

pytorchmergebot closed this in a18568f Jul 8, 2024

pytorchmergebot added Merged and removed merging labels Jul 8, 2024

github-actions bot deleted the gh/sinhaanhsul/28/head branch August 8, 2024 01:58

[dtensor][debug] Added functionality to convert log into a json file #129994

[dtensor][debug] Added functionality to convert log into a json file #129994

Uh oh!

Conversation

sinhaanshul commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129994

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XilunWu Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

sinhaanshul Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

XilunWu Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

sinhaanshul commented Jul 3, 2024

Uh oh!

pytorchmergebot commented Jul 3, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 4, 2024

Merge failed

Uh oh!

sinhaanshul commented Jul 5, 2024

Uh oh!

pytorch-bot bot commented Jul 5, 2024

Uh oh!

sinhaanshul commented Jul 5, 2024

Uh oh!

pytorchmergebot commented Jul 5, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 6, 2024

Merge failed

Uh oh!

sinhaanshul commented Jul 8, 2024

Uh oh!

pytorchmergebot commented Jul 8, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sinhaanshul commented Jul 2, 2024 •

edited

Loading

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading