Skip to content

Conversation

sinhaanshul
Copy link
Contributor

@sinhaanshul sinhaanshul commented Jul 2, 2024

Stack from ghstack (oldest at bottom):

Summary
Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below:

Screenshot 2024-07-02 at 3 40 41 PM Screenshot 2024-07-02 at 3 41 43 PM

Test Plan

  1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

  2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @chauhang @d4l3k

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129994

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 0ac7e2b with merge base 784e3b4 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 2, 2024
sinhaanshul added a commit that referenced this pull request Jul 2, 2024
@sinhaanshul sinhaanshul requested review from XilunWu, tianyu-l and wz337 July 2, 2024 22:42
@sinhaanshul sinhaanshul added the topic: not user facing topic category label Jul 2, 2024
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except 2 niches. You can merge this PR after addressing it.


# converts dictonary into json file
with open(file_name, "w") as json_file:
json.dump(json_dict, json_file, indent=4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, is the indent=4 a requirement by json dataloader or just a good number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the number I chose as it matches the indentation used in python files, just makes json easy to read when debugging

print(comm_mode.generate_operation_tracing_table())
comm_mode.log_operation_tracing_table_to_file()

def test_MLP_json_dump(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example snippet, I'm not sure if we should name it with test_ prefix. Does other reviewer stand with this decision? If no, I suggest that we rename them with prefix example_.

[ghstack-poisoned]
@sinhaanshul
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 3, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 5, 5, linux.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
@sinhaanshul
Copy link
Contributor Author

@pytorchbot merge -f

Copy link

pytorch-bot bot commented Jul 5, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

[ghstack-poisoned]
@sinhaanshul
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@sinhaanshul
Copy link
Contributor Author

@pytorchbot merge -f "unrelated CI failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 8, 2024
…parameter sharding and module fqn (#130072)

**Summary**
In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below:

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: #130072
Approved by: https://github.com/XilunWu
ghstack dependencies: #129994
@github-actions github-actions bot deleted the gh/sinhaanhsul/28/head branch August 8, 2024 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants