Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FR tutorial to include file path, writer and print usage #3058

Merged
merged 2 commits into from
Sep 23, 2024

Conversation

fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Sep 23, 2024

Description

This is the PR trying to update the tutorial of FR to reflect the latest changes and provide more context on how to use it.

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

Copy link

pytorch-bot bot commented Sep 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3058

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 143d736 with merge base 0b23f46 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor editorial nits.

prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
prototype_source/flight_recorder_tutorial.rst Outdated Show resolved Hide resolved
@fduwjj fduwjj merged commit cd7f684 into main Sep 23, 2024
18 of 19 checks passed
@fduwjj fduwjj deleted the fr_tutorial branch September 23, 2024 20:08
@@ -71,6 +73,9 @@ Additional Settings

``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter``
Copy link
Contributor

@c-p-i-o c-p-i-o Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:
Can you make this (DebugInfoWriter) a link to the code in PyTorch?

@@ -48,6 +48,8 @@ Enabling Flight Recorder
------------------------
There are two required environment variables to get the initial version of Flight Recorder working.

- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you move this to the Optional settings section?
Otherwise the comment on Line 49 can be confusing to the customer (there are two required environment variables).
Or you can fix the comment.

Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
ranks and PGs. An example command is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested clarification to sentence.

can be narrowed down to certain ranks and PGs using the selected-ranks option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants