Skip to content

Conversation

Copy link

pytorch-bot bot commented Jun 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128466

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 6edbbe2 with merge base 8b6391e (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jun 11, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 11, 2024
jamesperng added a commit that referenced this pull request Jun 11, 2024
@jamesperng jamesperng closed this Jun 11, 2024
@jamesperng jamesperng reopened this Jun 11, 2024
@jamesperng jamesperng marked this pull request as draft June 11, 2024 23:43
@jamesperng
Copy link
Contributor Author

@jamesperng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


bool ProcessGroupNCCL::WorkNCCL::checkTimeout(
std::optional<std::chrono::milliseconds> timeout) {
SCOPED_WAIT_COUNTER_US("pytorch.logging.wait_counter.ProcessGroupNCCL::WorkNCCL::checkTimeout");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be convention that all counters have the same prefix "pytorch.logging.wait_counter"? If so maybe its worth prepending that inside the counter impl and then users can just write the suffix "ProcessGroup..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for these wait counters, we would have the same prefix pytorch.wait_counter (i took out the logging).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside of having the prefix put in implicitly, is that it might make the counters less searchable in the code?

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58441466

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58441466

@jamesperng jamesperng requested a review from c-p-i-o June 24, 2024 20:09
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 24, 2024
@jamesperng jamesperng marked this pull request as ready for review June 24, 2024 21:03
cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58441466

@jamesperng
Copy link
Contributor Author

/easycla

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58441466

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58441466

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58441466

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 26, 2024
…#128605)

* created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp`
    * uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it.
    * this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons.
        * `facebook::data_preproc::WaitCounterUs` uses 3 singletons:
             1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton
             2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton
             3. publisherSingleton — normal singleton since it manages resources (threads)
        * `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0.
* linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl`
* wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation.

as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library?

Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/)
Pull Request resolved: #128605
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #128466
@github-actions github-actions bot deleted the gh/jamesperng/5/head branch July 27, 2024 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants