[c10d] Add a logger for all nccl collectives with its time duration when completed #156008

fduwjj · 2025-06-15T00:33:03Z

Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k

pytorch-bot · 2025-06-15T00:33:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156008

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs

As of commit 163c971 with merge base 9ed0060 ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / linux-jammy-py3.9-gcc11-pch / build (gh)
pull / linux-jammy-xpu-2025.1-py3.9 / build (gh)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-06-15T00:33:34Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-16T17:18:37Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-16T17:26:45Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-16T19:04:42Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-16T19:15:29Z

This pull request was exported from Phabricator. Differential Revision: D76552340

fegin

PR is already accepted internally.

fegin

PR is already accepted internally.

…hen completed (#156008) Summary: Pull Request resolved: #156008 We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives. Test Plan: CI + dry run. Rollback Plan: Reviewed By: uthakore Differential Revision: D76552340

facebook-github-bot · 2025-06-17T17:30:48Z

This pull request was exported from Phabricator. Differential Revision: D76552340

…hen completed (pytorch#156008) Summary: Pull Request resolved: pytorch#156008 We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives. Test Plan: CI + dry run. Rollback Plan: Reviewed By: fegin, H-Huang, uthakore Differential Revision: D76552340

facebook-github-bot · 2025-06-17T17:48:15Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-17T23:25:54Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-17T23:35:52Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-18T02:31:49Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-18T02:39:40Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-18T04:00:44Z

This pull request was exported from Phabricator. Differential Revision: D76552340

…hen completed (pytorch#156008) Summary: Pull Request resolved: pytorch#156008 We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives. Test Plan: CI + dry run. Rollback Plan: Reviewed By: fegin, H-Huang, uthakore Differential Revision: D76552340

facebook-github-bot · 2025-06-18T04:06:29Z

This pull request was exported from Phabricator. Differential Revision: D76552340

facebook-github-bot · 2025-06-18T07:47:49Z

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

pytorchmergebot · 2025-06-18T07:49:30Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3.9-gcc11-pch / build, pull / linux-jammy-xpu-2025.1-py3.9 / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 15, 2025

facebook-github-bot added the fb-exported label Jun 15, 2025

fduwjj requested review from d4l3k, eqy and kwen2501 June 16, 2025 17:12

fduwjj force-pushed the export-D76552340 branch from 34f6965 to b2a8181 Compare June 16, 2025 17:18

fduwjj force-pushed the export-D76552340 branch from b2a8181 to b5d2284 Compare June 16, 2025 17:26

fduwjj force-pushed the export-D76552340 branch from b5d2284 to 7aacc3e Compare June 16, 2025 19:04

fduwjj force-pushed the export-D76552340 branch from 7aacc3e to 5e9ec24 Compare June 16, 2025 19:15

fegin approved these changes Jun 16, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 16, 2025

eqy approved these changes Jun 16, 2025

View reviewed changes

fduwjj force-pushed the export-D76552340 branch from 5e9ec24 to 78e221e Compare June 17, 2025 17:30

fduwjj force-pushed the export-D76552340 branch from 78e221e to a3b0b65 Compare June 17, 2025 17:48

fduwjj force-pushed the export-D76552340 branch from a3b0b65 to d8ba58a Compare June 17, 2025 23:26

fduwjj force-pushed the export-D76552340 branch from d8ba58a to 351293f Compare June 17, 2025 23:35

fduwjj force-pushed the export-D76552340 branch from 351293f to 80d3136 Compare June 18, 2025 02:31

fduwjj force-pushed the export-D76552340 branch from 80d3136 to 60eaea4 Compare June 18, 2025 02:39

fduwjj force-pushed the export-D76552340 branch from 60eaea4 to 6cadb38 Compare June 18, 2025 04:00

fduwjj force-pushed the export-D76552340 branch from 6cadb38 to 163c971 Compare June 18, 2025 04:06

pytorchmergebot added the merging label Jun 18, 2025

pytorchmergebot closed this in 577baa4 Jun 18, 2025

pytorchmergebot added Merged and removed merging labels Jun 18, 2025

[c10d] Add a logger for all nccl collectives with its time duration when completed #156008

[c10d] Add a logger for all nccl collectives with its time duration when completed #156008

Uh oh!

Conversation

fduwjj commented Jun 15, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156008

❌ 2 Cancelled Jobs

Uh oh!

facebook-github-bot commented Jun 15, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 17, 2025

Uh oh!

facebook-github-bot commented Jun 17, 2025

Uh oh!

facebook-github-bot commented Jun 17, 2025

Uh oh!

facebook-github-bot commented Jun 17, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

pytorchmergebot commented Jun 18, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fduwjj commented Jun 15, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 15, 2025 •

edited

Loading