Skip to content

Conversation

fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Jun 15, 2025

Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k

Copy link

pytorch-bot bot commented Jun 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156008

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs

As of commit 163c971 with merge base 9ed0060 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 15, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 34f6965 to b2a8181 Compare June 16, 2025 17:18
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from b2a8181 to b5d2284 Compare June 16, 2025 17:26
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from b5d2284 to 7aacc3e Compare June 16, 2025 19:04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 7aacc3e to 5e9ec24 Compare June 16, 2025 19:15
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is already accepted internally.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is already accepted internally.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 16, 2025
pytorch-bot bot pushed a commit that referenced this pull request Jun 16, 2025
…hen completed (#156008)

Summary:
Pull Request resolved: #156008

We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Reviewed By: uthakore

Differential Revision: D76552340
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 5e9ec24 to 78e221e Compare June 17, 2025 17:30
fduwjj added a commit to fduwjj/pytorch that referenced this pull request Jun 17, 2025
…hen completed (pytorch#156008)

Summary:
Pull Request resolved: pytorch#156008

We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Reviewed By: fegin, H-Huang, uthakore

Differential Revision: D76552340
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 78e221e to a3b0b65 Compare June 17, 2025 17:48
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from a3b0b65 to d8ba58a Compare June 17, 2025 23:26
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from d8ba58a to 351293f Compare June 17, 2025 23:35
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 351293f to 80d3136 Compare June 18, 2025 02:31
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 80d3136 to 60eaea4 Compare June 18, 2025 02:39
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 60eaea4 to 6cadb38 Compare June 18, 2025 04:00
…hen completed (pytorch#156008)

Summary:
Pull Request resolved: pytorch#156008

We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Reviewed By: fegin, H-Huang, uthakore

Differential Revision: D76552340
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76552340

@fduwjj fduwjj force-pushed the export-D76552340 branch from 6cadb38 to 163c971 Compare June 18, 2025 04:06
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3.9-gcc11-pch / build, pull / linux-jammy-xpu-2025.1-py3.9 / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants