Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size info to collective logs #100413

Closed
wants to merge 1 commit into from
Closed

Add size info to collective logs #100413

wants to merge 1 commit into from

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented May 1, 2023

Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch.

(Reason is that when copying WorkNCCL object during work enqueue, we don't copy outputs_ due to reference concern, hence output.size() is never triggered.)

This PR logs sizes using separate fields, hence not relying on outputs_.

New timeout log:

[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out.

@pytorch-bot
Copy link

pytorch-bot bot commented May 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100413

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 59a0b52:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kwen2501
Copy link
Contributor Author

kwen2501 commented May 1, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR:
@pytorchbot merge -r
Or just rebase by leaving @pytorchbot rebase comment

Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Contributor Author

kwen2501 commented Jun 1, 2023

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased pg_nccl_log_size onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout pg_nccl_log_size && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

shaoyf42 pushed a commit to shaoyf42/pytorch that referenced this pull request Jun 1, 2023
Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch.

(Reason is that when copying `WorkNCCL` object during work enqueue, we don't copy `outputs_` due to reference concern, hence `output.size()` is never triggered.)

This PR logs sizes using separate fields, hence not relying on `outputs_`.

New timeout log:
```
[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out.
```
Pull Request resolved: pytorch#100413
Approved by: https://github.com/kumpera
alimoezzi pushed a commit to alimoezzi/pytorch that referenced this pull request Jun 3, 2023
Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch.

(Reason is that when copying `WorkNCCL` object during work enqueue, we don't copy `outputs_` due to reference concern, hence `output.size()` is never triggered.)

This PR logs sizes using separate fields, hence not relying on `outputs_`.

New timeout log:
```
[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out.
```
Pull Request resolved: pytorch#100413
Approved by: https://github.com/kumpera
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants