Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kineto][NCCL][5/n] Populate in/out split size info for all_to_all from CPU to CUDA kernel #112308

Closed
wants to merge 1 commit into from

Conversation

yoyoyocmu
Copy link
Contributor

Summary: This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Test Plan:
Trace example:

https://pxl.cl/3H2nd

Differential Revision: D50762093

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112308

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b53fc36 with merge base a50f6d3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50762093

@@ -365,12 +366,18 @@ std::unordered_map<std::string, std::string> saveNcclMeta(
kDtype, fmt::format("\"{}\"", c10::toString(debugInfo->getDType())));
map.emplace(kInMsgSize, std::to_string(debugInfo->getInMessageSize()));
map.emplace(kOutMsgSize, std::to_string(debugInfo->getOutMessageSize()));
map.emplace(
auto& inSplitSizes = debugInfo->getInputSplitSizes();
if (!inSplitSizes.empty() && inSplitSizes.size() <= kTruncatLength) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still record the first kTruncatLength in the list?

Copy link
Contributor Author

@yoyoyocmu yoyoyocmu Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, i was thinking should we keep the first 1 or 30 elements in the list or just skip recording if the length is too long. Any recommended rule to follow?

I updated the code to record the first element when the total length > 30.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to record the first 30 elements, and then add a ... when the length is greater than 30. So then we can claim to show all elements up to the first 30.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this in the future, if someone is interested.

yoyoyocmu added a commit to yoyoyocmu/kineto that referenced this pull request Oct 30, 2023
pytorch#822)

Summary:
X-link: pytorch/pytorch#112308


This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Reviewed By: idning

Differential Revision: D50762093
yoyoyocmu added a commit to yoyoyocmu/pytorch that referenced this pull request Oct 30, 2023
…om CPU to CUDA kernel (pytorch#112308)

Summary:

X-link: pytorch/kineto#822

This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Test Plan:
**Trace example**:
- For non all_to_all collective functions: https://fburl.com/perfdoctor/4nobsu15
https://pxl.cl/3GNVb

- For all_to_all: https://fburl.com/perfdoctor/f418goys

https://pxl.cl/3H2nd

Reviewed By: idning

Differential Revision: D50762093
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50762093

Copy link
Member

@aaronenyeshi aaronenyeshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

yoyoyocmu added a commit to yoyoyocmu/kineto that referenced this pull request Oct 30, 2023
pytorch#822)

Summary:
X-link: pytorch/pytorch#112308


This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Reviewed By: aaronenyeshi, idning

Differential Revision: D50762093
yoyoyocmu added a commit to yoyoyocmu/pytorch that referenced this pull request Oct 30, 2023
…om CPU to CUDA kernel (pytorch#112308)

Summary:

X-link: pytorch/kineto#822

This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Test Plan:
**Trace example**:
- For non all_to_all collective functions: https://fburl.com/perfdoctor/4nobsu15
https://pxl.cl/3GNVb

- For all_to_all: https://fburl.com/perfdoctor/f418goys

https://pxl.cl/3H2nd

Reviewed By: aaronenyeshi, idning

Differential Revision: D50762093
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50762093

…om CPU to CUDA kernel (pytorch#112308)

Summary:

X-link: pytorch/kineto#822

This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Test Plan:
**Trace example**:
- For non all_to_all collective functions: https://fburl.com/perfdoctor/4nobsu15
https://pxl.cl/3GNVb

- For all_to_all: https://fburl.com/perfdoctor/f418goys

https://pxl.cl/3H2nd

Reviewed By: aaronenyeshi, idning

Differential Revision: D50762093
yoyoyocmu added a commit to yoyoyocmu/kineto that referenced this pull request Nov 6, 2023
pytorch#822)

Summary:
X-link: pytorch/pytorch#112308


This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Reviewed By: aaronenyeshi, idning

Differential Revision: D50762093
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50762093

facebook-github-bot pushed a commit to pytorch/kineto that referenced this pull request Nov 7, 2023
#822)

Summary:
X-link: pytorch/pytorch#112308

Pull Request resolved: #822

This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

bypass-github-pytorch-ci-checks

Reviewed By: aaronenyeshi, idning

Differential Revision: D50762093

fbshipit-source-id: a118b9e2623ca0ac6b5f9e30cd554666a4c01a12
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…om CPU to CUDA kernel (pytorch#112308)

Summary: This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Test Plan:
**Trace example**:
- For non all_to_all collective functions: https://fburl.com/perfdoctor/4nobsu15
https://pxl.cl/3GNVb

- For all_to_all: https://fburl.com/perfdoctor/f418goys

https://pxl.cl/3H2nd

Differential Revision: D50762093

Pull Request resolved: pytorch#112308
Approved by: https://github.com/aaronenyeshi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged release notes: profiler release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants