Skip to content

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented May 7, 2025

Summary:
This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as reduce_test and send_recv_test but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

  • support recv from any is not supported
  • AllreduceBcubeBase2 is failing

Differential Revision: D73291471

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/gloo#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Differential Revision: D73291471
"Destructing buffer with pending sends, sendPending_=", sendPending_);
try {
for (int i = 0; i < sendPending; i++) {
waitSend();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blocking so the destructor can hang?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually isn't needed anymore -- there was an issue that's since been fixed. Changed this to just be a warning now

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Aside from my n00b questions, it looks good to me overall.

attr.name = names[0];
}

attr.index = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N00b question, why this magic number? Is it possible that we can add a comment for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed -- this is infra specific, should be passing it in via the attr and not hardcoding

void UnboundBuffer::abortWaitRecv() {
std::lock_guard<std::mutex> guard(m_);
abortWaitRecv_ = true;
recvCv_.notify_one();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this does guarantee which thread to wakes up. Would this be a problem? Oh maybe it is per pair, so it won't be a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abortWaitRecv only aborts a single wait request so we only want to wake one thread. I don't think we ever call multiple waitRecv on the same buffer. Usually each algorithm is only waiting for one thing at a time since it's a blocking call

namespace ibverbs {

Pair::Pair(
int rank,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure my understand correctly, this is always global rank right? For example, for Sub-PG scenario, we sometimes will have local ranks as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the rank of the Gloo process group/context. On the gloo side there's only a single rank concept per context, no understanding of sub PG ranks etc.


const struct ibv_mr* Pair::getMemoryRegion(int slot) {
std::unique_lock<std::mutex> lock(m_);
void Pair::getMemoryRegion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, If we now return nothing, do you think we still name it getMemoryRegion. I mean when we call getSomething, we expect something will be returned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading L360, maybe we can call it recv or fetchMemoryRegion? No strong opinion..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still gets the memory region -- it just returns it via the callback rather than immediately returning it. Renamed recv to match sendMemoryRegion

}
}

void Pair::send(Buffer* buffer, size_t offset, size_t length, size_t roffset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another n00b question, why there are two send? Is it possible that we can have comments to explain under what condition this send will be used? For example the comment of the first one says "// Send from the specified buffer to remote side of pair." I guess this one is similar. Even the code looks similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for the two different types of buffers. Buffer is bound to a specific pair and reuses the memory region where as UnboundBuffer can be used across multiple pairs and will bind the memory region just for that transfer.

I tried to generalize this logic but had to split it given the differences in apis between the two

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for your explanation.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Aside from my n00b questions, it looks good to me overall.

d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from 38a8de2 to a0f435b Compare May 7, 2025 21:17
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from a0f435b to d42d5ff Compare May 7, 2025 21:19
d4l3k added a commit to d4l3k/pytorch that referenced this pull request May 7, 2025
Summary:

X-link: pytorch/gloo#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Reviewed By: fduwjj

Differential Revision: D73291471
d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from d42d5ff to 685dbed Compare May 7, 2025 21:19
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from 685dbed to 91d2dcc Compare May 7, 2025 21:20
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015

Pull Request resolved: pytorch#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from 91d2dcc to 4644c93 Compare May 7, 2025 21:26
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015

Pull Request resolved: pytorch#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from 4644c93 to 0e3b319 Compare May 7, 2025 21:35
d4l3k added a commit to d4l3k/pytorch that referenced this pull request May 7, 2025
Summary:

X-link: pytorch/gloo#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Reviewed By: fduwjj

Differential Revision: D73291471
d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from 0e3b319 to a1e57a8 Compare May 7, 2025 22:11
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015

Pull Request resolved: pytorch#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from a1e57a8 to 6c31249 Compare May 7, 2025 22:14
d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from 6c31249 to addafab Compare May 7, 2025 22:18
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

d4l3k added a commit to d4l3k/pytorch that referenced this pull request May 7, 2025
Summary:

X-link: pytorch/gloo#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Reviewed By: fduwjj

Differential Revision: D73291471
d4l3k added a commit to d4l3k/pytorch that referenced this pull request May 7, 2025
Summary:
Pull Request resolved: pytorch#153015

X-link: pytorch/gloo#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Reviewed By: fduwjj

Differential Revision: D73291471
d4l3k added a commit to d4l3k/gloo that referenced this pull request May 7, 2025
Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from addafab to d9fcaf5 Compare May 7, 2025 22:32
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

Summary:
X-link: pytorch/pytorch#153015


This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Reviewed By: fduwjj

Differential Revision: D73291471
@d4l3k d4l3k force-pushed the export-D73291471 branch from d9fcaf5 to 0573f54 Compare May 7, 2025 22:39
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D73291471

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 8, 2025
Summary:
X-link: pytorch/gloo#437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Differential Revision: D73291471

Pull Request resolved: #153015
Approved by: https://github.com/fduwjj
@d4l3k d4l3k merged commit 7a4c857 into pytorch:main May 8, 2025
6 of 9 checks passed
@d4l3k d4l3k deleted the export-D73291471 branch May 8, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants