-
Notifications
You must be signed in to change notification settings - Fork 337
c10d/gloo: add ibverbs backend #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Differential Revision: D73291471
gloo/transport/ibverbs/buffer.cc
Outdated
"Destructing buffer with pending sends, sendPending_=", sendPending_); | ||
try { | ||
for (int i = 0; i < sendPending; i++) { | ||
waitSend(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this blocking so the destructor can hang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually isn't needed anymore -- there was an issue that's since been fixed. Changed this to just be a warning now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Aside from my n00b questions, it looks good to me overall.
gloo/transport/ibverbs/device.cc
Outdated
attr.name = names[0]; | ||
} | ||
|
||
attr.index = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N00b question, why this magic number? Is it possible that we can add a comment for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed -- this is infra specific, should be passing it in via the attr and not hardcoding
void UnboundBuffer::abortWaitRecv() { | ||
std::lock_guard<std::mutex> guard(m_); | ||
abortWaitRecv_ = true; | ||
recvCv_.notify_one(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this does guarantee which thread to wakes up. Would this be a problem? Oh maybe it is per pair, so it won't be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abortWaitRecv only aborts a single wait request so we only want to wake one thread. I don't think we ever call multiple waitRecv on the same buffer. Usually each algorithm is only waiting for one thing at a time since it's a blocking call
namespace ibverbs { | ||
|
||
Pair::Pair( | ||
int rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure my understand correctly, this is always global rank right? For example, for Sub-PG scenario, we sometimes will have local ranks as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the rank of the Gloo process group/context. On the gloo side there's only a single rank concept per context, no understanding of sub PG ranks etc.
gloo/transport/ibverbs/pair.cc
Outdated
|
||
const struct ibv_mr* Pair::getMemoryRegion(int slot) { | ||
std::unique_lock<std::mutex> lock(m_); | ||
void Pair::getMemoryRegion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, If we now return nothing, do you think we still name it getMemoryRegion
. I mean when we call getSomething, we expect something will be returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading L360, maybe we can call it recv or fetchMemoryRegion? No strong opinion..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still gets the memory region -- it just returns it via the callback rather than immediately returning it. Renamed recv to match sendMemoryRegion
} | ||
} | ||
|
||
void Pair::send(Buffer* buffer, size_t offset, size_t length, size_t roffset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another n00b question, why there are two send? Is it possible that we can have comments to explain under what condition this send will be used? For example the comment of the first one says "// Send from the specified buffer to remote side of pair." I guess this one is similar. Even the code looks similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for the two different types of buffers. Buffer is bound to a specific pair and reuses the memory region where as UnboundBuffer can be used across multiple pairs and will bind the memory region just for that transfer.
I tried to generalize this logic but had to split it given the differences in apis between the two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for your explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Aside from my n00b questions, it looks good to me overall.
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
Summary: X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Reviewed By: fduwjj Differential Revision: D73291471
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/pytorch#153015 Pull Request resolved: pytorch#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/pytorch#153015 Pull Request resolved: pytorch#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
Summary: X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Reviewed By: fduwjj Differential Revision: D73291471
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/pytorch#153015 Pull Request resolved: pytorch#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Reviewed By: fduwjj Differential Revision: D73291471
Summary: Pull Request resolved: pytorch#153015 X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Reviewed By: fduwjj Differential Revision: D73291471
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/pytorch#153015 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Reviewed By: fduwjj Differential Revision: D73291471
This pull request was exported from Phabricator. Differential Revision: D73291471 |
Summary: X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Differential Revision: D73291471 Pull Request resolved: #153015 Approved by: https://github.com/fduwjj
Summary:
This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.
This currently is passing basic tests such as
reduce_test
andsend_recv_test
but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.Known issues:
Differential Revision: D73291471