-
Notifications
You must be signed in to change notification settings - Fork 74.9k
Ibverbs-based RDMA path #8943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ibverbs-based RDMA path #8943
Conversation
Can one of the admins verify this patch? |
Via another channel I am in communication wtih @junshi15 about this PR. It needs to be updated slightly to the most recent version of the main repository, then he will submit again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just starting to enter a few comments one by one, because this tool will lose all unsubmitted comments on various minor browser events.
// analyzing request | ||
// the channel setting part is redundant. | ||
string remote_host_name = request->host_name(); | ||
RdmaChannel* rc = rdma_mgr_->FindChannel(remote_host_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK(rc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
int i = 0; | ||
int idx[] = {1, 0, 3, 2}; | ||
std::vector<RdmaBuffer*> mb(rc->message_buffers()); | ||
for (const auto& mr : request->mr()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK whether request->mr() has more than size 4?
} | ||
|
||
// Initiate recv | ||
for (int i = 0; i < 100; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose eventually this should be some kind of config option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in the future revision.
rb = new RdmaMessageBuffer(this, name); | ||
} else if (buffer_type == ACK) { | ||
rb = new RdmaAckBuffer(this, name); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK(!rb) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added CHECK(rb)
tensorflow/contrib/verbs/rdma.cc
Outdated
// None | ||
void RdmaChannel::InsertRecvCallback(string& key, | ||
std::function<void()> recv_done) { | ||
ct_mu_.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you include tensorflow/core/platform/mutex.h you can use the scoped mutex_lock instead of explicit unlocks like here. It's a little less error prone in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed most places to scoped mutex_lock.
<< "send dev name: " << src_dev->name() | ||
<< " gpu_info: " << src_dev->tensorflow_gpu_device_info(); | ||
// "val" is on a GPU. Uses GPUUtil to fill the proto. | ||
s = VerbsUtil::SetProtoFromGPUSync(in, src_dev, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this stall the CQ processing thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it can. will optimize in a future revision.
return; | ||
} | ||
AllocatorAttributes src_alloc_attr; | ||
src_alloc_attr.set_on_host(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
class VerbsServerFactory : public ServerFactory { | ||
public: | ||
bool AcceptsOptions(const ServerDef& server_def) override { | ||
return server_def.protocol() == "verbs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use "grpc+verbs" as the protocol, since this is a hybrid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to "grpc+verbs"
GetRemoteAddressResponse* response) { | ||
// analyzing request | ||
// the channel setting part is redundant. | ||
string remote_host_name = request->host_name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
tensorflow/contrib/verbs/BUILD
Outdated
srcs = ["verbs_service.proto"], | ||
cc_api_version = 2, | ||
visibility = [ | ||
"//tensorflow:internal", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works, prefer tensorflow:subpackages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to tensorflow:subpackages
af23ab6
to
0a6c80b
Compare
rebased to master branch, conflicts resolved. |
Jenkins, test this please. |
Jenkins, test this please |
@junshi15 : the buildifier sanity check is failing, and that appears to be blocking all further testing. It looks like you can run |
TF_CHECK_OK(Stop()); | ||
TF_CHECK_OK(Join()); | ||
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete empty line
worker_service_ = | ||
NewGrpcWorkerService(worker_impl_.get(), &builder).release(); | ||
// extra service: | ||
if (service_func != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use { } whenever 'if' construct spans more than one line.
std::unique_ptr<GrpcServer> ret( | ||
new GrpcServer(server_def, env == nullptr ? Env::Default() : env)); | ||
TF_RETURN_IF_ERROR(ret->Init()); | ||
ServiceCreationFunction service_func = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/NULL/nullptr
working on tensorflow/contrib/verbs/BUILD, buildifier (https://github.com/bazelbuild/buildtools) does not build for me, so I will edit it manually. |
const std::string& worker_name, WorkerCacheInterface* worker_cache)> | ||
RendezvousMgrCreationFunction; | ||
|
||
typedef std::function<void(const WorkerEnv*, ::grpc::ServerBuilder*)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document what this is supposed to do? I think it's an initialization for a derived class. So maybe call it ServiceInitFunction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ServiceCreationFunction" registers a service on the server. This is one way to register a service by the derived class. VerbsServer::Init() calls GrpcServer::Init(). But the server is already built and run after GrpcServer::Init(), so I have to pass a function to GrpcServer::Init() to register an extra service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with your naming.
Jenkins, test this please. |
1 similar comment
Jenkins, test this please. |
Add missing comma in verbs/BUILD file
Jenkins, test this please. |
What's the problem with two BUILD files that failed sanity checks? indentation? |
Not sure, trying to figure that out.
…On Fri, Apr 7, 2017 at 1:36 PM, Jun Shi ***@***.***> wrote:
What's the problem with two BUILD files that failed sanity checks?
indentation?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#8943 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO818e4BFtUkrUAwSSpqvX03yORDYEPLks5rtp47gaJpZM4MyR40>
.
|
Do we have an example where link option can be included/excluded with a flag? |
I found one here, will give it a try. |
Added a switch to turn on/off link option "-libverbs". Please initiate a build. Thank you. |
Jenkins test this please |
Can one of the admins verify this patch? |
Added "#ifdef TENSORFLOW_USE_VERBS" to two more files. Please initiate the build. Thanks. |
Jenkins, test this please |
@poxvoculi are we good with this then? |
Jenkins, test this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks
Typos in README.md: |
Right. @wydwww could you please send a follow-up PR? |
@wydwww Thanks for catching the typos. |
#9926 2017-05-18 09:52:27.325014: F tensorflow/contrib/verbs/rdma.cc:129] Check failed: wc_[i].status == IBV_WC_SUCCESS Failed status
|
verbs build failed on current master branch, last commit.
|
@junshi15 : we are not able to use NIC bonding feature with this PR. is this PR supports NIC bonding? If not is it possible to add NIC bonding feature with verbs. May I know how to handle in code? do you have any plans to add it in near future? |
@vdevaram I do not have experience with NIC bonding and do not have a plan to add this feature in the future. Your contributions are welcome. |
This patch introduces an ibverbs-based RDMA path between servers for tensor transfer (weights, gradients, etc). The existing gRPC path remains and is responsible for "administrative" tasks, such as setting up the Rdma path, exchanging computation graphs, etc. Design details can be found in the README file below:
https://github.com/yahoo/tensorflow/blob/verbs_rdma/tensorflow/contrib/verbs/README.md