Skip to content

Conversation

junshi15
Copy link
Contributor

@junshi15 junshi15 commented Apr 4, 2017

This patch introduces an ibverbs-based RDMA path between servers for tensor transfer (weights, gradients, etc). The existing gRPC path remains and is responsible for "administrative" tasks, such as setting up the Rdma path, exchanging computation graphs, etc. Design details can be found in the README file below:
https://github.com/yahoo/tensorflow/blob/verbs_rdma/tensorflow/contrib/verbs/README.md

@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@rohan100jain rohan100jain requested a review from jhseu April 5, 2017 14:55
@jhseu jhseu requested review from poxvoculi and removed request for jhseu April 5, 2017 18:10
@poxvoculi
Copy link
Contributor

Via another channel I am in communication wtih @junshi15 about this PR. It needs to be updated slightly to the most recent version of the main repository, then he will submit again.

@poxvoculi poxvoculi self-assigned this Apr 5, 2017
Copy link
Contributor

@poxvoculi poxvoculi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just starting to enter a few comments one by one, because this tool will lose all unsubmitted comments on various minor browser events.

// analyzing request
// the channel setting part is redundant.
string remote_host_name = request->host_name();
RdmaChannel* rc = rdma_mgr_->FindChannel(remote_host_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK(rc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

int i = 0;
int idx[] = {1, 0, 3, 2};
std::vector<RdmaBuffer*> mb(rc->message_buffers());
for (const auto& mr : request->mr()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK whether request->mr() has more than size 4?

}

// Initiate recv
for (int i = 0; i < 100; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose eventually this should be some kind of config option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in the future revision.

rb = new RdmaMessageBuffer(this, name);
} else if (buffer_type == ACK) {
rb = new RdmaAckBuffer(this, name);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK(!rb) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added CHECK(rb)

// None
void RdmaChannel::InsertRecvCallback(string& key,
std::function<void()> recv_done) {
ct_mu_.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you include tensorflow/core/platform/mutex.h you can use the scoped mutex_lock instead of explicit unlocks like here. It's a little less error prone in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed most places to scoped mutex_lock.

<< "send dev name: " << src_dev->name()
<< " gpu_info: " << src_dev->tensorflow_gpu_device_info();
// "val" is on a GPU. Uses GPUUtil to fill the proto.
s = VerbsUtil::SetProtoFromGPUSync(in, src_dev,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this stall the CQ processing thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it can. will optimize in a future revision.

return;
}
AllocatorAttributes src_alloc_attr;
src_alloc_attr.set_on_host(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

class VerbsServerFactory : public ServerFactory {
public:
bool AcceptsOptions(const ServerDef& server_def) override {
return server_def.protocol() == "verbs";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use "grpc+verbs" as the protocol, since this is a hybrid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "grpc+verbs"

GetRemoteAddressResponse* response) {
// analyzing request
// the channel setting part is redundant.
string remote_host_name = request->host_name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

srcs = ["verbs_service.proto"],
cc_api_version = 2,
visibility = [
"//tensorflow:internal",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works, prefer tensorflow:subpackages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to tensorflow:subpackages

@junshi15 junshi15 force-pushed the verbs_rdma branch 2 times, most recently from af23ab6 to 0a6c80b Compare April 7, 2017 08:02
@junshi15
Copy link
Contributor Author

junshi15 commented Apr 7, 2017

rebased to master branch, conflicts resolved.

@poxvoculi
Copy link
Contributor

Jenkins, test this please.

@jhseu
Copy link
Contributor

jhseu commented Apr 7, 2017

Jenkins, test this please

@poxvoculi
Copy link
Contributor

poxvoculi commented Apr 7, 2017

@junshi15 : the buildifier sanity check is failing, and that appears to be blocking all further testing.

It looks like you can run
run buildifier tensorflow/contrib/verbs/BUILD
or fix the errors by hand. It looks like some build rule deps are not ordered properly.

TF_CHECK_OK(Stop());
TF_CHECK_OK(Join());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete empty line

worker_service_ =
NewGrpcWorkerService(worker_impl_.get(), &builder).release();
// extra service:
if (service_func != NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use { } whenever 'if' construct spans more than one line.

std::unique_ptr<GrpcServer> ret(
new GrpcServer(server_def, env == nullptr ? Env::Default() : env));
TF_RETURN_IF_ERROR(ret->Init());
ServiceCreationFunction service_func = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/NULL/nullptr

@junshi15
Copy link
Contributor Author

junshi15 commented Apr 7, 2017

working on tensorflow/contrib/verbs/BUILD, buildifier (https://github.com/bazelbuild/buildtools) does not build for me, so I will edit it manually.

const std::string& worker_name, WorkerCacheInterface* worker_cache)>
RendezvousMgrCreationFunction;

typedef std::function<void(const WorkerEnv*, ::grpc::ServerBuilder*)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document what this is supposed to do? I think it's an initialization for a derived class. So maybe call it ServiceInitFunction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ServiceCreationFunction" registers a service on the server. This is one way to register a service by the derived class. VerbsServer::Init() calls GrpcServer::Init(). But the server is already built and run after GrpcServer::Init(), so I have to pass a function to GrpcServer::Init() to register an extra service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with your naming.

@junshi15
Copy link
Contributor Author

junshi15 commented Apr 7, 2017

Jenkins, test this please.

1 similar comment
@poxvoculi
Copy link
Contributor

Jenkins, test this please.

Add missing comma in verbs/BUILD file
@poxvoculi
Copy link
Contributor

Jenkins, test this please.

@junshi15
Copy link
Contributor Author

junshi15 commented Apr 7, 2017

What's the problem with two BUILD files that failed sanity checks? indentation?

@poxvoculi
Copy link
Contributor

poxvoculi commented Apr 7, 2017 via email

@junshi15
Copy link
Contributor Author

junshi15 commented Apr 9, 2017

Do we have an example where link option can be included/excluded with a flag?
"#define" only takes care of header files. I need to disable a link option below:
https://github.com/yahoo/tensorflow/blob/5776b4ffc851cbcb742952c6721739f45f0dbf6f/tensorflow/contrib/verbs/BUILD#L138-L140

@junshi15
Copy link
Contributor Author

junshi15 commented Apr 9, 2017

@junshi15
Copy link
Contributor Author

junshi15 commented Apr 9, 2017

Added a switch to turn on/off link option "-libverbs". Please initiate a build. Thank you.

@poxvoculi
Copy link
Contributor

Jenkins test this please

@poxvoculi poxvoculi closed this Apr 10, 2017
@poxvoculi poxvoculi reopened this Apr 10, 2017
@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@junshi15
Copy link
Contributor Author

Added "#ifdef TENSORFLOW_USE_VERBS" to two more files. Please initiate the build. Thanks.

@jhseu
Copy link
Contributor

jhseu commented Apr 10, 2017

Jenkins, test this please

@drpngx
Copy link
Contributor

drpngx commented Apr 15, 2017

@poxvoculi are we good with this then?

@poxvoculi
Copy link
Contributor

Jenkins, test this please.

Copy link
Contributor

@poxvoculi poxvoculi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks

@drpngx drpngx merged commit dd40e98 into tensorflow:master Apr 19, 2017
@wydwww
Copy link
Contributor

wydwww commented Apr 20, 2017

Typos in README.md:
fro in line 17
transimssion in line 39

@drpngx
Copy link
Contributor

drpngx commented Apr 20, 2017

Right. @wydwww could you please send a follow-up PR?

@junshi15
Copy link
Contributor Author

@wydwww Thanks for catching the typos.

@fanlu
Copy link
Contributor

fanlu commented May 18, 2017

#9926
When I training inception model, I got the following error
2017-05-18 09:50:50.534662: I tensorflow/contrib/verbs/rdma_mgr.cc:56] connecting to remote node /job:worker/replica:0/task:2
2017-05-18 09:51:02.357013: I tensorflow/contrib/verbs/rdma.cc:518] channel already connected
2017-05-18 09:51:02.357102: I tensorflow/contrib/verbs/rdma_mgr.cc:56] connecting to remote node /job:worker/replica:0/task:0
2017-05-18 09:51:02.358775: I tensorflow/contrib/verbs/rdma_mgr.cc:56] connecting to remote node /job:ps/replica:0/task:0
2017-05-18 09:51:07.867035: I tensorflow/contrib/verbs/rdma.cc:518] channel already connected
2017-05-18 09:51:08.095058: I tensorflow/contrib/verbs/rdma.cc:518] channel already connected
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=3; total_num_replicas=3
INFO:tensorflow:2017-05-18 09:52:16.455855 Supervisor
2017-05-18 09:52:19.322077: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 126a016806cf6873 with config:
allow_soft_placement: true

2017-05-18 09:52:27.325014: F tensorflow/contrib/verbs/rdma.cc:129] Check failed: wc_[i].status == IBV_WC_SUCCESS Failed status
transport retry counter exceeded 12 40500432 129
Aborted (core dumped)
this is my command line
CUDA_VISIBLE_DEVICES=1 bazel-bin/inception/imagenet_distributed_train \

--batch_size=32
--data_dir=/data0/imagenet-data
--job_name='worker'
--task_id=1
--ps_hosts='localhost:2222'
--worker_hosts='localhost:2223,localhost:2224,localhost:2225' --protocol='grpc+verbs'

@junshi15 @llhe Can you help me to solve this problem?

@junshi15
Copy link
Contributor Author

@fanlu Please use this link #9926

@suiyuan2009
Copy link
Contributor

verbs build failed on current master branch, last commit.

ERROR: /home/dongziming/tensorflow/tensorflow/contrib/verbs/BUILD:158:1: C++ compilation of rule '//tensorflow/contrib/verbs:verbs_server_lib' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter ... (remaining 164 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
tensorflow/contrib/verbs/verbs_server_lib.cc: In constructor 'tensorflow::{anonymous}::VerbsServerRegistrar::VerbsServerRegistrar()':
tensorflow/contrib/verbs/verbs_server_lib.cc:153:5: error: 'gpr_allocation_functions' was not declared in this scope
     gpr_allocation_functions alloc_fns;
     ^
tensorflow/contrib/verbs/verbs_server_lib.cc:154:5: error: 'alloc_fns' was not declared in this scope
     alloc_fns.malloc_fn = port::Malloc;
     ^
tensorflow/contrib/verbs/verbs_server_lib.cc:157:43: error: 'gpr_set_allocation_functions' was not declared in this scope
     gpr_set_allocation_functions(alloc_fns);
                                           ^

@vdevaram
Copy link

@junshi15 : we are not able to use NIC bonding feature with this PR. is this PR supports NIC bonding? If not is it possible to add NIC bonding feature with verbs. May I know how to handle in code? do you have any plans to add it in near future?

@junshi15
Copy link
Contributor Author

@vdevaram I do not have experience with NIC bonding and do not have a plan to add this feature in the future. Your contributions are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.