Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 196939548 #19345

Merged
merged 71 commits into from
May 17, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
c9298ed
Enable gpu tests for cross_tower_ops_test
guptapriya May 16, 2018
a77abd0
Trivial message cleanup.
martinwicke May 16, 2018
a09c0c8
Fix bug in `WorkerService::Logging()` handler.
mrry May 16, 2018
0fd579e
[TF:XLA] Make softplus more accurate
majnemer May 16, 2018
8cfbc0c
Add performance notes for in-context gradient calls.
tomhennigan May 16, 2018
07bb8c1
Implementation of transpose_conv
tensorflower-gardener May 16, 2018
b0e4e9f
Improving variable_scope documentation.
alextp May 16, 2018
15c319f
Refactor HloInstruction::Fuse and add a method for multi-output fusion.
tensorflower-gardener May 16, 2018
4eb8c8c
internal change
tensorflower-gardener May 16, 2018
d817bb3
Employ array flat sizes more directly in optimized_ops, some places i…
tensorflower-gardener May 16, 2018
373d83e
Migrate HloExecutionProfileTest to textual HLO
tensorflower-gardener May 16, 2018
b72f26e
Add support libraries in core/platform.
tensorflower-gardener May 16, 2018
c09c8ae
[XLA] Expose MinimumMemoryForComputation in hlo_scheduling.h
hawkinsp May 16, 2018
eaa78c1
Resolved inconsistency with shape inference for tf.reduce_join when p…
tensorflower-gardener May 16, 2018
61108cc
Modify tf.contrib.distributions.BatchReshape to behave a bit more like
tensorflower-gardener May 16, 2018
f8d92d7
Migrating BestModelExportStrategy to core library.
tensorflower-gardener May 16, 2018
ce41367
Don't initialize GPUs if none will be used.
tensorflower-gardener May 16, 2018
47e52f0
boosted_trees: accept integer labels properly now the same as float l…
yk5 May 16, 2018
f48c411
Add tf.contrib.data.make_tf_record_dataset() like make_csv_dataset() and
tensorflower-gardener May 16, 2018
1cb3552
[TF:XLA:INTERPRETER] speed up select and scatter by avoiding memory a…
nickdesaulniers May 16, 2018
319f0d6
Add TPUContext for input_fn invocation.
May 16, 2018
bd9ffaa
Removed C++ ABSL includes from tensorflow/core and tensorflow/compiler.
tensorflower-gardener May 16, 2018
1012c69
Remove more Estimator dependencies from core TensorFlow.
May 16, 2018
d82b6b4
[XLA:GPU] Teach ir_emitter_nested how to deal with multi output loop …
d0k May 16, 2018
a42d2f4
Fixes tflite_diff script.
May 16, 2018
0f7bb5b
[XLA:GPU] Emit the final write of the tuple pointers
d0k May 16, 2018
01ed446
Internal Change.
May 16, 2018
6f76e5c
Turn off MirroredStrategy Dataset prefetching in tests when using the
tensorflower-gardener May 16, 2018
de4a1a2
Expand tests to include int64 output type.
jpienaar May 16, 2018
1fdf522
Fix the CCFLAGS mismatch.
tensorflower-gardener May 16, 2018
5e9c94c
[TF:XLA:CPU] enable s32 reduce-window
nickdesaulniers May 16, 2018
d825bd4
Use sequence_length arg for dynamic_rnn within RNNEstimator
tensorflower-gardener May 16, 2018
ea3f7d1
Remove redundant initialization of collective params.
dubey May 16, 2018
37ec698
Add a test for compiled tfe.defun in GradientTape
iganichev May 16, 2018
6e6fb2f
Fix broken link.
tensorflower-gardener May 16, 2018
5ac249b
Remove sorted as types not sortable.
jpienaar May 16, 2018
1614536
Fix the gradient of reduce_prod for complex dtypes.
brianwa84 May 16, 2018
c9e4705
BUILD cleanup in contrib/lite/...
tensorflower-gardener May 16, 2018
41af978
Automated g4 rollback of changelist 196691101
hawkinsp May 16, 2018
da60097
Checkpointable: move python/training/checkpointable_* to python/train…
allenlavoie May 16, 2018
42b657e
Add a parameter to the adaptive shared batcher which allows the user …
tensorflower-gardener May 16, 2018
68134f0
Fix typo in comment
May 16, 2018
d5c075b
Add test for 64-bit clz and sign.
jpienaar May 16, 2018
ee21903
Make sparse_cross operations publicly available.
ispirmustafa May 16, 2018
8f216f5
Fixing test for Topk kernel in TFlite
tensorflower-gardener May 16, 2018
9fd3485
[TF:XLA] Take subcomputations into account during HLO scheduling.
dimvar May 16, 2018
2504156
Move DoesNotUseOperandBuffer and CanShareOperandBufferWithUser from
fdxmw May 16, 2018
9c1a186
Remove unused inclusions
tensorflower-gardener May 16, 2018
76728db
Allow for remote eager execution.
May 16, 2018
6dfde69
[XLA:GPU] Add op-tracing to XLA:GPU.
May 16, 2018
99bd4bc
Remove unused inclusions
tensorflower-gardener May 17, 2018
8baec7e
[XLA] Add documentation explaining FusionKind.
May 17, 2018
a5479bf
[XLA] Improve documentation on HloModule, HloComputation, and HloInst…
May 17, 2018
e1589a9
Adds basic TPU replicate training support for Keras.
May 17, 2018
d5f3097
Remove no-op statement. tf_additional_lib_srcs only selects .cc files…
tensorflower-gardener May 17, 2018
7f1f1b0
Re-enabling a test after a previous fix.
jsimsa May 17, 2018
1511cf2
Fix typo in TensorHandle
iganichev May 17, 2018
0c33e1f
Remove _USE_C_API staging in tests now that the C API is enabled by d…
skye May 17, 2018
d335efb
Remove _USE_C_API staging in tests now that the C API is enabled by d…
skye May 17, 2018
1d46d1a
[XLA] Remove XlaOp::GetShape. It only works when the buidler of the X…
tensorflower-gardener May 17, 2018
1ac9c48
[TF:XLA] Make noinline function work with control flow.
tensorflower-gardener May 17, 2018
b33d100
Append device name in executor logging
protoget May 17, 2018
cf55582
Enhance DenseLayer + XLA compatibility test cases to cover compilatio…
tensorflower-gardener May 17, 2018
a5ed23e
Internal change.
jart May 17, 2018
0668573
[TF:XLA] Bump open source llvm revision to r332236
May 17, 2018
b2e53b9
Making GetOptionalInput from kernel_util.h return a pointer to const …
tensorflower-gardener May 17, 2018
deca317
Add more logging in BaseGPUDevice::ComputeHelper for kernel completion.
protoget May 17, 2018
147a31e
[tf.data] Accept NumPy dtype objects in `Dataset.from_generator(..., …
mrry May 17, 2018
c111fa1
Sort tags before logging.
tensorflower-gardener May 17, 2018
23e17c6
Update installation documentation to reflect that CUDA 8 and cuDNN 6 …
tensorflower-gardener May 17, 2018
c5a389d
Merge commit for internal changes
zheng-xq May 17, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
23 changes: 23 additions & 0 deletions tensorflow/c/eager/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,17 @@ tf_cuda_library(
"//conditions:default": [],
}) + [
"//tensorflow/core/common_runtime/eager:eager_operation",
"//tensorflow/core/distributed_runtime/eager:eager_client",
"//tensorflow/core/distributed_runtime/rpc/eager:grpc_eager_client",
"//tensorflow/core/distributed_runtime/rpc:grpc_channel",
"//tensorflow/core/distributed_runtime/rpc/eager:eager_grpc_server_lib",
"//tensorflow/core/distributed_runtime/rpc:grpc_server_lib",
"//tensorflow/core/distributed_runtime/rpc:grpc_worker_cache",
"//tensorflow/core/distributed_runtime/rpc:grpc_worker_service",
"//tensorflow/core/distributed_runtime/rpc:rpc_rendezvous_mgr",
"//tensorflow/core/distributed_runtime:remote_device",
"//tensorflow/core/distributed_runtime:server_lib",
"//tensorflow/core/distributed_runtime:worker_env",
"//tensorflow/core:gpu_runtime",
],
)
Expand All @@ -74,6 +85,17 @@ tf_cuda_library(
"//tensorflow/core/common_runtime/eager:eager_operation",
"//tensorflow/core/common_runtime/eager:kernel_and_device",
"//tensorflow/core/common_runtime/eager:tensor_handle",
"//tensorflow/core/distributed_runtime:remote_device",
"//tensorflow/core/distributed_runtime:server_lib",
"//tensorflow/core/distributed_runtime:worker_env",
"//tensorflow/core/distributed_runtime/eager:eager_client",
"//tensorflow/core/distributed_runtime/eager:remote_tensor_handle",
"//tensorflow/core/distributed_runtime/rpc:grpc_channel",
"//tensorflow/core/distributed_runtime/rpc:grpc_worker_cache",
"//tensorflow/core/distributed_runtime/rpc:grpc_worker_service",
"//tensorflow/core/distributed_runtime/rpc:rpc_rendezvous_mgr",
"//tensorflow/core/distributed_runtime/rpc/eager:eager_grpc_server_lib",
"//tensorflow/core/distributed_runtime/rpc/eager:grpc_eager_client",
],
)

Expand All @@ -92,6 +114,7 @@ tf_cuda_cc_test(
"//tensorflow/core:protos_all_cc",
"//tensorflow/core:test",
"//tensorflow/core:test_main",
"//tensorflow/core/distributed_runtime/rpc/eager:eager_grpc_server_lib",
],
)

Expand Down
147 changes: 143 additions & 4 deletions tensorflow/c/eager/c_api.cc
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,17 @@ limitations under the License.
#include "tensorflow/core/common_runtime/eager/execute.h"
#include "tensorflow/core/common_runtime/function.h"
#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
#include "tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h"
#include "tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.h"
#include "tensorflow/core/distributed_runtime/rpc/grpc_channel.h"
#include "tensorflow/core/distributed_runtime/server_lib.h"
#include "tensorflow/core/distributed_runtime/worker_env.h"
#include "tensorflow/core/framework/node_def_util.h"
#include "tensorflow/core/framework/rendezvous.h"
#include "tensorflow/core/framework/tensor_shape.pb.h"
#include "tensorflow/core/framework/types.h"
#include "tensorflow/core/lib/core/refcount.h"
#include "tensorflow/core/lib/gtl/cleanup.h"
#include "tensorflow/core/lib/gtl/flatmap.h"
#include "tensorflow/core/lib/gtl/map_util.h"
#include "tensorflow/core/lib/gtl/stl_util.h"
Expand Down Expand Up @@ -71,6 +77,121 @@ string DeviceName(const tensorflow::Device* d) {
std::atomic_int_fast64_t func_id_generator(0);
#endif // TENSORFLOW_EAGER_USE_XLA

tensorflow::Status GetAllRemoteDevices(
const std::vector<string>& remote_workers,
tensorflow::WorkerCacheInterface* worker_cache,
std::unique_ptr<tensorflow::DeviceMgr>* device_mgr) {
std::vector<tensorflow::Device*> remote_devices;
tensorflow::Status status;
// TODO(nareshmodi) do this in parallel instead of serially.
for (const string& remote_worker : remote_workers) {
tensorflow::Notification n;
tensorflow::NewRemoteDevices(
tensorflow::Env::Default(), worker_cache, remote_worker,
[&status, &n, &remote_devices](
const tensorflow::Status& s,
std::vector<tensorflow::Device*>* devices) {
status = s;
if (s.ok()) {
for (tensorflow::Device* d : *devices) {
remote_devices.push_back(d);
}
}
n.Notify();
});
n.WaitForNotification();
}
std::unique_ptr<tensorflow::DeviceMgr> remote_device_mgr(
new tensorflow::DeviceMgr(remote_devices));

TF_RETURN_IF_ERROR(status);

*device_mgr = std::move(remote_device_mgr);
return tensorflow::Status::OK();
}

tensorflow::Status CreateRemoteContexts(
const std::vector<string>& remote_workers,
tensorflow::eager::EagerClientCache* remote_eager_workers, bool async,
tensorflow::gtl::FlatMap<string, tensorflow::uint64>* remote_contexts) {
for (int i = 0; i < remote_workers.size(); i++) {
const string& remote_worker = remote_workers[i];

tensorflow::eager::CreateContextRequest request;
tensorflow::eager::CreateContextResponse response;
tensorflow::DeviceNameUtils::ParsedName parsed_name;
if (!tensorflow::DeviceNameUtils::ParseFullName(remote_worker,
&parsed_name)) {
return tensorflow::errors::InvalidArgument(
"Unable to parse ", remote_worker, " as a device name");
}
request.mutable_server_def()->set_job_name(parsed_name.job);
request.mutable_server_def()->set_task_index(parsed_name.task);
request.set_async(async);
auto* eager_client = remote_eager_workers->GetClient(remote_worker);
if (eager_client == nullptr) {
return tensorflow::errors::Internal(
"Cannot find a client for the given target:", remote_worker);
}
tensorflow::Notification n;
tensorflow::Status status;
// TODO(nareshmodi) do this in parallel instead of serially.
eager_client->CreateContextAsync(
&request, &response, [&status, &n](const tensorflow::Status& s) {
status = s;
n.Notify();
});
n.WaitForNotification();
TF_RETURN_IF_ERROR(status);

remote_contexts->emplace(remote_worker, response.context_id());
}
return tensorflow::Status::OK();
}

tensorflow::Status NewRemoteAwareTFE_Context(const TFE_ContextOptions* opts,
TFE_Context** ctx) {
string worker_name = tensorflow::strings::StrCat(
"/job:", opts->server_def.job_name(),
"/replica:0/task:", opts->server_def.task_index());
std::unique_ptr<tensorflow::eager::EagerGrpcServer> server;
TF_RETURN_IF_ERROR(
tensorflow::eager::EagerGrpcServer::Create(opts->server_def, &server));

TF_RETURN_IF_ERROR(server->Start());

std::vector<string> remote_workers;
server->master_env()->worker_cache->ListWorkers(&remote_workers);
remote_workers.erase(
std::remove(remote_workers.begin(), remote_workers.end(), worker_name),
remote_workers.end());

std::unique_ptr<tensorflow::DeviceMgr> remote_device_mgr;
TF_RETURN_IF_ERROR(GetAllRemoteDevices(
remote_workers, server->master_env()->worker_cache, &remote_device_mgr));

std::shared_ptr<tensorflow::GrpcChannelCache> channel_cache =
server->channel_cache();
std::unique_ptr<tensorflow::eager::EagerClientCache> remote_eager_workers(
tensorflow::eager::NewGrpcEagerClientCache(channel_cache));

// Initialize remote eager workers.
tensorflow::gtl::FlatMap<string, tensorflow::uint64> remote_contexts;
TF_RETURN_IF_ERROR(CreateRemoteContexts(remote_workers,
remote_eager_workers.get(),
opts->async, &remote_contexts));

tensorflow::RemoteRendezvous* r =
server->worker_env()->rendezvous_mgr->Find(0);

auto* device_mgr = server->worker_env()->device_mgr;
*ctx = new TFE_Context(opts->session_options.options, opts->policy,
opts->async, device_mgr, r, std::move(server),
std::move(remote_eager_workers),
std::move(remote_device_mgr), remote_contexts);

return tensorflow::Status::OK();
}
} // namespace

extern "C" {
Expand All @@ -91,6 +212,15 @@ void TFE_ContextOptionsSetDevicePlacementPolicy(
options->policy = policy;
}

TF_CAPI_EXPORT extern void TFE_ContextOptionsSetServerDef(
TFE_ContextOptions* options, const void* proto, size_t proto_len,
TF_Status* status) {
if (!options->server_def.ParseFromArray(proto, proto_len)) {
status->status = tensorflow::errors::InvalidArgument(
"Invalid tensorflow.ServerDef protocol buffer");
}
}

TF_CAPI_EXPORT extern void TFE_ContextSetAsyncForThread(TFE_Context* ctx,
unsigned char async,
TF_Status* status) {
Expand All @@ -100,17 +230,23 @@ TF_CAPI_EXPORT extern void TFE_ContextSetAsyncForThread(TFE_Context* ctx,
void TFE_DeleteContextOptions(TFE_ContextOptions* options) { delete options; }

TFE_Context* TFE_NewContext(const TFE_ContextOptions* opts, TF_Status* status) {
if (!opts->server_def.job_name().empty()) {
TFE_Context* ctx = nullptr;
status->status = NewRemoteAwareTFE_Context(opts, &ctx);
return ctx;
}

std::vector<tensorflow::Device*> devices;
status->status = tensorflow::DeviceFactory::AddDevices(
opts->session_options.options, "/job:localhost/replica:0/task:0",
&devices);
if (!status->status.ok()) {
return nullptr;
}
if (!status->status.ok()) return nullptr;
std::unique_ptr<tensorflow::DeviceMgr> device_mgr(
new tensorflow::DeviceMgr(devices));

tensorflow::Rendezvous* r =
new tensorflow::IntraProcessRendezvous(device_mgr.get());

return new TFE_Context(opts->session_options.options, opts->policy,
opts->async, std::move(device_mgr), r);
}
Expand All @@ -119,7 +255,10 @@ void TFE_DeleteContext(TFE_Context* ctx, TF_Status* status) { delete ctx; }

TF_DeviceList* TFE_ContextListDevices(TFE_Context* ctx, TF_Status* status) {
TF_DeviceList* list = new TF_DeviceList;
ctx->context.device_mgr()->ListDeviceAttributes(&list->response);
ctx->context.local_device_mgr()->ListDeviceAttributes(&list->response);
if (ctx->context.remote_device_mgr()) {
ctx->context.remote_device_mgr()->ListDeviceAttributes(&list->response);
}
return list;
}

Expand Down
10 changes: 10 additions & 0 deletions tensorflow/c/eager/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,16 @@ TF_CAPI_EXPORT extern void TFE_ContextOptionsSetAsync(TFE_ContextOptions*,
TF_CAPI_EXPORT extern void TFE_ContextOptionsSetDevicePlacementPolicy(
TFE_ContextOptions*, TFE_ContextDevicePlacementPolicy);

// A tensorflow.ServerDef specifies remote workers (in addition to the current
// workers name). Operations created on this context can then be executed on
// any of these remote workers by setting an appropriate device.
//
// If the following is set, all servers identified by the
// ServerDef must be up when the context is created.
TF_CAPI_EXPORT extern void TFE_ContextOptionsSetServerDef(
TFE_ContextOptions* options, const void* proto, size_t proto_len,
TF_Status* status);

// Destroy an options object.
TF_CAPI_EXPORT extern void TFE_DeleteContextOptions(TFE_ContextOptions*);

Expand Down
26 changes: 26 additions & 0 deletions tensorflow/c/eager/c_api_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,14 @@ limitations under the License.
#include "tensorflow/core/common_runtime/eager/tensor_handle.h"
#include "tensorflow/core/common_runtime/function.h"
#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
#include "tensorflow/core/distributed_runtime/eager/eager_client.h"
#include "tensorflow/core/distributed_runtime/remote_device.h"
#include "tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h"
#include "tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.h"
#include "tensorflow/core/distributed_runtime/rpc/grpc_worker_service.h"
#include "tensorflow/core/distributed_runtime/rpc/rpc_rendezvous_mgr.h"
#include "tensorflow/core/distributed_runtime/server_lib.h"
#include "tensorflow/core/distributed_runtime/worker_env.h"
#include "tensorflow/core/framework/rendezvous.h"
#include "tensorflow/core/lib/core/stringpiece.h"
#include "tensorflow/core/lib/gtl/inlined_vector.h"
Expand All @@ -51,6 +59,7 @@ struct TFE_ContextOptions {
// true if async execution is enabled.
bool async = false;
TFE_ContextDevicePlacementPolicy policy{TFE_DEVICE_PLACEMENT_SILENT};
tensorflow::ServerDef server_def;
};

struct TFE_Context {
Expand All @@ -64,6 +73,23 @@ struct TFE_Context {
default_policy),
async, std::move(device_mgr), rendezvous) {}

explicit TFE_Context(
const tensorflow::SessionOptions& opts,
TFE_ContextDevicePlacementPolicy default_policy, bool async,
tensorflow::DeviceMgr* local_device_mgr,
tensorflow::Rendezvous* rendezvous,
std::unique_ptr<tensorflow::GrpcServer> server,
std::unique_ptr<tensorflow::eager::EagerClientCache> remote_eager_workers,
std::unique_ptr<tensorflow::DeviceMgr> remote_device_mgr,
const tensorflow::gtl::FlatMap<tensorflow::string, tensorflow::uint64>&
remote_contexts)
: context(opts,
static_cast<tensorflow::ContextDevicePlacementPolicy>(
default_policy),
async, local_device_mgr, rendezvous, std::move(server),
std::move(remote_eager_workers), std::move(remote_device_mgr),
remote_contexts) {}

tensorflow::EagerContext context;
};

Expand Down