Error in Distribution Strategy with train_and_evaluate #21412

lenlen · 2018-08-06T15:07:05Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.4 LTS
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.10.0-rc1 and nightly
Python version: Python 3.5.2
Bazel version (if compiling from source): No
GCC/Compiler version (if compiling from source): No
CUDA/cuDNN version: 9.0.176
GPU model and memory: Tesla V100 16152MiB
Exact command to reproduce: Save the snippet in test.py and run "python test.py"

Describe the problem

Using the distribute strategy (OneDeviceStrategy) I have a crash with train_and_evaluate method of the estimator.
The code below works with train and evaluate as separeted functions, but it doesn't work with train_and_evaluate.
I have attached a snipped with train, evaluate and train_and_evaluate methods togheter to gather the behaviour differences. At the end I've put the output error log.
Note: it works with tensorflow 1.9.0 and codalab

from tensorflow import keras as ks
import numpy as np
import tensorflow as tf
from tensorflow.python.estimator import keras as keras_lib


tf.logging.set_verbosity(tf.logging.INFO)


def input_fn():
    x = np.random.random((1024, 10))
    y = np.random.randint(2, size=(1024, 1))
    x = tf.cast(x, tf.float32)
    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataset = dataset.repeat(100)
    dataset = dataset.batch(1)
    return dataset


model = ks.Sequential()
model.add(ks.layers.Dense(16, activation='relu', input_shape=(10,)))
model.add(ks.layers.Dense(1, activation='sigmoid'))

optimizer = tf.train.GradientDescentOptimizer(0.2)

model.compile(loss='binary_crossentropy', optimizer=optimizer)

strategy = tf.contrib.distribute.OneDeviceStrategy("device:GPU:0")
config = tf.estimator.RunConfig(train_distribute=strategy)

keras_estimator = keras_lib.model_to_estimator(
  keras_model=model,
  config=config)

keras_estimator.train(input_fn=input_fn, steps=10)
keras_estimator.evaluate(input_fn=input_fn, steps=3)

train_spec = tf.estimator.TrainSpec(
    input_fn=input_fn,
    max_steps=20)
eval_spec = tf.estimator.EvalSpec(
    input_fn=input_fn,
    steps=3)

tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)

Source code / logs

INFO:tensorflow:Using the Keras model provided.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp_xh7nsci
INFO:tensorflow:Using config: {'_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_task_type': 'worker', '_session_config': None, '_is_chief': True, '_evaluation_master': '', '_log_step_count_steps': 100, '_model_dir': '/tmp/tmp_xh7nsci', '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9a715b1240>, '_device_fn': None, '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_train_distribute': <tensorflow.contrib.distribute.python.one_device_strategy.OneDeviceStrategy object at 0x7f9a630426d8>}
2018-08-06 11:45:02.557498: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-06 11:45:02.679737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-06 11:45:02.680248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1404] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-08-06 11:45:02.680294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:03.032268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:03.032321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:03.032339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:03.032678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2018-08-06 11:45:03.032984: E tensorflow/core/common_runtime/gpu/gpu_device.cc:228] Illegal GPUOptions.experimental.num_dev_to_dev_copy_streams=0 set to 1 instead.
2018-08-06 11:45:03.353069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:03.353135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:03.353154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:03.353162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:03.353297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:04.042046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:04.042111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:04.042128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:04.042148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:04.042305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/keras_model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:loss = 0.6695193, step = 0
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:Loss for final step: 0.3536753.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-06-11:45:04
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:04.822515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:04.822570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:04.822594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:04.822613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:04.822780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/3]
INFO:tensorflow:Evaluation [2/3]
INFO:tensorflow:Evaluation [3/3]
INFO:tensorflow:Finished evaluation at 2018-08-06-11:45:04
INFO:tensorflow:Saving dict for global step 10: global_step = 10, loss = 0.1408012
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10: /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:05.220540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:05.220581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:05.220600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:05.220615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:05.220760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:loss = 1.5840994, step = 10
INFO:tensorflow:Saving checkpoints for 20 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
  File "test2.py", line 45, in <module>
    tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 590, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 691, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 376, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1143, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1368, in _train_model_distributed
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1451, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 695, in __exit__
    self._close_internal(exception_type)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 727, in _close_internal
    h.end(self._coordinated_creator.tf_sess)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 470, in end
    self._save(session, last_step)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 489, in _save
    if l.after_save(session, step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 497, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 884, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 463, in evaluate
    input_fn, hooks, checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1474, in _evaluate_build_graph
    model_fn_lib.LOSS_METRIC_KEY] = metrics_lib.mean(estimator_spec.loss)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 376, in mean
    mean_t = distribute_lib.get_tower_context().merge_call(
AttributeError: 'NoneType' object has no attribute 'merge_call'

The text was updated successfully, but these errors were encountered:

lenlen · 2018-08-06T16:42:15Z

probably related to #21180

shivaniag · 2018-08-07T01:14:33Z

@guptapriya: could you take a look.

guptapriya · 2018-08-08T22:20:13Z

yes, I looked into this briefly sometime ago, will be working on it soon. Duplicate of this issue: #21180

guptapriya · 2018-08-14T19:37:45Z

This should be fixed in master now

tensorflowbutler assigned shivaniag Aug 7, 2018

shivaniag assigned guptapriya and unassigned shivaniag Aug 7, 2018

shivaniag added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 7, 2018

guptapriya added the comp:dist-strat Distribution Strategy related issues label Aug 8, 2018

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 9, 2018

tensorflow-copybara closed this as completed in 77fabbe Aug 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Distribution Strategy with train_and_evaluate #21412

Error in Distribution Strategy with train_and_evaluate #21412

lenlen commented Aug 6, 2018

lenlen commented Aug 6, 2018

shivaniag commented Aug 7, 2018

guptapriya commented Aug 8, 2018

guptapriya commented Aug 14, 2018

Error in Distribution Strategy with train_and_evaluate #21412

Error in Distribution Strategy with train_and_evaluate #21412

Comments

lenlen commented Aug 6, 2018

System information

Describe the problem

Source code / logs

lenlen commented Aug 6, 2018

shivaniag commented Aug 7, 2018

guptapriya commented Aug 8, 2018

guptapriya commented Aug 14, 2018