You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.4 LTS
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.10.0-rc1 and nightly
Python version: Python 3.5.2
Bazel version (if compiling from source): No
GCC/Compiler version (if compiling from source): No
CUDA/cuDNN version: 9.0.176
GPU model and memory: Tesla V100 16152MiB
Exact command to reproduce: Save the snippet in test.py and run "python test.py"
Describe the problem
Using the distribute strategy (OneDeviceStrategy) I have a crash with train_and_evaluate method of the estimator.
The code below works with train and evaluate as separeted functions, but it doesn't work with train_and_evaluate.
I have attached a snipped with train, evaluate and train_and_evaluate methods togheter to gather the behaviour differences. At the end I've put the output error log.
Note: it works with tensorflow 1.9.0 and codalab
from tensorflow import keras as ks
import numpy as np
import tensorflow as tf
from tensorflow.python.estimator import keras as keras_lib
tf.logging.set_verbosity(tf.logging.INFO)
def input_fn():
x = np.random.random((1024, 10))
y = np.random.randint(2, size=(1024, 1))
x = tf.cast(x, tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.repeat(100)
dataset = dataset.batch(1)
return dataset
model = ks.Sequential()
model.add(ks.layers.Dense(16, activation='relu', input_shape=(10,)))
model.add(ks.layers.Dense(1, activation='sigmoid'))
optimizer = tf.train.GradientDescentOptimizer(0.2)
model.compile(loss='binary_crossentropy', optimizer=optimizer)
strategy = tf.contrib.distribute.OneDeviceStrategy("device:GPU:0")
config = tf.estimator.RunConfig(train_distribute=strategy)
keras_estimator = keras_lib.model_to_estimator(
keras_model=model,
config=config)
keras_estimator.train(input_fn=input_fn, steps=10)
keras_estimator.evaluate(input_fn=input_fn, steps=3)
train_spec = tf.estimator.TrainSpec(
input_fn=input_fn,
max_steps=20)
eval_spec = tf.estimator.EvalSpec(
input_fn=input_fn,
steps=3)
tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)
Source code / logs
INFO:tensorflow:Using the Keras model provided.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp_xh7nsci
INFO:tensorflow:Using config: {'_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_task_type': 'worker', '_session_config': None, '_is_chief': True, '_evaluation_master': '', '_log_step_count_steps': 100, '_model_dir': '/tmp/tmp_xh7nsci', '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9a715b1240>, '_device_fn': None, '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_train_distribute': <tensorflow.contrib.distribute.python.one_device_strategy.OneDeviceStrategy object at 0x7f9a630426d8>}
2018-08-06 11:45:02.557498: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-06 11:45:02.679737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-06 11:45:02.680248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1404] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-08-06 11:45:02.680294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:03.032268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:03.032321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970] 0
2018-08-06 11:45:03.032339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0: N
2018-08-06 11:45:03.032678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2018-08-06 11:45:03.032984: E tensorflow/core/common_runtime/gpu/gpu_device.cc:228] Illegal GPUOptions.experimental.num_dev_to_dev_copy_streams=0 set to 1 instead.
2018-08-06 11:45:03.353069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:03.353135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:03.353154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970] 0
2018-08-06 11:45:03.353162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0: N
2018-08-06 11:45:03.353297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:04.042046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:04.042111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:04.042128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970] 0
2018-08-06 11:45:04.042148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0: N
2018-08-06 11:45:04.042305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/keras_model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:loss = 0.6695193, step = 0
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:Loss for final step: 0.3536753.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-06-11:45:04
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:04.822515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:04.822570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:04.822594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970] 0
2018-08-06 11:45:04.822613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0: N
2018-08-06 11:45:04.822780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/3]
INFO:tensorflow:Evaluation [2/3]
INFO:tensorflow:Evaluation [3/3]
INFO:tensorflow:Finished evaluation at 2018-08-06-11:45:04
INFO:tensorflow:Saving dict for global step 10: global_step = 10, loss = 0.1408012
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10: /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:05.220540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:05.220581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:05.220600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970] 0
2018-08-06 11:45:05.220615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0: N
2018-08-06 11:45:05.220760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:loss = 1.5840994, step = 10
INFO:tensorflow:Saving checkpoints for 20 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
File "test2.py", line 45, in <module>
tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 590, in run
return self.run_local()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 691, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 376, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1143, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1368, in _train_model_distributed
saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1451, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 695, in __exit__
self._close_internal(exception_type)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 727, in _close_internal
h.end(self._coordinated_creator.tf_sess)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 470, in end
self._save(session, last_step)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 489, in _save
if l.after_save(session, step):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 497, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in _evaluate
self._evaluator.evaluate_and_export())
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 884, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 463, in evaluate
input_fn, hooks, checkpoint_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1474, in _evaluate_build_graph
model_fn_lib.LOSS_METRIC_KEY] = metrics_lib.mean(estimator_spec.loss)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 376, in mean
mean_t = distribute_lib.get_tower_context().merge_call(
AttributeError: 'NoneType' object has no attribute 'merge_call'
The text was updated successfully, but these errors were encountered:
System information
Describe the problem
Using the distribute strategy (OneDeviceStrategy) I have a crash with train_and_evaluate method of the estimator.
The code below works with train and evaluate as separeted functions, but it doesn't work with train_and_evaluate.
I have attached a snipped with train, evaluate and train_and_evaluate methods togheter to gather the behaviour differences. At the end I've put the output error log.
Note: it works with tensorflow 1.9.0 and codalab
Source code / logs
The text was updated successfully, but these errors were encountered: