Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Distribution Strategy with train_and_evaluate #21412

Closed
lenlen opened this issue Aug 6, 2018 · 4 comments
Closed

Error in Distribution Strategy with train_and_evaluate #21412

lenlen opened this issue Aug 6, 2018 · 4 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues

Comments

@lenlen
Copy link

lenlen commented Aug 6, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.4 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.10.0-rc1 and nightly
  • Python version: Python 3.5.2
  • Bazel version (if compiling from source): No
  • GCC/Compiler version (if compiling from source): No
  • CUDA/cuDNN version: 9.0.176
  • GPU model and memory: Tesla V100 16152MiB
  • Exact command to reproduce: Save the snippet in test.py and run "python test.py"

Describe the problem

Using the distribute strategy (OneDeviceStrategy) I have a crash with train_and_evaluate method of the estimator.
The code below works with train and evaluate as separeted functions, but it doesn't work with train_and_evaluate.
I have attached a snipped with train, evaluate and train_and_evaluate methods togheter to gather the behaviour differences. At the end I've put the output error log.
Note: it works with tensorflow 1.9.0 and codalab

from tensorflow import keras as ks
import numpy as np
import tensorflow as tf
from tensorflow.python.estimator import keras as keras_lib


tf.logging.set_verbosity(tf.logging.INFO)


def input_fn():
    x = np.random.random((1024, 10))
    y = np.random.randint(2, size=(1024, 1))
    x = tf.cast(x, tf.float32)
    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataset = dataset.repeat(100)
    dataset = dataset.batch(1)
    return dataset


model = ks.Sequential()
model.add(ks.layers.Dense(16, activation='relu', input_shape=(10,)))
model.add(ks.layers.Dense(1, activation='sigmoid'))

optimizer = tf.train.GradientDescentOptimizer(0.2)

model.compile(loss='binary_crossentropy', optimizer=optimizer)

strategy = tf.contrib.distribute.OneDeviceStrategy("device:GPU:0")
config = tf.estimator.RunConfig(train_distribute=strategy)

keras_estimator = keras_lib.model_to_estimator(
  keras_model=model,
  config=config)

keras_estimator.train(input_fn=input_fn, steps=10)
keras_estimator.evaluate(input_fn=input_fn, steps=3)

train_spec = tf.estimator.TrainSpec(
    input_fn=input_fn,
    max_steps=20)
eval_spec = tf.estimator.EvalSpec(
    input_fn=input_fn,
    steps=3)

tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)

Source code / logs

INFO:tensorflow:Using the Keras model provided.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp_xh7nsci
INFO:tensorflow:Using config: {'_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_task_type': 'worker', '_session_config': None, '_is_chief': True, '_evaluation_master': '', '_log_step_count_steps': 100, '_model_dir': '/tmp/tmp_xh7nsci', '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9a715b1240>, '_device_fn': None, '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_train_distribute': <tensorflow.contrib.distribute.python.one_device_strategy.OneDeviceStrategy object at 0x7f9a630426d8>}
2018-08-06 11:45:02.557498: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-06 11:45:02.679737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-06 11:45:02.680248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1404] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-08-06 11:45:02.680294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:03.032268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:03.032321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:03.032339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:03.032678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2018-08-06 11:45:03.032984: E tensorflow/core/common_runtime/gpu/gpu_device.cc:228] Illegal GPUOptions.experimental.num_dev_to_dev_copy_streams=0 set to 1 instead.
2018-08-06 11:45:03.353069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:03.353135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:03.353154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:03.353162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:03.353297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:04.042046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:04.042111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:04.042128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:04.042148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:04.042305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/keras_model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:loss = 0.6695193, step = 0
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:Loss for final step: 0.3536753.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-06-11:45:04
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:04.822515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:04.822570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:04.822594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:04.822613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:04.822780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/3]
INFO:tensorflow:Evaluation [2/3]
INFO:tensorflow:Evaluation [3/3]
INFO:tensorflow:Finished evaluation at 2018-08-06-11:45:04
INFO:tensorflow:Saving dict for global step 10: global_step = 10, loss = 0.1408012
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10: /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-08-06 11:45:05.220540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-08-06 11:45:05.220581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-06 11:45:05.220600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0 
2018-08-06 11:45:05.220615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N 
2018-08-06 11:45:05.220760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14856 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
INFO:tensorflow:Restoring parameters from /tmp/tmp_xh7nsci/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:loss = 1.5840994, step = 10
INFO:tensorflow:Saving checkpoints for 20 into /tmp/tmp_xh7nsci/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
  File "test2.py", line 45, in <module>
    tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 590, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 691, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 376, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1143, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1368, in _train_model_distributed
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1451, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 695, in __exit__
    self._close_internal(exception_type)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 727, in _close_internal
    h.end(self._coordinated_creator.tf_sess)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 470, in end
    self._save(session, last_step)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 489, in _save
    if l.after_save(session, step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 497, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 884, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 463, in evaluate
    input_fn, hooks, checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1474, in _evaluate_build_graph
    model_fn_lib.LOSS_METRIC_KEY] = metrics_lib.mean(estimator_spec.loss)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 376, in mean
    mean_t = distribute_lib.get_tower_context().merge_call(
AttributeError: 'NoneType' object has no attribute 'merge_call'

@lenlen
Copy link
Author

lenlen commented Aug 6, 2018

probably related to #21180

@shivaniag shivaniag assigned guptapriya and unassigned shivaniag Aug 7, 2018
@shivaniag shivaniag added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 7, 2018
@shivaniag
Copy link
Contributor

@guptapriya: could you take a look.

@guptapriya
Copy link
Contributor

yes, I looked into this briefly sometime ago, will be working on it soon. Duplicate of this issue: #21180

@guptapriya guptapriya added the comp:dist-strat Distribution Strategy related issues label Aug 8, 2018
@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 9, 2018
@guptapriya
Copy link
Contributor

This should be fixed in master now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues
Projects
None yet
Development

No branches or pull requests

4 participants