-
Notifications
You must be signed in to change notification settings - Fork 75.3k
impl.OpError: file is too short to be an sstable & DataLossError: Checksum does not match- TensorFlow 2.1.0 #39033
Description
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (Linux Ubuntu 16.04):
- TensorFlow version (2.1.0):
- Python version: (3.6)
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
- Kubernetes version (v1.14.3)
- Kubeflow version (v1.0)
Describe the current behavior
Hi guys! i am trying to run this TensorFlow job on Kubernetes cluster using kubeflow. But i keep getting these indeterministic errors, which are really hard to follow. I have to run the same job again and again using different tfconfigs ... and every time, there's a chance that the job might fail because of one of the following issues. The job uses TensorFlow2.0, and kubeflow1.0. The fact that the job fails with a chance is really weird which makes it very hard to isolate. If I simply delete and restart the job, sometimes it runs fine(but there's a chance it might give the same error again - slight chance!). Could someone please point out the root cause that might be causing such behavior!
Describe the expected behavior
The jobs should not fail in an indeterministic manner.
Error Log1
2020-04-16 03:32:10.840927: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-04-16 03:32:10.841009: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-04-16 03:32:10.841016: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I0416 03:32:11.778649 140150905272128 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0416 03:32:11.781358 140150905272128 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0416 03:32:11.781544 140150905272128 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
2020-04-16 03:32:11.785308: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-16 03:32:11.785339: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-16 03:32:11.785361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0): /proc/driver/nvidia/version does not exist
2020-04-16 03:32:11.785606: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-16 03:32:11.794453: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194825000 Hz
2020-04-16 03:32:11.796551: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4895200 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-16 03:32:11.796574: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:1587007938.6536536:tensorflow:TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
I0416 03:32:18.653653 140150905272128 run_config.py:535] TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
INFO:1587007938.6553543:tensorflow:Using the Keras model provided.
I0416 03:32:18.655354 140150905272128 keras.py:540] Using the Keras model provided.
WARNING:1587007938.7225273:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0416 03:32:18.722527 140150905272128 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:1587007950.8246615:tensorflow:Using config: {'_model_dir': './out/tftestaccuracy7-testjob-temp-1-7-3-3-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 7, '_num_worker_replicas': 4, '_is_chief': True}
I0416 03:32:30.824661 140150905272128 estimator.py:216] Using config: {'_model_dir': './out/tftestaccuracy7-testjob-temp-1-7-3-3-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 7, '_num_worker_replicas': 4, '_is_chief': True}
INFO:1587007950.825566:tensorflow:Not using Distribute Coordinator.
I0416 03:32:30.825566 140150905272128 estimator_training.py:186] Not using Distribute Coordinator.
INFO:1587007950.8260767:tensorflow:Start Tensorflow server.
I0416 03:32:30.826076 140150905272128 training.py:744] Start Tensorflow server.
2020-04-16 03:32:30.835044: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
2020-04-16 03:32:30.835109: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job ps -> {0 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222, 1 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222, 2 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222, 3 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222, 4 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222, 5 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222, 6 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222}
2020-04-16 03:32:30.835125: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222, 1 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222, 2 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222}
2020-04-16 03:32:30.838051: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:1587007950.8507316:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0416 03:32:30.850731 140150905272128 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0416 03:32:30.867805 140150905272128 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0416 03:32:30.875023 140150905272128 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0416 03:32:30.875263 140150905272128 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
INFO:1587007951.0019863:tensorflow:Calling model_fn.
I0416 03:32:31.001986 140150905272128 estimator.py:1151] Calling model_fn.
INFO:1587007957.5129364:tensorflow:Done calling model_fn.
I0416 03:32:37.512936 140150905272128 estimator.py:1153] Done calling model_fn.
INFO:1587007957.5134861:tensorflow:Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
I0416 03:32:37.513486 140150905272128 estimator.py:1372] Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
INFO:1587007957.513585:tensorflow:Warm-starting from: ./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt
I0416 03:32:37.513585 140150905272128 warm_starting_util.py:464] Warm-starting from: ./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt
INFO:1587007957.513649:tensorflow:Warm-starting variables only in TRAINABLE_VARIABLES.
I0416 03:32:37.513648 140150905272128 warm_starting_util.py:343] Warm-starting variables only in TRAINABLE_VARIABLES.
tfds.core.DatasetInfo(
name='cifar10',
version=3.0.0,
description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
features=FeaturesDict({
'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),
total_num_examples=60000,
splits={
'test': 10000,
'train': 50000,
},
supervised_keys=('image', 'label'),
citation="""@TECHREPORT{Krizhevsky09learningmultiple,
author = {Alex Krizhevsky},
title = {Learning multiple layers of features from tiny images},
institution = {},
year = {2009}
}""",
redistribution_info=,
)
Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
+++++ Building Keras model +++++
Output of feature extraction (original model): (64, 4, 4, 2048)
Output of 0th classification layer (<tensorflow.python.keras.layers.pooling.GlobalAveragePooling2D object at 0x7f766056aa58>): (64, 2048)
Output of 1th classification layer (<tensorflow.python.keras.layers.core.Dense object at 0x7f766056aba8>): (64, 10)
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resnet50 (Model) (None, 4, 4, 2048) 23587712
_________________________________________________________________
sequential (Sequential) (None, 10) 20490
=================================================================
Total params: 23,608,202
Trainable params: 23,555,082
Non-trainable params: 53,120
_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
global_average_pooling2d (Gl (None, 2048) 0
_________________________________________________________________
dense (Dense) (None, 10) 20490
=================================================================
Total params: 20,490
Trainable params: 20,490
Non-trainable params: 0
_________________________________________________________________
+++++ Train and evaluate the Estimator model +++++
tfds.core.DatasetInfo(
name='cifar10',
version=3.0.0,
description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
features=FeaturesDict({
'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),
total_num_examples=60000,
splits={
'test': 10000,
'train': 50000,
},
supervised_keys=('image', 'label'),
citation="""@TECHREPORT{Krizhevsky09learningmultiple,
author = {Alex Krizhevsky},
title = {Learning multiple layers of features from tiny images},
institution = {},
year = {2009}
}""",
redistribution_info=,
)
Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: file is too short to be an sstable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/keras_to_est.py", line 242, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/app/keras_to_est.py", line 227, in main
tf.estimator.train_and_evaluate(model_est, train_spec, eval_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
getattr(self, task_to_run)()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 677, in run_master
self._start_distributed_training(saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1373, in _train_with_estimator_spec
warm_starting_util.warm_start(*self._warm_start_settings)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/warm_starting_util.py", line 476, in warm_start
ckpt_to_initialize_from, grouped_variables.keys())
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/warm_starting_util.py", line 397, in _get_object_checkpoint_renames
names_to_keys = saver_lib.object_graph_key_mapping(fname)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1617, in object_graph_key_mapping
reader = py_checkpoint_reader.NewCheckpointReader(checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader
error_translator(e)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/py_checkpoint_reader.py", line 48, in error_translator
raise errors_impl.OpError(None, None, error_message, errors_impl.UNKNOWN)
tensorflow.python.framework.errors_impl.OpError: file is too short to be an sstable
Error Log2
2020-04-17 11:10:04.594547: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-04-17 11:10:04.594650: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-04-17 11:10:04.594665: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I0417 11:10:05.664701 140220500707136 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0417 11:10:05.667569 140220500707136 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0417 11:10:05.667821 140220500707136 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
2020-04-17 11:10:05.672440: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-17 11:10:05.672469: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-17 11:10:05.672494: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0): /proc/driver/nvidia/version does not exist
2020-04-17 11:10:05.672896: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-17 11:10:05.685860: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194855000 Hz
2020-04-17 11:10:05.688580: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4db8fe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-17 11:10:05.688621: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:1587121814.71867:tensorflow:TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
I0417 11:10:14.718669 140220500707136 run_config.py:535] TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
INFO:1587121814.7204363:tensorflow:Using the Keras model provided.
I0417 11:10:14.720436 140220500707136 keras.py:540] Using the Keras model provided.
INFO:1587121814.7229145:tensorflow:Using config: {'_model_dir': './out/tftestaccuracy6-testjob-temp-1-6-11-11-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 6, '_num_worker_replicas': 12, '_is_chief': True}
I0417 11:10:14.722914 140220500707136 estimator.py:216] Using config: {'_model_dir': './out/tftestaccuracy6-testjob-temp-1-6-11-11-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 6, '_num_worker_replicas': 12, '_is_chief': True}
INFO:1587121814.7236319:tensorflow:Not using Distribute Coordinator.
I0417 11:10:14.723631 140220500707136 estimator_training.py:186] Not using Distribute Coordinator.
INFO:1587121814.7240767:tensorflow:Start Tensorflow server.
I0417 11:10:14.724076 140220500707136 training.py:744] Start Tensorflow server.
2020-04-17 11:10:14.731863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
2020-04-17 11:10:14.731898: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job ps -> {0 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222, 1 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222, 2 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222, 3 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222, 4 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222, 5 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222}
2020-04-17 11:10:14.731910: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222, 1 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222, 2 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222, 3 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222, 4 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222, 5 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222, 6 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222, 7 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222, 8 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222, 9 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222, 10 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222}
2020-04-17 11:10:14.733813: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:1587121814.7500741:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0417 11:10:14.750074 140220500707136 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:1587121814.7521474:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0417 11:10:14.752147 140220500707136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0417 11:10:14.775437 140220500707136 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0417 11:10:14.781437 140220500707136 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0417 11:10:14.781795 140220500707136 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
INFO:1587121814.9113212:tensorflow:Calling model_fn.
I0417 11:10:14.911321 140220500707136 estimator.py:1151] Calling model_fn.
INFO:1587121822.074707:tensorflow:Done calling model_fn.
I0417 11:10:22.074707 140220500707136 estimator.py:1153] Done calling model_fn.
INFO:1587121822.0750966:tensorflow:Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
I0417 11:10:22.075096 140220500707136 estimator.py:1372] Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
INFO:1587121822.075177:tensorflow:Warm-starting from: ./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt
I0417 11:10:22.075176 140220500707136 warm_starting_util.py:464] Warm-starting from: ./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt
INFO:1587121822.0752382:tensorflow:Warm-starting variables only in TRAINABLE_VARIABLES.
I0417 11:10:22.075238 140220500707136 warm_starting_util.py:343] Warm-starting variables only in TRAINABLE_VARIABLES.
INFO:1587121823.262496:tensorflow:Warm-started 214 variables.
I0417 11:10:23.262495 140220500707136 warm_starting_util.py:538] Warm-started 214 variables.
INFO:1587121823.2661114:tensorflow:Create CheckpointSaverHook.
I0417 11:10:23.266111 140220500707136 basic_session_run_hooks.py:546] Create CheckpointSaverHook.
INFO:1587121825.0649378:tensorflow:Graph was finalized.
I0417 11:10:25.064937 140220500707136 monitored_session.py:246] Graph was finalized.
tfds.core.DatasetInfo(
name='cifar10',
version=3.0.0,
description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
features=FeaturesDict({
'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),
total_num_examples=60000,
splits={
'test': 10000,
'train': 50000,
},
supervised_keys=('image', 'label'),
citation="""@TECHREPORT{Krizhevsky09learningmultiple,
author = {Alex Krizhevsky},
title = {Learning multiple layers of features from tiny images},
institution = {},
year = {2009}
}""",
redistribution_info=,
)
Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
+++++ Building Keras model +++++
Output of feature extraction (original model): (64, 4, 4, 2048)
Output of 0th classification layer (<tensorflow.python.keras.layers.pooling.GlobalAveragePooling2D object at 0x7f86ec14b9e8>): (64, 2048)
Output of 1th classification layer (<tensorflow.python.keras.layers.core.Dense object at 0x7f86ec14bb38>): (64, 10)
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resnet50 (Model) (None, 4, 4, 2048) 23587712
_________________________________________________________________
sequential (Sequential) (None, 10) 20490
=================================================================
Total params: 23,608,202
Trainable params: 23,555,082
Non-trainable params: 53,120
_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
global_average_pooling2d (Gl (None, 2048) 0
_________________________________________________________________
dense (Dense) (None, 10) 20490
=================================================================
Total params: 20,490
Trainable params: 20,490
Non-trainable params: 0
_________________________________________________________________
+++++ Train and evaluate the Estimator model +++++
tfds.core.DatasetInfo(
name='cifar10',
version=3.0.0,
description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
features=FeaturesDict({
'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),
total_num_examples=60000,
splits={
'test': 10000,
'train': 50000,
},
supervised_keys=('image', 'label'),
citation="""@TECHREPORT{Krizhevsky09learningmultiple,
author = {Alex Krizhevsky},
title = {Learning multiple layers of features from tiny images},
institution = {},
year = {2009}
}""",
redistribution_info=,
)
Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: From /job:ps/replica:0/task:3:
Checksum does not match: stored 3880206044 vs. calculated on the restored bytes 1782481297
[[{{node checkpoint_initializer_53}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/keras_to_est.py", line 242, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/app/keras_to_est.py", line 227, in main
tf.estimator.train_and_evaluate(model_est, train_spec, eval_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
getattr(self, task_to_run)()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 677, in run_master
self._start_distributed_training(saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1493, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 604, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1038, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 749, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1231, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1236, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 902, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 669, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 300, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: From /job:ps/replica:0/task:3:
Checksum does not match: stored 3880206044 vs. calculated on the restored bytes 1782481297
[[node checkpoint_initializer_53 (defined at usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1373) ]]
Original stack trace for 'checkpoint_initializer_53':
File "app/keras_to_est.py", line 242, in <module>
app.run(main)
File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "app/keras_to_est.py", line 227, in main
tf.estimator.train_and_evaluate(model_est, train_spec, eval_spec)
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
getattr(self, task_to_run)()
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 677, in run_master
self._start_distributed_training(saving_listeners=saving_listeners)
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
saving_listeners=saving_listeners)
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
saving_listeners)
File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1373, in _train_with_estimator_spec
warm_starting_util.warm_start(*self._warm_start_settings)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/warm_starting_util.py", line 533, in warm_start
checkpoint_utils.init_from_checkpoint(ckpt_to_initialize_from, vocabless_vars)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
init_from_checkpoint_fn)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1949, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1956, in _merge_call
return merge_fn(self._strategy, *args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 286, in <lambda>
ckpt_dir_or_file, assignment_map)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 334, in _init_from_checkpoint
_set_variable_or_list_initializer(var, ckpt_file, tensor_name_in_ckpt)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 458, in _set_variable_or_list_initializer
_set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "")
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 412, in _set_checkpoint_initializer
ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0]
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1506, in restore_v2
name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1756, in __init__
self._traceback = tf_stack.extract_stack()