Skip to content

impl.OpError: file is too short to be an sstable & DataLossError: Checksum does not match- TensorFlow 2.1.0 #39033

@ali-raza-tariq

Description

@ali-raza-tariq

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (Linux Ubuntu 16.04):
  • TensorFlow version (2.1.0):
  • Python version: (3.6)
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A
  • Kubernetes version (v1.14.3)
  • Kubeflow version (v1.0)

Describe the current behavior
Hi guys! i am trying to run this TensorFlow job on Kubernetes cluster using kubeflow. But i keep getting these indeterministic errors, which are really hard to follow. I have to run the same job again and again using different tfconfigs ... and every time, there's a chance that the job might fail because of one of the following issues. The job uses TensorFlow2.0, and kubeflow1.0. The fact that the job fails with a chance is really weird which makes it very hard to isolate. If I simply delete and restart the job, sometimes it runs fine(but there's a chance it might give the same error again - slight chance!). Could someone please point out the root cause that might be causing such behavior!

Describe the expected behavior
The jobs should not fail in an indeterministic manner.

Error Log1

2020-04-16 03:32:10.840927: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-04-16 03:32:10.841009: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-04-16 03:32:10.841016: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I0416 03:32:11.778649 140150905272128 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0416 03:32:11.781358 140150905272128 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0416 03:32:11.781544 140150905272128 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
2020-04-16 03:32:11.785308: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-16 03:32:11.785339: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-16 03:32:11.785361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0): /proc/driver/nvidia/version does not exist
2020-04-16 03:32:11.785606: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-16 03:32:11.794453: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194825000 Hz
2020-04-16 03:32:11.796551: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4895200 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-16 03:32:11.796574: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:1587007938.6536536:tensorflow:TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
I0416 03:32:18.653653 140150905272128 run_config.py:535] TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
INFO:1587007938.6553543:tensorflow:Using the Keras model provided.
I0416 03:32:18.655354 140150905272128 keras.py:540] Using the Keras model provided.
WARNING:1587007938.7225273:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0416 03:32:18.722527 140150905272128 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:1587007950.8246615:tensorflow:Using config: {'_model_dir': './out/tftestaccuracy7-testjob-temp-1-7-3-3-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 7, '_num_worker_replicas': 4, '_is_chief': True}
I0416 03:32:30.824661 140150905272128 estimator.py:216] Using config: {'_model_dir': './out/tftestaccuracy7-testjob-temp-1-7-3-3-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222'], 'worker': ['tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222', 'tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy7-testjob-temp-1-7-3-3-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 7, '_num_worker_replicas': 4, '_is_chief': True}
INFO:1587007950.825566:tensorflow:Not using Distribute Coordinator.
I0416 03:32:30.825566 140150905272128 estimator_training.py:186] Not using Distribute Coordinator.
INFO:1587007950.8260767:tensorflow:Start Tensorflow server.
I0416 03:32:30.826076 140150905272128 training.py:744] Start Tensorflow server.
2020-04-16 03:32:30.835044: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
2020-04-16 03:32:30.835109: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job ps -> {0 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-0.ali.svc:2222, 1 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-1.ali.svc:2222, 2 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-2.ali.svc:2222, 3 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-3.ali.svc:2222, 4 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-4.ali.svc:2222, 5 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-5.ali.svc:2222, 6 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-ps-6.ali.svc:2222}
2020-04-16 03:32:30.835125: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-0.ali.svc:2222, 1 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-1.ali.svc:2222, 2 -> tftestaccuracy7-testjob-temp-1-7-3-3-0-worker-2.ali.svc:2222}
2020-04-16 03:32:30.838051: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:1587007950.8507316:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0416 03:32:30.850731 140150905272128 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0416 03:32:30.867805 140150905272128 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0416 03:32:30.875023 140150905272128 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0416 03:32:30.875263 140150905272128 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
INFO:1587007951.0019863:tensorflow:Calling model_fn.
I0416 03:32:31.001986 140150905272128 estimator.py:1151] Calling model_fn.
INFO:1587007957.5129364:tensorflow:Done calling model_fn.
I0416 03:32:37.512936 140150905272128 estimator.py:1153] Done calling model_fn.
INFO:1587007957.5134861:tensorflow:Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
I0416 03:32:37.513486 140150905272128 estimator.py:1372] Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
INFO:1587007957.513585:tensorflow:Warm-starting from: ./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt
I0416 03:32:37.513585 140150905272128 warm_starting_util.py:464] Warm-starting from: ./out/tftestaccuracy7-testjob-temp-1-7-3-3-0/keras/keras_model.ckpt
INFO:1587007957.513649:tensorflow:Warm-starting variables only in TRAINABLE_VARIABLES.
I0416 03:32:37.513648 140150905272128 warm_starting_util.py:343] Warm-starting variables only in TRAINABLE_VARIABLES.
tfds.core.DatasetInfo(
    name='cifar10',
    version=3.0.0,
    description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
    homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
    features=FeaturesDict({
        'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=60000,
    splits={
        'test': 10000,
        'train': 50000,
    },
    supervised_keys=('image', 'label'),
    citation="""@TECHREPORT{Krizhevsky09learningmultiple,
        author = {Alex Krizhevsky},
        title = {Learning multiple layers of features from tiny images},
        institution = {},
        year = {2009}
    }""",
    redistribution_info=,
)

Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
+++++ Building Keras model +++++
Output of feature extraction (original model): (64, 4, 4, 2048)
Output of 0th classification layer (<tensorflow.python.keras.layers.pooling.GlobalAveragePooling2D object at 0x7f766056aa58>): (64, 2048)
Output of 1th classification layer (<tensorflow.python.keras.layers.core.Dense object at 0x7f766056aba8>): (64, 10)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
resnet50 (Model)             (None, 4, 4, 2048)        23587712  
_________________________________________________________________
sequential (Sequential)      (None, 10)                20490     
=================================================================
Total params: 23,608,202
Trainable params: 23,555,082
Non-trainable params: 53,120
_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                20490     
=================================================================
Total params: 20,490
Trainable params: 20,490
Non-trainable params: 0
_________________________________________________________________
+++++ Train and evaluate the Estimator model +++++
tfds.core.DatasetInfo(
    name='cifar10',
    version=3.0.0,
    description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
    homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
    features=FeaturesDict({
        'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=60000,
    splits={
        'test': 10000,
        'train': 50000,
    },
    supervised_keys=('image', 'label'),
    citation="""@TECHREPORT{Krizhevsky09learningmultiple,
        author = {Alex Krizhevsky},
        title = {Learning multiple layers of features from tiny images},
        institution = {},
        year = {2009}
    }""",
    redistribution_info=,
)

Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/py_checkpoint_reader.py", line 95, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: file is too short to be an sstable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/keras_to_est.py", line 242, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/app/keras_to_est.py", line 227, in main
    tf.estimator.train_and_evaluate(model_est, train_spec, eval_spec)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
    getattr(self, task_to_run)()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 677, in run_master
    self._start_distributed_training(saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1373, in _train_with_estimator_spec
    warm_starting_util.warm_start(*self._warm_start_settings)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/warm_starting_util.py", line 476, in warm_start
    ckpt_to_initialize_from, grouped_variables.keys())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/warm_starting_util.py", line 397, in _get_object_checkpoint_renames
    names_to_keys = saver_lib.object_graph_key_mapping(fname)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1617, in object_graph_key_mapping
    reader = py_checkpoint_reader.NewCheckpointReader(checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader
    error_translator(e)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/py_checkpoint_reader.py", line 48, in error_translator
    raise errors_impl.OpError(None, None, error_message, errors_impl.UNKNOWN)
tensorflow.python.framework.errors_impl.OpError: file is too short to be an sstable

Error Log2

2020-04-17 11:10:04.594547: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-04-17 11:10:04.594650: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-04-17 11:10:04.594665: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I0417 11:10:05.664701 140220500707136 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0417 11:10:05.667569 140220500707136 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0417 11:10:05.667821 140220500707136 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
2020-04-17 11:10:05.672440: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-17 11:10:05.672469: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-17 11:10:05.672494: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0): /proc/driver/nvidia/version does not exist
2020-04-17 11:10:05.672896: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-17 11:10:05.685860: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194855000 Hz
2020-04-17 11:10:05.688580: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4db8fe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-17 11:10:05.688621: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:1587121814.71867:tensorflow:TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
I0417 11:10:14.718669 140220500707136 run_config.py:535] TF_CONFIG environment variable: {'cluster': {'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}, 'task': {'type': 'master', 'index': 0}, 'environment': 'cloud'}
INFO:1587121814.7204363:tensorflow:Using the Keras model provided.
I0417 11:10:14.720436 140220500707136 keras.py:540] Using the Keras model provided.
INFO:1587121814.7229145:tensorflow:Using config: {'_model_dir': './out/tftestaccuracy6-testjob-temp-1-6-11-11-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 6, '_num_worker_replicas': 12, '_is_chief': True}
I0417 11:10:14.722914 140220500707136 estimator.py:216] Using config: {'_model_dir': './out/tftestaccuracy6-testjob-temp-1-6-11-11-0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 20000, '_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'master': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222'], 'ps': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222'], 'worker': ['tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222', 'tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222']}), '_task_type': 'master', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://tftestaccuracy6-testjob-temp-1-6-11-11-0-master-0.ali.svc:2222', '_evaluation_master': '', '_num_ps_replicas': 6, '_num_worker_replicas': 12, '_is_chief': True}
INFO:1587121814.7236319:tensorflow:Not using Distribute Coordinator.
I0417 11:10:14.723631 140220500707136 estimator_training.py:186] Not using Distribute Coordinator.
INFO:1587121814.7240767:tensorflow:Start Tensorflow server.
I0417 11:10:14.724076 140220500707136 training.py:744] Start Tensorflow server.
2020-04-17 11:10:14.731863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
2020-04-17 11:10:14.731898: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job ps -> {0 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-0.ali.svc:2222, 1 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-1.ali.svc:2222, 2 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-2.ali.svc:2222, 3 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-3.ali.svc:2222, 4 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-4.ali.svc:2222, 5 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-ps-5.ali.svc:2222}
2020-04-17 11:10:14.731910: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-0.ali.svc:2222, 1 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-1.ali.svc:2222, 2 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-2.ali.svc:2222, 3 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-3.ali.svc:2222, 4 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-4.ali.svc:2222, 5 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-5.ali.svc:2222, 6 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-6.ali.svc:2222, 7 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-7.ali.svc:2222, 8 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-8.ali.svc:2222, 9 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-9.ali.svc:2222, 10 -> tftestaccuracy6-testjob-temp-1-6-11-11-0-worker-10.ali.svc:2222}
2020-04-17 11:10:14.733813: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:1587121814.7500741:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0417 11:10:14.750074 140220500707136 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:1587121814.7521474:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0417 11:10:14.752147 140220500707136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0417 11:10:14.775437 140220500707136 dataset_builder.py:199] Overwrite dataset info from restored data version.
I0417 11:10:14.781437 140220500707136 dataset_builder.py:285] Reusing dataset cifar10 (./dataDir/cifar10/3.0.0)
I0417 11:10:14.781795 140220500707136 dataset_builder.py:458] Constructing tf.data.Dataset for split train, from ./dataDir/cifar10/3.0.0
INFO:1587121814.9113212:tensorflow:Calling model_fn.
I0417 11:10:14.911321 140220500707136 estimator.py:1151] Calling model_fn.
INFO:1587121822.074707:tensorflow:Done calling model_fn.
I0417 11:10:22.074707 140220500707136 estimator.py:1153] Done calling model_fn.
INFO:1587121822.0750966:tensorflow:Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
I0417 11:10:22.075096 140220500707136 estimator.py:1372] Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from='./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt', vars_to_warm_start='.*', var_name_to_vocab_info={}, var_name_to_prev_var_name={})
INFO:1587121822.075177:tensorflow:Warm-starting from: ./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt
I0417 11:10:22.075176 140220500707136 warm_starting_util.py:464] Warm-starting from: ./out/tftestaccuracy6-testjob-temp-1-6-11-11-0/keras/keras_model.ckpt
INFO:1587121822.0752382:tensorflow:Warm-starting variables only in TRAINABLE_VARIABLES.
I0417 11:10:22.075238 140220500707136 warm_starting_util.py:343] Warm-starting variables only in TRAINABLE_VARIABLES.
INFO:1587121823.262496:tensorflow:Warm-started 214 variables.
I0417 11:10:23.262495 140220500707136 warm_starting_util.py:538] Warm-started 214 variables.
INFO:1587121823.2661114:tensorflow:Create CheckpointSaverHook.
I0417 11:10:23.266111 140220500707136 basic_session_run_hooks.py:546] Create CheckpointSaverHook.
INFO:1587121825.0649378:tensorflow:Graph was finalized.
I0417 11:10:25.064937 140220500707136 monitored_session.py:246] Graph was finalized.
tfds.core.DatasetInfo(
    name='cifar10',
    version=3.0.0,
    description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
    homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
    features=FeaturesDict({
        'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=60000,
    splits={
        'test': 10000,
        'train': 50000,
    },
    supervised_keys=('image', 'label'),
    citation="""@TECHREPORT{Krizhevsky09learningmultiple,
        author = {Alex Krizhevsky},
        title = {Learning multiple layers of features from tiny images},
        institution = {},
        year = {2009}
    }""",
    redistribution_info=,
)

Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
+++++ Building Keras model +++++
Output of feature extraction (original model): (64, 4, 4, 2048)
Output of 0th classification layer (<tensorflow.python.keras.layers.pooling.GlobalAveragePooling2D object at 0x7f86ec14b9e8>): (64, 2048)
Output of 1th classification layer (<tensorflow.python.keras.layers.core.Dense object at 0x7f86ec14bb38>): (64, 10)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
resnet50 (Model)             (None, 4, 4, 2048)        23587712  
_________________________________________________________________
sequential (Sequential)      (None, 10)                20490     
=================================================================
Total params: 23,608,202
Trainable params: 23,555,082
Non-trainable params: 53,120
_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                20490     
=================================================================
Total params: 20,490
Trainable params: 20,490
Non-trainable params: 0
_________________________________________________________________
+++++ Train and evaluate the Estimator model +++++
tfds.core.DatasetInfo(
    name='cifar10',
    version=3.0.0,
    description='The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.',
    homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
    features=FeaturesDict({
        'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=60000,
    splits={
        'test': 10000,
        'train': 50000,
    },
    supervised_keys=('image', 'label'),
    citation="""@TECHREPORT{Krizhevsky09learningmultiple,
        author = {Alex Krizhevsky},
        title = {Learning multiple layers of features from tiny images},
        institution = {},
        year = {2009}
    }""",
    redistribution_info=,
)

Input data (batch size 64): <DatasetV1Adapter shapes: ((None, 128, 128, 3), (None,)), types: (tf.float32, tf.int64)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: From /job:ps/replica:0/task:3:
Checksum does not match: stored 3880206044 vs. calculated on the restored bytes 1782481297
	 [[{{node checkpoint_initializer_53}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/keras_to_est.py", line 242, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/app/keras_to_est.py", line 227, in main
    tf.estimator.train_and_evaluate(model_est, train_spec, eval_spec)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
    getattr(self, task_to_run)()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 677, in run_master
    self._start_distributed_training(saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1493, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 604, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 669, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 300, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: From /job:ps/replica:0/task:3:
Checksum does not match: stored 3880206044 vs. calculated on the restored bytes 1782481297
	 [[node checkpoint_initializer_53 (defined at usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1373) ]]

Original stack trace for 'checkpoint_initializer_53':
  File "app/keras_to_est.py", line 242, in <module>
    app.run(main)
  File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "app/keras_to_est.py", line 227, in main
    tf.estimator.train_and_evaluate(model_est, train_spec, eval_spec)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
    getattr(self, task_to_run)()
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 677, in run_master
    self._start_distributed_training(saving_listeners=saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
    saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1373, in _train_with_estimator_spec
    warm_starting_util.warm_start(*self._warm_start_settings)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/warm_starting_util.py", line 533, in warm_start
    checkpoint_utils.init_from_checkpoint(ckpt_to_initialize_from, vocabless_vars)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
    init_from_checkpoint_fn)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1949, in merge_call
    return self._merge_call(merge_fn, args, kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1956, in _merge_call
    return merge_fn(self._strategy, *args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 286, in <lambda>
    ckpt_dir_or_file, assignment_map)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 334, in _init_from_checkpoint
    _set_variable_or_list_initializer(var, ckpt_file, tensor_name_in_ckpt)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 458, in _set_variable_or_list_initializer
    _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "")
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 412, in _set_checkpoint_initializer
    ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0]
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1506, in restore_v2
    name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1756, in __init__
    self._traceback = tf_stack.extract_stack()

Metadata

Metadata

Labels

TF 2.1for tracking issues in 2.1 releasecomp:datatf.data related issuescomp:kerasKeras related issuesstat:awaiting responseStatus - Awaiting response from authortype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions