Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TF 2.0] Using keras.metrics in TPU training results in error #33517

Closed
georgealexandruvlad opened this issue Oct 18, 2019 · 16 comments
Closed
Assignees
Labels
comp:dist-strat Distribution Strategy related issues comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@georgealexandruvlad
Copy link

georgealexandruvlad commented Oct 18, 2019

I am trying to train a BERT model from https://github.com/tensorflow/models/tree/master/official/nlp on TPU in google colab. I changed the metrics list passed to model in compile method to:

bert_model.compile(optimizer=optimizer, loss=loss_fn, metrics=get_metrics())

where get_metrics is a function which returns a list of metrics ("accuracy" and instance for Recall and Precision built in tensorflow.keras.metrics):

from tensorflow.keras.metrics import Recall, Precision

def get_metrics():
    return ["accuracy",
            Recall(),
            Precision(), ]

Training results in the following error (after one epoch ends, before validation statistics are displayed):

I1018 16:34:07.313311 140541208393600 remote.py:151] Entering into master device scope: /job:worker/replica:0/task:0
2019-10-18 16:34:07.359467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-18 16:34:07.465723: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-10-18 16:34:07.465842: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7b6f1b4d4089): /proc/driver/nvidia/version does not exist
2019-10-18 16:34:07.466260: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-18 16:34:07.472748: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-10-18 16:34:07.473076: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3172f40 executing computations on platform Host. Devices:
2019-10-18 16:34:07.473114: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-10-18 16:34:07.475920: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 10.29.203.98:8470}
2019-10-18 16:34:07.475955: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30501}
2019-10-18 16:34:07.476742: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:30501
2019-10-18 16:34:07.497844: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 10.29.203.98:8470}
2019-10-18 16:34:07.497905: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30501}
INFO:tensorflow:Initializing the TPU system: 10.29.203.98:8470
I1018 16:34:07.499603 140541208393600 tpu_strategy_util.py:70] Initializing the TPU system: 10.29.203.98:8470
INFO:tensorflow:Clearing out eager caches
I1018 16:34:15.119202 140541208393600 tpu_strategy_util.py:94] Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
I1018 16:34:15.121769 140541208393600 tpu_strategy_util.py:114] Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
I1018 16:34:15.128222 140541208393600 tpu_system_metadata.py:148] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I1018 16:34:15.128440 140541208393600 tpu_system_metadata.py:149] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I1018 16:34:15.129121 140541208393600 tpu_system_metadata.py:150] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I1018 16:34:15.129209 140541208393600 tpu_system_metadata.py:152] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1018 16:34:15.129295 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.129720 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1018 16:34:15.129811 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I1018 16:34:15.129892 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I1018 16:34:15.129969 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I1018 16:34:15.130045 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I1018 16:34:15.130121 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I1018 16:34:15.130197 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I1018 16:34:15.130281 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I1018 16:34:15.130358 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I1018 16:34:15.130436 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I1018 16:34:15.130511 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.130593 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.248266 140541208393600 train.py:212] Training using TF 2.0 Keras compile/fit API with distrubuted strategy.
WARNING:tensorflow:Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
W1018 16:35:33.236943 140541208393600 training_utils.py:1547] Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
Train on 129 steps, validate on 65 steps
Epoch 1/5
2019-10-18 16:38:03.018892: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-10-18 16:38:03.020371: E tensorflow/core/platform/default/device_tracer.cc:70] CUDA error: CUDA_ERROR_NO_DEVICE
  1/129 [..............................] - ETA: 5:12:28 - loss: 1.0083 - accuracy: 0.2031 - recall: 0.1719 - precision: 0.2619WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.610206). Check your callbacks.
W1018 16:38:06.456197 140541208393600 callbacks.py:244] Method (on_train_batch_end) is slow compared to the batch update (1.610206). Check your callbacks.
128/129 [============================>.] - ETA: 1s - loss: 0.5022 - accuracy: 0.7563 - recall: 0.5862 - precision: 0.81392019-10-18 16:38:45.271991: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 0
Additional GRPC error information:
{"created":"@1571416725.271891392","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 0","grpc_status":3}
2019-10-18 16:38:45.272429: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 1
Additional GRPC error information:
{"created":"@1571416725.272350919","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 1","grpc_status":3}
2019-10-18 16:38:45.272841: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 2
Additional GRPC error information:
{"created":"@1571416725.272756237","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 2","grpc_status":3}
2019-10-18 16:38:45.273165: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 3
Additional GRPC error information:
{"created":"@1571416725.273105048","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 3","grpc_status":3}
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 340, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 332, in main
    run_bert(strategy, input_meta_data)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 287, in run_bert
    use_keras_compile_fit=FLAGS.use_keras_compile_fit)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 226, in run_bert_classifier
    custom_callbacks=None)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 143, in run_keras_compile_fit
    callbacks=custom_callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 685, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 439, in model_iteration
    steps_name='validation_steps')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 299, in model_iteration
    batch_outs = f(actual_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 878, in execution_function
    return [out.numpy() for out in distributed_function(input_fn)]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 526, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnimplementedError:  Compilation failure: Asked to propagate a dynamic dimension from hlo %tuple.5198 = (pred[], f32[4,2]{1,0}) tuple(pred[] %convert.5196, f32[4,2]{1,0} %add.5004), metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}@{1}@0 to hlo %conditional.5209 = (pred[]) conditional(pred[] %convert.5196, (pred[], f32[4,2]{1,0}) %tuple.5198, (pred[], f32[4,2]{1,0}) %tuple.5198), true_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_true_127733_const_0__.5199, false_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_false_127734_const_0__.5204, metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}, which is not implemented.
	TPU compilation failed
	 [[{{node tpu_compile_succeeded_assert/_6193329545322784681/_7}}]]
Additional GRPC error information:
{"created":"@1571416725.270929013","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":" Compilation failure: Asked to propagate a dynamic dimension from hlo %tuple.5198 = (pred[], f32[4,2]{1,0}) tuple(pred[] %convert.5196, f32[4,2]{1,0} %add.5004), metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}@{1}@0 to hlo %conditional.5209 = (pred[]) conditional(pred[] %convert.5196, (pred[], f32[4,2]{1,0}) %tuple.5198, (pred[], f32[4,2]{1,0}) %tuple.5198), true_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_true_127733_const_0__.5199, false_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_false_127734_const_0__.5204, metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}, which is not implemented.\n\tTPU compilation failed\n\t [[{{node tpu_compile_succeeded_assert/_6193329545322784681/_7}}]]","grpc_status":12} [Op:__inference_distributed_function_127913]

Function call stack:
distributed_function -> distributed_function

2019-10-18 16:38:53.401848: E tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc:72] Remote EagerContext with id 6450803200035565614 does not seem to exist.

With only "accuracy" returned it works well finishing all epochs. With custom metrics like:

def precision(y_true, y_pred):
    y_pred = tf.math.rint(y_pred)
    TP = tf.math.reduce_sum(y_pred * y_true)
    FP = tf.math.reduce_sum(y_pred * (1 - y_true))

    _precision = tf.math.divide(TP, (TP + FP + eps))
    return _precision

it works as well, but the values returned are not correct. I suppose this is happening because on the TPU there are X steps per loop computed and somehow (I didn't dig too much into it) messes up the output metric. I tried with builtin functions to verify the behavior but it resulted in the error previously mentioned.

Snippet of the training call (the function is called run_keras_compile_fit in the github link I provided and it can be found in bert/run_classifier.py with almost none custom code added):

    with strategy.scope():
        training_dataset = train_input_fn()
        evaluation_dataset = eval_input_fn()
        bert_model, sub_model = model_fn()
        optimizer = bert_model.optimizer

        if init_checkpoint:
            checkpoint = tf.train.Checkpoint(model=sub_model)
            checkpoint.restore(init_checkpoint).assert_existing_objects_matched()

        bert_model.compile(optimizer=optimizer, loss=loss_fn, metrics=get_metrics())

        summary_dir = os.path.join(model_dir, 'summaries')
        summary_callback = tf.keras.callbacks.TensorBoard(summary_dir)
        checkpoint_path = os.path.join(model_dir, 'checkpoint')
        checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
            checkpoint_path, save_weights_only=True, save_best_only=True, mode='min')

        if custom_callbacks is not None:
            custom_callbacks += [summary_callback, checkpoint_callback]
        else:
            custom_callbacks = [summary_callback, checkpoint_callback]

        bert_model.fit(
            x=training_dataset,
            validation_data=evaluation_dataset,
            steps_per_epoch=steps_per_epoch,
            epochs=epochs,
            validation_steps=eval_steps,
            callbacks=custom_callbacks)

        return bert_model

In colab I installed the stable release of tensorflow 2.0 as the nightly version doesn't work well with colab's TPU's for now. The keras metrics are supposed to work with TPUs or this is not yet a feature?

@gadagashwini-zz gadagashwini-zz self-assigned this Oct 21, 2019
@gadagashwini-zz gadagashwini-zz added comp:model Model related issues TF 2.0 Issues relating to TensorFlow 2.0 labels Oct 21, 2019
@gadagashwini-zz
Copy link
Contributor

@georgealexandruvlad, As this is more related to TF models, please post this in TF model repo. Thanks!

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Oct 21, 2019
@georgealexandruvlad
Copy link
Author

@gadagashwini The thing is that the model works with accuracy and it doesn't with other tf.keras.metrics. So the problem shouldn't be at the model's implementation if the only change is to add an additional metric to be evaluated. I implemented a simple example that still results in the error mentioned https://colab.research.google.com/drive/18okZncYBJOrd9AZxy4tCrHQ3O3s_tNsl.

The question is whether this is a bug or I am doing something wrong. It's very little documentation on TPUs and for other distributed strategies this case seems to be documented as working in tensorflow guide pages.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 22, 2019
@gadagashwini-zz gadagashwini-zz added comp:dist-strat Distribution Strategy related issues type:support Support issues comp:keras Keras related issues and removed comp:model Model related issues labels Oct 22, 2019
@georgealexandruvlad
Copy link
Author

@ymodak Could I have some insights about what was causing the problem? Is it a bug or the keras.metrics are not yet supported on TPUs? And if they are not supported is there any alternative to this?

@georgealexandruvlad
Copy link
Author

@gadagashwini could you give me some information regarding the issue? I would like to know whether there is a problem (I didn't really get a response) that can't be resolved in the next couple of days/weeks or should I search for other solutions on the matter (like using tf1.x).

@yunxing yunxing self-assigned this Oct 24, 2019
@yunxing
Copy link
Contributor

yunxing commented Oct 24, 2019

@georgealexandruvlad

Hi, sorry about the breakage. The internal version of this issue got routed to me yesterday and we should have a fix very soon today (at least on our nightly release).

The root cause is our compiler had trouble handling conditionals with dynamic shapes, which is introduced by "Assert" operation in Metric.

@rxsang also added an option to disable the dynamic shapes behavior, IIRC you can enable that by setting strategy.experimental_enable_dynamic_batch_size = False

@georgealexandruvlad
Copy link
Author

@yunxing Thanks for the update! Looking forward to the next nightly release then.

@ymodak ymodak removed their assignment Oct 28, 2019
@ymodak ymodak added type:bug Bug and removed type:support Support issues labels Oct 28, 2019
@ahmadsalim
Copy link

Hi, I am having a similar dynamic dimension issue when using OneHot:

tensorflow.python.framework.errors_impl.UnimplementedError: From /job:worker/replica:0/task:0:
ERROR   2019-10-30 11:01:09 +0100       master-replica-0                Compilation failure: Asked to propagate a dynamic dimension from hlo %select.513 = f32[12,305,215,56]{3,2,1,0} select(pred[12,305,215,56]{3,2,1,0} %compare.510, f32[12,305,215,56]{3,2,1,0} %broadcast.511, f32[12,305,215,56]{3,2,1,0} %broadcast.512), metadata={op_type="OneHot" op_name="chargrid_encoder_2/chargrid_encoder_a0/one_hot/one_hot"}@{}@0 to hlo %concatenate.529 = f32[12,305,215,58]{3,2,1,0} concatenate(f32[12,305,215,56]{3,2,1,0} %select.513, f32[12,305,215,2]{3,2,1,0} %convert.528), dimensions={3}, metadata={op_type="ConcatV2" op_name="chargrid_encoder_2/chargrid_encoder_a0/concat"}, which is not implemented.

Is that related, or should I open a separate issue? 😄

@yunxing
Copy link
Contributor

yunxing commented Nov 1, 2019

Hi ahmadsalim@, we should have already fixed this issue a while ago. Which tf version are you using?

@ahmadsalim
Copy link

@yunxing Thanks for the response. I am using 1.14, should I upgrade to 2.0 instead? 😄

@yunxing
Copy link
Contributor

yunxing commented Nov 7, 2019

Missed the notification. This should be fixed in nightly releases, do you have access to those ? I remember we also have 1.x nightly release which should also include the fix. cc @rxsang who is more familiar with this than me.

@yunxing yunxing closed this as completed Nov 7, 2019
@tensorflow-bot
Copy link

tensorflow-bot bot commented Nov 7, 2019

Are you satisfied with the resolution of your issue?
Yes
No

@CalebEverett
Copy link

CalebEverett commented Dec 8, 2020

tf.keras.metrics are not working with TPUs on colab version 2.3

https://colab.research.google.com/drive/1C09OUXP-7Es4KIthVA6daRcGq_bJKGe8#scrollTo=hbXc0o4p2W0a

it looks like tf.metrics.AUC is causing the error.

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

NotFoundError: 9 root error(s) found.
  (0) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
	 [[while/body/_1/while/Const_7/_337]]
  (1) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_2758051469797844661/_31}}]]
  (2) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_3997717623177238997/_19}}]]
  (3) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
	 [[while/loop_body_control/_50/_148]]
  (4) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_12310723377697630185/_10}}]]
  (5) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_13160019155312455749/_25}}]]
  (6) Invalid argument: {{function_node __inference_train_step_4729}} Compilation failure: Incompatible shapes: [200,128] vs. [200,16384]
	 [[{{node while/body/_1/while/LogicalAnd}}]]
	TPU compilation failed
	 [[tpu_compile_succeeded_assert/_18307140185544563773/_5]]
  (7) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
	 [[while/body/_1/while/Const_3/_281]]
  (8) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
0 successful operations.
0 derived errors ignored.

@shenzhun
Copy link

TPUs are not workong on Colab either.

tensorflow: 2.4

https://colab.research.google.com/drive/1slQKTzSOnE9U70QCQCoGJrcyXCdqRJwz#scrollTo=3Qz6XSPEDsyZ

@rxsang
Copy link
Member

rxsang commented Dec 28, 2020

Could you try nightly? The fix yunxing@ mentioned may not be in TF2.4 release yet, it would be in TF2.5 and nightly.

Missed the notification. This should be fixed in nightly releases, do you have access to those ? I remember we also have 1.x nightly release which should also include the fix. cc @rxsang who is more familiar with this than me.

@neekey
Copy link

neekey commented Apr 12, 2021

Tried nightly, still experiencing the same issue, see my notebook on colab https://colab.research.google.com/drive/1PVBomGMDz5zbbCgSUTCKm6gIMQ2eG6V-?usp=sharing

@pn12
Copy link

pn12 commented Jun 18, 2021

Anyone able to resolve the error? Am facing the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests