TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944

dbonner · 2019-11-03T07:20:13Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (see below). The code has been adopted from the Colab notebook (https://colab.research.google.com/drive/1yWaLpCWImXZE2fPV0ZYDdWWI8f52__9A#scrollTo=mnhwpzb73KIL) with instructions below on how to run a TPU using ctpu. I have 90 days free access with TFRC.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Cloud TPU pair (Master VM is Linux 4.9.0-11-amd64 Add support for Python 3.x #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 GNU/Linux and TPU is v3.8 8 corel)
TensorFlow installed from (source or binary): binary (pip)
TensorFlow version (use command below): Tested on tf-nightly binary (2.1.0-dev20191102) from pip.
Python version: conda-forge 3.7.3

Describe the current behavior
When tf-nightly is installed, following the setup instructions below and running the attached code (yourcode.py) gives an error. It occurs at lines 54-56 (traceback line 56) of yourcode.py:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

I have marked the place at line 58 with "# This is where the error occurs".

The full output including the error is (follow the steps under "Code to reproduce this issue"):

2019-11-03 06:59:16.549391: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-03 06:59:16.549450: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2.1.0-dev20191102
2019-11-03 06:59:18.662355: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2019-11-03 06:59:18.662422: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2019-11-03 06:59:18.662459: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (mimetic): /proc/driver/nvidia/version does not exist
2019-11-03 06:59:18.663151: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-03 06:59:18.672455: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-11-03 06:59:18.673297: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ca94e697b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-03 06:59:18.673382: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (gs://mimetic_store/glue/mrpc/0.0.2)
INFO:absl:Constructing tf.data.Dataset for split None, from gs://mimetic_store/glue/mrpc/0.0.2
Saved glue_mnli_train.
Saved glue_mnli_valid.
2019-11-03 07:00:41.886095: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2019-11-03 07:00:41.886162: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:54535}
2019-11-03 07:01:53.957214: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2019-11-03 07:01:53.957297: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:54535}
2019-11-03 07:01:53.958220: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:54535
INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
Traceback (most recent call last):
File "yourcode.py", line 56, in
tf.tpu.experimental.initialize_tpu_system(tpu)
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/tpu/tpu_strategy_util.py", line 103, in initialize_tpu_system
serialized_topology = output.numpy()
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 942, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 910, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: '__inference__tpu_init_fn_4206710' is neither a type of a primitive operation nor a name of a function registered in binary running on n-48a744b7-w-0. Make sure the operation or function is registered in the binary running in this process.
2019-11-03 07:01:54.492446: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:75] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: '__inference__tpu_init_fn_4206710' is neither a type of a primitive operation nor a name of a function registered in binary running on n-48a744b7-w-0. Make sure the operation or function is registered in the binary running in this process.

Describe the expected behavior
When tested on tensorflow==2.0.0 from pip, the code completes.
The whole process (with tf2.0.0) outputs at the end:
Epoch: [2] Validation accuracy = 0.843137264251709

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Create a Google Cloud master VM and TPU pair:

ctpu up --name=yourtpupair --zone=us-central1-a --tpu-size=v3-8 --machine-type=n1-standard-8 --disk-size-gb=40

Set up a conda python 3.7 development environment with tf-nightly on the master VM

sudo apt update && sudo apt install bzip2 libxml2-dev libxslt-dev -y && wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh && bash Anaconda3-2019.10-Linux-x86_64.sh

Accept the defaults and initialize Anaconda

rm Anaconda3-2019.10-Linux-x86_64.sh && . ~/.bashrc && conda config --add channels anaconda && conda config --add channels conda-forge && conda config --set channel_priority strict && conda create -n yourconda python=3.7 -y && conda activate yourconda

conda install tqdm

pip install tensorflow-datasets transformers && pip install --upgrade google-api-python-client && pip install --upgrade oauth2client && pip install --ignore-installed --upgrade tf-nightly

Download the "glue/mrpc" dataset to ~/tensorflow_datasets in a python shell:

python
import tensorflow as tf
import tensorflow_datasets
data = tensorflow_datasets.load("glue/mrpc")

Create a Google storage bucket named "your_bucket".

Copy the entire folder (~/tensorflow_datasets/glue) to gs://your_bucket

Run the code in yourcode.py on a Google Cloud VM master's conda environment (yourconda) connected to TPU:

python yourcode.py

The output above (including the error) is produced.

Now install tensorflow==2.0.0 and rerun it and training will complete:

pip install --ignore-installed --upgrade tensorflow==2.0.0
python yourcode.py

yourcode.py.txt
(rename it yourcode.py)

amahendrakar · 2021-01-18T16:07:57Z

@dbonner,
Could you please update TensorFlow to the latest stable version v2.4 and check if you are facing the same error. Thanks!

google-ml-butler · 2021-01-25T16:37:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2021-02-01T17:29:34Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2021-02-01T17:29:39Z

Are you satisfied with the resolution of your issue?
Yes
No

gadagashwini-zz self-assigned this Nov 4, 2019

gadagashwini-zz added TF 2.0 Issues relating to TensorFlow 2.0 comp:tpus tpu, tpuestimator comp:dist-strat Distribution Strategy related issues type:support Support issues labels Nov 4, 2019

gadagashwini-zz assigned jvishnuvardhan and unassigned gadagashwini-zz Nov 4, 2019

jvishnuvardhan assigned jhseu and unassigned jvishnuvardhan Nov 15, 2019

jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug and removed type:support Support issues labels Nov 15, 2019

amahendrakar self-assigned this Jan 18, 2021

amahendrakar added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jan 18, 2021

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jan 25, 2021

google-ml-butler bot closed this as completed Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944

TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944

dbonner commented Nov 3, 2019

amahendrakar commented Jan 18, 2021

google-ml-butler bot commented Jan 25, 2021

google-ml-butler bot commented Feb 1, 2021

google-ml-butler bot commented Feb 1, 2021

TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944

TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944

Comments

dbonner commented Nov 3, 2019

amahendrakar commented Jan 18, 2021

google-ml-butler bot commented Jan 25, 2021

google-ml-butler bot commented Feb 1, 2021

google-ml-butler bot commented Feb 1, 2021