TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944
Labels
comp:dist-strat
Distribution Strategy related issues
comp:tpus
tpu, tpuestimator
stale
This label marks the issue/pr stale - to be closed automatically if no activity
stat:awaiting response
Status - Awaiting response from author
TF 2.0
Issues relating to TensorFlow 2.0
type:bug
Bug
System information
Describe the current behavior
When tf-nightly is installed, following the setup instructions below and running the attached code (yourcode.py) gives an error. It occurs at lines 54-56 (traceback line 56) of yourcode.py:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
I have marked the place at line 58 with "# This is where the error occurs".
The full output including the error is (follow the steps under "Code to reproduce this issue"):
2019-11-03 06:59:16.549391: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-03 06:59:16.549450: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2.1.0-dev20191102
2019-11-03 06:59:18.662355: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2019-11-03 06:59:18.662422: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2019-11-03 06:59:18.662459: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (mimetic): /proc/driver/nvidia/version does not exist
2019-11-03 06:59:18.663151: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-03 06:59:18.672455: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-11-03 06:59:18.673297: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ca94e697b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-03 06:59:18.673382: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (gs://mimetic_store/glue/mrpc/0.0.2)
INFO:absl:Constructing tf.data.Dataset for split None, from gs://mimetic_store/glue/mrpc/0.0.2
Saved glue_mnli_train.
Saved glue_mnli_valid.
2019-11-03 07:00:41.886095: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2019-11-03 07:00:41.886162: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:54535}
2019-11-03 07:01:53.957214: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2019-11-03 07:01:53.957297: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:54535}
2019-11-03 07:01:53.958220: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:54535
INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
Traceback (most recent call last):
File "yourcode.py", line 56, in
tf.tpu.experimental.initialize_tpu_system(tpu)
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/tpu/tpu_strategy_util.py", line 103, in initialize_tpu_system
serialized_topology = output.numpy()
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 942, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 910, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: '__inference__tpu_init_fn_4206710' is neither a type of a primitive operation nor a name of a function registered in binary running on n-48a744b7-w-0. Make sure the operation or function is registered in the binary running in this process.
2019-11-03 07:01:54.492446: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:75] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: '__inference__tpu_init_fn_4206710' is neither a type of a primitive operation nor a name of a function registered in binary running on n-48a744b7-w-0. Make sure the operation or function is registered in the binary running in this process.
Describe the expected behavior
When tested on tensorflow==2.0.0 from pip, the code completes.
The whole process (with tf2.0.0) outputs at the end:
Epoch: [2] Validation accuracy = 0.843137264251709
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Create a Google Cloud master VM and TPU pair:
ctpu up --name=yourtpupair --zone=us-central1-a --tpu-size=v3-8 --machine-type=n1-standard-8 --disk-size-gb=40
Set up a conda python 3.7 development environment with tf-nightly on the master VM
sudo apt update && sudo apt install bzip2 libxml2-dev libxslt-dev -y && wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh && bash Anaconda3-2019.10-Linux-x86_64.sh
Accept the defaults and initialize Anaconda
rm Anaconda3-2019.10-Linux-x86_64.sh && . ~/.bashrc && conda config --add channels anaconda && conda config --add channels conda-forge && conda config --set channel_priority strict && conda create -n yourconda python=3.7 -y && conda activate yourconda
conda install tqdm
pip install tensorflow-datasets transformers && pip install --upgrade google-api-python-client && pip install --upgrade oauth2client && pip install --ignore-installed --upgrade tf-nightly
Download the "glue/mrpc" dataset to ~/tensorflow_datasets in a python shell:
python
import tensorflow as tf
import tensorflow_datasets
data = tensorflow_datasets.load("glue/mrpc")
Create a Google storage bucket named "your_bucket".
Copy the entire folder (~/tensorflow_datasets/glue) to gs://your_bucket
Run the code in yourcode.py on a Google Cloud VM master's conda environment (yourconda) connected to TPU:
python yourcode.py
The output above (including the error) is produced.
Now install tensorflow==2.0.0 and rerun it and training will complete:
pip install --ignore-installed --upgrade tensorflow==2.0.0
python yourcode.py
yourcode.py.txt
(rename it yourcode.py)
The text was updated successfully, but these errors were encountered: