Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process #33944

Closed
dbonner opened this issue Nov 3, 2019 · 4 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues comp:tpus tpu, tpuestimator stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@dbonner
Copy link

dbonner commented Nov 3, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (see below). The code has been adopted from the Colab notebook (https://colab.research.google.com/drive/1yWaLpCWImXZE2fPV0ZYDdWWI8f52__9A#scrollTo=mnhwpzb73KIL) with instructions below on how to run a TPU using ctpu. I have 90 days free access with TFRC.
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Cloud TPU pair (Master VM is Linux 4.9.0-11-amd64 Add support for Python 3.x #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 GNU/Linux and TPU is v3.8 8 corel)
  • TensorFlow installed from (source or binary): binary (pip)
  • TensorFlow version (use command below): Tested on tf-nightly binary (2.1.0-dev20191102) from pip.
  • Python version: conda-forge 3.7.3

Describe the current behavior
When tf-nightly is installed, following the setup instructions below and running the attached code (yourcode.py) gives an error. It occurs at lines 54-56 (traceback line 56) of yourcode.py:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

I have marked the place at line 58 with "# This is where the error occurs".

The full output including the error is (follow the steps under "Code to reproduce this issue"):

2019-11-03 06:59:16.549391: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-03 06:59:16.549450: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2.1.0-dev20191102
2019-11-03 06:59:18.662355: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2019-11-03 06:59:18.662422: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2019-11-03 06:59:18.662459: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (mimetic): /proc/driver/nvidia/version does not exist
2019-11-03 06:59:18.663151: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-03 06:59:18.672455: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-11-03 06:59:18.673297: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ca94e697b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-03 06:59:18.673382: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (gs://mimetic_store/glue/mrpc/0.0.2)
INFO:absl:Constructing tf.data.Dataset for split None, from gs://mimetic_store/glue/mrpc/0.0.2
Saved glue_mnli_train.
Saved glue_mnli_valid.
2019-11-03 07:00:41.886095: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2019-11-03 07:00:41.886162: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:54535}
2019-11-03 07:01:53.957214: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2019-11-03 07:01:53.957297: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:54535}
2019-11-03 07:01:53.958220: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:54535
INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
Traceback (most recent call last):
File "yourcode.py", line 56, in
tf.tpu.experimental.initialize_tpu_system(tpu)
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/tpu/tpu_strategy_util.py", line 103, in initialize_tpu_system
serialized_topology = output.numpy()
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 942, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/home/daniel_bonner_anu_edu_au/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 910, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: '__inference__tpu_init_fn_4206710' is neither a type of a primitive operation nor a name of a function registered in binary running on n-48a744b7-w-0. Make sure the operation or function is registered in the binary running in this process.
2019-11-03 07:01:54.492446: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:75] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: '__inference__tpu_init_fn_4206710' is neither a type of a primitive operation nor a name of a function registered in binary running on n-48a744b7-w-0. Make sure the operation or function is registered in the binary running in this process.

Describe the expected behavior
When tested on tensorflow==2.0.0 from pip, the code completes.
The whole process (with tf2.0.0) outputs at the end:
Epoch: [2] Validation accuracy = 0.843137264251709

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Create a Google Cloud master VM and TPU pair:

ctpu up --name=yourtpupair --zone=us-central1-a --tpu-size=v3-8 --machine-type=n1-standard-8 --disk-size-gb=40

Set up a conda python 3.7 development environment with tf-nightly on the master VM

sudo apt update && sudo apt install bzip2 libxml2-dev libxslt-dev -y && wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh && bash Anaconda3-2019.10-Linux-x86_64.sh

Accept the defaults and initialize Anaconda

rm Anaconda3-2019.10-Linux-x86_64.sh && . ~/.bashrc && conda config --add channels anaconda && conda config --add channels conda-forge && conda config --set channel_priority strict && conda create -n yourconda python=3.7 -y && conda activate yourconda

conda install tqdm

pip install tensorflow-datasets transformers && pip install --upgrade google-api-python-client && pip install --upgrade oauth2client && pip install --ignore-installed --upgrade tf-nightly

Download the "glue/mrpc" dataset to ~/tensorflow_datasets in a python shell:

python
import tensorflow as tf
import tensorflow_datasets
data = tensorflow_datasets.load("glue/mrpc")

Create a Google storage bucket named "your_bucket".

Copy the entire folder (~/tensorflow_datasets/glue) to gs://your_bucket

Run the code in yourcode.py on a Google Cloud VM master's conda environment (yourconda) connected to TPU:

python yourcode.py

The output above (including the error) is produced.

Now install tensorflow==2.0.0 and rerun it and training will complete:

pip install --ignore-installed --upgrade tensorflow==2.0.0
python yourcode.py

yourcode.py.txt
(rename it yourcode.py)

@dbonner dbonner changed the title TPU support has regressed in tf-nightly (worked perfectly in tf=2.0.0) - operation or function not registered in the binary running in this process TPU support has regressed in tf-nightly (worked well in tf=2.0.0) - operation or function not registered in the binary running in this process Nov 3, 2019
@gadagashwini-zz gadagashwini-zz self-assigned this Nov 4, 2019
@gadagashwini-zz gadagashwini-zz added TF 2.0 Issues relating to TensorFlow 2.0 comp:tpus tpu, tpuestimator comp:dist-strat Distribution Strategy related issues type:support Support issues labels Nov 4, 2019
@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug and removed type:support Support issues labels Nov 15, 2019
@amahendrakar amahendrakar self-assigned this Jan 18, 2021
@amahendrakar
Copy link
Contributor

@dbonner,
Could you please update TensorFlow to the latest stable version v2.4 and check if you are facing the same error. Thanks!

@amahendrakar amahendrakar added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jan 18, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jan 25, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues comp:tpus tpu, tpuestimator stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants