How to use TF Hub on a distributed setting? #48

cgarciae · 2018-05-11T20:54:35Z

I want to use the ResNet-101-v2 feature vectors to do some transfer learning. I am training with the Estimator API on GCP, I call the hub Module at the beggining of the model_fn.

module_url = "https://tfhub.dev/google/imagenet/resnet_v2_101/feature_vector/1"
module = hub.Module(module_url)
height, width = hub.get_expected_image_size(module)
images = tf.image.resize_images(input_tensor, [height, width])
feature_vectors = module(images)

When I run in a single node ("basic-gpu") all is well, however, when I run the same code in distributed mode ("standard-1") I get this error:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 337, in _set_variable_or_list_initializer _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "") File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 299, in _set_checkpoint_initializer ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables/variables: Not found: /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables; No such file or directory [[Node: checkpoint_initializer_537 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](checkpoint_initializer_537/prefix, checkpoint_initializer_537/tensor_names, checkpoint_initializer_537/shape_and_slices)]] [[Node: init/NoOp_3_S22 = _Recvclient_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=-7983147897712139617, tensor_name="edge_3296_init/NoOp_3", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]] To find out more about why your job exited please check the logs: ....

How should I structure my code for TF Hub to work with the Estimator API for distributed training?

The text was updated successfully, but these errors were encountered:

svsgoogle · 2018-05-14T11:22:15Z

The issue seems to be that the module is cached in a directory that is not accessible to all the machines that initialize variables.

Can you try to specify a location for module caching that all the jobs can read from? You can look at https://www.tensorflow.org/hub/basics section "Caching Modules" for instructions.

cgarciae · 2018-05-14T14:04:43Z

@svsgoogle thanks! Would a GS bucket be a valid location?

svsgoogle · 2018-05-14T14:58:24Z

It should be :)

bhack · 2018-05-15T10:10:51Z

It should not :) #50

yu-iskw · 2018-05-16T17:44:58Z

I have the same issue with Google ML Engine. In my opinion, the issue is caused by tensorflow-hub it self with a distributed training, not Google ML Engine.

I have made a repository to reproduce the issue.
https://github.com/yu-iskw/tensorflow-hub-with-ml-engine

As well as, we are discussing it on the google issue tracker.
https://issuetracker.google.com/issues/78898344

cgarciae · 2018-05-16T21:55:47Z

For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps:

On some local machine

[optional] clear all modules: rm -fr /tmp/tfhub_modules
Run some code that downloads the module: hub.Module(...)
Upload the module to some gcp bucket:

gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module

Now on your code you can use pass the location of the bucket to TF Hub

hub.Module("gs://bucket/some/path/to/module")

andresusanopinto · 2018-06-18T16:03:05Z

This was fixed in 6860621, but users will have to wait for a new pypi release + picked up by cloud.

bhack · 2018-06-18T16:11:35Z

I suppose.. is it working also on s3?

akhorlin · 2018-06-18T16:15:29Z

It should though I haven't explicitly tested on S3, so please let us know if you see any issues.

…

On Mon, Jun 18, 2018 at 6:11 PM bhack ***@***.***> wrote: I suppose.. is it working also on s3? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbunTCGcNxUDWesY5-fpg8QQgcZCn5nlks5t99E5gaJpZM4T8DNY> .

bhack · 2018-06-18T16:23:04Z

Probably is it a little bit incorrect 6860621#diff-781a53e648f3df8d16a08ec083b04bf4R18? Cause we are using tensorflow file:// as I see in 6860621#diff-cd783e402e7064fee42578c5b35d1c3c

akhorlin · 2018-06-18T16:29:11Z

What error are you observing with the fix in question? The test uses file:// because we cannot use GCS or S3 in a unit test, but in offline testing, we did test gs://. The breakage that was address by the fix applies to all custom filesystem in TF (file://, gs://. etc).

…

On Mon, Jun 18, 2018 at 6:23 PM bhack ***@***.***> wrote: Probably is a little bit incorrect 6860621 #diff-781a53e648f3df8d16a08ec083b04bf4R18 <6860621#diff-781a53e648f3df8d16a08ec083b04bf4R18>? Cause we are using tensorflow file:// as I see in 6860621 #diff-cd783e402e7064fee42578c5b35d1c3c <6860621#diff-cd783e402e7064fee42578c5b35d1c3c> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbunTOILlYuHMVG2UJJ2-IH4upg0OBoZks5t99PrgaJpZM4T8DNY> .

bhack · 2018-06-18T17:38:42Z

Is that it claims gcs in the chabgelog but file:// is not gcs only.

akhorlin · 2018-06-18T17:40:45Z

Yep, we will be more clear with the cl description next time ))

…

On Mon, Jun 18, 2018 at 7:38 PM bhack ***@***.***> wrote: Is that it claims gcs in the chabgelog but file:// is not gcs only. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbunTBvn7XigGSKlZOwTiQjwSFZCnm9vks5t9-WkgaJpZM4T8DNY> .

budzianowski · 2018-11-14T18:21:26Z

For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps:

On some local machine
1. [optional] clear all modules: `rm -fr /tmp/tfhub_modules`

2. Run some code that downloads the module: `hub.Module(...)`

3. Upload the module to some gcp bucket:
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module
Now on your code you can use pass the location of the bucket to TF Hub
hub.Module("gs://bucket/some/path/to/module")

This approach does not work for me? Did you manage to make it run? I have version 0.1.1

akhorlin · 2018-11-14T20:37:01Z

What error are you seeing?

…

On Wed, Nov 14, 2018, 19:21 Paweł Budzianowski ***@***.***> wrote: For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps: On some local machine 1. [optional] clear all modules: `rm -fr /tmp/tfhub_modules` 2. Run some code that downloads the module: `hub.Module(...)` 3. Upload the module to some gcp bucket: gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module Now on your code you can use pass the location of the bucket to TF Hub hub.Module("gs://bucket/some/path/to/module") This approach does not work for me? Did you manage to make it run? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbunTNmsrFc25hbbwMl7Hy2eUC0yzHkYks5uvF8pgaJpZM4T8DNY> .

budzianowski · 2018-11-15T00:01:36Z

I'm trying do the caching through gs but model gets stuck at loading the module as in #50.

akhorlin · 2018-11-15T08:56:13Z

One option for further debugging is to try to run a small test program that reads/write to the GCS bucket in question using tf.gfile <https://www.tensorflow.org/api_docs/python/tf/gfile/GFile>. This will make sure that the bucket and the local machine are set up correctly.

…

On Thu, Nov 15, 2018 at 1:01 AM Paweł Budzianowski ***@***.***> wrote: I'm trying do the caching through gs but model gets stuck at loading the module as in #50 <#50>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbunTDx6GgV2kNtMdBChQqIxs3PpGpRKks5uvK7igaJpZM4T8DNY> .

edumotya · 2019-02-04T11:16:33Z

For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps:

On some local machine

[optional] clear all modules: rm -fr /tmp/tfhub_modules

Run some code that downloads the module: hub.Module(...)

Upload the module to some gcp bucket:
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module
Now on your code you can use pass the location of the bucket to TF Hub
hub.Module("gs://bucket/some/path/to/module")

~~This approach does not work for me neither. I am seeing the following error:~~

File "/root/.local/lib/python3.5/site-packages/tensorflow_hub/module.py", line 58, in load_module_spec return registry.loader(path) File "/root/.local/lib/python3.5/site-packages/tensorflow_hub/registry.py", line 45, in __call__ self._name, args, kwargs)) RuntimeError: Missing implementation that supports: loader(*('gs://xxx/07Nasnet',), **{})

-------------------- EDITED

It actually works, I just added the inner module hash to the gs path. I mean this:

~~hub.Module("gs://bucket/some/path/to/module/")~~

hub.Module("gs://bucket/some/path/to/module/{module_hash}")

rmothukuru · 2019-04-30T08:06:52Z

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

cgarciae changed the title ~~How to use TensorFlow Hub on distributed setting?~~ How to use TF Hub on a distributed setting? May 11, 2018

cgarciae mentioned this issue May 15, 2018

Module download freezes if TFHUB_CACHE_DIR is a GS Bucket #50

Closed

Harshini-Gadige added type:support subtype:image-feature-vector labels Feb 26, 2019

rmothukuru self-assigned this Apr 30, 2019

rmothukuru added the hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team label Apr 30, 2019

rmothukuru closed this as completed Apr 30, 2019

jhihn mentioned this issue May 20, 2019

DataLossError: Checksum does not match: #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use TF Hub on a distributed setting? #48

How to use TF Hub on a distributed setting? #48

cgarciae commented May 11, 2018 •

edited

Loading

svsgoogle commented May 14, 2018

cgarciae commented May 14, 2018

svsgoogle commented May 14, 2018

bhack commented May 15, 2018

yu-iskw commented May 16, 2018

cgarciae commented May 16, 2018 •

edited

Loading

andresusanopinto commented Jun 18, 2018 •

edited

Loading

bhack commented Jun 18, 2018

akhorlin commented Jun 18, 2018 via email

bhack commented Jun 18, 2018 •

edited

Loading

akhorlin commented Jun 18, 2018 via email

bhack commented Jun 18, 2018

akhorlin commented Jun 18, 2018 via email

budzianowski commented Nov 14, 2018 •

edited

Loading

akhorlin commented Nov 14, 2018 via email

budzianowski commented Nov 15, 2018

akhorlin commented Nov 15, 2018 via email

edumotya commented Feb 4, 2019 •

edited

Loading

rmothukuru commented Apr 30, 2019

How to use TF Hub on a distributed setting? #48

How to use TF Hub on a distributed setting? #48

Comments

cgarciae commented May 11, 2018 • edited Loading

svsgoogle commented May 14, 2018

cgarciae commented May 14, 2018

svsgoogle commented May 14, 2018

bhack commented May 15, 2018

yu-iskw commented May 16, 2018

cgarciae commented May 16, 2018 • edited Loading

andresusanopinto commented Jun 18, 2018 • edited Loading

bhack commented Jun 18, 2018

akhorlin commented Jun 18, 2018 via email

bhack commented Jun 18, 2018 • edited Loading

akhorlin commented Jun 18, 2018 via email

bhack commented Jun 18, 2018

akhorlin commented Jun 18, 2018 via email

budzianowski commented Nov 14, 2018 • edited Loading

akhorlin commented Nov 14, 2018 via email

budzianowski commented Nov 15, 2018

akhorlin commented Nov 15, 2018 via email

edumotya commented Feb 4, 2019 • edited Loading

rmothukuru commented Apr 30, 2019

cgarciae commented May 11, 2018 •

edited

Loading

cgarciae commented May 16, 2018 •

edited

Loading

andresusanopinto commented Jun 18, 2018 •

edited

Loading

bhack commented Jun 18, 2018 •

edited

Loading

budzianowski commented Nov 14, 2018 •

edited

Loading

edumotya commented Feb 4, 2019 •

edited

Loading