Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use TF Hub on a distributed setting? #48

Closed
cgarciae opened this issue May 11, 2018 · 19 comments
Closed

How to use TF Hub on a distributed setting? #48

cgarciae opened this issue May 11, 2018 · 19 comments
Assignees
Labels
hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team subtype:image-feature-vector type:support

Comments

@cgarciae
Copy link

cgarciae commented May 11, 2018

I want to use the ResNet-101-v2 feature vectors to do some transfer learning. I am training with the Estimator API on GCP, I call the hub Module at the beggining of the model_fn.

module_url = "https://tfhub.dev/google/imagenet/resnet_v2_101/feature_vector/1"
module = hub.Module(module_url)
height, width = hub.get_expected_image_size(module)
images = tf.image.resize_images(input_tensor, [height, width])
feature_vectors = module(images)

When I run in a single node ("basic-gpu") all is well, however, when I run the same code in distributed mode ("standard-1") I get this error:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 337, in _set_variable_or_list_initializer _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "") File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 299, in _set_checkpoint_initializer ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables/variables: Not found: /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables; No such file or directory [[Node: checkpoint_initializer_537 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](checkpoint_initializer_537/prefix, checkpoint_initializer_537/tensor_names, checkpoint_initializer_537/shape_and_slices)]] [[Node: init/NoOp_3_S22 = _Recvclient_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=-7983147897712139617, tensor_name="edge_3296_init/NoOp_3", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]] To find out more about why your job exited please check the logs: ....

How should I structure my code for TF Hub to work with the Estimator API for distributed training?

@cgarciae cgarciae changed the title How to use TensorFlow Hub on distributed setting? How to use TF Hub on a distributed setting? May 11, 2018
@svsgoogle
Copy link
Contributor

The issue seems to be that the module is cached in a directory that is not accessible to all the machines that initialize variables.

Can you try to specify a location for module caching that all the jobs can read from? You can look at https://www.tensorflow.org/hub/basics section "Caching Modules" for instructions.

@cgarciae
Copy link
Author

@svsgoogle thanks! Would a GS bucket be a valid location?

@svsgoogle
Copy link
Contributor

It should be :)

@bhack
Copy link

bhack commented May 15, 2018

It should not :) #50

@yu-iskw
Copy link

yu-iskw commented May 16, 2018

I have the same issue with Google ML Engine. In my opinion, the issue is caused by tensorflow-hub it self with a distributed training, not Google ML Engine.

I have made a repository to reproduce the issue.
https://github.com/yu-iskw/tensorflow-hub-with-ml-engine

As well as, we are discussing it on the google issue tracker.
https://issuetracker.google.com/issues/78898344

@cgarciae
Copy link
Author

cgarciae commented May 16, 2018

For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps:

On some local machine

  1. [optional] clear all modules: rm -fr /tmp/tfhub_modules
  2. Run some code that downloads the module: hub.Module(...)
  3. Upload the module to some gcp bucket:
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module

Now on your code you can use pass the location of the bucket to TF Hub

hub.Module("gs://bucket/some/path/to/module")

@andresusanopinto
Copy link
Contributor

andresusanopinto commented Jun 18, 2018

This was fixed in 6860621, but users will have to wait for a new pypi release + picked up by cloud.

@bhack
Copy link

bhack commented Jun 18, 2018

I suppose.. is it working also on s3?

@akhorlin
Copy link
Collaborator

akhorlin commented Jun 18, 2018 via email

@bhack
Copy link

bhack commented Jun 18, 2018

Probably is it a little bit incorrect 6860621#diff-781a53e648f3df8d16a08ec083b04bf4R18? Cause we are using tensorflow file:// as I see in 6860621#diff-cd783e402e7064fee42578c5b35d1c3c

@akhorlin
Copy link
Collaborator

akhorlin commented Jun 18, 2018 via email

@bhack
Copy link

bhack commented Jun 18, 2018

Is that it claims gcs in the chabgelog but file:// is not gcs only.

@akhorlin
Copy link
Collaborator

akhorlin commented Jun 18, 2018 via email

@budzianowski
Copy link

budzianowski commented Nov 14, 2018

For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps:

On some local machine

1. [optional] clear all modules: `rm -fr /tmp/tfhub_modules`

2. Run some code that downloads the module: `hub.Module(...)`

3. Upload the module to some gcp bucket:
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module

Now on your code you can use pass the location of the bucket to TF Hub

hub.Module("gs://bucket/some/path/to/module")

This approach does not work for me? Did you manage to make it run? I have version 0.1.1

@akhorlin
Copy link
Collaborator

akhorlin commented Nov 14, 2018 via email

@budzianowski
Copy link

I'm trying do the caching through gs but model gets stuck at loading the module as in #50.

@akhorlin
Copy link
Collaborator

akhorlin commented Nov 15, 2018 via email

@edumotya
Copy link

edumotya commented Feb 4, 2019

For those interested, the issue can be solved by saving the module into GCP, you can do it with the following steps:

On some local machine

  1. [optional] clear all modules: rm -fr /tmp/tfhub_modules
  2. Run some code that downloads the module: hub.Module(...)
  3. Upload the module to some gcp bucket:
gsutil -m cp -R /tmp/tfhub_modules/{module_hash} gs://bucket/some/path/to/module

Now on your code you can use pass the location of the bucket to TF Hub

hub.Module("gs://bucket/some/path/to/module")

This approach does not work for me neither. I am seeing the following error:

File "/root/.local/lib/python3.5/site-packages/tensorflow_hub/module.py", line 58, in load_module_spec return registry.loader(path) File "/root/.local/lib/python3.5/site-packages/tensorflow_hub/registry.py", line 45, in __call__ self._name, args, kwargs)) RuntimeError: Missing implementation that supports: loader(*('gs://xxx/07Nasnet',), **{})

-------------------- EDITED

It actually works, I just added the inner module hash to the gs path. I mean this:

hub.Module("gs://bucket/some/path/to/module/")

hub.Module("gs://bucket/some/path/to/module/{module_hash}")

@rmothukuru rmothukuru self-assigned this Apr 30, 2019
@rmothukuru rmothukuru added the hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team label Apr 30, 2019
@rmothukuru
Copy link

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team subtype:image-feature-vector type:support
Projects
None yet
Development

No branches or pull requests

10 participants