Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

rsethur · 2018-05-29T16:48:58Z

In my project I use Tf-Hub with estimators. However when I try to use multi GPU's (single machine) using tf.contrib.estimator.replicate_model_fn, I get the following error:

variable_scope was unused but the corresponding ". "name_scope was already taken.

Probably it is from this source line : link

Any help is much appreciated - received with thanks.

CC: @arnoegw

arnoegw · 2018-05-30T09:33:26Z

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

rsethur · 2018-06-06T02:33:25Z

Hello @arnoegw , Can you please provide me more guidance/some pseudo code would help. Tf-Hub + Estimators have awesome potential for developers - ironing out these kinks would definitely help.

arnoegw · 2018-06-06T15:31:54Z

I very much agree: it would be great to iron out the kinks that prevent straightforward use of Hub modules with multi-GPU Estimators. Unfortunately, at this time, I neither have that code, nor worked-out example code for the hack around that I sketched above. Sorry.

Leaving this open for the feature request...

matthew-z · 2018-08-21T15:31:09Z

+1 The same problem when use estimator.

I also look forward to trying multiGPU with tf-hub

nikolausWest · 2018-09-17T08:24:40Z

+1 Same issue here. Would like to use tf-hub with estimators and multi GPU.

In the meantime it would also be great with some pseudo code or more detailed explanation on how to hack around it would be really appreciated.

akhilkatpally · 2018-09-25T17:49:12Z

+1 Same problem when using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

marhlder · 2018-11-19T14:46:55Z

Did anyone manage to conjure a working hack for this?
I was unable to get it to work through a tf.collection

marhlder · 2018-11-19T15:17:54Z

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

Where would the shared instance have to be created?

Doing something like this in the model_fn does not work:

      if len(tf.get_collection(
          "SHARED_ELMO_INSTANCE_COLLECTION",
          scope=None
      )) == 0:

        elmo = hub.Module("https://tfhub.dev/google/elmo/2", name="ELMO", trainable=True)

        tf.add_to_collection(
          "SHARED_ELMO_INSTANCE_COLLECTION",
          elmo
        )

      elmo = tf.get_collection(
        "SHARED_ELMO_INSTANCE_COLLECTION",
        scope=None
      )[0]

      elmo_representations = elmo(
        inputs={
          "tokens": tokens,
          "sequence_len": tokens_length
        },
        signature="tokens",
        as_dict=True)["elmo"]

jasonkrone · 2018-12-17T04:23:55Z

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

edumotya · 2019-02-05T09:32:49Z

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

bjayakumar · 2019-02-07T17:05:57Z

Came here to report that it is still not fixed. I hope they fix it soon.

Harshini-Gadige · 2019-03-14T17:33:45Z

@arnoegw Any update or ETA on this ?

arnoegw · 2019-03-15T09:31:14Z

Hi all, thanks for your patience. We understand that multi-GPU training is important. While it was possible in low-level TensorFlow early on, its support by high-level frameworks has been a moving target. With the advent of TensorFlow 2 (see the recent Dev Summit), both sides of the story are changing again, but for the better:

Hub modules for TF2 will be SavedModels in the TF2 version of that format, loaded natively
with tf.saved_model.load(). Under the hood, this provides a clean separation of computation
and state, which helps the cause.
DistributionStrategy
is the new, more powerful abstraction for various kinds of parallel training.

So the TF2 version of this feature request is DistributionStrategy support for model pieces brought in by loading a SavedModel, preferably through Keras (not low-level TF). This is on the radar for the TensorFlow and TF Hub teams, but there is no specific timeline.

tf.contrib.estimator.replicate_model_fn is deprecated by now. We do not plan to go back and work on supporting it. Let me change the issue title accordingly....

arnoegw · 2019-03-15T09:35:47Z

For those especially interested in retraining of image models faster than with retrain.py:

If you are ready to live on the cutting edge of TF 2.0.0alpha0, take a look at Hub's examples/colab/tf2_image_retraining.ipynb which is considerably smaller, faster (if you use a GPU), and even supports fine-tuning the image module. However, this is still with a single GPU.

o-90 · 2019-05-12T23:40:10Z

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

Really hampered by this issue.

From what I understand tensorflow_hub.Module._try_get_state_scope is complaining because the embeddings are trying to be placed on all available GPUs.

one would have to to share hub.Module instances **for each graph**
that model_fn gets called in

A little more detail on what is meant by that sentence would go along way. Not asking for a solution but some pseudo-code could be great.

r-wheeler · 2019-05-13T17:33:51Z

I am really hampered by this issue as well.

rsethur · 2019-05-13T18:30:35Z

@arnoegw Many thanks for the development. Question: How is Hub positioned in comparison to the Keras applications models - seems to be quite similar. Will there be some unification in the future?
Also some of the models does not support fine tuning (object detection) - do you plan to fix this in future releases?

Thanks again!

arnoegw · 2019-05-14T11:47:16Z

@rsethur: There are no plans for unification at this time. TF Hub overlaps with Keras Applications for the particular case of reusing CNNs for image classification / feature extraction, but TF Hub offers modules (sometimes entire models) for a number of other domains, and requires neither the module consumer nor the module publisher to use Keras.

arnoegw · 2019-05-14T11:52:32Z

@gobrewers14, @r-wheeler: There is no great solution for TF1, but for TF2, there are the plans I described on March 15, and the already available examples/colab/tf2_image_retraining.ipynb with decent fine-tuning performance on a single GPU. Hope that helps.

littleDing · 2019-06-18T09:21:01Z

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

mhajiaghayi · 2019-06-27T01:44:42Z

I have the same problem with tf-hub and estimator and very disappointed by the response of tf team. sadly, one version to another, there are lots of changes in tensorflow.

Aashish-1008 · 2019-08-26T18:51:27Z

+1
I'm having the same problem using estimator,
tf-hub with multi GPU
tf.contrib.distribute.MirroredStrategy(num_gpus=8) .

serdarbozoglan · 2019-10-18T13:38:02Z

I am also getting the same error: "RuntimeError: variable_scope module_8/ was unused but the corresponding name_scope was already taken."

akshaydnicator · 2020-04-19T09:56:18Z

Still not fixed I believe. Please help!

RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.

Full Traceback:

RuntimeError Traceback (most recent call last)
in
6 tf.compat.v1.disable_eager_execution()
7
----> 8 elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

/opt/conda/lib/python3.6/site-packages/tensorflow_hub/module.py in init(self, spec, trainable, name, tags)
160 raise ValueError("No such graph variant: tags=%r" % tags)
161
--> 162 abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False)
163 self._name = abs_state_scope.split("/")[-2]
164

/opt/conda/lib/python3.6/site-packages/tensorflow_hub/module.py in _try_get_state_scope(name, mark_name_scope_used)
393 raise RuntimeError(
394 "variable_scope %s was unused but the corresponding "
--> 395 "name_scope was already taken." % abs_state_scope)
396 return abs_state_scope
397

RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.

sbecon · 2020-08-26T02:58:28Z

I have the same issue

frozenzo · 2020-12-30T05:54:37Z

Still hampered by the same issue for the time, is there any (hack) solution?

arnoegw · 2021-02-11T10:12:24Z

This won't be fixed for TF1 and the libraries that target it (hub.Module, Estimator).

For TF2, Keras, and the TF2 SavedModels loaded from TF Hub with hub.KerasLayer, the usual way of building and compiling a Keras model under a tf.distribute.MirroredStrategy and then calling .fit()on a tf.data.Dataset should just work. What we don't have yet is a great example to demonstrate that, say, on a multi-GPU machine on Google Cloud.

maringeo · 2021-06-11T15:20:14Z

TF Hub's make_image_classifier tool has been updated to use tf.data.Dataset and to demonstrate distributed training, including multi-GPU: https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier.

The make_image_classifier code is not a minimal working example, but as #64 (comment) says, a Keras model build under tf.distribute.MirroredStrategy that uses tf.data.Dataset should work on multi-GPU.

I plan to keep this issue open for a few weeks, in case anyone encounters any issues that I've missed during testing.

Harshini-Gadige added the type:bug label Feb 26, 2019

Harshini-Gadige added the hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team label Mar 14, 2019

Harshini-Gadige assigned arnoegw Mar 14, 2019

Harshini-Gadige added the stat:awaiting tensorflower label Mar 14, 2019

arnoegw changed the title ~~Tensorflow Hub: Failed Multi gpu execution with tf.contrib.estimator.replicate_model_fn~~ Tensorflow Hub: Support multi-GPU training in Keras or Estimator Mar 15, 2019

Harshini-Gadige added type:feature and removed stat:awaiting tensorflower type:bug labels Mar 15, 2019

woxue mentioned this issue Dec 26, 2020

How to run VTAB on multi-GPUs? google-research/task_adaptation#14

Open

arnoegw assigned maringeo and unassigned arnoegw Feb 11, 2021

arghyaganguly added the stat:awaiting tensorflower label Jun 4, 2021

maringeo closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

rsethur commented May 29, 2018 •

edited

arnoegw commented May 30, 2018

rsethur commented Jun 6, 2018

arnoegw commented Jun 6, 2018

matthew-z commented Aug 21, 2018 •

edited

nikolausWest commented Sep 17, 2018 •

edited

akhilkatpally commented Sep 25, 2018 •

edited

marhlder commented Nov 19, 2018 •

edited

marhlder commented Nov 19, 2018 •

edited

jasonkrone commented Dec 17, 2018

edumotya commented Feb 5, 2019

bjayakumar commented Feb 7, 2019

Harshini-Gadige commented Mar 14, 2019

arnoegw commented Mar 15, 2019

arnoegw commented Mar 15, 2019 •

edited

o-90 commented May 12, 2019 •

edited

r-wheeler commented May 13, 2019

rsethur commented May 13, 2019

arnoegw commented May 14, 2019

arnoegw commented May 14, 2019

littleDing commented Jun 18, 2019

mhajiaghayi commented Jun 27, 2019

Aashish-1008 commented Aug 26, 2019

serdarbozoglan commented Oct 18, 2019

akshaydnicator commented Apr 19, 2020

sbecon commented Aug 26, 2020

frozenzo commented Dec 30, 2020

arnoegw commented Feb 11, 2021

maringeo commented Jun 11, 2021 •

edited

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

Comments

rsethur commented May 29, 2018 • edited

arnoegw commented May 30, 2018

rsethur commented Jun 6, 2018

arnoegw commented Jun 6, 2018

matthew-z commented Aug 21, 2018 • edited

nikolausWest commented Sep 17, 2018 • edited

akhilkatpally commented Sep 25, 2018 • edited

marhlder commented Nov 19, 2018 • edited

marhlder commented Nov 19, 2018 • edited

jasonkrone commented Dec 17, 2018

edumotya commented Feb 5, 2019

bjayakumar commented Feb 7, 2019

Harshini-Gadige commented Mar 14, 2019

arnoegw commented Mar 15, 2019

arnoegw commented Mar 15, 2019 • edited

o-90 commented May 12, 2019 • edited

r-wheeler commented May 13, 2019

rsethur commented May 13, 2019

arnoegw commented May 14, 2019

arnoegw commented May 14, 2019

littleDing commented Jun 18, 2019

mhajiaghayi commented Jun 27, 2019

Aashish-1008 commented Aug 26, 2019

serdarbozoglan commented Oct 18, 2019

akshaydnicator commented Apr 19, 2020

sbecon commented Aug 26, 2020

frozenzo commented Dec 30, 2020

arnoegw commented Feb 11, 2021

maringeo commented Jun 11, 2021 • edited

rsethur commented May 29, 2018 •

edited

matthew-z commented Aug 21, 2018 •

edited

nikolausWest commented Sep 17, 2018 •

edited

akhilkatpally commented Sep 25, 2018 •

edited

marhlder commented Nov 19, 2018 •

edited

marhlder commented Nov 19, 2018 •

edited

arnoegw commented Mar 15, 2019 •

edited

o-90 commented May 12, 2019 •

edited

maringeo commented Jun 11, 2021 •

edited