Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

Closed
rsethur opened this issue May 29, 2018 · 28 comments
Closed

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64

rsethur opened this issue May 29, 2018 · 28 comments
Assignees
Labels
hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team stat:awaiting tensorflower type:feature

Comments

@rsethur
Copy link

rsethur commented May 29, 2018

In my project I use Tf-Hub with estimators. However when I try to use multi GPU's (single machine) using tf.contrib.estimator.replicate_model_fn, I get the following error:

variable_scope was unused but the corresponding ". "name_scope was already taken.

Probably it is from this source line : link

Any help is much appreciated - received with thanks.

CC: @arnoegw

@arnoegw
Copy link
Contributor

arnoegw commented May 30, 2018

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

@rsethur
Copy link
Author

rsethur commented Jun 6, 2018

Hello @arnoegw , Can you please provide me more guidance/some pseudo code would help. Tf-Hub + Estimators have awesome potential for developers - ironing out these kinks would definitely help.

@arnoegw
Copy link
Contributor

arnoegw commented Jun 6, 2018

I very much agree: it would be great to iron out the kinks that prevent straightforward use of Hub modules with multi-GPU Estimators. Unfortunately, at this time, I neither have that code, nor worked-out example code for the hack around that I sketched above. Sorry.

Leaving this open for the feature request...

@matthew-z
Copy link

matthew-z commented Aug 21, 2018

+1 The same problem when use estimator.

I also look forward to trying multiGPU with tf-hub

@nikolausWest
Copy link

nikolausWest commented Sep 17, 2018

+1 Same issue here. Would like to use tf-hub with estimators and multi GPU.

In the meantime it would also be great with some pseudo code or more detailed explanation on how to hack around it would be really appreciated.

@akhilkatpally
Copy link

akhilkatpally commented Sep 25, 2018

+1 Same problem when using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

@marhlder
Copy link

marhlder commented Nov 19, 2018

Did anyone manage to conjure a working hack for this?
I was unable to get it to work through a tf.collection

@marhlder
Copy link

marhlder commented Nov 19, 2018

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

Where would the shared instance have to be created?

Doing something like this in the model_fn does not work:

      if len(tf.get_collection(
          "SHARED_ELMO_INSTANCE_COLLECTION",
          scope=None
      )) == 0:

        elmo = hub.Module("https://tfhub.dev/google/elmo/2", name="ELMO", trainable=True)

        tf.add_to_collection(
          "SHARED_ELMO_INSTANCE_COLLECTION",
          elmo
        )

      elmo = tf.get_collection(
        "SHARED_ELMO_INSTANCE_COLLECTION",
        scope=None
      )[0]

      elmo_representations = elmo(
        inputs={
          "tokens": tokens,
          "sequence_len": tokens_length
        },
        signature="tokens",
        as_dict=True)["elmo"]


@jasonkrone
Copy link

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

1 similar comment
@edumotya
Copy link

edumotya commented Feb 5, 2019

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

@bjayakumar
Copy link

Came here to report that it is still not fixed. I hope they fix it soon.

@Harshini-Gadige Harshini-Gadige added the hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team label Mar 14, 2019
@Harshini-Gadige
Copy link

@arnoegw Any update or ETA on this ?

@arnoegw
Copy link
Contributor

arnoegw commented Mar 15, 2019

Hi all, thanks for your patience. We understand that multi-GPU training is important. While it was possible in low-level TensorFlow early on, its support by high-level frameworks has been a moving target. With the advent of TensorFlow 2 (see the recent Dev Summit), both sides of the story are changing again, but for the better:

  • Hub modules for TF2 will be SavedModels in the TF2 version of that format, loaded natively
    with tf.saved_model.load(). Under the hood, this provides a clean separation of computation
    and state, which helps the cause.
  • DistributionStrategy
    is the new, more powerful abstraction for various kinds of parallel training.

So the TF2 version of this feature request is DistributionStrategy support for model pieces brought in by loading a SavedModel, preferably through Keras (not low-level TF). This is on the radar for the TensorFlow and TF Hub teams, but there is no specific timeline.

tf.contrib.estimator.replicate_model_fn is deprecated by now. We do not plan to go back and work on supporting it. Let me change the issue title accordingly....

@arnoegw arnoegw changed the title Tensorflow Hub: Failed Multi gpu execution with tf.contrib.estimator.replicate_model_fn Tensorflow Hub: Support multi-GPU training in Keras or Estimator Mar 15, 2019
@arnoegw
Copy link
Contributor

arnoegw commented Mar 15, 2019

For those especially interested in retraining of image models faster than with retrain.py:

If you are ready to live on the cutting edge of TF 2.0.0alpha0, take a look at Hub's examples/colab/tf2_image_retraining.ipynb which is considerably smaller, faster (if you use a GPU), and even supports fine-tuning the image module. However, this is still with a single GPU.

@o-90
Copy link

o-90 commented May 12, 2019

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

Really hampered by this issue.

From what I understand tensorflow_hub.Module._try_get_state_scope is complaining because the embeddings are trying to be placed on all available GPUs.

one would have to to share hub.Module instances **for each graph**
that model_fn gets called in

A little more detail on what is meant by that sentence would go along way. Not asking for a solution but some pseudo-code could be great.

@r-wheeler
Copy link

I am really hampered by this issue as well.

@rsethur
Copy link
Author

rsethur commented May 13, 2019

@arnoegw Many thanks for the development. Question: How is Hub positioned in comparison to the Keras applications models - seems to be quite similar. Will there be some unification in the future?
Also some of the models does not support fine tuning (object detection) - do you plan to fix this in future releases?

Thanks again!

@arnoegw
Copy link
Contributor

arnoegw commented May 14, 2019

@rsethur: There are no plans for unification at this time. TF Hub overlaps with Keras Applications for the particular case of reusing CNNs for image classification / feature extraction, but TF Hub offers modules (sometimes entire models) for a number of other domains, and requires neither the module consumer nor the module publisher to use Keras.

@arnoegw
Copy link
Contributor

arnoegw commented May 14, 2019

@gobrewers14, @r-wheeler: There is no great solution for TF1, but for TF2, there are the plans I described on March 15, and the already available examples/colab/tf2_image_retraining.ipynb with decent fine-tuning performance on a single GPU. Hope that helps.

@littleDing
Copy link

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

@mhajiaghayi
Copy link

I have the same problem with tf-hub and estimator and very disappointed by the response of tf team. sadly, one version to another, there are lots of changes in tensorflow.

@Aashish-1008
Copy link

+1
I'm having the same problem using estimator,
tf-hub with multi GPU
tf.contrib.distribute.MirroredStrategy(num_gpus=8) .

@serdarbozoglan
Copy link

I am also getting the same error: "RuntimeError: variable_scope module_8/ was unused but the corresponding name_scope was already taken."

@akshaydnicator
Copy link

Still not fixed I believe. Please help!

RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.

Full Traceback:


RuntimeError Traceback (most recent call last)
in
6 tf.compat.v1.disable_eager_execution()
7
----> 8 elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

/opt/conda/lib/python3.6/site-packages/tensorflow_hub/module.py in init(self, spec, trainable, name, tags)
160 raise ValueError("No such graph variant: tags=%r" % tags)
161
--> 162 abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False)
163 self._name = abs_state_scope.split("/")[-2]
164

/opt/conda/lib/python3.6/site-packages/tensorflow_hub/module.py in _try_get_state_scope(name, mark_name_scope_used)
393 raise RuntimeError(
394 "variable_scope %s was unused but the corresponding "
--> 395 "name_scope was already taken." % abs_state_scope)
396 return abs_state_scope
397

RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.

@sbecon
Copy link

sbecon commented Aug 26, 2020

I have the same issue

@frozenzo
Copy link

Still hampered by the same issue for the time, is there any (hack) solution?

@arnoegw
Copy link
Contributor

arnoegw commented Feb 11, 2021

This won't be fixed for TF1 and the libraries that target it (hub.Module, Estimator).

For TF2, Keras, and the TF2 SavedModels loaded from TF Hub with hub.KerasLayer, the usual way of building and compiling a Keras model under a tf.distribute.MirroredStrategy and then calling .fit()on a tf.data.Dataset should just work. What we don't have yet is a great example to demonstrate that, say, on a multi-GPU machine on Google Cloud.

@maringeo
Copy link
Collaborator

maringeo commented Jun 11, 2021

TF Hub's make_image_classifier tool has been updated to use tf.data.Dataset and to demonstrate distributed training, including multi-GPU: https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier.

The make_image_classifier code is not a minimal working example, but as #64 (comment) says, a Keras model build under tf.distribute.MirroredStrategy that uses tf.data.Dataset should work on multi-GPU.

I plan to keep this issue open for a few weeks, in case anyone encounters any issues that I've missed during testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hub For all issues related to tf hub library and tf hub tutorials or examples posted by hub team stat:awaiting tensorflower type:feature
Projects
None yet
Development

No branches or pull requests