Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling model including hub.KerasLayer fails in distribution strategy scope. #469

Closed
uchua opened this issue Jan 8, 2020 · 11 comments
Closed

Comments

@uchua
Copy link

uchua commented Jan 8, 2020

When trying to compile a model that includes a hub.KerasLayer I get the following error:

ValueError: Variable (<tf.Variable 'bert/embeddings/word_embeddings:0' shape=(119547, 768) dtype=float32>) was not created in the distribution strategy scope of (<tensorflow.python.distribute.tpu_strategy.TPUStrategy object at 0x7fb17c83f3c8>). It is most likely due to not all layers or the model or optimizer being created outside the distribution strategy scope. Try to make sure your code looks similar to the following.
with strategy.scope():
  model=_create_model()
  model.compile(...)

Code used to build and compile model:

with tpu_strategy.scope():
    in_id = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), name="input_ids", dtype=np.int32)
    in_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), name="input_masks", dtype=np.int32)
    in_segment = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), name="segment_ids", dtype=np.int32)
    bert_inputs = {"input_ids": in_id, "input_mask": in_mask, "segment_ids": in_segment}

    bert_output = hub.KerasLayer(BERT_MODEL_HUB, trainable=True, signature="tokens", output_key="pooled_output")(bert_inputs)

    dense = tf.keras.layers.Dense(256, input_shape=(768,), activation='relu')(bert_output)
    pred = tf.keras.layers.Dense(len(unique_labels), activation='sigmoid')(dense)

    model = tf.keras.models.Model(inputs=bert_inputs, outputs=pred)

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
        loss=tf.keras.losses.binary_crossentropy,
        metrics=["accuracy"]
    )

This is the model I'm loading in the KerasLayer: https://tfhub.dev/google/bert_multi_cased_L-12_H-768_A-12/1

The entire model is created within the scope of the distribution strategy, so I'm not sure what the cause of the error would be, other than a bug.

@rmothukuru rmothukuru self-assigned this Jan 9, 2020
@arnoegw arnoegw assigned arnoegw and unassigned rmothukuru Jan 9, 2020
@arnoegw
Copy link
Contributor

arnoegw commented Jan 9, 2020

Yes, this may well be due to a bug in TensorFlow 2.0. Sorry about that. Please try a recent tf-nightly PIP package and reopen if the issue persists.

@arnoegw arnoegw closed this as completed Jan 9, 2020
@arnoegw
Copy link
Contributor

arnoegw commented Jan 9, 2020

@uchua
Copy link
Author

uchua commented Jan 9, 2020

Edit:

Apparently tf-nightly did install correctly, but when you run tf.version.GIT_VERSION it tells you it's version 1.12.

Strangely enough, installing tf-nightly gave me TensorFlow version 1.12. I was able to install TensorFlow v2.1.0-rc2-17-ge5bf8de by running pip install tensorflow-gpu==2.1, but then I get these errors when trying to create the TPU distribution strategy:

UnimplementedError                        Traceback (most recent call last)
<ipython-input-25-6596c0485615> in <module>
      7 
      8     resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
----> 9     tf.config.experimental_connect_to_host(resolver.master())
     10     tf.tpu.experimental.initialize_tpu_system(resolver)
     11     tpu_strategy = tf.distribute.experimental.TPUStrategy(resolver)

~/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/remote.py in connect_to_remote_host(remote_host, job_name)
     75       {job_name: [_strip_prefix(host, _GRPC_PREFIX) for host in remote_hosts]})
     76 
---> 77   connect_to_cluster(cluster_spec)
     78 
     79 

~/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/remote.py in connect_to_cluster(cluster_spec_or_resolver, job_name, task_index, protocol, make_master_device_default)
    140     context.set_server_def(server_def)
    141   else:
--> 142     context.update_server_def(server_def)
    143 
    144   if make_master_device_default and isinstance(

~/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/context.py in update_server_def(server_def)
   2017 
   2018 def update_server_def(server_def):
-> 2019   context().update_server_def(server_def)
   2020 
   2021 

~/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/context.py in update_server_def(self, server_def, keep_alive_secs)
    592       pywrap_tensorflow.TFE_ContextUpdateServerDef(self._context_handle,
    593                                                    keep_alive_secs,
--> 594                                                    server_def_str)
    595       self._initialize_logical_devices()
    596 

UnimplementedError: 
Additional GRPC error information:
{"created":"@1578592686.165655670","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"","grpc_status":12}

This is the code for creating the strategy:

TPU_ADDRESS = "grpc://" + "10.0.0.2:8470"

with tf.compat.v1.Session(TPU_ADDRESS) as session:
    print('TPU devices:')
    pprint.pprint(session.list_devices())

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(resolver)

So it looks like there might be a bug in TensorFlow 2.1? Or should I be creating the TPU distribution strategy differently?

@arnoegw
Copy link
Contributor

arnoegw commented Jan 10, 2020

Please help me investigate further. Do I guess correctly that your code runs on a Colab notebook hosted on colab.research.google.com? Please share a Colab notebook with me to reproduce the issue with TF 2.1 (possibly prerelease) and BERT from a TF2-style SavedModel, such as https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/1.

@arnoegw arnoegw reopened this Jan 10, 2020
@uchua
Copy link
Author

uchua commented Jan 10, 2020

@arnoegw I'm using Colab and a Google Cloud TPU Instance. The error I previously posted appears to be caused by tf.config.experimental_connect_to_host(resolver.master()) already having been ran. I still encounter another error however while trying to create the TPU Strategy, but it appears to be an issue with TensorFlow, not TensorFlow Hub.

@makecent
Copy link

makecent commented Dec 6, 2020

Using TF-2.3, still got the same problem. I use distribute.MirroredStrategy().scope(). Any idea?

@ucesfpa
Copy link

ucesfpa commented Dec 21, 2020

Using TF-2.3, still got the same problem. I use distribute.MirroredStrategy().scope(). Any idea?

I also have the same issue. Please help!

@arnoegw
Copy link
Contributor

arnoegw commented Feb 11, 2021

FWIW, https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu shows how it works for TF2.4. (It also worked for TF2.3, but in Colab you can't change the TPU worker version to try.)

@HTGorji
Copy link

HTGorji commented May 16, 2022

I get an error when I want to compile my model.
My model is a Deeplab.
Could you please help me with that?
Tensorflow 2.2.0

def create_model(args, model_name, output_dim):
    if model_name == "deeplabV3_plus":
        input_shape = ((args.input_size, args.input_size, 3))
        model = DeepLabV3Plus(input_shape)
    return model

Then later, in one section of my code, I want to train my model, but I get the following error.

         strategy = tf.distribute.MirroredStrategy()
         with strategy.scope():
            METRICS = [
                keras.metrics.TruePositives(name='tp'),
                keras.metrics.FalsePositives(name='fp'),
                keras.metrics.TrueNegatives(name='tn'),
                keras.metrics.FalseNegatives(name='fn')
            ]
            model = self.model
            model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=METRICS)

        history = model.fit(train_data,
                            verbose=2,
                            epochs=args.epochs,
                            shuffle=False)
 raise ValueError(
ValueError: Variable (<tf.Variable 'conv1_conv/kernel:0' shape=(7, 7, 3, 64) dtype=float32, numpy=
array([[[[-0.02878591,  0.03968531,  0.03641525, ..., -0.03084487,
          -0.02158457, -0.014132  ],
         [ 0.00074502, -0.0025292 , -0.04114929, ..., -0.0122269 ,
           0.02191756,  0.00397792],
         [ 0.01599922,  0.02605366,  0.02913692, ...,  0.01153038,
           0.01981341, -0.01823156]],

        [[ 0.0302609 , -0.03902362,  0.01403341, ...,  0.02662553,
           0.01796425, -0.02402215],
         [-0.00916608,  0.02252114,  0.03229741, ..., -0.02720008,
           0.04137112,  0.01507518],
         [-0.00389622,  0.01516257,  0.03228285, ...,  0.04099331,
           0.00592189,  0.0221811 ]],

        [[ 0.03253492, -0.01435366,  0.02506782, ..., -0.03665198,
          -0.00949166, -0.02421995],
         [-0.01569086, -0.04148282, -0.01675939, ..., -0.0239016 ,
          -0.02261319,  0.03264611],
         [-0.03728421,  0.01131782, -0.0091066 , ..., -0.00546361,
          -0.01797785, -0.03041088]],

        ...,

        [[ 0.01792137,  0.03674115,  0.02934234, ...,  0.03210436,
          -0.01456548, -0.03170301],
         [-0.00490872,  0.02028899,  0.03461434, ...,  0.01494849,
           0.03875599,  0.02271222],
         [-0.01961406, -0.03714995, -0.01296565, ..., -0.00358429,
           0.04021056, -0.01833691]],

        [[-0.01868246,  0.03103003, -0.01014595, ...,  0.0008074 ,
           0.02688321, -0.01458171],
         [ 0.01615499,  0.01003345, -0.03475311, ...,  0.03200821,
          -0.03148265, -0.03816338],
         [-0.02500446, -0.04130269,  0.03440503, ...,  0.00656155,
          -0.00379527,  0.01493592]],

        [[ 0.03022939,  0.0246064 ,  0.02682921, ..., -0.03900644,
          -0.00721056, -0.02661576],
         [-0.02527636,  0.02407213, -0.00924577, ..., -0.00710348,
          -0.01992491,  0.00035052],
         [ 0.02833832,  0.01586089, -0.04064951, ...,  0.02057872,
           0.00046169,  0.01369259]]],


       [[[-0.01213805,  0.03111628,  0.02673499, ...,  0.00549041,
           0.02938559, -0.01614523],
         [-0.00885064, -0.00359355, -0.00358286, ..., -0.04022623,
           0.02911928,  0.02872286],
         [ 0.03571633, -0.0010606 , -0.0220888 , ..., -0.0238349 ,
          -0.03300236, -0.03627774]],

        [[ 0.03680736,  0.00487308, -0.01098377, ..., -0.005409  ,
          -0.01626872, -0.03017656],
         [ 0.03737028,  0.01651787, -0.01228035, ..., -0.02987617,
           0.01691703, -0.03619874],
         [-0.00311079,  0.02328985,  0.03340743, ...,  0.0070854 ,
          -0.008461  ,  0.01112581]],

        [[-0.01047574, -0.04075018, -0.01167224, ..., -0.03680304,
          -0.02393692, -0.02736144],
         [ 0.00220935, -0.00096874,  0.01025746, ..., -0.01753948,
          -0.00598299, -0.04184602],
         [-0.03863832, -0.03028022,  0.02922371, ..., -0.00947168,
           0.03729448, -0.0011937 ]],

        ...,

        [[-0.01837078, -0.04027799, -0.00785745, ..., -0.00608211,
           0.02307337, -0.01391036],
         [-0.01727627, -0.03093182,  0.03744406, ..., -0.00287396,
          -0.02517046,  0.02554483],
         [ 0.01364107,  0.02183375, -0.00415699, ...,  0.01771579,
          -0.03176679,  0.0247004 ]],

        [[ 0.00443545,  0.02236081,  0.011328  , ..., -0.01656485,
           0.01892501, -0.01097592],
         [ 0.01361132, -0.01312676,  0.00155234, ...,  0.03267791,
           0.0075016 ,  0.00613252],
         [ 0.000273  , -0.02797817,  0.03450226, ..., -0.02308745,
          -0.00685496,  0.03451258]],

        [[ 0.02024524, -0.0317959 ,  0.00446649, ..., -0.0217262 ,
          -0.03863733,  0.0247216 ],
         [-0.03236089,  0.02701623, -0.01374604, ...,  0.02401371,
          -0.00909961,  0.0173627 ],
         [ 0.00756612,  0.01979894, -0.01443185, ..., -0.03781085,
           0.02710035, -0.03068993]]],


       [[[ 0.02833999, -0.00623503, -0.02557684, ..., -0.02902167,
           0.03662451, -0.0300681 ],
         [ 0.00201247,  0.00327801,  0.00134143, ...,  0.03343113,
           0.00843311, -0.02711113],
         [-0.01984301, -0.00551971,  0.02528293, ...,  0.03623073,
          -0.01419386, -0.0007051 ]],

        [[ 0.02833929, -0.01252304, -0.02989598, ...,  0.04137161,
          -0.00839223,  0.01394486],
         [-0.02659342,  0.00217635, -0.03752425, ...,  0.00908671,
           0.03528336, -0.03762647],
         [-0.02389926,  0.01050568, -0.02797867, ...,  0.00836991,
           0.01234463,  0.04017064]],

        [[ 0.00203969, -0.02852772,  0.00551932, ...,  0.00465782,
          -0.03351232, -0.03177572],
         [ 0.02206092,  0.03589064, -0.01951779, ...,  0.00248256,
           0.0077274 ,  0.01249541],
         [ 0.03517856, -0.00963541, -0.00515503, ..., -0.00965383,
           0.00070368, -0.01995948]],

        ...,

        [[-0.04234631,  0.03789089, -0.00772819, ..., -0.01769685,
          -0.03292959,  0.03434884],
         [ 0.02146093,  0.01694347,  0.00318442, ..., -0.01881772,
          -0.01668198,  0.02567464],
         [ 0.03571705,  0.03259897, -0.01294047, ...,  0.02712655,
           0.02608032, -0.01180965]],

        [[ 0.01185928, -0.04079281, -0.01112602, ...,  0.04131548,
           0.04050212, -0.00183395],
         [-0.04209455, -0.02359672, -0.00488375, ...,  0.01303448,
          -0.00892901, -0.03552562],
         [ 0.01119863,  0.01505886, -0.01598817, ...,  0.03405465,
           0.00575322,  0.03870751]],

        [[-0.03639854,  0.04030317,  0.0210116 , ..., -0.02623046,
           0.00511633, -0.00511916],
         [ 0.01786057, -0.02768354, -0.0043756 , ...,  0.03386175,
           0.02333821,  0.02404093],
         [-0.03236726,  0.02467468,  0.01458252, ..., -0.03318507,
           0.04025299,  0.01569658]]],


       ...,


       [[[-0.02181565, -0.03812791,  0.03461647, ...,  0.03866912,
           0.01784445,  0.04158403],
         [-0.02396281,  0.00245118,  0.03562684, ...,  0.0102238 ,
          -0.02127265,  0.02097363],
         [ 0.02662035, -0.032344  , -0.03497766, ..., -0.0173105 ,
           0.0330125 ,  0.02158869]],

        [[ 0.03021639, -0.03994903, -0.02158598, ...,  0.02239507,
          -0.00532783,  0.03573686],
         [-0.00708783, -0.00955907,  0.01542434, ...,  0.0060139 ,
          -0.00969837,  0.00809037],
         [-0.0187736 , -0.03951379, -0.00171888, ..., -0.03458672,
          -0.01707613,  0.03584855]],

        [[ 0.02697386,  0.0227053 , -0.02373461, ...,  0.02323208,
           0.02264023,  0.00521714],
         [ 0.03144782,  0.02519536, -0.01290952, ...,  0.01900919,
           0.01282516, -0.00580743],
         [ 0.00347913,  0.04055054,  0.0163502 , ...,  0.04182776,
           0.01810824,  0.02913269]],

        ...,

        [[-0.00699282, -0.02034857,  0.01357543, ...,  0.02949038,
          -0.01741139, -0.02793814],
         [ 0.03612771, -0.01679454,  0.00621703, ...,  0.0012925 ,
           0.01140555, -0.01587559],
         [-0.03647219, -0.0248061 ,  0.01499417, ..., -0.03298131,
          -0.01821174, -0.00639778]],

        [[-0.0124757 , -0.02934109, -0.03352599, ...,  0.01948061,
           0.01547889,  0.03470867],
         [ 0.02129364, -0.03426546, -0.02001888, ...,  0.02287591,
           0.02132068,  0.019723  ],
         [ 0.01685879, -0.0293836 , -0.03255122, ...,  0.03241389,
           0.00359154, -0.01897937]],

        [[ 0.00998483, -0.03145854,  0.025159  , ..., -0.00608719,
           0.03768683, -0.01049537],
         [-0.00681999,  0.04274154, -0.01796091, ...,  0.00209755,
           0.03404817, -0.0367528 ],
         [-0.02953758,  0.03915798, -0.00744066, ..., -0.03437715,
          -0.00139549,  0.02726422]]],


       [[[ 0.00132366, -0.01231837,  0.02457621, ..., -0.01362764,
          -0.02422343,  0.01555888],
         [ 0.03975777, -0.01529455, -0.00347792, ..., -0.02121332,
           0.02689363,  0.01563011],
         [ 0.00771964,  0.00972354, -0.00712591, ..., -0.01438173,
          -0.04188299, -0.0042484 ]],

        [[-0.00451561,  0.00983775,  0.00607958, ...,  0.01211549,
           0.03849814, -0.00082758],
         [-0.01711717,  0.02956637, -0.03246796, ..., -0.02180483,
          -0.02713996, -0.03387011],
         [ 0.01791469, -0.02935785, -0.03102269, ...,  0.02160091,
           0.01942689, -0.02052332]],

        [[-0.01667717, -0.0326683 , -0.03564648, ...,  0.0072429 ,
           0.03514059, -0.02634143],
         [ 0.01950502, -0.00975966, -0.02890214, ..., -0.00114042,
          -0.02262777, -0.03157064],
         [-0.0098084 , -0.0039121 ,  0.00276146, ...,  0.0217625 ,
           0.02369581,  0.03543634]],

        ...,

        [[ 0.00755312,  0.01343475, -0.00427934, ...,  0.00775591,
           0.02321446,  0.03337752],
         [ 0.01661244,  0.00883742,  0.01554158, ...,  0.01742725,
           0.02783858,  0.02373559],
         [ 0.01748803,  0.02117841, -0.01117309, ...,  0.02326744,
           0.04078443, -0.00090919]],

        [[-0.03060398, -0.02152558,  0.01366274, ..., -0.0021921 ,
          -0.01093575,  0.01506961],
         [-0.02631433,  0.0001437 ,  0.03135208, ...,  0.0380754 ,
           0.03520514,  0.00907693],
         [-0.03707059, -0.01206359, -0.02500631, ...,  0.02470848,
           0.02481312,  0.01542969]],

        [[-0.01821166,  0.02100131,  0.03497771, ...,  0.03766722,
          -0.02053893,  0.00021088],
         [-0.01354444,  0.03923328, -0.0168089 , ...,  0.00970965,
           0.03874869,  0.04041659],
         [ 0.02845939,  0.03721001,  0.03801313, ..., -0.01394963,
           0.0335576 ,  0.00692736]]],


       [[[ 0.03618703, -0.0138254 , -0.00453906, ...,  0.02329758,
           0.03991621, -0.01152961],
         [-0.01452164, -0.02750105, -0.03441633, ..., -0.00388454,
           0.04104266, -0.00631983],
         [ 0.02425962,  0.04094199, -0.00775501, ..., -0.00427654,
           0.0076022 , -0.0125599 ]],

        [[ 0.02671588,  0.03458586,  0.00841282, ...,  0.03032816,
          -0.02575159,  0.01461332],
         [ 0.01187777,  0.03301452, -0.03002692, ..., -0.01510497,
           0.02217172,  0.02302239],
         [ 0.01616631,  0.00667933,  0.02903796, ...,  0.03481976,
          -0.01880755,  0.012068  ]],

        [[ 0.01323244, -0.03229237, -0.02649139, ..., -0.01777507,
          -0.03302069,  0.00621729],
         [ 0.02260002,  0.0165596 ,  0.03154979, ...,  0.00254676,
          -0.00893909, -0.02884208],
         [-0.00751556,  0.03365554,  0.03511561, ..., -0.02840066,
           0.02889479,  0.00891231]],

        ...,

        [[-0.02447437,  0.03960178, -0.03086215, ..., -0.03870507,
          -0.01697602, -0.02034575],
         [ 0.03166722, -0.02755606,  0.00750558, ...,  0.02459161,
           0.02179844,  0.0252091 ],
         [ 0.0228941 , -0.00440465, -0.02639789, ...,  0.01260859,
          -0.02058532, -0.02541496]],

        [[ 0.03373916, -0.01786761, -0.0122059 , ..., -0.01751993,
          -0.01739639, -0.00601044],
         [ 0.00135932, -0.0009327 ,  0.03774614, ...,  0.02985087,
          -0.01790373,  0.01864828],
         [ 0.030805  ,  0.03011939,  0.01722021, ...,  0.03570656,
          -0.01456108,  0.0192683 ]],

        [[-0.03611086,  0.04171606,  0.0315418 , ..., -0.03692954,
           0.00130274,  0.03395432],
         [ 0.02079141, -0.03837941, -0.03237452, ...,  0.03315201,
           0.00455385,  0.02035925],
         [ 0.01785033, -0.00887285, -0.03673003, ...,  0.03627278,
          -0.00523347, -0.01229645]]]], dtype=float32)>) was not created in the distribution strategy scope of (<tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x0000020E2492E970>). It is most likely because some layers, model, or optimizer was being created outside the distribution strategy scope. Try to make sure your code looks similar to the following.
with strategy.scope():
  model=_create_model()
  model.compile(...)

@HTGorji
Copy link

HTGorji commented May 16, 2022

Update:
When instead of self.model, I import my model directly using model = DeepLabV3Plus(input_shape) everything is fine.
The difference is when I import the model directly, it generates MirroredVariable for all variables of layers, but when I use self.model, it generates just tf.Variable. Could you let me know how can I convert tf.Variable. to MirroredVariable. Maybe it can solve my problem.
It's weird self.model and model = DeepLabV3Plus(input_shape) are the same I don't know why the strategy.scope(): output is different for these two.

`[MirroredVariable:{
0: <tf.Variable 'conv1_conv/kernel:0' shape=(7, 7, 3, 64) dtype=float32, numpy=
array([[[[ 2.74782069e-02, -2.23000664e-02, 4.21437509e-02, ...,

@Stanfordrods1999
Copy link

Why is this issue closed? Is there a solution??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants