Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615

lamsalab · 2018-08-14T20:51:33Z

System information

Have I written custom code: Yes
OS Platform and Distribution: Linux Ubuntu 18.04
Mobile device: N/A
TensorFlow installed from: source
TensorFlow version: ('v1.9.0-rc2-2165-g7b80190d52', '1.10.0-rc1')
Python version: 2.7
Bazel version: 0.15.2
GCC/Compiler version: gcc version 6.4.0 20180424
CUDA/cuDNN version: 9.1 / 7.1
GPU model and memory: K80, 12GB
Exact command to reproduce: N/A

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
I have been trying to fine tune my model using Mirrored Strategy. To initialize the weights from a given checkpoint, I pass a scaffold object to Estimator_Spec object, with init_fn as a parameter. However, when I run it, the program crashes (refer to the stacktrace below for more details). I looked around for similar issue, but couldn't find any. Hence, I had to dig through the source code to figure the cause of the issue. While exploring, I ran into this message:
TODO(anjalisridhar): Figure out how to resolve the following scaffold parameters: init_feed_dict, init_fn
called here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/estimator.py#L1599

I was wondering whether fixing this is on priority list, or whether it has been sitting on the back burner until other issues have been resolved.

i am not sure how to present a reproducible code in this context, but do let me know if I need to clarify some more.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 120, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7fd221bfad10>, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd221bfacd0>, '_model_dir': '../data/model2', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}
2018-08-11 08:50:44.010343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-08-11 08:50:44.164489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-08-11 08:50:44.164798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0, 1
2018-08-11 08:50:44.719187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 08:50:44.719236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0 1
2018-08-11 08:50:44.719243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N Y
2018-08-11 08:50:44.719247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 1: Y N
2018-08-11 08:50:44.719771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2018-08-11 08:50:44.885422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Configured nccl all-reduce.
2018-08-11 08:50:45.166638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0, 1
2018-08-11 08:50:45.166758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 08:50:45.166769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0 1
2018-08-11 08:50:45.166777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N Y
2018-08-11 08:50:45.166793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 1: Y N
2018-08-11 08:50:45.167078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2018-08-11 08:50:45.167251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:batch_all_reduce invoked for batches size = 34 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
File "train.py", line 167, in
tf.app.run()
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 164, in main
classifier.train( input_fn=_get_input, max_steps=FLAGS.max_num_steps )
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 343, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1127, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1256, in _train_model_distributed
grouped_estimator_spec.scaffold, self._train_distribution)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1576, in _combine_distributed_scaffold
init_fn = distribution.group(init_fn)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/training/distribute.py", line 1007, in group
return control_flow_ops.group(value, name=name)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3396, in group
"'%s' with type '%s'" % (inp, type(inp)))
TypeError: Expected tf.group() expected Tensor arguments not '<function at 0x7fcfe46a8c08>' with type '<type 'function'>'

tensorflowbutler · 2018-08-15T08:14:01Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Exact command to reproduce
Mobile device

weinman · 2018-09-01T13:39:21Z

Yes, the code still fails with a freshly compiled master, 1bc856b.

robieta · 2018-09-17T03:22:02Z

@anj-s @guptapriya Can you comment?

weinman · 2018-09-17T16:36:47Z

Thanks. I'm not sure how to best construct a minimum working example, but here's a more concrete sketch than the description in the OP.

In our model_fn.py which has all the custom Estimator framework, we have something like ...

def train_fn( tune_from_model ): 
    """Returns a function that trains the model"""

    def train( features, labels, mode ):

        train_op, loss = _get_training( )
        scaffold = tf.train.Scaffold( init_fn=
                                      _get_init_pretrained( tune_from ) )

        return tf.estimator.EstimatorSpec( mode=mode, 
                                           loss=loss, 
                                           train_op=train_op,
                                           scaffold=scaffold )
    return train

and then in our train.py driver

    # Initialize the classifier
    classifier = tf.estimator.Estimator( config=_get_config(), 
                                         model_fn=model_fn.train_fn(FLAGS.tune_from),
                                         model_dir=FLAGS.output )
   
    # Train the model
    classifier.train( input_fn=_get_input, max_steps=FLAGS.max_num_steps )

The root issue seems to be tied to this line/comment in estimator.py.

weinman · 2018-10-03T11:46:42Z

Yes.

guptapriya · 2019-01-29T02:04:59Z

Does tensorflow/estimator@1f478bf#diff-d934b9f2c2d9384e077e7ab45e001f69 fix this issue? (My guess is not, but want to check)

bamine · 2019-02-28T08:33:48Z

Hello, anything new on this issue ? It seems Scaffold is just unusable with MirroredStrategy, is there a better way ?

weinman · 2019-03-02T20:58:34Z

@bamine I haven't tried with the new build yet suggested above, but I hope to soon. I don't know of any convenient alternatives at this time. Are there other ways of loading model parameters?

rabintang · 2019-07-01T11:31:55Z

I have tried 1.13.1 version, and encounter the same error.

sushreebarsa · 2021-07-05T16:15:58Z

Hi There,

We are moving this issue to closed status, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.
Please open a new issue for any help you need against 2.x, and we will get you the right help.Thanks!

google-ml-butler · 2021-07-05T16:16:00Z

Are you satisfied with the resolution of your issue?
Yes
No

tensorflowbutler assigned robieta Aug 15, 2018

tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Aug 15, 2018

anj-s added the comp:dist-strat Distribution Strategy related issues label Sep 18, 2018

guptapriya assigned anj-s and robieta and unassigned robieta Sep 18, 2018

seemuch self-assigned this Oct 19, 2018

anj-s removed their assignment Oct 28, 2018

Harshini-Gadige removed the stat:awaiting response Status - Awaiting response from author label Dec 14, 2018

Harshini-Gadige added the type:bug Bug label Jan 25, 2019

weinman mentioned this issue Mar 8, 2019

Using Multiple GPU as a train_device weinman/cnn_lstm_ctc_ocr#51

Closed

isaprykin added the stat:ds-triaged label May 3, 2019

robieta removed their assignment Feb 8, 2020

lvenugopalan removed the stat:ds-triaged label Apr 23, 2020

lvenugopalan unassigned seemuch Apr 23, 2020

sushreebarsa closed this as completed Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615

Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615

lamsalab commented Aug 14, 2018 •

edited

tensorflowbutler commented Aug 15, 2018

weinman commented Sep 1, 2018

robieta commented Sep 17, 2018

weinman commented Sep 17, 2018

weinman commented Oct 3, 2018

guptapriya commented Jan 29, 2019

bamine commented Feb 28, 2019

weinman commented Mar 2, 2019

rabintang commented Jul 1, 2019

sushreebarsa commented Jul 5, 2021

google-ml-butler bot commented Jul 5, 2021

Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615

Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615

Comments

lamsalab commented Aug 14, 2018 • edited

System information

Describe the problem

Source code / logs

tensorflowbutler commented Aug 15, 2018

weinman commented Sep 1, 2018

robieta commented Sep 17, 2018

weinman commented Sep 17, 2018

weinman commented Oct 3, 2018

guptapriya commented Jan 29, 2019

bamine commented Feb 28, 2019

weinman commented Mar 2, 2019

rabintang commented Jul 1, 2019

sushreebarsa commented Jul 5, 2021

google-ml-butler bot commented Jul 5, 2021

lamsalab commented Aug 14, 2018 •

edited