Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615

Closed
lamsalab opened this issue Aug 14, 2018 · 11 comments
Closed
Labels
comp:dist-strat Distribution Strategy related issues type:bug Bug

Comments

@lamsalab
Copy link

lamsalab commented Aug 14, 2018

System information

Have I written custom code: Yes
OS Platform and Distribution: Linux Ubuntu 18.04
Mobile device: N/A
TensorFlow installed from: source
TensorFlow version: ('v1.9.0-rc2-2165-g7b80190d52', '1.10.0-rc1')
Python version: 2.7
Bazel version: 0.15.2
GCC/Compiler version: gcc version 6.4.0 20180424
CUDA/cuDNN version: 9.1 / 7.1
GPU model and memory: K80, 12GB
Exact command to reproduce: N/A

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
I have been trying to fine tune my model using Mirrored Strategy. To initialize the weights from a given checkpoint, I pass a scaffold object to Estimator_Spec object, with init_fn as a parameter. However, when I run it, the program crashes (refer to the stacktrace below for more details). I looked around for similar issue, but couldn't find any. Hence, I had to dig through the source code to figure the cause of the issue. While exploring, I ran into this message:
TODO(anjalisridhar): Figure out how to resolve the following scaffold parameters: init_feed_dict, init_fn
called here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/estimator.py#L1599

I was wondering whether fixing this is on priority list, or whether it has been sitting on the back burner until other issues have been resolved.

i am not sure how to present a reproducible code in this context, but do let me know if I need to clarify some more.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 120, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7fd221bfad10>, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd221bfacd0>, '_model_dir': '../data/model2', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}
2018-08-11 08:50:44.010343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-08-11 08:50:44.164489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-08-11 08:50:44.164798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0, 1
2018-08-11 08:50:44.719187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 08:50:44.719236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0 1
2018-08-11 08:50:44.719243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N Y
2018-08-11 08:50:44.719247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 1: Y N
2018-08-11 08:50:44.719771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2018-08-11 08:50:44.885422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Configured nccl all-reduce.
2018-08-11 08:50:45.166638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0, 1
2018-08-11 08:50:45.166758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 08:50:45.166769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0 1
2018-08-11 08:50:45.166777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N Y
2018-08-11 08:50:45.166793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 1: Y N
2018-08-11 08:50:45.167078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2018-08-11 08:50:45.167251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:batch_all_reduce invoked for batches size = 34 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
File "train.py", line 167, in
tf.app.run()
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 164, in main
classifier.train( input_fn=_get_input, max_steps=FLAGS.max_num_steps )
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 343, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1127, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1256, in _train_model_distributed
grouped_estimator_spec.scaffold, self._train_distribution)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1576, in _combine_distributed_scaffold
init_fn = distribution.group(init_fn)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/training/distribute.py", line 1007, in group
return control_flow_ops.group(value, name=name)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3396, in group
"'%s' with type '%s'" % (inp, type(inp)))
TypeError: Expected tf.group() expected Tensor arguments not '<function at 0x7fcfe46a8c08>' with type '<type 'function'>'

@tensorflowbutler tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Aug 15, 2018
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Exact command to reproduce
Mobile device

@weinman
Copy link

weinman commented Sep 1, 2018

Yes, the code still fails with a freshly compiled master, 1bc856b.

@robieta
Copy link

robieta commented Sep 17, 2018

@anj-s @guptapriya Can you comment?

@weinman
Copy link

weinman commented Sep 17, 2018

Thanks. I'm not sure how to best construct a minimum working example, but here's a more concrete sketch than the description in the OP.

In our model_fn.py which has all the custom Estimator framework, we have something like ...

def train_fn( tune_from_model ): 
    """Returns a function that trains the model"""

    def train( features, labels, mode ):

        train_op, loss = _get_training( )
        scaffold = tf.train.Scaffold( init_fn=
                                      _get_init_pretrained( tune_from ) )

        return tf.estimator.EstimatorSpec( mode=mode, 
                                           loss=loss, 
                                           train_op=train_op,
                                           scaffold=scaffold )
    return train

and then in our train.py driver

    # Initialize the classifier
    classifier = tf.estimator.Estimator( config=_get_config(), 
                                         model_fn=model_fn.train_fn(FLAGS.tune_from),
                                         model_dir=FLAGS.output )
   
    # Train the model
    classifier.train( input_fn=_get_input, max_steps=FLAGS.max_num_steps )

The root issue seems to be tied to this line/comment in estimator.py.

@anj-s anj-s added the comp:dist-strat Distribution Strategy related issues label Sep 18, 2018
@guptapriya guptapriya assigned anj-s and robieta and unassigned robieta Sep 18, 2018
@weinman
Copy link

weinman commented Oct 3, 2018

Yes.

@seemuch seemuch self-assigned this Oct 19, 2018
@anj-s anj-s removed their assignment Oct 28, 2018
@Harshini-Gadige Harshini-Gadige removed the stat:awaiting response Status - Awaiting response from author label Dec 14, 2018
@guptapriya
Copy link
Contributor

Does tensorflow/estimator@1f478bf#diff-d934b9f2c2d9384e077e7ab45e001f69 fix this issue? (My guess is not, but want to check)

@bamine
Copy link

bamine commented Feb 28, 2019

Hello, anything new on this issue ? It seems Scaffold is just unusable with MirroredStrategy, is there a better way ?

@weinman
Copy link

weinman commented Mar 2, 2019

@bamine I haven't tried with the new build yet suggested above, but I hope to soon. I don't know of any convenient alternatives at this time. Are there other ways of loading model parameters?

@rabintang
Copy link

I have tried 1.13.1 version, and encounter the same error.

@robieta robieta removed their assignment Feb 8, 2020
@sushreebarsa
Copy link
Contributor

Hi There,

We are moving this issue to closed status, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.
Please open a new issue for any help you need against 2.x, and we will get you the right help.Thanks!

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues type:bug Bug
Projects
None yet
Development

No branches or pull requests