New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine tuning a model by building a Scaffold in Mirrored Strategy is not supported #21615
Comments
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. |
Yes, the code still fails with a freshly compiled master, 1bc856b. |
@anj-s @guptapriya Can you comment? |
Thanks. I'm not sure how to best construct a minimum working example, but here's a more concrete sketch than the description in the OP. In our
and then in our
The root issue seems to be tied to this line/comment in |
Yes. |
Does tensorflow/estimator@1f478bf#diff-d934b9f2c2d9384e077e7ab45e001f69 fix this issue? (My guess is not, but want to check) |
Hello, anything new on this issue ? It seems Scaffold is just unusable with MirroredStrategy, is there a better way ? |
@bamine I haven't tried with the new build yet suggested above, but I hope to soon. I don't know of any convenient alternatives at this time. Are there other ways of loading model parameters? |
I have tried 1.13.1 version, and encounter the same error. |
Hi There, We are moving this issue to closed status, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions. |
System information
Have I written custom code: Yes
OS Platform and Distribution: Linux Ubuntu 18.04
Mobile device: N/A
TensorFlow installed from: source
TensorFlow version: ('v1.9.0-rc2-2165-g7b80190d52', '1.10.0-rc1')
Python version: 2.7
Bazel version: 0.15.2
GCC/Compiler version: gcc version 6.4.0 20180424
CUDA/cuDNN version: 9.1 / 7.1
GPU model and memory: K80, 12GB
Exact command to reproduce: N/A
Describe the problem
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
I have been trying to fine tune my model using Mirrored Strategy. To initialize the weights from a given checkpoint, I pass a scaffold object to Estimator_Spec object, with init_fn as a parameter. However, when I run it, the program crashes (refer to the stacktrace below for more details). I looked around for similar issue, but couldn't find any. Hence, I had to dig through the source code to figure the cause of the issue. While exploring, I ran into this message:
TODO(anjalisridhar): Figure out how to resolve the following scaffold parameters: init_feed_dict, init_fn
called here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/estimator.py#L1599
I was wondering whether fixing this is on priority list, or whether it has been sitting on the back burner until other issues have been resolved.
i am not sure how to present a reproducible code in this context, but do let me know if I need to clarify some more.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 120, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7fd221bfad10>, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd221bfacd0>, '_model_dir': '../data/model2', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}
2018-08-11 08:50:44.010343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-08-11 08:50:44.164489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-08-11 08:50:44.164798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0, 1
2018-08-11 08:50:44.719187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 08:50:44.719236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0 1
2018-08-11 08:50:44.719243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N Y
2018-08-11 08:50:44.719247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 1: Y N
2018-08-11 08:50:44.719771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2018-08-11 08:50:44.885422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Configured nccl all-reduce.
2018-08-11 08:50:45.166638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0, 1
2018-08-11 08:50:45.166758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 08:50:45.166769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0 1
2018-08-11 08:50:45.166777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N Y
2018-08-11 08:50:45.166793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 1: Y N
2018-08-11 08:50:45.167078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0, compute capability: 3.7)
2018-08-11 08:50:45.167251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10757 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:batch_all_reduce invoked for batches size = 34 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
File "train.py", line 167, in
tf.app.run()
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 164, in main
classifier.train( input_fn=_get_input, max_steps=FLAGS.max_num_steps )
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 343, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1127, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1256, in _train_model_distributed
grouped_estimator_spec.scaffold, self._train_distribution)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1576, in _combine_distributed_scaffold
init_fn = distribution.group(init_fn)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/training/distribute.py", line 1007, in group
return control_flow_ops.group(value, name=name)
File "/home/weinman/virtualenv/tf-master-7b80190d52/local/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3396, in group
"'%s' with type '%s'" % (inp, type(inp)))
TypeError: Expected tf.group() expected Tensor arguments not '<function at 0x7fcfe46a8c08>' with type '<type 'function'>'
The text was updated successfully, but these errors were encountered: