-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple GPU in model_main.py (since there is no more train.py) #5421
Comments
you may find the train.py in "legacy" folder |
It doesn't work... It says there is not enough values to unpack (expected 7, got 0). |
train.py no longer works, returning the following error:
We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API. |
I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)
|
can anybody did multi gpu training using train. py? |
I just tried changing the batch size to 2, and noticed that the second GPU was still not utilized. In fact, the one GPU spun up to ~80% usage on
|
I'm seeing the same thing as @nigelmathes |
Set num_clones = 2
On Wed 10 Oct, 2018, 8:56 AM Peter Yu, ***@***.***> wrote:
I'm seeing the same thing as @nigelmathes <https://github.com/nigelmathes>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5421 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AUME5-Qa5O2kD4F48-kjfmWeW05v6afSks5ujWjRgaJpZM4XDMcE>
.
--
Thank you,
Varun.
|
@varun19299 that option is only available for the old |
I dont think the new This is probably because Estimator distribution strategies don’t work with Could one of the maintainers explain this? |
@varun19299 , @nealwu |
As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim. But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly. |
Hi @pkulzc, I'm trying to train rcnn using cloudml (object detection api), runtime version 1.9, and if I try and set the num clones option along with train.py, I get an error with an unrecognized option. Is there a specific runtime necessary to work with train.py? |
My implementation of Faster R-CNN and Mask R-CNN supports multi-gpu and distributed training. |
@pkulzc I tried to train a quantized model using legacy train.py with multi-GPU, but it seem not to work. |
@pkulzc What is the meaning of |
@oopsodd I met the same problem, have you solved it? |
@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate? |
@pkulzc Then if I replace all the slim layers with tf.keras.layers, will the model_main.py be able to run on multiple gpus with ease? or should I make other lots of contribution for multi-gpu training? If so, could you give some useful keywords for me to study multi-gpu training and to contribute some codes for this API? |
imo, the bigger incompatibility would be with |
@varun19299 There's already a keras model without using any slim codes in this API repository. To see: https://github.com/tensorflow/models/blob/master/research/object_detection/models/keras_applications/mobilenet_v2.py. Have you tested the multi-gpu training with this keras model? Then it would be appreciated if you share your comments about multi-gpu training with estimators. Of course I will try this also in a few days. |
@samuel1208 sorry, I didn't. How about you? |
@donghyeon Have you tested the multi-gpu training with above keras model you mentioned? Thanks very much. For multi GPU training, I have tried removing all slim.argscope related and explicitly re-constructing/re-initializing model network with tf.layers, but it didn't work with multiple gpus. Did you succeed or should I make other lots of efforts for multi-gpu training? Thanks. |
I think it would be nice to treat this as prio- > training on 1 gpu is allmost impossible. |
I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: |
@lighTQ train.py is legacy at this point, and the API has moved away from it. In fact, using the most updated version of this repository, train.py throws errors. As far as I can tell, there is still no multi-gpu training in model_main.py. |
@waltermaldonado I have tested using official So it seems that the slim module could work under |
UPDATE: as for the problem I mentioned above when using
It seems to be due to the incompatibility between python2 and python3 for what |
@CasiaFan I can't make model_mail.py work with MirroredStrategy. I am using tensorflow 1.13.1 with ssd_mobilenet_v2_coco. Traceback (most recent call last):
File "model_main.py", line 112, in <module>
tf.app.run()
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "model_main.py", line 108, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed
self.config))
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 641, in _call_for_each_replica
return _call_for_each_replica(self._container_strategy(), fn, args, kwargs)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica
coord.join(threads)
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/prog/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 167, in _call_for_each_replica
merge_args = values.regroup({t.device: t.merge_args for t in threads})
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in regroup
for i in range(len(v0)))
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in <genexpr>
for i in range(len(v0)))
File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 1010, in regroup
assert set(v.keys()) == v0keys
AssertionError Any idea for me ? |
@paillardf umm... It seems some nodes in the graph on each devices are different. Do you add some operations on specific device or just use the default object detection api? |
@CasiaFan i didn't change anything in the object_detection folder except the line you gave us. I am up to date with the model repo as well. |
@CasiaFan Would you please post the commit hash where you clone/last merge from the repo? I am running TF 1.13.1 but getting a different error message |
@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1" |
Yes, this is the way to do that with the legacy train scripts. We just can't do the training with multiple GPUs using the model_main.py. |
@ChanZou @paillardf Sorry for responding later. This issue seems to be related with combination usage of
Now I have already upgraded my tf to version 1.4. Only adding the following code will cause problems like this: gpu_devices = ["/device:GPU:{}".format(x) for x in range(len(FLAGS.gpus.split(",")))]
strategy = tf.distribute.MirroredStrategy(devices=gpu_devices,
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce(num_packs=1))
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,
train_distribute=strategy,
save_checkpoints_steps=1500) error
And this error seems to to be caused by scaffold defining saver for EstimatorSpec. When commenting this snippet of code (line 501-509). Training with |
@CasiaFan Thank you for your sharing, it is super helpful! I made the same discovery and followed #27392 to modify TF code. |
@lighTQ , @pkulzc if we are scaling on a single machine with 2 GPUs, should we keep
|
@CasiaFan I was successfully able to construct model_main.py with multiple-gpu learning by your comments. Thanks a lot. But when I was comparing two ways for training with multiple gpu,
with same model(faster-rcnn-resnet101 and its configs), I found some errors. Q1) when I set batch_size = 1 and gpu = 2, then 1) threw an error Q2) when I set batch_size = 8 with gpu = 1, then 1) and 2) worked fine. So I guess that MirroredStrategy does not work fine with b_size = batch for each gpu * num of gpus. And, from those experience results, I think training result of (batch_size=16, gpu=2 with legacy/train.py) is same as (batch_size=8, gpu=2 with model_main.py, MirroredStrategy). Thanks. |
@kjkim-kr Thanks for reporting this case. I didn't try legacy/train.py but I checked the trainer.py and found it allocated data queue and training process to each gpu using tf.device directly, which should be more efficient since a fairly fundamental api. Have you tried to set a smaller batch-size in both modes and to check if the MirroredStrategy mode will occupy more memory? |
Yes, I switched to legacy/train and it worked. |
Yes, as I said above I was wondering whether or not there was a fix to multi-GPU training with model_main.py and not legacy/train.py. model_main.py is the more recent release that allows evaluation on the fly and ideally should support multi-GPU for an object detection task. |
@harsh306, have you managed to preform eval.py after training with num_clones > 1? |
@royshil That solution seems to be extremely slow and its not utilizes full power of all GPUs. |
Hi, did anyone manage to get |
Hi! Anyone achieved running |
@laksgreen @lighTQ |
I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using |
Please can you share the code you have written to tackle the clone issue. |
|
Thanks |
Please go to Stack Overflow for help and support:
http://stackoverflow.com/questions/tagged/tensorflow
Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the problem
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
Greetings,
I would like to know how to proceed to use both of my GPUs for training a Faster R-CNN with NASNet-A featurization model with the model_main.py file included in the object_detection tools now that train.py is gone. If it is not possible, I would like to request this feature or a workaround to make it work.
Thanks in advance.
The text was updated successfully, but these errors were encountered: