Multiple GPU in model_main.py (since there is no more train.py) #5421

waltermaldonado · 2018-10-02T02:04:24Z

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

What is the top-level directory of the model you are using: research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.10.1-0-g4dcfddc5d1 1.10.1
Bazel version (if compiling from source): NA
CUDA/cuDNN version: V9.0.176
GPU model and memory: 2x Tesla P100 16Gb
Exact command to reproduce: NA

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Greetings,

I would like to know how to proceed to use both of my GPUs for training a Faster R-CNN with NASNet-A featurization model with the model_main.py file included in the object_detection tools now that train.py is gone. If it is not possible, I would like to request this feature or a workaround to make it work.

Thanks in advance.

Burgomehl · 2018-10-02T19:47:03Z

you may find the train.py in "legacy" folder

waltermaldonado · 2018-10-02T20:44:49Z

It doesn't work...

It says there is not enough values to unpack (expected 7, got 0).

nigelmathes · 2018-10-04T20:11:00Z

train.py no longer works, returning the following error:

 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API.

varun19299 · 2018-10-06T16:18:31Z

I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)

train.py no longer works, returning the following error:
 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)
We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API.

rickragv · 2018-10-06T19:01:52Z

can anybody did multi gpu training using train. py?

nigelmathes · 2018-10-09T13:13:37Z

I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)
train.py no longer works, returning the following error:
 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

I just tried changing the batch size to 2, and noticed that the second GPU was still not utilized. In fact, the one GPU spun up to ~80% usage on nvidia-smi (as expected), but then the whole system had core dump memory error with the following stack trace:

INFO:tensorflow:loss = 3.4643404, step = 23013
I1009 13:08:23.581555 140509247522560 tf_logging.py:115] loss = 3.4643404, step = 23013
*** Error in `python': double free or corruption (fasttop): 0x00007fc3b0022e90 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca520eb7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fca520f437a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca520f853c]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6Tensor16CopyFromInternalERKS0_RKNS_11TensorShapeE+0xbe)[0x7fca042dca9e]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow25NonMaxSuppressionV3V4Base7ComputeEPNS_15OpKernelContextE+0x7a)[0x7fca0796087a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow13TracingDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xbc)[0x7fca0443c37c]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x63a4bc)[0x7fca044804bc]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x63ae2a)[0x7fca04480e2a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x21a)[0x7fca044ee96a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x32)[0x7fca044eda12]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fc9fac6dc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fca524456ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fca5217b41d]

yukw777 · 2018-10-10T03:24:32Z

I'm seeing the same thing as @nigelmathes

varun19299 · 2018-10-10T06:55:16Z

Set num_clones = 2

On Wed 10 Oct, 2018, 8:56 AM Peter Yu, ***@***.***> wrote: I'm seeing the same thing as @nigelmathes <https://github.com/nigelmathes> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5421 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUME5-Qa5O2kD4F48-kjfmWeW05v6afSks5ujWjRgaJpZM4XDMcE> .

-- Thank you, Varun.

yukw777 · 2018-10-10T12:50:52Z

@varun19299 that option is only available for the old train.py which has been deprecated and no longer works as @waltermaldonado and @nigelmathes pointed out. We need an option for the new model_main.py script.

varun19299 · 2018-10-10T13:58:39Z

I dont think the new model_main.py supports multi-GPU

This is probably because Estimator distribution strategies don’t work with tf.contrib.slim or tf.contrib.layers.

Could one of the maintainers explain this?

Hafplo · 2018-10-11T08:08:53Z

@varun19299 , @nealwu
Until this is solved, could I use clusters (multi-node single-gpu) with the new Estimator module?

pkulzc · 2018-10-29T16:31:42Z

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

pchowdhry · 2018-10-30T02:32:42Z

Hi @pkulzc, I'm trying to train rcnn using cloudml (object detection api), runtime version 1.9, and if I try and set the num clones option along with train.py, I get an error with an unrecognized option. Is there a specific runtime necessary to work with train.py?

ppwwyyxx · 2018-11-16T15:43:47Z

My implementation of Faster R-CNN and Mask R-CNN supports multi-gpu and distributed training.

oopsodd · 2018-11-28T05:10:28Z

@pkulzc I tried to train a quantized model using legacy train.py with multi-GPU, but it seem not to work.
I got this when run legacy eval.py: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint.
Did it support to train a quantized model with multi-GPU?

toddwyl · 2018-12-16T12:20:32Z

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate?

samuel1208 · 2018-12-20T08:02:04Z

@oopsodd I met the same problem, have you solved it?

FiveMaster · 2019-01-08T09:45:48Z

@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate?

donghyeon · 2019-01-16T14:18:19Z

@pkulzc Then if I replace all the slim layers with tf.keras.layers, will the model_main.py be able to run on multiple gpus with ease? or should I make other lots of contribution for multi-gpu training? If so, could you give some useful keywords for me to study multi-gpu training and to contribute some codes for this API?

varun19299 · 2019-01-17T02:20:22Z

imo, the bigger incompatibility would be with slim.argscope

donghyeon · 2019-01-19T06:43:09Z

@varun19299 There's already a keras model without using any slim codes in this API repository. To see: https://github.com/tensorflow/models/blob/master/research/object_detection/models/keras_applications/mobilenet_v2.py. Have you tested the multi-gpu training with this keras model? Then it would be appreciated if you share your comments about multi-gpu training with estimators. Of course I will try this also in a few days.

oopsodd · 2019-01-21T02:00:59Z

@samuel1208 sorry, I didn't. How about you?

v-qjqs · 2019-03-01T10:25:08Z

@donghyeon Have you tested the multi-gpu training with above keras model you mentioned? Thanks very much. For multi GPU training, I have tried removing all slim.argscope related and explicitly re-constructing/re-initializing model network with tf.layers, but it didn't work with multiple gpus. Did you succeed or should I make other lots of efforts for multi-gpu training? Thanks.

Tantael · 2019-03-04T00:02:56Z

I think it would be nice to treat this as prio- > training on 1 gpu is allmost impossible.

lighTQ · 2019-03-06T02:19:32Z

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

nigelmathes · 2019-03-08T18:52:34Z

@lighTQ train.py is legacy at this point, and the API has moved away from it. In fact, using the most updated version of this repository, train.py throws errors.

As far as I can tell, there is still no multi-gpu training in model_main.py.

CasiaFan · 2019-05-31T09:25:16Z

@waltermaldonado I have tested using official ssd_mobilenet_v2_coco.config in the config directory. The python3 process with PID 38956 on GPU3, 4 is my training process and I set batch_size to 4 just for testing.

So it seems that the slim module could work under tf.contrib.distribute.MirroredStrategy. BTW, I write a mobilenet v3 backbone using tf.keras which works fine with this strategy. However, if I use the on-the-shelf ssd_mobilenet_v2_keras, an error occurs that input_shape received by convolutional_keras_box_predictor.py build function at line 135 is None. I'm working on this error now.

CasiaFan · 2019-05-31T12:28:12Z

UPDATE: as for the problem I mentioned above when using ssd_mobilenet_v2_keras as feature extractor,

File "/home/arkenstone/models/research/object_detection/predictors/convolutional_keras_box_predictor.py", line 135, in build
if len(input_shapes) != len(self._prediction_heads[BOX_ENCODINGS]):
TypeError: object of type 'NoneType' has no len()

It seems to be due to the incompatibility between python2 and python3 for what dict.values() returns. Just add feature_maps = list(feature_maps) after extracting feature maps at line 570 in file ssd_meta_arch.py.

paillardf · 2019-06-03T09:38:48Z

@CasiaFan I can't make model_mail.py work with MirroredStrategy. I am using tensorflow 1.13.1 with ssd_mobilenet_v2_coco.
I get this error at start :

Traceback (most recent call last):
  File "model_main.py", line 112, in <module>
    tf.app.run()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 108, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed
    self.config))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 641, in _call_for_each_replica
    return _call_for_each_replica(self._container_strategy(), fn, args, kwargs)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica
    coord.join(threads)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 167, in _call_for_each_replica
    merge_args = values.regroup({t.device: t.merge_args for t in threads})
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in regroup
    for i in range(len(v0)))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in <genexpr>
    for i in range(len(v0)))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 1010, in regroup
    assert set(v.keys()) == v0keys
AssertionError

Any idea for me ?
Thank you

CasiaFan · 2019-06-11T07:10:14Z

@paillardf umm... It seems some nodes in the graph on each devices are different. Do you add some operations on specific device or just use the default object detection api?

paillardf · 2019-06-11T07:59:25Z

@CasiaFan i didn't change anything in the object_detection folder except the line you gave us. I am up to date with the model repo as well.

ChanZou · 2019-06-12T21:29:23Z

@CasiaFan Would you please post the commit hash where you clone/last merge from the repo? I am running TF 1.13.1 but getting a different error message
ValueError: Variable FeatureExtractor/MobilenetV2/Conv/weights/replica_1/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?
The error message persists among keras and slim models so I think your commit hash will be very helpful for both @paillardf and myself. Thanks.

laksgreen · 2019-07-04T15:30:24Z

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1"

waltermaldonado · 2019-07-04T18:17:21Z

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1"

Yes, this is the way to do that with the legacy train scripts. We just can't do the training with multiple GPUs using the model_main.py.

CasiaFan · 2019-07-12T09:36:27Z

@ChanZou @paillardf Sorry for responding later. This issue seems to be related with combination usage of tf.contrib.distribute.MirroredStrategy and tf.train.ExponentialMovingAverage
See #27392 for detailed information. For simplicity, I turn off the use_moving_average during training like this in config:

train_config: {
  batch_size: 2
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
    use_moving_average: false  # add this line
  }

Now I have already upgraded my tf to version 1.4. Only adding the following code will cause problems like this:
code

gpu_devices = ["/device:GPU:{}".format(x) for x in range(len(FLAGS.gpus.split(",")))]
strategy = tf.distribute.MirroredStrategy(devices=gpu_devices, 
                                            cross_device_ops=tf.distribute.HierarchicalCopyAllReduce(num_packs=1))
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,
                                  train_distribute=strategy,
                                  save_checkpoints_steps=1500)

error

...
  File "/data/fanzong/miniconda3/envs/tf_cuda10/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 126, in _require_cross_replica_or_default_context_extended
    raise RuntimeError("Method requires being in cross-replica context, use "
RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()

And this error seems to to be caused by scaffold defining saver for EstimatorSpec. When commenting this snippet of code (line 501-509). Training with model_main.py could work fine! I will dig into this problem latter. But since I have added a lot of custom code, I am sorry I cannot push a commit that could work under tf 1.13.1. Previous experiment was implemented on a fresh git cloned branch and I'll be appreciated if you can test it.
BTW, as for the slim concern mentioned in previous posts, I have checked the object detection api project roughly (both tf1.13.1 and tf1.14.0 version). Now only small amount of codes use slim and most of them come from slim.net defining network architecture. As for other modules, all have counterparts written in tf.keras. So I don't think slim is the obstacle for multi-gpu training.

ChanZou · 2019-07-12T13:48:38Z

@CasiaFan Thank you for your sharing, it is super helpful! I made the same discovery and followed #27392 to modify TF code.
To add a little bit more info, TowerOptimizer and replicate_model_fn are tempting but using them in model_main.py can be really dangerous. The obvious reason is that they were deprecated more than a year ago. But what makes them more dangerous than that the code might still run. However, they use variable_scope to handle variable reuse and tf.keras seems not to be affected by that. So instead of having one replica of the model I had 4 (one for each GPU) replicas trained at the same time, which led to massive checkpoints and degraded model performance.

harsh306 · 2019-08-12T18:46:25Z

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@lighTQ , @pkulzc if we are scaling on a single machine with 2 GPUs, should we keep
--worker_replicas=1?

 Python object_detection/legacy/train.py 
      - pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config 
         --train_dir=/root/research/Datasets/bak/model/myTrain2 
         --worker_replicas=1
         --num_clones=2 --ps_tasks=1

kjkim-kr · 2019-08-20T07:24:50Z

@CasiaFan I was successfully able to construct model_main.py with multiple-gpu learning by your comments. Thanks a lot. But when I was comparing two ways for training with multiple gpu,

legacy/train.py --num_clones=2 --ps_tasks=1
model_main.py + tf.contrib.distribute.MirroredStrategy(num_gpus=2)

with same model(faster-rcnn-resnet101 and its configs), I found some errors.

Q1) when I set batch_size = 1 and gpu = 2, then 1) threw an error
(ValueError: not enough values to unpack(expected 7, got 0) )
but 2) worked fine.

Q2) when I set batch_size = 8 with gpu = 1, then 1) and 2) worked fine.
but if I increased batch_size to 16, and set gpu = 2, then 1) worked fine,
but 2) threw an OOM error.
(I use 8 gpus of RTX 2080ti - 11GB, tensorflow 1.14.0)

So I guess that MirroredStrategy does not work fine with b_size = batch for each gpu * num of gpus. And, from those experience results, I think training result of (batch_size=16, gpu=2 with legacy/train.py) is same as (batch_size=8, gpu=2 with model_main.py, MirroredStrategy).
Did you check this point? If you do, or know this issue, then will you let me know about this?

Thanks.

CasiaFan · 2019-08-26T07:52:53Z

@kjkim-kr Thanks for reporting this case. I didn't try legacy/train.py but I checked the trainer.py and found it allocated data queue and training process to each gpu using tf.device directly, which should be more efficient since a fairly fundamental api. Have you tried to set a smaller batch-size in both modes and to check if the MirroredStrategy mode will occupy more memory?

sp-ananth · 2019-11-26T02:19:09Z

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.

Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

@pkulzc

harsh306 · 2019-11-27T06:14:55Z

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.

Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

sp-ananth · 2019-12-02T22:33:02Z

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.
Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

Yes, as I said above I was wondering whether or not there was a fix to multi-GPU training with model_main.py and not legacy/train.py.

model_main.py is the more recent release that allows evaluation on the fly and ideally should support multi-GPU for an object detection task.

shiragit · 2019-12-03T07:18:24Z

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.
Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

@harsh306, have you managed to preform eval.py after training with num_clones > 1?

Adblu · 2020-04-23T08:45:53Z

@royshil That solution seems to be extremely slow and its not utilizes full power of all GPUs.
Any newer update from this year ?

qraleq · 2020-09-16T13:26:35Z

Hi, did anyone manage to get model_main.py working in a multi-GPU setting and in an efficient manner (not being slower, and utilizing all the GPUs)?

davitv · 2020-12-03T16:05:35Z

Hi! Anyone achieved running model_main with multi-gpu env? Thanks to @CasiaFan 's answer i was able to run it, but it is extremely slow (but nvidia-smi shows that all gpus are used, and actually it looks like training freezes at step 0).

sainisanjay · 2021-03-30T13:06:07Z

--worker_replicas=2 --num_clones=2 --ps_tasks=1"

@laksgreen @lighTQ
Could you please help me what is the difference between --worker_replicas=2 --num_clones=2 --ps_tasks=1" ?
I can start the multi- GPU training using train.py but I can see my GPU's have been utilized only 30-40% of each. How I can increased the GPU utilization?
I have 8 GPU's in a single machine.

sainisanjay · 2021-04-04T03:43:14Z

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using object_detection/export_inference_graph.py it will give you error like issue #5625. That means when you use multiple gpu's for training variable names are changes in the graph (each nodes or variables are replaced with clone_1/nodeName similarly others ). In order to solve this issue you have to removed the clone or clone_1 from the graph and then you have to export the checkpoints. I have wrote my python script to removed these extra clone and clone_1

Source82 · 2021-08-21T06:27:06Z

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using object_detection/export_inference_graph.py it will give you error like issue #5625. That means when you use multiple gpu's for training variable names are changes in the graph (each nodes or variables are replaced with clone_1/nodeName similarly others ). In order to solve this issue you have to removed the clone or clone_1 from the graph and then you have to export the checkpoints. I have wrote my python script to removed these extra clone and clone_1

Please can you share the code you have written to tackle the clone issue.
Thanks

sainisanjay · 2021-08-21T11:20:00Z

@Source82

import sys, getopt
import tensorflow as tf

usage_str = 'python tensorflow_rename_variables.py --checkpoint_dir=path/to/dir/ --dry_run'


def rename(checkpoint_dir, dry_run):
	checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
	with tf.Session() as sess:
		for var_name, _ in tf.contrib.framework.list_variables(checkpoint_dir):
			var = tf.contrib.framework.load_variable(checkpoint_dir, var_name)
			pos1 = var_name.find('clone_')
			pos2 = var_name.find('/clone_')
			posf = len(var_name)
			print(pos1)
			print(pos2)
			# Set the new name
			new_name = var_name
			if (pos1 != -1) and (pos2 != -1):
				new_name = var_name[pos2+9:posf]
			if (pos1 == 0) and (pos2 == -1):
				new_name = var_name[8:posf]
			if dry_run:
				print('%s would be renamed to %s.' % (var_name, new_name))
			else:
				print('Renaming %s to %s.' % (var_name, new_name))
				# Rename the variable
				var = tf.Variable(var, name=new_name)

	if not dry_run:
		# Save the variables
		saver = tf.train.Saver()
		sess.run(tf.global_variables_initializer())
		saver.save(sess, checkpoint.model_checkpoint_path)


def main(argv):
	checkpoint_dir = None
	dry_run = False

	try:
		opts, args = getopt.getopt(argv, 'h', ['help=', 'checkpoint_dir=','dry_run='])
	except getopt.GetoptError as e:
			print(usage_str)
			sys.exit(2)

	for opt, arg in opts:
		if opt in ('-h', '--help'):
			print(usage_str)
			sys.exit()
		elif opt == '--checkpoint_dir':
			checkpoint_dir = arg
		elif opt == '--dry_run':
			dry_run = True

	if not checkpoint_dir:
		print('Please specify a checkpoint_dir. Usage:')
		print(usage_str)
		sys.exit(2)
	rename(checkpoint_dir,  dry_run)


if __name__ == '__main__':
	main(sys.argv[1:])

Source82 · 2021-08-21T11:43:16Z

@Source82

import sys, getopt
import tensorflow as tf

usage_str = 'python tensorflow_rename_variables.py --checkpoint_dir=path/to/dir/ --dry_run'


def rename(checkpoint_dir, dry_run):
	checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
	with tf.Session() as sess:
		for var_name, _ in tf.contrib.framework.list_variables(checkpoint_dir):
			var = tf.contrib.framework.load_variable(checkpoint_dir, var_name)
			pos1 = var_name.find('clone_')
			pos2 = var_name.find('/clone_')
			posf = len(var_name)
			print(pos1)
			print(pos2)
			# Set the new name
			new_name = var_name
			if (pos1 != -1) and (pos2 != -1):
				new_name = var_name[pos2+9:posf]
			if (pos1 == 0) and (pos2 == -1):
				new_name = var_name[8:posf]
			if dry_run:
				print('%s would be renamed to %s.' % (var_name, new_name))
			else:
				print('Renaming %s to %s.' % (var_name, new_name))
				# Rename the variable
				var = tf.Variable(var, name=new_name)

	if not dry_run:
		# Save the variables
		saver = tf.train.Saver()
		sess.run(tf.global_variables_initializer())
		saver.save(sess, checkpoint.model_checkpoint_path)


def main(argv):
	checkpoint_dir = None
	dry_run = False

	try:
		opts, args = getopt.getopt(argv, 'h', ['help=', 'checkpoint_dir=','dry_run='])
	except getopt.GetoptError as e:
			print(usage_str)
			sys.exit(2)

	for opt, arg in opts:
		if opt in ('-h', '--help'):
			print(usage_str)
			sys.exit()
		elif opt == '--checkpoint_dir':
			checkpoint_dir = arg
		elif opt == '--dry_run':
			dry_run = True

	if not checkpoint_dir:
		print('Please specify a checkpoint_dir. Usage:')
		print(usage_str)
		sys.exit(2)
	rename(checkpoint_dir,  dry_run)


if __name__ == '__main__':
	main(sys.argv[1:])

Thanks

tensorflowbutler assigned nealwu Oct 2, 2018

roadcode mentioned this issue Oct 21, 2018

Why multi-GPU training not supported for object detection? #5565

Open

pkulzc assigned pkulzc and unassigned nealwu Oct 29, 2018

royshil mentioned this issue Jan 17, 2020

How to run model_main.py with multi-GPU? #6047

Open

CasiaFan mentioned this issue May 29, 2020

tf.distribute.MirroredStrategy incompatible with tf.estimator training when defining tf.train.Scaffold with saver tensorflow/tensorflow#30649

Closed

ravikyram added models:research models that come under research directory type:support labels Jul 15, 2020

Multiple GPU in model_main.py (since there is no more train.py) #5421

Multiple GPU in model_main.py (since there is no more train.py) #5421

Comments

waltermaldonado commented Oct 2, 2018 • edited Loading

System information

Describe the problem

Source code / logs

Burgomehl commented Oct 2, 2018

waltermaldonado commented Oct 2, 2018

nigelmathes commented Oct 4, 2018

varun19299 commented Oct 6, 2018

rickragv commented Oct 6, 2018

nigelmathes commented Oct 9, 2018

yukw777 commented Oct 10, 2018

varun19299 commented Oct 10, 2018 via email

yukw777 commented Oct 10, 2018

varun19299 commented Oct 10, 2018

Hafplo commented Oct 11, 2018

pkulzc commented Oct 29, 2018

pchowdhry commented Oct 30, 2018

ppwwyyxx commented Nov 16, 2018

oopsodd commented Nov 28, 2018

toddwyl commented Dec 16, 2018 • edited Loading

samuel1208 commented Dec 20, 2018

FiveMaster commented Jan 8, 2019

donghyeon commented Jan 16, 2019

varun19299 commented Jan 17, 2019

donghyeon commented Jan 19, 2019

oopsodd commented Jan 21, 2019

v-qjqs commented Mar 1, 2019

Tantael commented Mar 4, 2019

lighTQ commented Mar 6, 2019

nigelmathes commented Mar 8, 2019

CasiaFan commented May 31, 2019

CasiaFan commented May 31, 2019 • edited Loading

paillardf commented Jun 3, 2019

CasiaFan commented Jun 11, 2019

paillardf commented Jun 11, 2019

ChanZou commented Jun 12, 2019

laksgreen commented Jul 4, 2019

waltermaldonado commented Jul 4, 2019 • edited Loading

CasiaFan commented Jul 12, 2019

ChanZou commented Jul 12, 2019

harsh306 commented Aug 12, 2019

kjkim-kr commented Aug 20, 2019 • edited Loading

CasiaFan commented Aug 26, 2019

sp-ananth commented Nov 26, 2019 • edited Loading

harsh306 commented Nov 27, 2019

sp-ananth commented Dec 2, 2019

shiragit commented Dec 3, 2019

Adblu commented Apr 23, 2020

qraleq commented Sep 16, 2020

davitv commented Dec 3, 2020

sainisanjay commented Mar 30, 2021 • edited Loading

sainisanjay commented Apr 4, 2021

Source82 commented Aug 21, 2021

sainisanjay commented Aug 21, 2021

Source82 commented Aug 21, 2021

waltermaldonado commented Oct 2, 2018 •

edited

Loading

toddwyl commented Dec 16, 2018 •

edited

Loading

CasiaFan commented May 31, 2019 •

edited

Loading

waltermaldonado commented Jul 4, 2019 •

edited

Loading

kjkim-kr commented Aug 20, 2019 •

edited

Loading

sp-ananth commented Nov 26, 2019 •

edited

Loading

sainisanjay commented Mar 30, 2021 •

edited

Loading