Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple GPU in model_main.py (since there is no more train.py) #5421

Open
waltermaldonado opened this issue Oct 2, 2018 · 63 comments
Open
Assignees
Labels
models:research models that come under research directory type:support

Comments

@waltermaldonado
Copy link

waltermaldonado commented Oct 2, 2018

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • What is the top-level directory of the model you are using: research/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.10.1-0-g4dcfddc5d1 1.10.1
  • Bazel version (if compiling from source): NA
  • CUDA/cuDNN version: V9.0.176
  • GPU model and memory: 2x Tesla P100 16Gb
  • Exact command to reproduce: NA

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.


Greetings,

I would like to know how to proceed to use both of my GPUs for training a Faster R-CNN with NASNet-A featurization model with the model_main.py file included in the object_detection tools now that train.py is gone. If it is not possible, I would like to request this feature or a workaround to make it work.

Thanks in advance.

@Burgomehl
Copy link

you may find the train.py in "legacy" folder

@waltermaldonado
Copy link
Author

It doesn't work...

It says there is not enough values to unpack (expected 7, got 0).

@nigelmathes
Copy link

train.py no longer works, returning the following error:

 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API.

@varun19299
Copy link

I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)

train.py no longer works, returning the following error:

 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API.

@rickragv
Copy link

rickragv commented Oct 6, 2018

can anybody did multi gpu training using train. py?

@nigelmathes
Copy link

I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)

train.py no longer works, returning the following error:

 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

I just tried changing the batch size to 2, and noticed that the second GPU was still not utilized. In fact, the one GPU spun up to ~80% usage on nvidia-smi (as expected), but then the whole system had core dump memory error with the following stack trace:

INFO:tensorflow:loss = 3.4643404, step = 23013
I1009 13:08:23.581555 140509247522560 tf_logging.py:115] loss = 3.4643404, step = 23013
*** Error in `python': double free or corruption (fasttop): 0x00007fc3b0022e90 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca520eb7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fca520f437a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca520f853c]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6Tensor16CopyFromInternalERKS0_RKNS_11TensorShapeE+0xbe)[0x7fca042dca9e]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow25NonMaxSuppressionV3V4Base7ComputeEPNS_15OpKernelContextE+0x7a)[0x7fca0796087a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow13TracingDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xbc)[0x7fca0443c37c]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x63a4bc)[0x7fca044804bc]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x63ae2a)[0x7fca04480e2a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x21a)[0x7fca044ee96a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x32)[0x7fca044eda12]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fc9fac6dc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fca524456ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fca5217b41d]

@yukw777
Copy link

yukw777 commented Oct 10, 2018

I'm seeing the same thing as @nigelmathes

@varun19299
Copy link

varun19299 commented Oct 10, 2018 via email

@yukw777
Copy link

yukw777 commented Oct 10, 2018

@varun19299 that option is only available for the old train.py which has been deprecated and no longer works as @waltermaldonado and @nigelmathes pointed out. We need an option for the new model_main.py script.

@varun19299
Copy link

I dont think the new model_main.py supports multi-GPU

This is probably because Estimator distribution strategies don’t work with tf.contrib.slim or tf.contrib.layers.

Could one of the maintainers explain this?

@Hafplo
Copy link

Hafplo commented Oct 11, 2018

@varun19299 , @nealwu
Until this is solved, could I use clusters (multi-node single-gpu) with the new Estimator module?

@pkulzc
Copy link
Contributor

pkulzc commented Oct 29, 2018

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

@pchowdhry
Copy link

Hi @pkulzc, I'm trying to train rcnn using cloudml (object detection api), runtime version 1.9, and if I try and set the num clones option along with train.py, I get an error with an unrecognized option. Is there a specific runtime necessary to work with train.py?

@ppwwyyxx
Copy link
Contributor

My implementation of Faster R-CNN and Mask R-CNN supports multi-gpu and distributed training.

@oopsodd
Copy link

oopsodd commented Nov 28, 2018

@pkulzc I tried to train a quantized model using legacy train.py with multi-GPU, but it seem not to work.
I got this when run legacy eval.py: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint.
Did it support to train a quantized model with multi-GPU?

@toddwyl
Copy link

toddwyl commented Dec 16, 2018

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate?

@samuel1208
Copy link

@oopsodd I met the same problem, have you solved it?

@FiveMaster
Copy link

@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate?

@donghyeon
Copy link

@pkulzc Then if I replace all the slim layers with tf.keras.layers, will the model_main.py be able to run on multiple gpus with ease? or should I make other lots of contribution for multi-gpu training? If so, could you give some useful keywords for me to study multi-gpu training and to contribute some codes for this API?

@varun19299
Copy link

imo, the bigger incompatibility would be with slim.argscope

@donghyeon
Copy link

@varun19299 There's already a keras model without using any slim codes in this API repository. To see: https://github.com/tensorflow/models/blob/master/research/object_detection/models/keras_applications/mobilenet_v2.py. Have you tested the multi-gpu training with this keras model? Then it would be appreciated if you share your comments about multi-gpu training with estimators. Of course I will try this also in a few days.

@oopsodd
Copy link

oopsodd commented Jan 21, 2019

@samuel1208 sorry, I didn't. How about you?

@v-qjqs
Copy link

v-qjqs commented Mar 1, 2019

@donghyeon Have you tested the multi-gpu training with above keras model you mentioned? Thanks very much. For multi GPU training, I have tried removing all slim.argscope related and explicitly re-constructing/re-initializing model network with tf.layers, but it didn't work with multiple gpus. Did you succeed or should I make other lots of efforts for multi-gpu training? Thanks.

@Tantael
Copy link

Tantael commented Mar 4, 2019

I think it would be nice to treat this as prio- > training on 1 gpu is allmost impossible.

@lighTQ
Copy link

lighTQ commented Mar 6, 2019

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@nigelmathes
Copy link

@lighTQ train.py is legacy at this point, and the API has moved away from it. In fact, using the most updated version of this repository, train.py throws errors.

As far as I can tell, there is still no multi-gpu training in model_main.py.

@CasiaFan
Copy link

@waltermaldonado I have tested using official ssd_mobilenet_v2_coco.config in the config directory. The python3 process with PID 38956 on GPU3, 4 is my training process and I set batch_size to 4 just for testing.

image

So it seems that the slim module could work under tf.contrib.distribute.MirroredStrategy. BTW, I write a mobilenet v3 backbone using tf.keras which works fine with this strategy. However, if I use the on-the-shelf ssd_mobilenet_v2_keras, an error occurs that input_shape received by convolutional_keras_box_predictor.py build function at line 135 is None. I'm working on this error now.

@CasiaFan
Copy link

CasiaFan commented May 31, 2019

UPDATE: as for the problem I mentioned above when using ssd_mobilenet_v2_keras as feature extractor,

File "/home/arkenstone/models/research/object_detection/predictors/convolutional_keras_box_predictor.py", line 135, in build
if len(input_shapes) != len(self._prediction_heads[BOX_ENCODINGS]):
TypeError: object of type 'NoneType' has no len()

It seems to be due to the incompatibility between python2 and python3 for what dict.values() returns. Just add feature_maps = list(feature_maps) after extracting feature maps at line 570 in file ssd_meta_arch.py.

@paillardf
Copy link

@CasiaFan I can't make model_mail.py work with MirroredStrategy. I am using tensorflow 1.13.1 with ssd_mobilenet_v2_coco.
I get this error at start :

Traceback (most recent call last):
  File "model_main.py", line 112, in <module>
    tf.app.run()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 108, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed
    self.config))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 641, in _call_for_each_replica
    return _call_for_each_replica(self._container_strategy(), fn, args, kwargs)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica
    coord.join(threads)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 167, in _call_for_each_replica
    merge_args = values.regroup({t.device: t.merge_args for t in threads})
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in regroup
    for i in range(len(v0)))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in <genexpr>
    for i in range(len(v0)))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 1010, in regroup
    assert set(v.keys()) == v0keys
AssertionError

Any idea for me ?
Thank you

@CasiaFan
Copy link

@paillardf umm... It seems some nodes in the graph on each devices are different. Do you add some operations on specific device or just use the default object detection api?

@paillardf
Copy link

@CasiaFan i didn't change anything in the object_detection folder except the line you gave us. I am up to date with the model repo as well.

@ChanZou
Copy link

ChanZou commented Jun 12, 2019

@CasiaFan Would you please post the commit hash where you clone/last merge from the repo? I am running TF 1.13.1 but getting a different error message
ValueError: Variable FeatureExtractor/MobilenetV2/Conv/weights/replica_1/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?
The error message persists among keras and slim models so I think your commit hash will be very helpful for both @paillardf and myself. Thanks.

@laksgreen
Copy link

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1"

@waltermaldonado
Copy link
Author

waltermaldonado commented Jul 4, 2019

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1"

Yes, this is the way to do that with the legacy train scripts. We just can't do the training with multiple GPUs using the model_main.py.

@CasiaFan
Copy link

@ChanZou @paillardf Sorry for responding later. This issue seems to be related with combination usage of tf.contrib.distribute.MirroredStrategy and tf.train.ExponentialMovingAverage
See #27392 for detailed information. For simplicity, I turn off the use_moving_average during training like this in config:

train_config: {
  batch_size: 2
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
    use_moving_average: false  # add this line
  }

Now I have already upgraded my tf to version 1.4. Only adding the following code will cause problems like this:
code

gpu_devices = ["/device:GPU:{}".format(x) for x in range(len(FLAGS.gpus.split(",")))]
strategy = tf.distribute.MirroredStrategy(devices=gpu_devices, 
                                            cross_device_ops=tf.distribute.HierarchicalCopyAllReduce(num_packs=1))
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,
                                  train_distribute=strategy,
                                  save_checkpoints_steps=1500)

error

...
  File "/data/fanzong/miniconda3/envs/tf_cuda10/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 126, in _require_cross_replica_or_default_context_extended
    raise RuntimeError("Method requires being in cross-replica context, use "
RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()

And this error seems to to be caused by scaffold defining saver for EstimatorSpec. When commenting this snippet of code (line 501-509). Training with model_main.py could work fine! I will dig into this problem latter. But since I have added a lot of custom code, I am sorry I cannot push a commit that could work under tf 1.13.1. Previous experiment was implemented on a fresh git cloned branch and I'll be appreciated if you can test it.
BTW, as for the slim concern mentioned in previous posts, I have checked the object detection api project roughly (both tf1.13.1 and tf1.14.0 version). Now only small amount of codes use slim and most of them come from slim.net defining network architecture. As for other modules, all have counterparts written in tf.keras. So I don't think slim is the obstacle for multi-gpu training.

@ChanZou
Copy link

ChanZou commented Jul 12, 2019

@CasiaFan Thank you for your sharing, it is super helpful! I made the same discovery and followed #27392 to modify TF code.
To add a little bit more info, TowerOptimizer and replicate_model_fn are tempting but using them in model_main.py can be really dangerous. The obvious reason is that they were deprecated more than a year ago. But what makes them more dangerous than that the code might still run. However, they use variable_scope to handle variable reuse and tf.keras seems not to be affected by that. So instead of having one replica of the model I had 4 (one for each GPU) replicas trained at the same time, which led to massive checkpoints and degraded model performance.

@harsh306
Copy link

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

@lighTQ , @pkulzc if we are scaling on a single machine with 2 GPUs, should we keep
--worker_replicas=1?

 Python object_detection/legacy/train.py 
      - pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config 
         --train_dir=/root/research/Datasets/bak/model/myTrain2 
         --worker_replicas=1
         --num_clones=2 --ps_tasks=1

@kjkim-kr
Copy link

kjkim-kr commented Aug 20, 2019

@CasiaFan I was successfully able to construct model_main.py with multiple-gpu learning by your comments. Thanks a lot. But when I was comparing two ways for training with multiple gpu,

  1. legacy/train.py --num_clones=2 --ps_tasks=1
  2. model_main.py + tf.contrib.distribute.MirroredStrategy(num_gpus=2)

with same model(faster-rcnn-resnet101 and its configs), I found some errors.

Q1) when I set batch_size = 1 and gpu = 2, then 1) threw an error
(ValueError: not enough values to unpack(expected 7, got 0) )
but 2) worked fine.

Q2) when I set batch_size = 8 with gpu = 1, then 1) and 2) worked fine.
but if I increased batch_size to 16, and set gpu = 2, then 1) worked fine,
but 2) threw an OOM error.
(I use 8 gpus of RTX 2080ti - 11GB, tensorflow 1.14.0)

So I guess that MirroredStrategy does not work fine with b_size = batch for each gpu * num of gpus. And, from those experience results, I think training result of (batch_size=16, gpu=2 with legacy/train.py) is same as (batch_size=8, gpu=2 with model_main.py, MirroredStrategy).
Did you check this point? If you do, or know this issue, then will you let me know about this?

Thanks.

@CasiaFan
Copy link

@kjkim-kr Thanks for reporting this case. I didn't try legacy/train.py but I checked the trainer.py and found it allocated data queue and training process to each gpu using tf.device directly, which should be more efficient since a fairly fundamental api. Have you tried to set a smaller batch-size in both modes and to check if the MirroredStrategy mode will occupy more memory?

@sp-ananth
Copy link

sp-ananth commented Nov 26, 2019

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.

Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

@pkulzc

@harsh306
Copy link

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.

Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

@sp-ananth
Copy link

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.
Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

Yes, as I said above I was wondering whether or not there was a fix to multi-GPU training with model_main.py and not legacy/train.py.

model_main.py is the more recent release that allows evaluation on the fly and ideally should support multi-GPU for an object detection task.

@shiragit
Copy link

shiragit commented Dec 3, 2019

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.
Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

@harsh306, have you managed to preform eval.py after training with num_clones > 1?

@Adblu
Copy link

Adblu commented Apr 23, 2020

@royshil That solution seems to be extremely slow and its not utilizes full power of all GPUs.
Any newer update from this year ?

@qraleq
Copy link

qraleq commented Sep 16, 2020

Hi, did anyone manage to get model_main.py working in a multi-GPU setting and in an efficient manner (not being slower, and utilizing all the GPUs)?

@davitv
Copy link

davitv commented Dec 3, 2020

Hi! Anyone achieved running model_main with multi-gpu env? Thanks to @CasiaFan 's answer i was able to run it, but it is extremely slow (but nvidia-smi shows that all gpus are used, and actually it looks like training freezes at step 0).

@sainisanjay
Copy link

sainisanjay commented Mar 30, 2021

--worker_replicas=2 --num_clones=2 --ps_tasks=1"

@laksgreen @lighTQ
Could you please help me what is the difference between --worker_replicas=2 --num_clones=2 --ps_tasks=1" ?
I can start the multi- GPU training using train.py but I can see my GPU's have been utilized only 30-40% of each. How I can increased the GPU utilization?
I have 8 GPU's in a single machine.

@sainisanjay
Copy link

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using object_detection/export_inference_graph.py it will give you error like issue #5625. That means when you use multiple gpu's for training variable names are changes in the graph (each nodes or variables are replaced with clone_1/nodeName similarly others ). In order to solve this issue you have to removed the clone or clone_1 from the graph and then you have to export the checkpoints. I have wrote my python script to removed these extra clone and clone_1

image

@Source82
Copy link

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU:
Python object_detection/legacy/train.py
   --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config
     --train_dir=/root/research/Datasets/bak/model/myTrain2
--worker_replicas=2
     --num_clones=2 --ps_tasks=1

I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using object_detection/export_inference_graph.py it will give you error like issue #5625. That means when you use multiple gpu's for training variable names are changes in the graph (each nodes or variables are replaced with clone_1/nodeName similarly others ). In order to solve this issue you have to removed the clone or clone_1 from the graph and then you have to export the checkpoints. I have wrote my python script to removed these extra clone and clone_1

image

Please can you share the code you have written to tackle the clone issue.
Thanks

@sainisanjay
Copy link

@Source82

import sys, getopt
import tensorflow as tf

usage_str = 'python tensorflow_rename_variables.py --checkpoint_dir=path/to/dir/ --dry_run'


def rename(checkpoint_dir, dry_run):
	checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
	with tf.Session() as sess:
		for var_name, _ in tf.contrib.framework.list_variables(checkpoint_dir):
			var = tf.contrib.framework.load_variable(checkpoint_dir, var_name)
			pos1 = var_name.find('clone_')
			pos2 = var_name.find('/clone_')
			posf = len(var_name)
			print(pos1)
			print(pos2)
			# Set the new name
			new_name = var_name
			if (pos1 != -1) and (pos2 != -1):
				new_name = var_name[pos2+9:posf]
			if (pos1 == 0) and (pos2 == -1):
				new_name = var_name[8:posf]
			if dry_run:
				print('%s would be renamed to %s.' % (var_name, new_name))
			else:
				print('Renaming %s to %s.' % (var_name, new_name))
				# Rename the variable
				var = tf.Variable(var, name=new_name)

	if not dry_run:
		# Save the variables
		saver = tf.train.Saver()
		sess.run(tf.global_variables_initializer())
		saver.save(sess, checkpoint.model_checkpoint_path)


def main(argv):
	checkpoint_dir = None
	dry_run = False

	try:
		opts, args = getopt.getopt(argv, 'h', ['help=', 'checkpoint_dir=','dry_run='])
	except getopt.GetoptError as e:
			print(usage_str)
			sys.exit(2)

	for opt, arg in opts:
		if opt in ('-h', '--help'):
			print(usage_str)
			sys.exit()
		elif opt == '--checkpoint_dir':
			checkpoint_dir = arg
		elif opt == '--dry_run':
			dry_run = True

	if not checkpoint_dir:
		print('Please specify a checkpoint_dir. Usage:')
		print(usage_str)
		sys.exit(2)
	rename(checkpoint_dir,  dry_run)


if __name__ == '__main__':
	main(sys.argv[1:])

@Source82
Copy link

@Source82

import sys, getopt
import tensorflow as tf

usage_str = 'python tensorflow_rename_variables.py --checkpoint_dir=path/to/dir/ --dry_run'


def rename(checkpoint_dir, dry_run):
	checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
	with tf.Session() as sess:
		for var_name, _ in tf.contrib.framework.list_variables(checkpoint_dir):
			var = tf.contrib.framework.load_variable(checkpoint_dir, var_name)
			pos1 = var_name.find('clone_')
			pos2 = var_name.find('/clone_')
			posf = len(var_name)
			print(pos1)
			print(pos2)
			# Set the new name
			new_name = var_name
			if (pos1 != -1) and (pos2 != -1):
				new_name = var_name[pos2+9:posf]
			if (pos1 == 0) and (pos2 == -1):
				new_name = var_name[8:posf]
			if dry_run:
				print('%s would be renamed to %s.' % (var_name, new_name))
			else:
				print('Renaming %s to %s.' % (var_name, new_name))
				# Rename the variable
				var = tf.Variable(var, name=new_name)

	if not dry_run:
		# Save the variables
		saver = tf.train.Saver()
		sess.run(tf.global_variables_initializer())
		saver.save(sess, checkpoint.model_checkpoint_path)


def main(argv):
	checkpoint_dir = None
	dry_run = False

	try:
		opts, args = getopt.getopt(argv, 'h', ['help=', 'checkpoint_dir=','dry_run='])
	except getopt.GetoptError as e:
			print(usage_str)
			sys.exit(2)

	for opt, arg in opts:
		if opt in ('-h', '--help'):
			print(usage_str)
			sys.exit()
		elif opt == '--checkpoint_dir':
			checkpoint_dir = arg
		elif opt == '--dry_run':
			dry_run = True

	if not checkpoint_dir:
		print('Please specify a checkpoint_dir. Usage:')
		print(usage_str)
		sys.exit(2)
	rename(checkpoint_dir,  dry_run)


if __name__ == '__main__':
	main(sys.argv[1:])

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:support
Projects
Object Detection
  
Needs triage (Issues)
Development

No branches or pull requests