GPU memory error with train.py and eval.py running together #1854

aloerch · 2017-07-04T21:14:46Z

System information

What is the top-level directory of the model you are using: object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): not yet...
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 18
TensorFlow installed from (source or binary): pip tensorflow-gpu
TensorFlow version (use command below): ('v1.2.0-rc2-21-g12f033d', '1.2.0')
CUDA/cuDNN version: Cuda compilation tools, release 8.0, V8.0.44
GPU model and memory: eVGA GTX 1080Ti, 11Gb
Exact command to reproduce:
Run 1st in one terminal:

python object_detection/train.py \
    --logtostderr \
    --pipeline_config_path=/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/samples/configs/faster_rcnn_resnet101_pets_learn.config \
    --train_dir=/home/meh/.virtualenvs/Project/models/model/train

That runs fine, training works... then run in 2nd terminal:

python object_detection/eval.py \
    --logtostderr \
    --pipeline_config_path=/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/samples/configs/faster_rcnn_resnet101_pets_learn.config \
    --checkpoint_dir=/home/meh/.virtualenvs/Project/models/model/train \
    --eval_dir=/home/meh/.virtualenvs/Project/models/model/eval

Evaluation fails with an error about the GPU being out of memory (training continues though in the other terminal window with no problem). Here is the traceback:

tensorflow/models $ python object_detection/eval.py \
>     --logtostderr \
>     --pipeline_config_path=/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/samples/configs/faster_rcnn_resnet101_pets_learn.config \
>     --checkpoint_dir=/home/meh/.virtualenvs/Project//models/model/train \
>     --eval_dir=/home/meh/.virtualenvs/Project//models/model/eval
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
2017-07-04 13:05:20.813917: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.813976: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.813987: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.813995: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.814005: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:21.414540: E tensorflow/core/common_runtime/direct_session.cc:138] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11713708032
Traceback (most recent call last):
  File "object_detection/eval.py", line 161, in <module>
    tf.app.run()
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "object_detection/eval.py", line 157, in main
    FLAGS.checkpoint_dir, FLAGS.eval_dir)
  File "/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/evaluator.py", line 211, in evaluate
    save_graph_dir=(eval_dir if eval_config.save_graph else ''))
  File "/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/eval_util.py", line 515, in repeated_checkpoint_run
    keys_to_exclude_from_results)
  File "/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/eval_util.py", line 359, in run_checkpoint_once
    sess = tf.Session(master, graph=tf.get_default_graph())
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1292, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 562, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Based on:

Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11713708032

it looks like eval.py appears to be trying to run the evaluation on the same GPU that is actively doing the training. I have a 2nd GPU, an eVGA GTX 1080 FTW with 8Gb of RAM that I would be happy to run the eval.py on, and generally speaking, I know how to write my own tensorflow graph using tf.device('/gpu:1') but... I cannot figure out where to insert this in the object_detection code.

I would recommend adding the ability to select the GPU to use for both training and evaluation, possibly as a flag. In the meantime, any help you can offer regarding where I can insert that in the eval.py tree would be much appreciated.

The text was updated successfully, but these errors were encountered:

schesho · 2017-07-05T08:07:55Z

a workaround would be to add after in the trainer.py file add:

session_config.gpu_options.per_process_gpu_memory_fraction = 0.8
(or any fraction that you want and that so that 1 - fraction would be sufficient for your eval.py to run)
after the line :

    session_config = tf.ConfigProto(allow_soft_placement=True                                 
                                                 log_device_placement=False)

drpngx · 2017-07-06T23:26:37Z

Yes, you can use CUDA_VISIBILE_DEVICES or the gpu_options to select what you have access to.

cy89 · 2017-07-07T23:06:46Z

@aloerch does that work for you?

aloerch · 2017-07-08T02:06:11Z

I will be trying @drpngx's solution tonight and let you know. @schectman's solution would only be ideal to me if I were limited to 1 GPU, but I have 2.

Cheers!

aloerch · 2017-07-08T05:08:27Z

Ok, so I do not know how to add commits or pull requests or submit my suggested changes to the code on github, but here's what I've worked out: As currently coded, object detection's eval process will fail to run at the same time the train process runs with a GPU out of memory error in some cases. The options are 1) to change the hard-coded fraction of memory used on the GPU for training within the code on a project by project basis, or 2) Make it possible for a user to select a different GPU for the eval process. I've confirmed that my changes listed below work and I would recommend they be added to eval.py in order to improve the usability of object detection.

Add this line to eval.py above import functools:
import os

Add this line to eval.py in the flags:

flags.DEFINE_string('gpudev','',
                    'Select a GPU Device.')

Add this line below FLAGS = flags.FLAGS:
os.environ["CUDA_VISIBLE_DEVICES"] = 'FLAGS.gpudev'

Finally, when running eval.py from the terminal, a person could use:
--gpudev=1

or some other number (0, 1, 2, etc) to designate the GPU to be used for the process.

This workaround has resolved my own problem with running the 2 processes simultaneously. Thanks for all of your help!

slandersson · 2017-07-10T19:04:11Z

Hi @aloerch I also have 2 GPUs and have tried what you've done, but I get an error saying it cannot detect the cuda device. Both of them have the same bus ID so maybe this is the issue?

I can run the train and evaluation at 70% - 30% on a single GPU and memory is being allocated across both.

I might just run 2 VMs one for each job and GPU.

Edit: On second thought, the eval probably doesn't slow down the process much at all. Then maybe is there a way to get the training to utilise both GPUs?

aloerch · 2017-07-10T21:01:34Z

@slandersson I'm not sure why you got the error about it not detecting the cuda device. Maybe you could post the traceback and the lines of code you edited from my recommendation so I can see if I can locate the error? You may have entered a typo or something else could be going on. Also, make sure you check your gpu device number... your first GPU would be GPU 0, the second would be GPU 1, and the third would be GPU 2. This means, if you have 2 GPUs, and you entered --gpudev=2 that would not work because you don't have a third GPU.

Even if the eval process doesn't use much GPU, if you hardcode in a limit of 70/30 for train/eval, your training will be limited to 70%, so that is not a solution I would want to use for myself.

thess24 · 2017-07-11T01:34:23Z

I agree there should be support for running eval and train on seperate GPU's. I think another nice feature would be to set a flag so the GPU can allow for growth rather than using all the memory at the start.

Because I'm working with a single GPU, I've modified the session_config in trainer.py so train.py doesn't automatically consume all the GPU memory:

session_config.gpu_options.allow_growth=True

After I let this run for a while, I'll start up the eval.py script which seems to be working, but there is certainly a better way to do this. Possibly both the evaluation and training jobs can have flags to allow for growth so this would automatically work on a single GPU.

aloerch · 2017-07-12T17:01:20Z

@thess24 that might be a good feature request. It would be easy to do... my modifications for fixing the GPU out of memory error took a very small amount of time. It would take longer for me to learn how to create pull requests than to implement something like that haha.

@slandersson did my feedback help with your issue?

slandersson · 2017-08-01T13:08:29Z

@aloerch

Add this line below FLAGS = flags.FLAGS:
os.environ["CUDA_VISIBLE_DEVICES"] = 'FLAGS.gpudev'

Perhaps this should be FLAGS.gpudev without quotes?

os.environ["CUDA_VISIBLE_DEVICES"] = FLAGS.gpudev

aloerch · 2017-08-01T15:06:30Z

@slandersson
Did you try it successfully without the quotes around FLAGS.gpudev? My current python version is 2.7 and worked with the quotes...

slandersson · 2017-08-01T17:27:05Z

@aloerch yep confirm it works without the quotes. Python 2.7 too

pkdogcom · 2017-08-11T10:21:20Z

@michaelisard Any updates on this issue?

michaelisard · 2017-08-11T22:50:38Z

@tombstone ?

pkdogcom · 2017-08-12T02:54:49Z

@aloerch @slandersson I also have to remove the quotes to make it work. Otherwise, it would keep throwing CUDA_ERROR_NO_DEVICE error and end up using CPU.

aloerch · 2017-08-12T18:48:25Z

@tombstone this is low-hanginf fruit for a pull-request :) I'd be happy to provide additional contributions in the future too.

joeysu · 2017-09-14T08:23:59Z

Train on a single GPU and eval on CPU:

Training: Allow growth

In trainer.py, add

session_config.gpu_options.allow_growth = True

behind

session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

Evaluation: Use CPU

In eval.py, add

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

ghost · 2017-11-03T18:36:21Z

When I try to reduce the memory allocated by my eval.py it turns to Out of Memory, what are the issues ?
Thanks

PythonImageDeveloper · 2018-02-28T16:13:17Z

@s5plus1,I follow you command , i open 2 terminal windows separately, in ones i run train.py this is ok and running well, but when i run eval.py in the other terminal windows , i got this error:
python3 eval.py \

    --logtostderr \
    --checkpoint_dir=training_ssd_mobile \
    --eval_dir=eval_caltech \
    --pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_coco.config

INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-28 19:37:41.333605: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-28 19:37:41.336325: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-02-28 19:37:41.336370: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: mm
2018-02-28 19:37:41.336378: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: mm
2018-02-28 19:37:41.336415: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.111.0
2018-02-28 19:37:41.336437: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
"""
2018-02-28 19:37:41.336452: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.111.0
2018-02-28 19:37:41.336460: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.111.0
INFO:tensorflow:Restoring parameters from training_ssd_mobile/model.ckpt-0
INFO:tensorflow:Restoring parameters from training_ssd_mobile/model.ckpt-0
WARNING:root:image 0 does not have groundtruth difficult flag specified
WARNING:root:image 1000 does not have groundtruth difficult flag specified
WARNING:root:image 2000 does not have groundtruth difficult flag specified
[......]

aloerch · 2018-02-28T16:38:35Z

@zeynali following @s5plus1 's solution disables cuda devices from being found by eval.py and therefore you see a message about no cuda devices available. Does the eval keep running? It should have used CPU and kept running for the eval process.

PythonImageDeveloper · 2018-02-28T17:23:47Z

@s5plus1 , i follow your command, you mention Evaluation: Use CPU , therefor , Should not it use cpu?
It should not be run in GPU ? Right?

PythonImageDeveloper · 2018-02-28T17:24:03Z

@s5plus1 , i follow your command, you mention Evaluation: Use CPU , therefor ,It should not be run in GPU ? Right?

joeysu · 2018-03-02T08:29:04Z

@zeynali It seems that the cuda devices were not disabled correctly, and you were using GPU instead of CPU which doesn't make sense if you followed my code:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

PythonImageDeveloper · 2018-03-02T08:48:51Z

@s5plus1, Yes ,I did the same. What seems to me: os.environ['CUDA_VISIBLE_DEVICES'] = '-1' didn't work for me , i have TF 1.5 , CUDA8 , CUDNN7 , perhaps i must change '-1' to other. What do you think ?

joeysu · 2018-03-03T04:01:14Z

@zeynali
Yes, you can change it if you have multiple GPUs.

‘-1’ means CPU,
‘0’ means GPU 0,
‘1’ means GPU 1,
etc.

Or you can try another way:
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)

According to https://groups.google.com/a/tensorflow.org/forum/m/#!topic/discuss/cFsmoeO9Nd4

PythonImageDeveloper · 2018-03-03T06:43:40Z

@s5plus1 , where i add this lines ? why 'GPU': 0 ? i want just only run on CPU , i dont have enough memory in GPU . when i train my model on GPU , this process allocate whole of memory Gpu to training phase , my model is small , can run only 4G GPU , have you idea for solve this problem?

config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)

frostell · 2018-03-20T09:19:20Z

A simple (but effective) way of preventing the eval job or tensorboard from crashing the training is to create a minimal virtual environment and installing tensorflow without GPU support in that environment, then simply activate the virtual environment and start the eval job and it will never take up GPU memory...

PythonImageDeveloper · 2018-03-20T14:15:20Z

@frostell , Have you tried it yourself?

frostell · 2018-03-21T08:42:47Z

yes @zeynali, of course! this is how I'm always running my models...

install virtualenv (I use "~/virtual_tf" as my {path-to-your-virtualenv})
sudo apt-get install python3-pip python3-dev python-virtualenv
virtualenv --system-site-packages -p python3 {path-to-your-virtualenv}
activate virtualenv
source {path-to-your-virtualenv}/bin/activate
upgrade pip and install tensorflow without GPU support
easy_install -U pip
pip3 install --upgrade tensorflow
now when training on your GPU, just activate your virtualenv again with...
source {path-to-your-virtualenv}/bin/activate

...and start you eval job with whatever command you are using.

some additional tests:

to test that it's working as you want to try
pip3 list | grep tensorflow
I get:
tensorflow (1.6.0)
tensorflow-gpu (1.5.0)
To see what tensorflow version is imported in a specific virtualenv, you can run
python3 -c "import tensorflow as tf; print(tf.__version__)"
in that virtualenv

For my GPU-installation I get:
1.5.0
For my CPU-installation I get:
1.6.0

Confirming that python import different versions of tensorflow in the different environments...

Good luck!!

frostell · 2018-03-21T08:57:21Z

(As I understand it, the virtual environment is like an isolated part of your OS where everything from locale to PATH variables can be set in a separate way. It also contains a python and tensorflow installation which is separate from the one you're using in your OS. This is good because you can tweak things without it effecting other python environments. I've actually put my tensorflow-GPU installation in a separate virtualenv as well)

madhavajay · 2018-05-18T00:30:22Z

A much easier way to avoid eval from using GPU is just to set the flags before running it and then unset them after e.g. put this in a .sh file.

export CUDA_VISIBLE_DEVICES=-1
python ./object_detection/eval.py \
    --logtostderr \
    --pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
    --checkpoint_dir=${PATH_TO_TRAIN_DIR} \
    --eval_dir=${PATH_TO_EVAL_DIR}
unset CUDA_VISIBLE_DEVICES

NightFury13 · 2018-06-26T06:23:51Z

Just wondering if there is a way to set GPU fraction also from the command-line like CUDA_VISIBLE_DEVICES?

joydeepmedhi · 2018-07-26T20:25:55Z

@Schechtman tensorflow-gpu still taking full memory even after setting fraction to 0.5? object-detection
Did that work for you?

madhavajay · 2018-07-27T00:27:45Z

Do you also need to set allow_growth=True?

jvishnuvardhan · 2019-06-12T19:16:12Z

Automatically closing this out since I understand it to be resolved, but please let me know if I'm mistaken. Please open a new issue if there are any unresolved issues. Thanks!

madhavajay · 2019-09-04T01:13:34Z

You can also disable directly in python or jupyter by placing this in the cells before you load TensorFlow. This works great if you have notebooks or code that you want to run while your also trying to train models etc.

# disable GPU so TensorFlow doesnt allocate the whole GPU memory to this notebook
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

cy89 added the stat:awaiting response Waiting on input from the contributor label Jul 7, 2017

aselle removed the stat:awaiting response Waiting on input from the contributor label Jul 8, 2017

michaelisard added the stat:awaiting response Waiting on input from the contributor label Jul 10, 2017

aselle removed the stat:awaiting response Waiting on input from the contributor label Jul 12, 2017

michaelisard assigned tombstone Jul 12, 2017

michaelisard marked this as a duplicate of #1958 Jul 14, 2017

michaelisard mentioned this issue Jul 14, 2017

How to use a single GPU in a multi-GPU machine for the Object Detection API? #1958

Closed

tuobay mentioned this issue Nov 9, 2017

Out Of Memory when training on Big Images #1817

Closed

PythonImageDeveloper unassigned tombstone Feb 28, 2018

angerson added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Feb 28, 2018

jvishnuvardhan closed this as completed Jun 12, 2019

GPU memory error with train.py and eval.py running together #1854

GPU memory error with train.py and eval.py running together #1854

Comments

aloerch commented Jul 4, 2017 • edited

System information

schesho commented Jul 5, 2017

drpngx commented Jul 6, 2017

cy89 commented Jul 7, 2017

aloerch commented Jul 8, 2017

aloerch commented Jul 8, 2017

slandersson commented Jul 10, 2017 • edited

aloerch commented Jul 10, 2017

thess24 commented Jul 11, 2017

aloerch commented Jul 12, 2017

slandersson commented Aug 1, 2017 • edited

aloerch commented Aug 1, 2017

slandersson commented Aug 1, 2017

pkdogcom commented Aug 11, 2017

michaelisard commented Aug 11, 2017

pkdogcom commented Aug 12, 2017

aloerch commented Aug 12, 2017 • edited

joeysu commented Sep 14, 2017 • edited

ghost commented Nov 3, 2017 • edited by ghost

PythonImageDeveloper commented Feb 28, 2018 • edited

aloerch commented Feb 28, 2018

PythonImageDeveloper commented Feb 28, 2018

PythonImageDeveloper commented Feb 28, 2018 • edited

joeysu commented Mar 2, 2018

PythonImageDeveloper commented Mar 2, 2018 • edited

joeysu commented Mar 3, 2018

PythonImageDeveloper commented Mar 3, 2018 • edited

frostell commented Mar 20, 2018

PythonImageDeveloper commented Mar 20, 2018

frostell commented Mar 21, 2018 • edited

frostell commented Mar 21, 2018

madhavajay commented May 18, 2018

NightFury13 commented Jun 26, 2018

joydeepmedhi commented Jul 26, 2018

madhavajay commented Jul 27, 2018

jvishnuvardhan commented Jun 12, 2019

madhavajay commented Sep 4, 2019

aloerch commented Jul 4, 2017 •

edited

slandersson commented Jul 10, 2017 •

edited

slandersson commented Aug 1, 2017 •

edited

aloerch commented Aug 12, 2017 •

edited

joeysu commented Sep 14, 2017 •

edited

ghost commented Nov 3, 2017 •

edited by ghost

PythonImageDeveloper commented Feb 28, 2018 •

edited

PythonImageDeveloper commented Feb 28, 2018 •

edited

PythonImageDeveloper commented Mar 2, 2018 •

edited

PythonImageDeveloper commented Mar 3, 2018 •

edited

frostell commented Mar 21, 2018 •

edited