Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory error with train.py and eval.py running together #1854

Closed
aloerch opened this issue Jul 4, 2017 · 36 comments
Closed

GPU memory error with train.py and eval.py running together #1854

aloerch opened this issue Jul 4, 2017 · 36 comments
Labels
stat:awaiting model gardener Waiting on input from TensorFlow model gardener

Comments

@aloerch
Copy link

aloerch commented Jul 4, 2017

System information

  • What is the top-level directory of the model you are using: object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): not yet...
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 18
  • TensorFlow installed from (source or binary): pip tensorflow-gpu
  • TensorFlow version (use command below): ('v1.2.0-rc2-21-g12f033d', '1.2.0')
  • CUDA/cuDNN version: Cuda compilation tools, release 8.0, V8.0.44
  • GPU model and memory: eVGA GTX 1080Ti, 11Gb
  • Exact command to reproduce:
    Run 1st in one terminal:
python object_detection/train.py \
    --logtostderr \
    --pipeline_config_path=/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/samples/configs/faster_rcnn_resnet101_pets_learn.config \
    --train_dir=/home/meh/.virtualenvs/Project/models/model/train

That runs fine, training works... then run in 2nd terminal:

python object_detection/eval.py \
    --logtostderr \
    --pipeline_config_path=/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/samples/configs/faster_rcnn_resnet101_pets_learn.config \
    --checkpoint_dir=/home/meh/.virtualenvs/Project/models/model/train \
    --eval_dir=/home/meh/.virtualenvs/Project/models/model/eval

Evaluation fails with an error about the GPU being out of memory (training continues though in the other terminal window with no problem). Here is the traceback:

tensorflow/models $ python object_detection/eval.py \
>     --logtostderr \
>     --pipeline_config_path=/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/samples/configs/faster_rcnn_resnet101_pets_learn.config \
>     --checkpoint_dir=/home/meh/.virtualenvs/Project//models/model/train \
>     --eval_dir=/home/meh/.virtualenvs/Project//models/model/eval
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
2017-07-04 13:05:20.813917: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.813976: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.813987: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.813995: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:20.814005: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-04 13:05:21.414540: E tensorflow/core/common_runtime/direct_session.cc:138] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11713708032
Traceback (most recent call last):
  File "object_detection/eval.py", line 161, in <module>
    tf.app.run()
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "object_detection/eval.py", line 157, in main
    FLAGS.checkpoint_dir, FLAGS.eval_dir)
  File "/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/evaluator.py", line 211, in evaluate
    save_graph_dir=(eval_dir if eval_config.save_graph else ''))
  File "/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/eval_util.py", line 515, in repeated_checkpoint_run
    keys_to_exclude_from_results)
  File "/home/meh/.virtualenvs/Project/lib/python2.7/site-packages/tensorflow/models/object_detection/eval_util.py", line 359, in run_checkpoint_once
    sess = tf.Session(master, graph=tf.get_default_graph())
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1292, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 562, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/meh/.virtualenvs/Project/local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Based on:

Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11713708032

it looks like eval.py appears to be trying to run the evaluation on the same GPU that is actively doing the training. I have a 2nd GPU, an eVGA GTX 1080 FTW with 8Gb of RAM that I would be happy to run the eval.py on, and generally speaking, I know how to write my own tensorflow graph using tf.device('/gpu:1') but... I cannot figure out where to insert this in the object_detection code.

I would recommend adding the ability to select the GPU to use for both training and evaluation, possibly as a flag. In the meantime, any help you can offer regarding where I can insert that in the eval.py tree would be much appreciated.

@schesho
Copy link

schesho commented Jul 5, 2017

a workaround would be to add after in the trainer.py file add:

session_config.gpu_options.per_process_gpu_memory_fraction = 0.8
(or any fraction that you want and that so that 1 - fraction would be sufficient for your eval.py to run)
after the line :

    session_config = tf.ConfigProto(allow_soft_placement=True                                 
                                                 log_device_placement=False)

@drpngx
Copy link
Contributor

drpngx commented Jul 6, 2017

Yes, you can use CUDA_VISIBILE_DEVICES or the gpu_options to select what you have access to.

@cy89 cy89 added the stat:awaiting response Waiting on input from the contributor label Jul 7, 2017
@cy89
Copy link

cy89 commented Jul 7, 2017

@aloerch does that work for you?

@aloerch
Copy link
Author

aloerch commented Jul 8, 2017

I will be trying @drpngx's solution tonight and let you know. @schectman's solution would only be ideal to me if I were limited to 1 GPU, but I have 2.

Cheers!

@aselle aselle removed the stat:awaiting response Waiting on input from the contributor label Jul 8, 2017
@aloerch
Copy link
Author

aloerch commented Jul 8, 2017

Ok, so I do not know how to add commits or pull requests or submit my suggested changes to the code on github, but here's what I've worked out: As currently coded, object detection's eval process will fail to run at the same time the train process runs with a GPU out of memory error in some cases. The options are 1) to change the hard-coded fraction of memory used on the GPU for training within the code on a project by project basis, or 2) Make it possible for a user to select a different GPU for the eval process. I've confirmed that my changes listed below work and I would recommend they be added to eval.py in order to improve the usability of object detection.

Add this line to eval.py above import functools:
import os

Add this line to eval.py in the flags:

flags.DEFINE_string('gpudev','',
                    'Select a GPU Device.')

Add this line below FLAGS = flags.FLAGS:
os.environ["CUDA_VISIBLE_DEVICES"] = 'FLAGS.gpudev'

Finally, when running eval.py from the terminal, a person could use:
--gpudev=1

or some other number (0, 1, 2, etc) to designate the GPU to be used for the process.

This workaround has resolved my own problem with running the 2 processes simultaneously. Thanks for all of your help!

@slandersson
Copy link

slandersson commented Jul 10, 2017

Hi @aloerch I also have 2 GPUs and have tried what you've done, but I get an error saying it cannot detect the cuda device. Both of them have the same bus ID so maybe this is the issue?

I can run the train and evaluation at 70% - 30% on a single GPU and memory is being allocated across both.

I might just run 2 VMs one for each job and GPU.

Edit: On second thought, the eval probably doesn't slow down the process much at all. Then maybe is there a way to get the training to utilise both GPUs?

image

@aloerch
Copy link
Author

aloerch commented Jul 10, 2017

@slandersson I'm not sure why you got the error about it not detecting the cuda device. Maybe you could post the traceback and the lines of code you edited from my recommendation so I can see if I can locate the error? You may have entered a typo or something else could be going on. Also, make sure you check your gpu device number... your first GPU would be GPU 0, the second would be GPU 1, and the third would be GPU 2. This means, if you have 2 GPUs, and you entered --gpudev=2 that would not work because you don't have a third GPU.

Even if the eval process doesn't use much GPU, if you hardcode in a limit of 70/30 for train/eval, your training will be limited to 70%, so that is not a solution I would want to use for myself.

@michaelisard michaelisard added the stat:awaiting response Waiting on input from the contributor label Jul 10, 2017
@thess24
Copy link
Contributor

thess24 commented Jul 11, 2017

I agree there should be support for running eval and train on seperate GPU's. I think another nice feature would be to set a flag so the GPU can allow for growth rather than using all the memory at the start.

Because I'm working with a single GPU, I've modified the session_config in trainer.py so train.py doesn't automatically consume all the GPU memory:

session_config.gpu_options.allow_growth=True

After I let this run for a while, I'll start up the eval.py script which seems to be working, but there is certainly a better way to do this. Possibly both the evaluation and training jobs can have flags to allow for growth so this would automatically work on a single GPU.

@aloerch
Copy link
Author

aloerch commented Jul 12, 2017

@thess24 that might be a good feature request. It would be easy to do... my modifications for fixing the GPU out of memory error took a very small amount of time. It would take longer for me to learn how to create pull requests than to implement something like that haha.

@slandersson did my feedback help with your issue?

@slandersson
Copy link

slandersson commented Aug 1, 2017

@aloerch

Add this line below FLAGS = flags.FLAGS:
os.environ["CUDA_VISIBLE_DEVICES"] = 'FLAGS.gpudev'

Perhaps this should be FLAGS.gpudev without quotes?

os.environ["CUDA_VISIBLE_DEVICES"] = FLAGS.gpudev

@aloerch
Copy link
Author

aloerch commented Aug 1, 2017

@slandersson
Did you try it successfully without the quotes around FLAGS.gpudev? My current python version is 2.7 and worked with the quotes...

@slandersson
Copy link

@aloerch yep confirm it works without the quotes. Python 2.7 too

@pkdogcom
Copy link

@michaelisard Any updates on this issue?

@michaelisard
Copy link

@tombstone ?

@pkdogcom
Copy link

@aloerch @slandersson I also have to remove the quotes to make it work. Otherwise, it would keep throwing CUDA_ERROR_NO_DEVICE error and end up using CPU.

@aloerch
Copy link
Author

aloerch commented Aug 12, 2017

@tombstone this is low-hanginf fruit for a pull-request :) I'd be happy to provide additional contributions in the future too.

@joeysu
Copy link

joeysu commented Sep 14, 2017

Train on a single GPU and eval on CPU:

  • Training: Allow growth

In trainer.py, add

session_config.gpu_options.allow_growth = True

behind

session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

  • Evaluation: Use CPU

In eval.py, add

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

@ghost
Copy link

ghost commented Nov 3, 2017

When I try to reduce the memory allocated by my eval.py it turns to Out of Memory, what are the issues ?
Thanks

@PythonImageDeveloper
Copy link

PythonImageDeveloper commented Feb 28, 2018

@s5plus1,I follow you command , i open 2 terminal windows separately, in ones i run train.py this is ok and running well, but when i run eval.py in the other terminal windows , i got this error:
python3 eval.py \

    --logtostderr \
    --checkpoint_dir=training_ssd_mobile \
    --eval_dir=eval_caltech \
    --pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_coco.config 

INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-28 19:37:41.333605: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-28 19:37:41.336325: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-02-28 19:37:41.336370: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: mm
2018-02-28 19:37:41.336378: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: mm
2018-02-28 19:37:41.336415: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.111.0
2018-02-28 19:37:41.336437: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
"""
2018-02-28 19:37:41.336452: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.111.0
2018-02-28 19:37:41.336460: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.111.0
INFO:tensorflow:Restoring parameters from training_ssd_mobile/model.ckpt-0
INFO:tensorflow:Restoring parameters from training_ssd_mobile/model.ckpt-0
WARNING:root:image 0 does not have groundtruth difficult flag specified
WARNING:root:image 1000 does not have groundtruth difficult flag specified
WARNING:root:image 2000 does not have groundtruth difficult flag specified
[......]

@aloerch
Copy link
Author

aloerch commented Feb 28, 2018

@zeynali following @s5plus1 's solution disables cuda devices from being found by eval.py and therefore you see a message about no cuda devices available. Does the eval keep running? It should have used CPU and kept running for the eval process.

@PythonImageDeveloper
Copy link

@s5plus1 , i follow your command, you mention Evaluation: Use CPU , therefor , Should not it use cpu?
It should not be run in GPU ? Right?

@PythonImageDeveloper
Copy link

PythonImageDeveloper commented Feb 28, 2018

@s5plus1 , i follow your command, you mention Evaluation: Use CPU , therefor ,It should not be run in GPU ? Right?

@angerson angerson added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Feb 28, 2018
@joeysu
Copy link

joeysu commented Mar 2, 2018

@zeynali It seems that the cuda devices were not disabled correctly, and you were using GPU instead of CPU which doesn't make sense if you followed my code:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

@PythonImageDeveloper
Copy link

PythonImageDeveloper commented Mar 2, 2018

@s5plus1, Yes ,I did the same. What seems to me: os.environ['CUDA_VISIBLE_DEVICES'] = '-1' didn't work for me , i have TF 1.5 , CUDA8 , CUDNN7 , perhaps i must change '-1' to other. What do you think ?

@joeysu
Copy link

joeysu commented Mar 3, 2018

@zeynali
Yes, you can change it if you have multiple GPUs.

‘-1’ means CPU,
‘0’ means GPU 0,
‘1’ means GPU 1,
etc.

Or you can try another way:
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)

According to https://groups.google.com/a/tensorflow.org/forum/m/#!topic/discuss/cFsmoeO9Nd4

@PythonImageDeveloper
Copy link

PythonImageDeveloper commented Mar 3, 2018

@s5plus1 , where i add this lines ? why 'GPU': 0 ? i want just only run on CPU , i dont have enough memory in GPU . when i train my model on GPU , this process allocate whole of memory Gpu to training phase , my model is small , can run only 4G GPU , have you idea for solve this problem?

config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)

@frostell
Copy link

A simple (but effective) way of preventing the eval job or tensorboard from crashing the training is to create a minimal virtual environment and installing tensorflow without GPU support in that environment, then simply activate the virtual environment and start the eval job and it will never take up GPU memory...

@PythonImageDeveloper
Copy link

@frostell , Have you tried it yourself?

@frostell
Copy link

frostell commented Mar 21, 2018

yes @zeynali, of course! this is how I'm always running my models...

  1. install virtualenv (I use "~/virtual_tf" as my {path-to-your-virtualenv})
    sudo apt-get install python3-pip python3-dev python-virtualenv
    virtualenv --system-site-packages -p python3 {path-to-your-virtualenv}

  2. activate virtualenv
    source {path-to-your-virtualenv}/bin/activate

  3. upgrade pip and install tensorflow without GPU support
    easy_install -U pip
    pip3 install --upgrade tensorflow

  4. now when training on your GPU, just activate your virtualenv again with...
    source {path-to-your-virtualenv}/bin/activate

...and start you eval job with whatever command you are using.

some additional tests:

  1. to test that it's working as you want to try
    pip3 list | grep tensorflow
    I get:
    tensorflow (1.6.0)
    tensorflow-gpu (1.5.0)

  2. To see what tensorflow version is imported in a specific virtualenv, you can run
    python3 -c "import tensorflow as tf; print(tf.__version__)"
    in that virtualenv

For my GPU-installation I get:
1.5.0
For my CPU-installation I get:
1.6.0

Confirming that python import different versions of tensorflow in the different environments...

Good luck!!

@frostell
Copy link

(As I understand it, the virtual environment is like an isolated part of your OS where everything from locale to PATH variables can be set in a separate way. It also contains a python and tensorflow installation which is separate from the one you're using in your OS. This is good because you can tweak things without it effecting other python environments. I've actually put my tensorflow-GPU installation in a separate virtualenv as well)

@madhavajay
Copy link

A much easier way to avoid eval from using GPU is just to set the flags before running it and then unset them after e.g. put this in a .sh file.

export CUDA_VISIBLE_DEVICES=-1
python ./object_detection/eval.py \
    --logtostderr \
    --pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
    --checkpoint_dir=${PATH_TO_TRAIN_DIR} \
    --eval_dir=${PATH_TO_EVAL_DIR}
unset CUDA_VISIBLE_DEVICES

@NightFury13
Copy link

Just wondering if there is a way to set GPU fraction also from the command-line like CUDA_VISIBLE_DEVICES?

@joydeepmedhi
Copy link

@Schechtman tensorflow-gpu still taking full memory even after setting fraction to 0.5? object-detection
Did that work for you?

@madhavajay
Copy link

Do you also need to set allow_growth=True?

@jvishnuvardhan
Copy link
Contributor

Automatically closing this out since I understand it to be resolved, but please let me know if I'm mistaken. Please open a new issue if there are any unresolved issues. Thanks!

@madhavajay
Copy link

You can also disable directly in python or jupyter by placing this in the cells before you load TensorFlow. This works great if you have notebooks or code that you want to run while your also trying to train models etc.

# disable GPU so TensorFlow doesnt allocate the whole GPU memory to this notebook
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting model gardener Waiting on input from TensorFlow model gardener
Projects
None yet
Development

No branches or pull requests