-
Notifications
You must be signed in to change notification settings - Fork 45.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory error with train.py and eval.py running together #1854
Comments
a workaround would be to add after in the trainer.py file add:
|
Yes, you can use |
@aloerch does that work for you? |
I will be trying @drpngx's solution tonight and let you know. @schectman's solution would only be ideal to me if I were limited to 1 GPU, but I have 2. Cheers! |
Ok, so I do not know how to add commits or pull requests or submit my suggested changes to the code on github, but here's what I've worked out: As currently coded, object detection's eval process will fail to run at the same time the train process runs with a GPU out of memory error in some cases. The options are 1) to change the hard-coded fraction of memory used on the GPU for training within the code on a project by project basis, or 2) Make it possible for a user to select a different GPU for the eval process. I've confirmed that my changes listed below work and I would recommend they be added to eval.py in order to improve the usability of object detection. Add this line to eval.py above import functools: Add this line to eval.py in the flags:
Add this line below FLAGS = flags.FLAGS: Finally, when running eval.py from the terminal, a person could use: or some other number (0, 1, 2, etc) to designate the GPU to be used for the process. This workaround has resolved my own problem with running the 2 processes simultaneously. Thanks for all of your help! |
Hi @aloerch I also have 2 GPUs and have tried what you've done, but I get an error saying it cannot detect the cuda device. Both of them have the same bus ID so maybe this is the issue? I can run the train and evaluation at 70% - 30% on a single GPU and memory is being allocated across both. I might just run 2 VMs one for each job and GPU. Edit: On second thought, the eval probably doesn't slow down the process much at all. Then maybe is there a way to get the training to utilise both GPUs? |
@slandersson I'm not sure why you got the error about it not detecting the cuda device. Maybe you could post the traceback and the lines of code you edited from my recommendation so I can see if I can locate the error? You may have entered a typo or something else could be going on. Also, make sure you check your gpu device number... your first GPU would be GPU 0, the second would be GPU 1, and the third would be GPU 2. This means, if you have 2 GPUs, and you entered --gpudev=2 that would not work because you don't have a third GPU. Even if the eval process doesn't use much GPU, if you hardcode in a limit of 70/30 for train/eval, your training will be limited to 70%, so that is not a solution I would want to use for myself. |
I agree there should be support for running eval and train on seperate GPU's. I think another nice feature would be to set a flag so the GPU can allow for growth rather than using all the memory at the start. Because I'm working with a single GPU, I've modified the session_config in trainer.py so train.py doesn't automatically consume all the GPU memory:
After I let this run for a while, I'll start up the eval.py script which seems to be working, but there is certainly a better way to do this. Possibly both the evaluation and training jobs can have flags to allow for growth so this would automatically work on a single GPU. |
@thess24 that might be a good feature request. It would be easy to do... my modifications for fixing the GPU out of memory error took a very small amount of time. It would take longer for me to learn how to create pull requests than to implement something like that haha. @slandersson did my feedback help with your issue? |
Perhaps this should be FLAGS.gpudev without quotes?
|
@slandersson |
@aloerch yep confirm it works without the quotes. Python 2.7 too |
@michaelisard Any updates on this issue? |
@aloerch @slandersson I also have to remove the quotes to make it work. Otherwise, it would keep throwing CUDA_ERROR_NO_DEVICE error and end up using CPU. |
@tombstone this is low-hanginf fruit for a pull-request :) I'd be happy to provide additional contributions in the future too. |
Train on a single GPU and eval on CPU:
In trainer.py, add
behind
In eval.py, add
|
When I try to reduce the memory allocated by my eval.py it turns to Out of Memory, what are the issues ? |
@s5plus1,I follow you command , i open 2 terminal windows separately, in ones i run train.py this is ok and running well, but when i run eval.py in the other terminal windows , i got this error:
INFO:tensorflow:depth of additional conv before box predictor: 0 |
@zeynali following @s5plus1 's solution disables cuda devices from being found by eval.py and therefore you see a message about no cuda devices available. Does the eval keep running? It should have used CPU and kept running for the eval process. |
@s5plus1 , i follow your command, you mention Evaluation: Use CPU , therefor , Should not it use cpu? |
@s5plus1 , i follow your command, you mention Evaluation: Use CPU , therefor ,It should not be run in GPU ? Right? |
@zeynali It seems that the cuda devices were not disabled correctly, and you were using GPU instead of CPU which doesn't make sense if you followed my code:
|
@s5plus1, Yes ,I did the same. What seems to me: os.environ['CUDA_VISIBLE_DEVICES'] = '-1' didn't work for me , i have TF 1.5 , CUDA8 , CUDNN7 , perhaps i must change '-1' to other. What do you think ? |
@zeynali ‘-1’ means CPU, Or you can try another way: According to https://groups.google.com/a/tensorflow.org/forum/m/#!topic/discuss/cFsmoeO9Nd4 |
@s5plus1 , where i add this lines ? why 'GPU': 0 ? i want just only run on CPU , i dont have enough memory in GPU . when i train my model on GPU , this process allocate whole of memory Gpu to training phase , my model is small , can run only 4G GPU , have you idea for solve this problem? config = tf.ConfigProto( |
A simple (but effective) way of preventing the eval job or tensorboard from crashing the training is to create a minimal virtual environment and installing tensorflow without GPU support in that environment, then simply activate the virtual environment and start the eval job and it will never take up GPU memory... |
@frostell , Have you tried it yourself? |
yes @zeynali, of course! this is how I'm always running my models...
...and start you eval job with whatever command you are using. some additional tests:
For my GPU-installation I get: Confirming that python import different versions of tensorflow in the different environments... Good luck!! |
(As I understand it, the virtual environment is like an isolated part of your OS where everything from locale to PATH variables can be set in a separate way. It also contains a python and tensorflow installation which is separate from the one you're using in your OS. This is good because you can tweak things without it effecting other python environments. I've actually put my tensorflow-GPU installation in a separate virtualenv as well) |
A much easier way to avoid eval from using GPU is just to set the flags before running it and then unset them after e.g. put this in a .sh file.
|
Just wondering if there is a way to set GPU fraction also from the command-line like CUDA_VISIBLE_DEVICES? |
@Schechtman tensorflow-gpu still taking full memory even after setting fraction to 0.5? object-detection |
Do you also need to set allow_growth=True? |
Automatically closing this out since I understand it to be resolved, but please let me know if I'm mistaken. Please open a new issue if there are any unresolved issues. Thanks! |
You can also disable directly in python or jupyter by placing this in the cells before you load TensorFlow. This works great if you have notebooks or code that you want to run while your also trying to train models etc.
|
System information
Run 1st in one terminal:
That runs fine, training works... then run in 2nd terminal:
Evaluation fails with an error about the GPU being out of memory (training continues though in the other terminal window with no problem). Here is the traceback:
Based on:
it looks like eval.py appears to be trying to run the evaluation on the same GPU that is actively doing the training. I have a 2nd GPU, an eVGA GTX 1080 FTW with 8Gb of RAM that I would be happy to run the eval.py on, and generally speaking, I know how to write my own tensorflow graph using tf.device('/gpu:1') but... I cannot figure out where to insert this in the object_detection code.
I would recommend adding the ability to select the GPU to use for both training and evaluation, possibly as a flag. In the meantime, any help you can offer regarding where I can insert that in the eval.py tree would be much appreciated.
The text was updated successfully, but these errors were encountered: