Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_OUT_OF_MEMORY: out of memory with RTX 2070 #25337

Closed
realEthanZou opened this issue Jan 30, 2019 · 7 comments
Closed

CUDA_ERROR_OUT_OF_MEMORY: out of memory with RTX 2070 #25337

realEthanZou opened this issue Jan 30, 2019 · 7 comments

Comments

@realEthanZou
Copy link

@realEthanZou realEthanZou commented Jan 30, 2019

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win 10 1809
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version: 1.12.0
  • Python version: 3.6.8
  • Installed using virtualenv? pip? conda?: venv
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: 9.0.176 / 7.4.2.24
  • GPU model and memory: RTX 2070 8G

Describe the problem
Out of memory. The same code run perfectly with the same environment with a GTX 1070 so memory should be no factor. I just swapped a graphics card and the error arose.
The error output is initially CUBLAS_STATUS_ALLOC_FAILED so I looked around a bit and used this chunk of code in #7072:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)

After applying this code the error output is attached below in "Any other info / logs" section.
From "failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory" I think "freeMemory: 6.57GiB" is allocated as usual but somehow not available to cuda.

Provide the exact sequence of commands / steps that you executed before running into the problem
Try to train a CNN model with

model.fit_generator(train_datagen, steps_per_epoch=train_datagen.n // batch_size, epochs=epochs, verbose=2, validation_data=dev_datagen, validation_steps=dev_datagen.n // batch_size)

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Using TensorFlow backend.
2019-01-30 22:54:18.923758: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-01-30 22:54:19.139480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.57GiB
2019-01-30 22:54:19.139744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-30 22:54:20.029695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-30 22:54:20.029852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-30 22:54:20.029938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-30 22:54:20.030160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6309 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Found 209023 images belonging to 2 classes.
Found 11002 images belonging to 2 classes.
Epoch 1/200
2019-01-30 22:54:52.312147: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.312662: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.80G (1932735232 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.312846: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.62G (1739461632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.313021: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.46G (1565515520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.480583: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.480762: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.481276: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.481446: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.581230: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.581404: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.581690: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.581856: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.603866: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.604042: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.604320: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.604489: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.609692: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.609966: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.610424: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.610587: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.626099: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.626276: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.626553: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-01-30 22:54:52.626718: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-30 22:54:52.665175: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.85G (3060860928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

@bmiftah

This comment has been minimized.

Copy link

@bmiftah bmiftah commented Apr 18, 2019

Has this problem been solved ?

I am facing exactly same problem with
python 3.6.7
cuda 9.0
cudnn 7.3.1
gpu model : 2 gpus: each GTX1080ti identical
tensorflow-gpu 1.11.0

running on python virtual environment

I am not running heavy model , I am running simple code to see if gpu is being loaded and used , but I got many lines of error all looks like the following line :
2019-04-18 21:33:39.704417: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.62G (9253279744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

I restart pycharm and my system but there was no change. Any help please ?

@realEthanZou

This comment has been minimized.

Copy link
Author

@realEthanZou realEthanZou commented Apr 18, 2019

@bmiftah I did a clean install of everything with ddu and it somehow works, that's why I closed the issue.

@bmiftah

This comment has been minimized.

Copy link

@bmiftah bmiftah commented Apr 19, 2019

If I may , what do you mean by clean install ?

@GaranceRichard

This comment has been minimized.

Copy link

@GaranceRichard GaranceRichard commented Sep 24, 2019

Same problem, what is a clean install ?

@realEthanZou

This comment has been minimized.

Copy link
Author

@realEthanZou realEthanZou commented Sep 24, 2019

I uninstalled all related packages and libraries, uninstalled graphics driver with DDU and reinstalled them again. I don’t know why it works but my best guess is it has something to do with the graphics driver. When I upgrade to 2070 super I also have to reinstall the driver to make it work.

@sadicu

This comment has been minimized.

Copy link

@sadicu sadicu commented Jan 7, 2020

I think that it happens because of properties of rtx graphic card. a certain portion of rtx 20xx graphic memory (2.9Gb of 7994Mb in rtx 2070s) is only available when using float16 data type in tensorflow. if you allocate whole graphic card memory, you must use two data types float32, float16.

opt = tf.keras.optimizers.Adam(1e-4)
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
model.compile(loss=custom_loss, optimizer=opt)

@sadicu

This comment has been minimized.

Copy link

@sadicu sadicu commented Jan 7, 2020

if you use opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt), automatically convert nodes to float16 and your graphic card memory can be all allocated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.