Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error Blas xGEMV launch failed on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

Closed
arvindrajan92 opened this issue Feb 21, 2022 · 12 comments
Assignees
Labels
TF 2.8 type:bug Bug type:build/install Build and install issues

Comments

@arvindrajan92
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Unknown
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): v2.8.0-rc1-32-g3f878cff5b6 2.8.0
  • Python version: 3.8
  • Bazel version (if compiling from source): N.A.
  • GCC/Compiler version (if compiling from source): N.A.
  • CUDA/cuDNN version: CUDA 11.2.1
  • GPU model and memory: Tesla T4 / 16 GB

Describe the current behavior
Running a block of code with Tensorflow v2.8.0 / Cuda 11.2 / CuDNN 8.1 returns an internal error Blas xGEMV launch failed when it runs perfectly well with Tensorflow v2.4.1 / Cuda 11.0 / CuDNN 8.0.

Describe the expected behavior
Return the same output as Tensorflow v2.4.1 / Cuda 11.0 / CuDNN 8.0.

Contributing

  • Do you want to contribute a PR? (yes/no): no
  • Briefly describe your candidate solution(if contributing): N.A.

Standalone code to reproduce the issue

The following block of code works perfectly well with Tensorflow v2.4.1 / Cuda 11.0 / CuDNN 8.0, but not with Tensorflow v2.8.0 / Cuda 11.2 / CuDNN 8.1.

import tensorflow as tf
empty_image = tf.zeros(shape=[1280, 1280, 3], dtype=tf.float32)
gray_image = tf.image.rgb_to_grayscale(empty_image)

An important point to note is that when I reduce the shape of empty_image to [512, 512, 3], there is no issue. However, I believe this is not a device memory issue as I can reproduce this with GeForce RTX 2080 Ti 11 GB as well as Tesla T4 16 GB.

Other info / logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,1638400,3], b.shape=[1,3,1], m=1638400, n=1, k=3 [Op:MatMul]
@arvindrajan92
Copy link
Author

arvindrajan92 commented Feb 21, 2022

Currently, the workaround for me is to use CUDA 11.1 with cuDNN 8.1.1. I arrived at this after finding out that Google Colab has TensorFlow 2.8.0 installed but runs on CUDA 11.1, although TensorFlow's compatibility matrix recommends CUDA 11.2. When I installed CUDA 11.1, which is usually bundled with cuDNN 8.0.x, TensorFlow threw an error saying it requires cuDNN 8.1.x. Hence, upgrading cuDNN to 8.1.1 does the trick.

Having said that, I believe the reported bug is something to be looked at and addressed. I have a feeling this problem would appear in all TensorFlow versions that recommends CUDA 11.2 and cuDNN 8.1, i.e., TensorFlow >= 2.5.0, and I am saying this because I was getting the same error after downgrading to TensorFlow 2.7.0 on CUDA 11.2 and cuDNN 8.1.

For those who has CUDA 11.1 installed with cuDNN 8.0.x on Ubuntu 18.04 / 20.04, the following commands would upgrade your cuDNN version from 8.0.x to 8.1.1.

wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.1.1/cudnn-11.2-linux-x64-v8.1.1.33.tgz -O /tmp/cudnn-11.2-linux-x64-v8.1.1.33.tgz
tar -xzvf /tmp/cudnn-11.2-linux-x64-v8.1.1.33.tgz -C /tmp/
sudo cp /tmp/cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp /tmp/cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

@tilakrayal
Copy link
Contributor

@arvindrajan92 ,
Google Colab has TensorFlow 2.8.0 installed and runs on CUDA 11.2 and I was able to execute the given code without any issues.Please find the gist here.Thanks!

@tilakrayal tilakrayal added type:build/install Build and install issues stat:awaiting response Status - Awaiting response from author labels Feb 21, 2022
@arvindrajan92
Copy link
Author

@arvindrajan92 , Google Colab has TensorFlow 2.8.0 installed and runs on CUDA 11.2 and I was able to execute the given code without any issues.Please find the gist here.Thanks!

hi @tilakrayal, thank you for getting back to me. your gist brings me back to this issue though. could you check your link please? also, this is my google colab notebook which says CUDA 11.1 when i execute nvcc --version

@tilakrayal tilakrayal removed the stat:awaiting response Status - Awaiting response from author label Feb 21, 2022
@gadagashwini
Copy link
Contributor

@arvindrajan92,
Given configurations Tested build configuration were tested on different platforms.

This error is due to

  • OOM error -GPU is running out of memory
  • Doesn't have enough compute capacity
  • There's a driver issue.

Can you verify the memory usage with nvidia-smi? If you have any other
processes using the GPU. And also check CUDA compute capability for the given nvidia drivers.

@gadagashwini gadagashwini added the stat:awaiting response Status - Awaiting response from author label Feb 22, 2022
@arvindrajan92
Copy link
Author

arvindrajan92 commented Feb 22, 2022

@arvindrajan92, Given configurations Tested build configuration were tested on different platforms.

This error is due to

  • OOM error -GPU is running out of memory
  • Doesn't have enough compute capacity
  • There's a driver issue.

Can you verify the memory usage with nvidia-smi? If you have any other processes using the GPU. And also check CUDA compute capability for the given nvidia drivers.

Thank you for taking a look at this issue @gadagashwini. Please allow me to address your points.

OOM error -GPU is running out of memory
Below is the block of codes and outputs when ran on AWS Deep Learning AMI GPU CUDA 11.2.1 (Ubuntu 20.04) 20220208. I have attached below the screenshot from nvidia-smi after running the codes. I can confirm that the GPU did not run out of memory and there are no other processes using the GPU as you can see from nvidia-smi. Furthermore, running these codes has not used more than 1GB of GPU memory.

[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> physical_devices = tf.config.list_physical_devices('GPU')
2022-02-22 12:46:37.311936: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:46:41.624784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:46:41.625446: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>>> tf.config.experimental.set_memory_growth(physical_devices[0], True)
>>> empty_image = tf.zeros(shape=[1280, 1280, 3], dtype=tf.float32)
2022-02-22 12:47:11.120757: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-22 12:47:11.121428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:11.122072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:11.122620: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.496265: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.496875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.497420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.497989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13795 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
>>> gray_image = tf.image.rgb_to_grayscale(empty_image)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,1638400,3], b.shape=[1,3,1], m=1638400, n=1, k=3 [Op:MatMul]

image

Doesn't have enough compute capacity
From the code block above, you can see that the compute capability is 7.5. Is this not enough?

There's a driver issue.
I can reproduce this on any of the AWS's AMI with GPU and CUDA 11.2.1 installed. Similarly, I can reproduce this on my local machine with Geforce RTX 3060 which has compute capability of 8.6 where the NVIDIA driver, CUDA 11.2.1 and cuDNN 8.1 are freshly installed. However, I don't see this issue on any of AWS's AMI with CUDA 11.1.1 installed after upgrading cuDNN to version 8.1 (from version 8.0) - I observe the same behaviour when installing CUDA 11.1.1 and cuDNN 8.1 on my local machine with Geforce RTX 3060.

Are you able to run this on a physical machine with CUDA 11.2.1 and cuDNN 8.1 without issues?

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Feb 24, 2022
@arvindrajan92
Copy link
Author

arvindrajan92 commented Mar 2, 2022

Hi @gadagashwini, are you still looking into this issue? Thanks.

@gadagashwini
Copy link
Contributor

@arvindrajan92,

However, I don't see this issue on any of AWS's AMI with CUDA 11.1.1 installed after upgrading cuDNN to version 8.1 (from version 8.0) - I observe the same behaviour when installing CUDA 11.1.1 and cuDNN 8.1 on my local machine with Geforce RTX 3060.

Indeed this is expected behaviour. As per the Tensorflow document, CUDA 11.2 and cuDNN 8.1 are compatible versions. I could run the given code on CUDA 11.2 with cuDNN 8.1. Thanks!

@gadagashwini gadagashwini added the stat:awaiting response Status - Awaiting response from author label Mar 2, 2022
@njzjz
Copy link
Contributor

njzjz commented Mar 8, 2022

It may be a bug of cublas. cublas 11.4 resolved an issue:

Some gemv cases were producing incorrect results if the matrix dimension (n or m) was large, for example 2^20.

In your case, m=1638400>2^20. As cublas is not open-source, it's unclear what versions of cublas have this issue.

@arvindrajan92
Copy link
Author

Thank you @njzjz, looking at tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,1638400,3], b.shape=[1,3,1], m=1638400, n=1, k=3 [Op:MatMul], seems like it is probably due to the bug in cuBLAS. When I change shape to [512, 512, 3], I am getting the expected output.

From trying out different versions of CUDA, seems like the bug is introduced in CUDA 11.2 and only resolved in CUDA 11.4. I don't see TensorFlow throwing the error in CUDA 11.1.

Hi @gadagashwini, I am happy to close the issue since it is a bug in cuBLAS 11.2. I suppose this is something to keep in mind so that upcoming TensorFlow versions are not built against CUDA 11.3, which may also have the same bug in cuBLAS.

@gadagashwini
Copy link
Contributor

@arvindrajan92,
Thanks for confirming. Since the issue is more related to cuBLAS, will move this to closure. Thanks!

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@Irfan-S-1
Copy link

Blas xGEMV launch failed : a.shape=[1,8696332,3], b.shape=[1,3,1], m=8696332, n=1, k=3 [Op:MatMul]

I am getting error as above i am using CUDA 11.2 and tensorflow version 2.11.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TF 2.8 type:bug Bug type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

6 participants