Internal error `Blas xGEMV launch failed` on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

arvindrajan92 · 2022-02-21T01:08:35Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Unknown
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): v2.8.0-rc1-32-g3f878cff5b6 2.8.0
Python version: 3.8
Bazel version (if compiling from source): N.A.
GCC/Compiler version (if compiling from source): N.A.
CUDA/cuDNN version: CUDA 11.2.1
GPU model and memory: Tesla T4 / 16 GB

Describe the current behavior
Running a block of code with Tensorflow v2.8.0 / Cuda 11.2 / CuDNN 8.1 returns an internal error Blas xGEMV launch failed when it runs perfectly well with Tensorflow v2.4.1 / Cuda 11.0 / CuDNN 8.0.

Describe the expected behavior
Return the same output as Tensorflow v2.4.1 / Cuda 11.0 / CuDNN 8.0.

Contributing

Do you want to contribute a PR? (yes/no): no
Briefly describe your candidate solution(if contributing): N.A.

Standalone code to reproduce the issue

The following block of code works perfectly well with Tensorflow v2.4.1 / Cuda 11.0 / CuDNN 8.0, but not with Tensorflow v2.8.0 / Cuda 11.2 / CuDNN 8.1.

import tensorflow as tf
empty_image = tf.zeros(shape=[1280, 1280, 3], dtype=tf.float32)
gray_image = tf.image.rgb_to_grayscale(empty_image)

An important point to note is that when I reduce the shape of empty_image to [512, 512, 3], there is no issue. However, I believe this is not a device memory issue as I can reproduce this with GeForce RTX 2080 Ti 11 GB as well as Tesla T4 16 GB.

Other info / logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,1638400,3], b.shape=[1,3,1], m=1638400, n=1, k=3 [Op:MatMul]

The text was updated successfully, but these errors were encountered:

arvindrajan92 · 2022-02-21T04:51:24Z

Currently, the workaround for me is to use CUDA 11.1 with cuDNN 8.1.1. I arrived at this after finding out that Google Colab has TensorFlow 2.8.0 installed but runs on CUDA 11.1, although TensorFlow's compatibility matrix recommends CUDA 11.2. When I installed CUDA 11.1, which is usually bundled with cuDNN 8.0.x, TensorFlow threw an error saying it requires cuDNN 8.1.x. Hence, upgrading cuDNN to 8.1.1 does the trick.

Having said that, I believe the reported bug is something to be looked at and addressed. I have a feeling this problem would appear in all TensorFlow versions that recommends CUDA 11.2 and cuDNN 8.1, i.e., TensorFlow >= 2.5.0, and I am saying this because I was getting the same error after downgrading to TensorFlow 2.7.0 on CUDA 11.2 and cuDNN 8.1.

For those who has CUDA 11.1 installed with cuDNN 8.0.x on Ubuntu 18.04 / 20.04, the following commands would upgrade your cuDNN version from 8.0.x to 8.1.1.

wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.1.1/cudnn-11.2-linux-x64-v8.1.1.33.tgz -O /tmp/cudnn-11.2-linux-x64-v8.1.1.33.tgz
tar -xzvf /tmp/cudnn-11.2-linux-x64-v8.1.1.33.tgz -C /tmp/
sudo cp /tmp/cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp /tmp/cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

tilakrayal · 2022-02-21T12:54:38Z

@arvindrajan92 ,
Google Colab has TensorFlow 2.8.0 installed and runs on CUDA 11.2 and I was able to execute the given code without any issues.Please find the gist here.Thanks!

arvindrajan92 · 2022-02-21T14:41:46Z

@arvindrajan92 , Google Colab has TensorFlow 2.8.0 installed and runs on CUDA 11.2 and I was able to execute the given code without any issues.Please find the gist here.Thanks!

hi @tilakrayal, thank you for getting back to me. your gist brings me back to this issue though. could you check your link please? also, this is my google colab notebook which says CUDA 11.1 when i execute nvcc --version

gadagashwini · 2022-02-22T10:26:12Z

@arvindrajan92,
Given configurations Tested build configuration were tested on different platforms.

This error is due to

OOM error -GPU is running out of memory
Doesn't have enough compute capacity
There's a driver issue.

Can you verify the memory usage with nvidia-smi? If you have any other
processes using the GPU. And also check CUDA compute capability for the given nvidia drivers.

arvindrajan92 · 2022-02-22T13:03:44Z

@arvindrajan92, Given configurations Tested build configuration were tested on different platforms.

This error is due to

OOM error -GPU is running out of memory

Doesn't have enough compute capacity

There's a driver issue.

Can you verify the memory usage with nvidia-smi? If you have any other processes using the GPU. And also check CUDA compute capability for the given nvidia drivers.

Thank you for taking a look at this issue @gadagashwini. Please allow me to address your points.

OOM error -GPU is running out of memory
Below is the block of codes and outputs when ran on AWS Deep Learning AMI GPU CUDA 11.2.1 (Ubuntu 20.04) 20220208. I have attached below the screenshot from nvidia-smi after running the codes. I can confirm that the GPU did not run out of memory and there are no other processes using the GPU as you can see from nvidia-smi. Furthermore, running these codes has not used more than 1GB of GPU memory.

[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> physical_devices = tf.config.list_physical_devices('GPU')
2022-02-22 12:46:37.311936: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:46:41.624784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:46:41.625446: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>>> tf.config.experimental.set_memory_growth(physical_devices[0], True)
>>> empty_image = tf.zeros(shape=[1280, 1280, 3], dtype=tf.float32)
2022-02-22 12:47:11.120757: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-22 12:47:11.121428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:11.122072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:11.122620: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.496265: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.496875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.497420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-22 12:47:12.497989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13795 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
>>> gray_image = tf.image.rgb_to_grayscale(empty_image)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ubuntu/miniconda3/envs/docrec/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,1638400,3], b.shape=[1,3,1], m=1638400, n=1, k=3 [Op:MatMul]

Doesn't have enough compute capacity
From the code block above, you can see that the compute capability is 7.5. Is this not enough?

There's a driver issue.
I can reproduce this on any of the AWS's AMI with GPU and CUDA 11.2.1 installed. Similarly, I can reproduce this on my local machine with Geforce RTX 3060 which has compute capability of 8.6 where the NVIDIA driver, CUDA 11.2.1 and cuDNN 8.1 are freshly installed. However, I don't see this issue on any of AWS's AMI with CUDA 11.1.1 installed after upgrading cuDNN to version 8.1 (from version 8.0) - I observe the same behaviour when installing CUDA 11.1.1 and cuDNN 8.1 on my local machine with Geforce RTX 3060.

Are you able to run this on a physical machine with CUDA 11.2.1 and cuDNN 8.1 without issues?

arvindrajan92 · 2022-03-02T00:48:56Z

Hi @gadagashwini, are you still looking into this issue? Thanks.

gadagashwini · 2022-03-02T06:53:21Z

@arvindrajan92,

However, I don't see this issue on any of AWS's AMI with CUDA 11.1.1 installed after upgrading cuDNN to version 8.1 (from version 8.0) - I observe the same behaviour when installing CUDA 11.1.1 and cuDNN 8.1 on my local machine with Geforce RTX 3060.

Indeed this is expected behaviour. As per the Tensorflow document, CUDA 11.2 and cuDNN 8.1 are compatible versions. I could run the given code on CUDA 11.2 with cuDNN 8.1. Thanks!

njzjz · 2022-03-08T02:31:17Z

It may be a bug of cublas. cublas 11.4 resolved an issue:

Some gemv cases were producing incorrect results if the matrix dimension (n or m) was large, for example 2^20.

In your case, m=1638400>2^20. As cublas is not open-source, it's unclear what versions of cublas have this issue.

arvindrajan92 · 2022-03-10T07:21:06Z

Thank you @njzjz, looking at tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,1638400,3], b.shape=[1,3,1], m=1638400, n=1, k=3 [Op:MatMul], seems like it is probably due to the bug in cuBLAS. When I change shape to [512, 512, 3], I am getting the expected output.

From trying out different versions of CUDA, seems like the bug is introduced in CUDA 11.2 and only resolved in CUDA 11.4. I don't see TensorFlow throwing the error in CUDA 11.1.

Hi @gadagashwini, I am happy to close the issue since it is a bug in cuBLAS 11.2. I suppose this is something to keep in mind so that upcoming TensorFlow versions are not built against CUDA 11.3, which may also have the same bug in cuBLAS.

gadagashwini · 2022-03-14T02:20:48Z

@arvindrajan92,
Thanks for confirming. Since the issue is more related to cuBLAS, will move this to closure. Thanks!

google-ml-butler · 2022-03-14T02:20:50Z

Are you satisfied with the resolution of your issue?
Yes
No

Irfan-S-1 · 2023-04-11T09:20:17Z

Blas xGEMV launch failed : a.shape=[1,8696332,3], b.shape=[1,3,1], m=8696332, n=1, k=3 [Op:MatMul]

I am getting error as above i am using CUDA 11.2 and tensorflow version 2.11.1

arvindrajan92 added the type:bug Bug label Feb 21, 2022

google-ml-butler bot assigned tilakrayal Feb 21, 2022

tilakrayal added the TF 2.8 label Feb 21, 2022

tilakrayal added type:build/install Build and install issues stat:awaiting response Status - Awaiting response from author labels Feb 21, 2022

tilakrayal removed the stat:awaiting response Status - Awaiting response from author label Feb 21, 2022

tilakrayal assigned gadagashwini and unassigned tilakrayal Feb 21, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Feb 22, 2022

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Feb 24, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Mar 2, 2022

njzjz mentioned this issue Mar 10, 2022

[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE deepmodeling/deepmd-kit#1062

Closed

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Mar 12, 2022

gadagashwini closed this as completed Mar 14, 2022

markemus mentioned this issue Sep 15, 2022

Model Trains on CPU but not on GPU - TensorDot/MatMul error #57666

Closed

metrizable mentioned this issue Oct 18, 2022

Blas xGEMV launch failed but it worked on the same GPU last week googlecolab/colabtools#3145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal error `Blas xGEMV launch failed` on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

Internal error `Blas xGEMV launch failed` on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

arvindrajan92 commented Feb 21, 2022

arvindrajan92 commented Feb 21, 2022 •

edited

tilakrayal commented Feb 21, 2022

arvindrajan92 commented Feb 21, 2022

gadagashwini commented Feb 22, 2022

arvindrajan92 commented Feb 22, 2022 •

edited

arvindrajan92 commented Mar 2, 2022 •

edited

gadagashwini commented Mar 2, 2022

njzjz commented Mar 8, 2022

arvindrajan92 commented Mar 10, 2022

gadagashwini commented Mar 14, 2022

google-ml-butler bot commented Mar 14, 2022

Irfan-S-1 commented Apr 11, 2023

Internal error Blas xGEMV launch failed on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

Internal error Blas xGEMV launch failed on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

Comments

arvindrajan92 commented Feb 21, 2022

arvindrajan92 commented Feb 21, 2022 • edited

tilakrayal commented Feb 21, 2022

arvindrajan92 commented Feb 21, 2022

gadagashwini commented Feb 22, 2022

arvindrajan92 commented Feb 22, 2022 • edited

arvindrajan92 commented Mar 2, 2022 • edited

gadagashwini commented Mar 2, 2022

njzjz commented Mar 8, 2022

arvindrajan92 commented Mar 10, 2022

gadagashwini commented Mar 14, 2022

google-ml-butler bot commented Mar 14, 2022

Irfan-S-1 commented Apr 11, 2023

Internal error `Blas xGEMV launch failed` on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

Internal error `Blas xGEMV launch failed` on Tensorflow v2.8.0 for the same block of codes that runs perfectly well on Tensorflow v2.4.1 #54463

arvindrajan92 commented Feb 21, 2022 •

edited

arvindrajan92 commented Feb 22, 2022 •

edited

arvindrajan92 commented Mar 2, 2022 •

edited