Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple graph invoking tf.complex() doesn't work on GPU, but works on CPU #38443

Closed
isaacgerg opened this issue Apr 10, 2020 · 12 comments
Closed
Assignees
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.1 for tracking issues in 2.1 release type:bug Bug

Comments

@isaacgerg
Copy link

isaacgerg commented Apr 10, 2020

Environment: Windows 10, Python 3.6, Tensorflow 2.1.0-rc2

The code below demonstrates a minimal working example of the bug. This code results in CUDA_ERROR_LAUNCH_FAILED when run on the GPU. But, if you run on the CPU, the code has no issues. I suspect the problem lies in the tensor coming out of tf.complex() as if I do not use that function, the issues seems to go away.

A small working example shows the error I get along with working code to reproduce on Windows 10.

2020-04-10 16:19:43.846387: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-10 16:19:44.860247: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-10 16:19:44.879431: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-10 16:19:45.231402: E tensorflow/stream_executor/cuda/cuda_driver.cc:948] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.231880: E tensorflow/stream_executor/gpu/gpu_timer.cc:55] Internal: Error destroying CUDA event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.232665: E tensorflow/stream_executor/gpu/gpu_timer.cc:60] Internal: Error destroying CUDA event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.233121: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 8B (8 bytes) from device: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.233532: E tensorflow/stream_executor/stream.cc:5452] Internal: Failed to enqueue async memset operation: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.233951: E tensorflow/stream_executor/cuda/cuda_driver.cc:613] failed to load PTX text as a module: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.234331: E tensorflow/stream_executor/cuda/cuda_driver.cc:618] error log buffer (1024 bytes): 
2020-04-10 16:19:45.234634: W tensorflow/core/kernels/gpu_utils.cc:68] Failed to check cudnn convolutions for out-of-bounds reads and writes with an error message: 'Failed to load PTX text as a module: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure'; skipping this check. This only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2020-04-10 16:19:45.235499: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 8B (8 bytes) from device: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.235957: I tensorflow/stream_executor/stream.cc:4963] [stream=000001EA81FD2D60,impl=000001EA92405DC0] did not memzero GPU location; source: 0000008AE2D3C858
2020-04-10 16:19:45.236342: E tensorflow/stream_executor/cuda/cuda_driver.cc:613] failed to load PTX text as a module: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-04-10 16:19:45.236834: E tensorflow/stream_executor/cuda/cuda_driver.cc:618] error log buffer (1024 bytes): 
2020-04-10 16:19:45.237205: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: cuDNN launch failure : input shape([5,16,256,256]) filter shape([1,1,16,1])
	 [[{{node model/conv2d_1/Conv2D}}]]
import numpy as np
import tensorflow as tf
print(tf.__version__)

input_size = (256,256,1)
input_real = tf.keras.layers.Input(input_size)
input_imag = tf.keras.layers.Input(input_size)

# Get input into mag and phase 
cpx_input = tf.keras.layers.Lambda(lambda x: tf.complex(x[0], x[1]))([input_real, input_imag])    
abs_of_input = tf.math.abs(cpx_input)
phase_of_input =  tf.math.angle(cpx_input) 

# Add some trainiable weights
conv1 = tf.keras.layers.Conv2D(16, 5, padding = 'same')(abs_of_input)
mask = tf.keras.layers.Conv2D(1, 1)(conv1) 
filtered_freq = mask * abs_of_input
reconstructedFreq_dc_centered = tf.complex(mask, 0.0) * tf.math.exp(tf.complex(0.0,1.0)*tf.complex(phase_of_input, 0.0))  # I believe this is the offending line
tmp = tf.math.abs(reconstructedFreq_dc_centered)

model = tf.keras.models.Model([input_real, input_imag], tmp)

model.summary()

model.compile(optimizer='SGD', loss = 'mse')

x_real = np.random.randn(5, 256, 256, 1)
x_imag = np.random.randn(5, 256, 256, 1)

model.train_on_batch(x = [x_real, x_imag], y = x_real)

EDIT 1: Simplified code more.

@isaacgerg isaacgerg added the type:bug Bug label Apr 10, 2020
@Saduf2019 Saduf2019 added the TF 2.1 for tracking issues in 2.1 release label Apr 12, 2020
@Saduf2019
Copy link
Contributor

@isaacgerg
i ran the code shared by you on tf-nightly and do not face any errors, please find the gist here on cpu same on gpu

@Saduf2019 Saduf2019 added comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author labels Apr 12, 2020
@isaacgerg
Copy link
Author

@Saduf2019
I updated to tf-nightly and the bug still exists. Can you rerun in a Windows 10 environment (that's the environment I mention in the first post of where the error occurs)?

@Saduf2019 Saduf2019 assigned gowthamkpr and unassigned Saduf2019 Apr 14, 2020
@gowthamkpr
Copy link

As mentioned in the error message

Invoking GPU asm compilation is supported on Cuda non-Windows platforms only Relying on driver to perform ptx compilation. This message will be only logged once.

This is the reason for you running into the error on windows @isaacgerg

@isaacgerg
Copy link
Author

isaacgerg commented Apr 14, 2020

@gowthamkpr Why doesnt the driver perform the ptx compilation then? The operation is simple, a FOIL multiply of complex numbers.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 16, 2020
@gowthamkpr gowthamkpr assigned sanjoy and unassigned gowthamkpr May 28, 2020
@gowthamkpr gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 28, 2020
@sanjoy
Copy link
Contributor

sanjoy commented May 28, 2020

I think the message from redzone_allocator.cc is a red herring and that the CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure errors have some other root cause. Can you please attach the full log?

@isaacgerg
Copy link
Author

Hi @sanjoy, the fully log is in the first post. Let me know if you need anything else.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 31, 2020
@isaacgerg
Copy link
Author

@sanjoy Any update on this? How can i help?

@sanjoy
Copy link
Contributor

sanjoy commented Jul 15, 2020

Hi @isaacgerg,

It is quite difficult to say much from

2020-04-10 16:19:45.231402: E tensorflow/stream_executor/cuda/cuda_driver.cc:948] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

if that's all the logs say. Can you try running with CUDA_LAUNCH_BLOCKING=1 set? Maybe that will help narrow this down.

@Saduf2019
Copy link
Contributor

@isaacgerg
Could you please update with respect tot he above comment, or verify with later tf versions [2.4.1] if you still face the issue.

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label May 3, 2021
@Saduf2019 Saduf2019 self-assigned this May 3, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 10, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.1 for tracking issues in 2.1 release type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants