New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect gradient for ctc_loss on GPU when using logit_length #41280
Comments
@Saduf2019 |
@pvanhaes, |
As far as I can tell, yes? It seems to be related since the problem occurs only when using the cudnn ctc loss implementation. |
I tried the script on V100 a couple of times and I can see the flakiness:
Run X:
Looking into it. |
I think the issue is the cudnn doesn't zero out the grads if it exceeds the sequence length. So, if the grads array happens to contain some large numbers, you will encounter the reported high error. One workaround is to explicitly apply the mask like below, where I use the sequence_mask() to generate a mask based on Also, I will file a bug towards our cudnn team to fix this issue. import tensorflow as tf
tf.random.set_seed(1)
use_logits_lengths = True
batch_size = 8
num_labels = 27
max_labels_length = 32
max_logits_length = 128
#batch_size = 4
#num_labels = 6
#max_labels_length = 32
#max_logits_length = 64
labels = []
labels_lengths = []
logits = []
logits_lengths = []
for i in range(batch_size):
labels_lengths.append(tf.random.uniform([], 1, max_labels_length, tf.int32))
labels.extend(tf.random.uniform([labels_lengths[-1]], 0, num_labels-1, tf.int32))
# I multiply label_length by 2 to make sure there are enough frames
logits_lengths.append(tf.random.uniform([], labels_lengths[-1].numpy()*2, max_logits_length+1, tf.int32))
labels = tf.RaggedTensor.from_row_lengths(labels, labels_lengths).to_sparse()
labels_lengths = tf.concat(labels_lengths, 0)
logits_lengths = tf.concat(logits_lengths, 0)
logits_lengths_full = tf.constant([max_logits_length]*batch_size)
logits = tf.random.uniform([batch_size, max_logits_length, num_labels])
logit_mask = tf.sequence_mask(logits_lengths, max_logits_length,
tf.dtypes.float32)
logit_mask = tf.expand_dims(logit_mask, axis=2)
#print("XXX", logit_mask)
def ctc_compare_cpu_gpu(logits_lengths, mask=None):
print("logits_lengths", logits_lengths.numpy())
print("labels_lengths", labels_lengths.numpy())
with tf.device("/gpu:0"):
with tf.GradientTape() as t:
t.watch(logits)
gpu_loss = tf.nn.ctc_loss(labels, logits, labels_lengths, logits_lengths, logits_time_major=False, blank_index=-1)
gpu_grad = t.gradient(gpu_loss, [logits])[0]
if mask is not None:
gpu_grad = gpu_grad * mask
with tf.device("/cpu:0"):
with tf.GradientTape() as t:
t.watch(logits)
cpu_loss = tf.nn.ctc_loss(labels, logits, labels_lengths, logits_lengths, logits_time_major=False, blank_index=-1)
cpu_grad = t.gradient(cpu_loss, [logits])[0]
print("Max loss error", tf.math.abs(gpu_loss - cpu_loss).numpy().max())
print("Max grad error", tf.math.abs(gpu_grad - cpu_grad).numpy().max())
return cpu_loss, gpu_loss, cpu_grad, gpu_grad
ctc_compare_cpu_gpu(logits_lengths_full)
ctc_compare_cpu_gpu(logits_lengths, mask=logit_mask)
#ctc_compare_cpu_gpu(logits_lengths)
|
Hi @kaixih |
Not sure the NaN issue is caused by those "unused and not correctly initialized" gradients output by cudnn backend. Do you mean you still hit the NaN issue even after manually applying the masks over gradients returned by the ctc loss call on GPU? |
What happened was that even when masking the gradients (using |
Thanks. It sounds like a numeric precision issue. If possible, could you also give a shot with "TF_CUDNN_DETERMINISTIC=1" which will force TF to use a deterministic CTC algorithm (However, this would require the label size under 256.). By default, TF uses a non-deterministic algorithm. |
Hi, I encountered the same problem where the CPU version of the ctc_loss works fine, but the GPU version gives the NAN. I've tried setting the TF_CUDNN_DETERMINISTIC=1 with bach_size=512 but the issue persists. However, setting batch size=128 or 256 seems to fix the issue. |
Hi @AveryLiu That is interesting. The deterministic algorithm shouldn't affect the batch size. Are you working on the variable sequence lengths or fixed sequence lengths? If it is the variable sequence lengths, maybe it is a fluke when batch size is 128 or 256. |
I'm using variable sequence length. Indeed setting batch_size to 128 or 256 does not solve the problem. It does not give me NANs, but the gradients seem incorrect and the loss stagnates (CPU version is ok). I am not sure how to apply the gradient masking mentioned above since I am using a fully capsulated Keras model. Another thing I observed is that the model can be trained with data where all label length is 1. |
Was able to replicate the issue in TF v2.5,please find the gist here..Thanks ! |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
Hi @pvanhaes, You need set GPU growth to True. Below code snippet worked.
Output
|
any update on this? tf2.11 seems gradients not equal when using sparsetensor |
System information
Describe the current behavior
I have experienced inconsistencies in the computation of the gradient of
tf.nn.ctc_loss
between the CPU and GPU implementations when thelogit_length
argument contains something else than[num_frames]*batch_size
.Mostly I observe that the gradient relative to
logits
for the GPU implementation does not contain zeros after the end of the sequence as given bylogit_length
. Whereas this is the case for the CPU implementation which seems to work correctly.I have noticed that the unit tests for this op do not test this case in particular (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/kernel_tests/ctc_loss_op_test.py#L993).
Standalone code to reproduce the issue
Output:
The text was updated successfully, but these errors were encountered: