Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED #41630

Closed
lissyx opened this issue Jul 22, 2020 · 35 comments
Closed
Assignees
Labels
comp:gpu TF 1.15 type:bug

Comments

@lissyx
Copy link
Contributor

@lissyx lissyx commented Jul 22, 2020

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian/Sid (2020-07-01), Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): source and binary
  • TensorFlow version (use command below): 1.15
  • Python version: 3.6, 3.7.8
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): 9.0
  • CUDA/cuDNN version: 10.0/7.4.1 ; 10.0/7.4.2.1 ; 10.0/7.5.1.10 ; 10.0/7.6.5.32
  • GPU model and memory: 2x RTX 2080 Ti ; 4x GTX 1080 Ti ;

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)": v1.15.3-0-g4386a6640c

Describe the current behavior
Training with some dataset triggers:

2020-07-22 16:15:42.108252: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:15:42.108385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 

Describe the expected behavior
Training should succeed, or TensorFlow or CUDNN should expose a more actionable error

Standalone code to reproduce the issue
Will be provided after.

Other info / logs
Will be provided after. Some noisy debugging session can be seen at mozilla/DeepSpeech#3088

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 22, 2020

Repro material.

To repro with r1.15 tensorflow on current Debian Sid, I needed to use pyenv to have a local 3.7 python (latest version supported by r1.15), but if you run it on a system-provided 3.7, you'd prefer remove those line and just setup a virtualenv.

STR:

  1. Create ~/tmp/issue3308
  2. Placed this script in ~/tmp/issue3088/
$ cat run_local.sh 
#!/bin/bash

set -xe

COMMAND_PREFIX=$1
echo $COMMAND_PREFIX

rm -fr ~/.local/share/deepspeech/

export PYENV_VERSION=3.7.8
export PYENV_ROOT=$HOME/pyenv/
export PATH=$PYENV_ROOT/bin/:$PATH

eval "$(pyenv init -)"

pip --version

for p in tensorboard tensorflow tensorflow-estimator tensorflow-gpu-local;
do
    pip list $p && pip uninstall --yes $p
done;

cd ds/ && \
        pip3 install --upgrade pip==20.0.2 wheel==0.34.2 setuptools==46.1.3 && \
        DS_NOTENSORFLOW=y pip3 install --upgrade -e . && \
        pip3 install --upgrade tensorflow-gpu==1.15.3
cd ..

#pip install --upgrade $HOME/tmp/issue3088/wheel_dst/tensorflow_gpu_local-1.15.0-cp37-cp37m-linux_x86_64.whl
#pip install --upgrade $HOME/tmp/issue3088/wheel_dst/tensorflow_gpu_local-*-cp37-cp37m-linux_x86_64.whl

### 1.13
#patch -p1 -d $PYENV_ROOT/versions/$PYENV_VERSION < disable_tensorflow_contrib_cloud.patch
# tpu hack
#pip install --upgrade tensorflow-estimator==1.14.0

### after
#patch -p1 -d $PYENV_ROOT/versions/$PYENV_VERSION < disable_tensorflow_core_contrib_cloud.patch

nvidia-smi

python --version

#TF_CPP_MIN_VLOG_LEVEL=1
python -c 'import tensorflow as tf; tf.test.is_gpu_available()'

$COMMAND_PREFIX $PYENV_ROOT/versions/$PYENV_VERSION/bin/python ds/DeepSpeech.py \
        --show_progressbar true \
        --log_level 0 \
        --train_cudnn true \
        --alphabet_config_path ds/data/alphabet.txt \
        --scorer "" \
        --train_files csvs/train_debug_mini_As_Bs_Cs_local.csv \
        --train_batch_size 2 \
        --n_hidden 2048 \
        --audio_sample_rate 16000 \
        --epochs 3 \
        --learning_rate 0.0001 \
        --dropout_rate 0.4 \
        --lm_alpha 0.75 \
        --lm_beta 1.85
  1. git clone https://github.com/mozilla/DeepSpeech/ ds/ (you might also need git-lfs installed prior)
  2. Extract attached zip file for repro dataset

The zip file also features some Dockerfile, if it helps for repro.
tensorflow_issue_41630.zip

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 22, 2020

More infos / logs

Failure log is (with some extra debug by myself):

D Session opened.                                                      
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.                                                                                                                                                                    
I Initializing all variables.                                                                                                                                                                               
2020-07-22 16:06:28.419358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7   
I STARTING Optimization                                                                                                                                                                                     
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-22 16:06:29.819427: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0   
generate_values <deepspeech_training.util.sample_collections.CSV object at 0x7f497418e610>                                                                                                                  
yield generate_values 0 csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f4d1710a8d0>                                                                                                                                                                                                               
yield generate_values 1 csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f4d1710a850>                                                                                                                                                                                                               
yield generate_values 2 csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f4d1710ab10>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
yield generate_values 3 csvs/../data/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f4d1710a9d0>                                                                                                                                                                                                               
yield generate_values 4 csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f4d1710a910>
yield generate_values 5 csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f4d1710abd0>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>                                                                                                                                                                                                                                                      
CudnnRNNForwardOp         
ShouldUsePaddedIO time_major=1                     
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74              
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
files: ["csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav" "csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav"] [74 74] [[[-0.245490551 0.00717380643 0.10210821 ... 0.117108196 -0.0276376158 -0.126444399]
  [0.0757513046 -0.0512054712 -0.291382492 ... 0.243732437 -0.0566715039 0.0398223847]]
                                                                                                                                                                                                                                                                                                                                                                                                                         
 [[-0.195820108 0.138200715 0.0146510527 ... 0.177044764 0.0705049634 -0.189360529]
  [-0.028521724 0.25693503 -0.152943641 ... 0.121622048 -0.342009 0.0648547485]]
                                                                                                      
 [[-0.22990115 0.229410842 0.0562512279 ... 0.0885351449 -0.210647494 -0.115016311]
  [-0.522053659 -0.152149498 -0.0655985773 ... 0.0328139216 -0.0536242127 0.0909850895]]
                                                                                                      
 ...                                    
                                                                                                      
 [[0.125317723 -0.201877266 0.342403591 ... -0.461629033 -0.7424016 -0.634134829]
  [-0.495484978 0.0653357953 0.444910228 ... 0.356240422 0.116222136 -0.165624559]]
                                                                                                      
 [[-0.559318066 -0.360073239 0.386285067 ... 0.250025511 -0.437686384 -0.0856590122]
  [-0.429158121 -0.0336664394 -0.304231912 ... 0.120450795 -0.136865944 -0.0517320298]]
                                                                                                      
 [[-0.30379957 0.109998763 0.230347365 ... 0.213008478 -0.100003451 -0.244150743]
  [-0.19179742 -0.0341652408 0.393984377 ... 0.300181448 -0.380117238 -0.0264749527]]] 'SparseTensor(indices=[[0 0]
 [0 1]                                                                
 [0 2]
 ...                                                                                                                                                                                                                                                                                                                                                                                                                     
 [1 22]                   
 [1 23]                                                                                   
 [1 24]], values=[5 14 0 ... 8 9 10], shape=[2 27])'                                                                                                                                                                                                                                                                                                                                                                     
CudnnRNNBackwardOp        
ShouldUsePaddedIO time_major=1                                                            
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74                                                    
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                     
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 188.119797                                                                                                                                                                                                                                                                                                                                              b
atch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
2020-07-22 16:06:33.612395: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 16:06:33.612438: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 

We verified a few things:

  • fine with tensorflow r1.14
  • breaks on r1.15 with cudnn 7.6, 7.5, more intermittent with 7.4 (I got severel non repro, reporting contributor still hit the issue)
  • repro with several versions of the nvidia driver (currently using 440.100 packaged by debian sid, contributor using similar version tested down to 430.64: mozilla/DeepSpeech#3088 (comment))
  • git bisected tensorflow several time to ensure the first commit really exposing the issue is #30889 that @kaixih pushed (confirmed by local revert and no repro at all over 25 runs)
  • forcing ShouldUsePaddedIO to return false would repro, forcing true would never repro in my case (not verified on big dataset from contributors)
  • tried changing the batch ordering when we feed / call ctc_loss but this would not help
  • a lot of other trial/experiments are to be found in the deepspeech bug report

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 22, 2020

Please do not hesitate to ping myself or @applied-machinelearning if you have troubles reproducing the issue, but with the provided material it should be straightforward.

@ravikyram ravikyram added comp:gpu TF 1.15 labels Jul 23, 2020
@ravikyram ravikyram assigned ymodak and unassigned ravikyram Jul 23, 2020
@ymodak ymodak assigned sanjoy and unassigned ymodak Jul 23, 2020
@sanjoy
Copy link
Contributor

@sanjoy sanjoy commented Jul 23, 2020

@timshen91 can you PTAL?

@kaixih
Copy link
Contributor

@kaixih kaixih commented Jul 23, 2020

Can you try to fetch the cudnn logs with the following env vars? And attach the somefile.log (If this is too large, we might only need the last part which contains the cudnnRNNForwardTrainingEx). @lissyx

export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=somefile.log

More details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 23, 2020

issue3088.log
I might not be able to react before tomorrow, it's getting later over there.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 23, 2020

More details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging

I swear I search a long time for a CUDNN-related debug env variable but never found that :/

@kaixih
Copy link
Contributor

@kaixih kaixih commented Jul 23, 2020

I! CuDNN (v7600) function cudnnRNNForwardTrainingEx() called:
...
i!         paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_DISABLED (0);
i!         plan: type=cudnnPersistentRNNPlan_t; val=NULL_PTR;
i!     xDesc: type=cudnnRNNDataDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         dimA: type=int; val=[75,2,2048];
i!         seqLengthArray: type=int; val=[74,75];
i!         layout: type=cudnnRNNDataLayout_t; val=CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED (0);
i!         paddingFill: type=CUDNN_DATA_FLOAT; val=0.000000;
...

An initial investigation shows the padding mode is suspicious. We shouldn't see CUDNN_RNN_PADDED_IO_DISABLED here, which is incompatible with the CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED below.

This case uses the variable sequence lengths [74, 75] and max seq length is 75. The time/seq major layout is used. From the logic here: https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L944-L956, the use_padded_io should be set True and subsequently the CUDNN_RNN_PADDED_IO_ENABLED should have been used here https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1082. I checked how the use_padded_io gets passed in the TF codebase and it looks no issue to me.

@kaixih
Copy link
Contributor

@kaixih kaixih commented Jul 24, 2020

Just one more thing: can you give a shot with TF_CUDNN_RESET_RND_GEN_STATE=1? The only thing that I suspect now is this branch might not be taken.

if (rnn_state.rnn_desc == nullptr || ResetRndGenState()) {

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 24, 2020

Wait, I might have made a mistake when generating those logs and left a false being forced for use_padded_io (I should maybe not have done that being sleepy). I'm cross-checking and will provide newer logs.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 24, 2020

Here is the log, with debug enabled, after ensuring ShouldUsePaddedIO does not hardcode a false:
issue3088_new_noResetRnnGenState.log
As you can see, this is still failing the same way.

Now, here are the logs with the same TensorFlow setup, and TF_CUDNN_RESET_RND_GEN_STATE=1
issue3088_new_withResetRnnGenState_0.log
issue3088_new_withResetRnnGenState_1.log
issue3088_new_withResetRnnGenState_2.log
issue3088_new_withResetRnnGenState_3.log
issue3088_new_withResetRnnGenState_4.log
As you can see, no repro at all now.

This does makes me wonder if we might not be in trouble because of how we use CudnnLSTM: https://github.com/mozilla/DeepSpeech/blob/9e023660ef1c0947b0f8c8db54b5cc9174c7076a/training/deepspeech_training/train.py#L108-L132
Context, as much as I can remember (I'm not the one who wrote that) is that we needed to handle variables properly for being able to reload checkpoint on non-CuDNN setup. Maybe @reuben can elaborate if that's required.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 24, 2020

This does makes me wonder if we might not be in trouble because of how we use CudnnLSTM: https://github.com/mozilla/DeepSpeech/blob/9e023660ef1c0947b0f8c8db54b5cc9174c7076a/training/deepspeech_training/train.py#L108-L132
Context, as much as I can remember (I'm not the one who wrote that) is that we needed to handle variables properly for being able to reload checkpoint on non-CuDNN setup. Maybe @reuben can elaborate if that's required.

So I verified, changed our code to avoid this like that:

diff --git a/training/deepspeech_training/train.py b/training/deepspeech_training/train.py
index 93d0c727..85c581ae 100644
--- a/training/deepspeech_training/train.py
+++ b/training/deepspeech_training/train.py
@@ -115,22 +115,19 @@ def rnn_impl_cudnn_rnn(x, seq_length, previous_state, _):
     # reuse=True to reuse variables, we can't easily make use of the object oriented
     # way CudnnLSTM is implemented, so we save a singleton instance in the function,
     # emulating a static function variable.
-    if not rnn_impl_cudnn_rnn.cell:
+    with tfv1.variable_scope('cudnn_lstm/rnn/multi_rnn_cell'):
         # Forward direction cell:
         fw_cell = tf.contrib.cudnn_rnn.CudnnLSTM(num_layers=1,
                                                  num_units=Config.n_cell_dim,
                                                  input_mode='linear_input',
                                                  direction='unidirectional',
                                                  dtype=tf.float32)
-        rnn_impl_cudnn_rnn.cell = fw_cell
 
-    output, output_state = rnn_impl_cudnn_rnn.cell(inputs=x,
-                                                   sequence_lengths=seq_length)
+        output, output_state = fw_cell(inputs=x,
+                                       sequence_lengths=seq_length)
 
     return output, output_state
 
-rnn_impl_cudnn_rnn.cell = None
-
 
 def rnn_impl_static_rnn(x, seq_length, previous_state, reuse):
     with tfv1.variable_scope('cudnn_lstm/rnn/multi_rnn_cell'):

Unfortunately, the issue does still repro.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 27, 2020

So @kaixih what's your take with those new logs and the extra verification I performed on our code ? Don't hesitate if you need more logging, I'll do it as quickly as I can :)

@kaixih
Copy link
Contributor

@kaixih kaixih commented Jul 27, 2020

So, for now two things can be verified:
(1) If the TF_CUDNN_RESET_RND_GEN_STATE=1, there is no repro, correct?
(2) If the TF_CUDNN_RESET_RND_GEN_STATE=0 (default setting), the use_padded_io is True, correct? If yes, can you track it and see if this branch is taken in your training:

?

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

(1) If the TF_CUDNN_RESET_RND_GEN_STATE=1, there is no repro, correct?

Yes, that is for sure.

(2) If the TF_CUDNN_RESET_RND_GEN_STATE=0 (default setting), the use_padded_io is True

Should I check that in the cudnn logs or at tensorflow framework level (i.e., return value of ShouldUsePaddedIO) ?

If yes, can you track it and see if this branch is taken in your training:

I will verify that.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

(2) If the TF_CUDNN_RESET_RND_GEN_STATE=0 (default setting), the use_padded_io is True, correct? If yes, can you track it and see if this branch is taken in your training:

It is indeed weird: so far, it looks like we do return true from ShouldUsePaddedIO but when we reach the code you link, the value is false there, and so we don't set the proper value for CUDNN side.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

So, it would looks like we are calling GetCachedRnnDescriptor() https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L1077-L1096 with proper use_padded_io=true, but the cache returns a hit and the matching rnn_desc we get is with CUDNN_RNN_PADDED_IO_DISABLED.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-28 13:51:55.003459: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
generate_values <deepspeech_training.util.sample_collections.CSV object at 0x7fce547bcc10>
yield generate_values 0 csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7fd2167e4090>
yield generate_values 1 csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7fd2167d8fd0>
yield generate_values 2 csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7fd2167d8cd0>
yield generate_values 3 csvs/../data/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7fce547bcc50>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
yield generate_values 4 csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7fd2167d8f90>
yield generate_values 5 csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7fd2167d8e50>
ShouldUsePaddedIO time_major=1                                                                    
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74                                                                  
ShouldUsePaddedIO [0]: seq_array[i]=74                    
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNForwardOp use_padded_io=0                                                                                                                                                                           
CudnnRNNForwardOp continue use_padded_io=0  
CudnnRNNForwardOp GetCachedRnnDescriptor use_padded_io=0                                                                                                                                                    
GetCachedRnnDescriptor call with use_padded_io=0
--> GetCachedRnnDescriptor[num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 74, 2, 2048]                            
--> GetCachedRnnDescriptor cache miss use_padded_io=0
--> GetCachedRnnDescriptor call CreateRnnDescriptor use_padded_io=0              
-->  CreateRnnDescriptor use_padded_io=0                                                                                                                                                                                                                                                                                                                                                                                      
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) IN use_padded_io=0
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) Create use_padded_io=0                                                                                                                                                                                                                                                                                                                                                                                          
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) Set_v6 use_padded_io=0
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) checking use_padded_io=0                                                      
ShouldUsePaddedIO time_major=1                 
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNBackwardOp use_padded_io=0                                                                
GetCachedRnnDescriptor call with use_padded_io=0
--> GetCachedRnnDescriptor[num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 74, 2, 2048] 
--> GetCachedRnnDescriptor cache miss use_padded_io=0                                            
--> GetCachedRnnDescriptor call CreateRnnDescriptor use_padded_io=0                                                                                                                                             
--> CreateRnnDescriptor use_padded_io=0                                                                                                                                                                         
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) IN use_padded_io=0                                                                  
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) Create use_padded_io=0                                                
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) Set_v6 use_padded_io=0 
static stream_executor::port::StatusOr<stream_executor::gpu::CudnnRnnDescriptor> stream_executor::gpu::CudnnRnnDescriptor::Create(const stream_executor::gpu::{anonymous}::CudnnHandle&, int, int, int, int, int, cudnnRNNInputMode_t, cudnnDirectionMode_t, cudnnRNNMode_t, cudnnDataType_t, cudnnDataType_t, const stream_executor::dnn::AlgorithmConfig&, float, tensorflow::uint64, stream_executor::ScratchAllocator
*, bool) checking use_padded_io=0                                 
Epoch 0 |   Training | Elapsed Time: 0:00:04 | Steps: 1 | Loss: 188.119797                                                                                                                                                                                                                                                                                                                                              b
atch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
ShouldUsePaddedIO time_major=1                                                                                                                                                                              
ShouldUsePaddedIO seq_array[0]=74                  
ShouldUsePaddedIO seq_array[1]=75                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=true all_max_seq_length=false
CudnnRNNForwardOp use_padded_io=1                                                                                                                                                                           
CudnnRNNForwardOp continue use_padded_io=1
CudnnRNNForwardOp GetCachedRnnDescriptor use_padded_io=1                                                                                                                                                    
GetCachedRnnDescriptor call with use_padded_io=1
--> GetCachedRnnDescriptor[num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]                                                   
--> GetCachedRnnDescriptor cache hit use_padded_io=1                             
2020-07-28 13:51:59.388865: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1783): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-28 13:51:59.388909: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1537 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

@kaixih So, after checking, I'm wondering if there's not something missing for creating the cache key:
https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L1083

We use model_shapes there, which should NOT allow cache hit:

GetCachedRnnDescriptor[num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 74, 2, 2048]

is obviously different from:

GetCachedRnnDescriptor[num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]

But according to the log above, we still had a cache hit on this.

Checking the definition of CudnnRnnModelShapes https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L494-L520, it is my understanding that IsCompatibleWith is used to compare the model shape: https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L507-L512

One can see that this method does not take into account max_seq_length, which would explain why we get a cache hit with padded disabled when we had it enabled.

At least, with that patch, I can't repro anymore the issue, and I can see cache misses and cache hits:

diff --git a/tensorflow/core/kernels/cudnn_rnn_ops.cc b/tensorflow/core/kernels/cudnn_rnn_ops.cc
index 4a27394f28..9098b97462 100644
--- a/tensorflow/core/kernels/cudnn_rnn_ops.cc
+++ b/tensorflow/core/kernels/cudnn_rnn_ops.cc
@@ -508,7 +508,7 @@ struct CudnnRnnModelShapes {
   bool IsCompatibleWith(const CudnnRnnModelShapes& rhs) const {
     return num_layers == rhs.num_layers && input_size == rhs.input_size &&
            num_units == rhs.num_units && dir_count == rhs.dir_count &&
-           cell_num_units == rhs.cell_num_units;
+           cell_num_units == rhs.cell_num_units && max_seq_length == rhs.max_seq_length;
   }
   string DebugString() const {
     return strings::Printf(

I'm unsure if this is correct, maybe we should just include max_seq_length into the hash itself?
Looking at CudnnRnnConfigHasher defined at https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L522-L550, it computes a hash from the model's shape https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L531-L533 which is directly used for computing the key.

This patch also works (and honestly, it does feel nicer):

diff --git a/tensorflow/core/kernels/cudnn_rnn_ops.cc b/tensorflow/core/kernels/cudnn_rnn_ops.cc
index 4a27394f28..db7ff98df2 100644
--- a/tensorflow/core/kernels/cudnn_rnn_ops.cc
+++ b/tensorflow/core/kernels/cudnn_rnn_ops.cc
@@ -530,7 +530,7 @@ struct CudnnRnnConfigHasher {
 
     uint64 hash =
         HashList({shapes.num_layers, shapes.input_size, shapes.num_units,
-                  shapes.dir_count, shapes.batch_size});
+                  shapes.dir_count, shapes.max_seq_length, shapes.batch_size});
     if (algo_desc.has_value()) {
       hash = Hash64Combine(hash, algo_desc->hash());
     }

What's your take @kaixih on this? Does that looks like the root cause of the problem to you?

@applied-machinelearning

@lissyx
Looks very plausible to be the root cause to me!
From a code point of view and it also fits with and explains all the patterns we saw while testing and debugging this issue. The same code seems also still present in TF2.x and master, which correlates with all the other reports you found about LSTM with CUDA/CUDNN being unstable while training and are likely related.
Great catch !

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

Great catch !

Thanks, no offense but I'll claim victory once I get feedback from @kaixih :)

@kaixih
Copy link
Contributor

@kaixih kaixih commented Jul 28, 2020

@lissyx Thanks, I agree with you that this should be the root cause. (This also reminds me that I probably forgot to add corresponding changes to the cached RNN desc searching back when adding the variable sequence length feature for cuDNN).

For your two patches, I prefer to change both or only the first one. The first one about the IsCompatibleWith used here is mainly for the probing in the hash table, but might be used elsewhere to compare the RNN states. And I think it makes sense to distinguish RNNs with different max_seq_length. The second one about the hash function is mainly for generating the hash key for the RNN states and your patch could give us a little faster searching performance (because we can skip the probing). So, I think the first one is probably a necessity or we simply change both.

For the next step, can you create a PR to fix this bug and maybe link to this issue? Thanks.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

For the next step, can you create a PR to fix this bug and maybe link to this issue? Thanks.

Sure!

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

I suspect there won't be a 1.15 dot release including that kind of change, unfortunately?

@kaixih
Copy link
Contributor

@kaixih kaixih commented Jul 28, 2020

Right, I think Google only fixes major issues for 1.15. Can you first create a PR against master? And later they might pick it for 1.15 if necessary? @sanjoy

lissyx added a commit to lissyx/tensorflow that referenced this issue Jul 28, 2020
@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 28, 2020

Right, I think Google only fixes major issues for 1.15. Can you first create a PR against master? And later they might pick it for 1.15 if necessary? @sanjoy

Yep, it's done: #41832

lissyx added a commit to lissyx/tensorflow that referenced this issue Jul 29, 2020
lissyx added a commit to lissyx/tensorflow that referenced this issue Jul 29, 2020
lissyx added a commit to lissyx/tensorflow that referenced this issue Jul 30, 2020
lissyx added a commit to lissyx/tensorflow that referenced this issue Jul 30, 2020
@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Jul 31, 2020

@kaixih So I have been running experiment on ~1000h of french training data, and setting TF_CUDNN_RESET_RND_GEN_STATE=1 would make one epoch of training about 20 seconds longer, ~23:30 minutes on my hardware from the ~23:10 minutes when using TF_CUDNN_RESET_RND_GEN_STATE=0

Does that sounds reasonnable to you? Are there any side-effects to using that env variable we should know about?

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Aug 7, 2020

@kaixih So I have been running experiment on ~1000h of french training data, and setting TF_CUDNN_RESET_RND_GEN_STATE=1 would make one epoch of training about 20 seconds longer, ~23:30 minutes on my hardware from the ~23:10 minutes when using TF_CUDNN_RESET_RND_GEN_STATE=0

Does that sounds reasonnable to you? Are there any side-effects to using that env variable we should know about?

@kaixih Gentle ping: we are close to release 1.0, and we'd like to know exactly what are the consequences of that flag so that we can either force it by default until a (potential 1.15.4 is released) or at least recommend people to use it?

Thanks!

@google-ml-butler
Copy link

@google-ml-butler google-ml-butler bot commented Aug 7, 2020

Are you satisfied with the resolution of your issue?
Yes
No

lissyx added a commit to lissyx/tensorflow that referenced this issue Aug 24, 2020
@applied-machinelearning
Copy link

@applied-machinelearning applied-machinelearning commented Sep 3, 2020

@kaixih
Is the fix from @lissyx fix actually enough ?
I have now rebuild tensorflow (1.15.3) with this fix applied and it now doesn't blow up in the training stage of DeepSpeech, but in the validation stage.

If i look at the code of:
Status GetCachedRnnDescriptor(
OpKernelContext* context,
const CudnnRnnModelShapes& model_shapes,
const RnnInputMode& input_mode,
const AlgorithmConfig& algo_config,
RnnStateCache* cache, RnnDescriptor** rnn_desc,
bool use_padded_io
)

  • Is it really enough to only check if the model_shapes match and not input_mode, algo_config and use_padded_io ?
  • Another question is if the whole caching is really worthwhile, since lissyx has reported on a test run he didn't find a stellar difference in training duration with or without the caching ?

mihaimaruseac added a commit that referenced this issue Sep 18, 2020
Fix #41630: include max_seq_length in cudnn descriptor cache key
@geetachavan1 geetachavan1 added this to To do in TensorFlow 2.4.0 via automation Sep 28, 2020
@geetachavan1 geetachavan1 moved this from To do to Done in TensorFlow 2.4.0 Sep 28, 2020
@gangaswamy
Copy link

@gangaswamy gangaswamy commented Nov 30, 2020

Just let you know guys that in my system I need to set TF_CUDNN_RESET_RND_GEN_STATE=1 flag to make it to work even after upgrading to 1.15.4

@soerengustenhoff
Copy link

@soerengustenhoff soerengustenhoff commented Dec 4, 2020

me aswell, it does not work without: "export TF_CUDNN_RESET_RND_GEN_STATE=1 " in the .sh file i run to fine-tune the model.

But it does work with it ! so i am a happy camper again.

perhaps the documentation should be updated again @lissyx ?

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Dec 4, 2020

me aswell, it does not work without: "export TF_CUDNN_RESET_RND_GEN_STATE=1 " in the .sh file i run to fine-tune the model.

But it does work with it ! so i am a happy camper again.

perhaps the documentation should be updated again @lissyx ?

No, it means there might be other cases triggering some issue. You are welcome to investigate the internals of TensorFlow / CUDNN.

@soerengustenhoff
Copy link

@soerengustenhoff soerengustenhoff commented Dec 4, 2020

I am in the same boat as you, insanely busy with work and my toddler.
i am writing my bachelor thesis now, and it is to be handed 4.th of January.

what i meant is that it might be a good idea to mention this tensorflow cudnn flag in the documentation.
but i guess that must folks will google the error and find these 3 threads.

i might look into this in February though.

@lissyx
Copy link
Contributor Author

@lissyx lissyx commented Dec 4, 2020

I am in the same boat as you, insanely busy with work and my toddler.
i am writing my bachelor thesis now, and it is to be handed 4.th of January.

Good luck, been there, done that, I know how you are right now.

what i meant is that it might be a good idea to mention this tensorflow cudnn flag in the documentation.

As you can see from the discussion, this flag is a workaround not a fix. If there's a documentation that might benefit from it, it would be deepspeech, not tensorflow as much as I can tell.

but i guess that must folks will google the error and find these 3 threads.

i might look into this in February though.

Please understand that tracking and fixing the root cause in TensorFlow to ensure DeepSpeech users are safe already took me weeks, specifically from the lack of reproductibility. That was when I was 100% of the time on the DeepSpeech project, which is not the case anymore, so, I'm sorry, but I really can't look into that.

chenyu-jiang pushed a commit to chenyu-jiang/tensorflow that referenced this issue Dec 7, 2020
chenyu-jiang added a commit to chenyu-jiang/tensorflow that referenced this issue Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu TF 1.15 type:bug
Projects
Development

No branches or pull requests

9 participants