Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Not found: No algorithm worked!" error for convolutional layer during search #703

Closed
robin-p-schmitt opened this issue Oct 1, 2021 · 23 comments

Comments

@robin-p-schmitt
Copy link
Contributor

robin-p-schmitt commented Oct 1, 2021

My config looks as follows: https://gist.github.com/robin-p-schmitt/fe9880b8ff3cd1a4c7201626776fdaab. The relevant part seems to be this:

"source": {"class": "eval", "eval": <...specaug...>},
"source0": {"class": "split_dims", "axis": "F", "dims": (-1, 1), "from": "source"},  # (T,40,1)

"conv0": {
  "class": "conv", "from": "source0",
  "padding": "same", "filter_size": (3, 3),
  "n_out": 32, "activation": None, "with_bias": True},  # (T,40,32)

It works for both train as well as search mode when I am using /u/merboldt/setups/2020-01-08--rnnt-rna/crnn as my RETURNN version. However, when I am using the most recent RETURNN version (/u/schmitt/src/returnn), only train mode seems to work. When I try to run search with the current RETURNN version I get the following error: https://gist.github.com/robin-p-schmitt/5f89cd01ed4c5b74d9aebee256a4707c.

The main error seems to be:

2021-09-30 12:52:25.768174: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:1
115 : Not found: No algorithm worked!
TensorFlow exception: 2 root error(s) found.
  (0) Not found: No algorithm worked!
         [[node conv0/convolution (defined at u/schmitt/src/returnn/returnn/tf/layers/basic.py:4061) ]]
         [[output/rec/while/Switch_14/_553]]
  (1) Not found: No algorithm worked!
         [[node conv0/convolution (defined at u/schmitt/src/returnn/returnn/tf/layers/basic.py:4061) ]]
0 successful operations.

The log for the relevant layer is:

layer root/'source' output: Data{'source_output', [B,T|'time'[B],F|F'feature:data'(40)]}
layer root/'source0' output: Data{'source0_output', [B,T|'time'[B],F|F'feature:data'(40),'source0_split_dims1'(1)]}
layer root/'conv0' output: Data{'conv0_output', [B,T|'time'[B],'source0_split_dims1'(1),F|F'conv0:channel'(32)]}
layer root/'conv0p' output: Data{'conv0p_output', [B,T|'time'[B],'conv0p:pool:s1'(1),F|F'conv0:channel'(32)]}

I am using Python 3.8.0 with TF 2.3 and CUDA 10.1.

@robin-p-schmitt

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz
Copy link
Member

albertz commented Oct 1, 2021

The RETURNN version is not printed correctly (RETURNN starting up, version 1.0.0+unknown). Can you fix that? I assume latest RETURNN? Please mention that in the initial description.

Also explicitly mention the TF. Looks like Python 3.8.0 with TF 2.3.

Also mention the CUDA version. I assume CUDA 10.1 from the log.

@albertz
Copy link
Member

albertz commented Oct 1, 2021

Oh fuck, I just notice another bug: SplitDimsLayer (source0 here) marks the wrong feature dim axis. Thus ConvLayer has the wrong shape. (That's why explicitly specifying the shape via out_shape or so as in #706 would be helpful...) I reported this in #704. It's probably unrelated.

@robin-p-schmitt
Copy link
Contributor Author

The RETURNN version is not printed correctly (RETURNN starting up, version 1.0.0+unknown). Can you fix that? I assume latest RETURNN? Please mention that in the initial description.

I will try to fix this on the weekend.

@albertz
Copy link
Member

albertz commented Oct 1, 2021

Some of these report that the GPU has too less memory and this causes such an error.
Can you try with a smaller batch size?

@robin-p-schmitt
Copy link
Contributor Author

Some of these report that the GPU has too less memory and this causes such an error. Can you try with a smaller batch size?

Yes, I will also do this over the weekend. If this and the other links don't help, I will also try to create a minimal config which reproduces the error.

albertz added a commit that referenced this issue Oct 1, 2021
WARNING: This potentially changed the behavior of configs.
E.g. the one reported in #703.
Most of the attention and transducer encoders
which used initial convolutional layers.

However, this bug probably did not exist for a long time.

Fix #704.
@tbscode
Copy link
Contributor

tbscode commented Oct 1, 2021

I get this also on some of my setups, which where working before.
Only the ones using conv-layers and only in recog.

@albertz
Copy link
Member

albertz commented Oct 1, 2021

Maybe #704 is actually relevant. This bug probably did not appear for too long, only since recently (I need to check since when exactly). Although it is somewhat strange that this could cause such an error.

I anyway would also test with a smaller batch size.

albertz added a commit that referenced this issue Oct 1, 2021
WARNING: This potentially changed the behavior of configs.
E.g. the one reported in #703.
Most of the attention and transducer encoders
which used initial convolutional layers.

However, this bug probably did not exist for a long time.

Fix #704.
@albertz
Copy link
Member

albertz commented Oct 1, 2021

I fixed #704 via #705 now. So also try with current master again (and the original batch size).

But in any case, also test with smaller batch size (before testing with current master). I want to see if this already solves it.

@tbscode
Copy link
Contributor

tbscode commented Oct 1, 2021

I've tried reducing batch size from 4000 to 500 for the search, same Not found: No algorithm worked! -errors with the old commit.
Searches work with bs 4000 on current master.

@albertz
Copy link
Member

albertz commented Oct 1, 2021

Interesting.

But one other thing: The model was trained with what RETURNN version exactly? Some version before 2ff056d from Wed Sep 22 (which introduced the bug)?

Because if so, I wonder that the model checkpoint loads properly at all.

Without the bug, i.e. before 2ff056d or after a0320ea, the conv0 layer should have a kernel size of list(filter_size) + [n_in, n_out], i.e. [3, 3, 1, 32].
With the bug, i.e. within those commits, the kernel size is [3, 3, 40, 32].

@albertz
Copy link
Member

albertz commented Oct 1, 2021

I was able to reproduce the TF error in a simple example using RETURNN (without split dims): https://gist.github.com/albertz/21e00a500e41eb0c8d27a8519e763f0e

@albertz
Copy link
Member

albertz commented Oct 1, 2021

I tested actually the TF model checkpoint behavior, because I expected that this should throw some error, and it turns out that this is already buggy in TF. This maybe causes the problem here. I reported the problem here: tensorflow/tensorflow#52220

@albertz
Copy link
Member

albertz commented Oct 1, 2021

Now I could also reproduce the convolution error in pure TF. It's correct that it throws an exception, but the exception or error message is very misleading. I reported that as a bug here: tensorflow/tensorflow#52223

@robin-p-schmitt
Copy link
Contributor Author

But one other thing: The model was trained with what RETURNN version exactly? Some version before 2ff056d from Wed Sep 22 (which introduced the bug)?

The model was trained with a RETURNN version from before Sep 22. I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model.

@robin-p-schmitt
Copy link
Contributor Author

I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model.

So this actually worked!

@albertz
Copy link
Member

albertz commented Oct 2, 2021

The model was trained with a RETURNN version from before Sep 22. I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model.

But then you don't need to retrain. Any version before Sep 22 was fine. There should be no difference to current master.

@robin-p-schmitt
Copy link
Contributor Author

robin-p-schmitt commented Oct 2, 2021

But then you don't need to retrain. Any version before Sep 22 was fine. There should be no difference to current master.

I just checked and the config I previously trained on (where the error occurs) imported tf as from returnn.tf.compat import v1 as tf instead of import tensorflow as tf. Could this also have something to do with this? I am not totally sure about the RETURNN version I used there but I at least think it was one before Sep 22.
At least the recognition is working for me now if I train on https://gist.github.com/robin-p-schmitt/fe9880b8ff3cd1a4c7201626776fdaab and using the RETURNN repo from commit ea3e91f.

@albertz
Copy link
Member

albertz commented Oct 4, 2021

I just checked and the config I previously trained on (where the error occurs) imported tf as from returnn.tf.compat import v1 as tf instead of import tensorflow as tf. Could this also have something to do with this?

No, I don't think so. If there would have been any issue about that, you would have gotten some exception.

@robin-p-schmitt
Copy link
Contributor Author

I get this also on some of my setups, which where working before. Only the ones using conv-layers and only in recog.

Does it also work for you if you retrain the model with a current RETURNN version and then do the recognition on this one? Or do you still get the same error then?

@albertz
Copy link
Member

albertz commented Oct 18, 2021

So, on RETURNN side, this was #704, and this is fixed now.

On TF side, there are still the two outstanding issues (see my comments above), and we need to wait for some fix of them. However, they are not critical. They just would have made it easier to detect and understand the problem.

@albertz albertz closed this as completed Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants