"Not found: No algorithm worked!" error for convolutional layer during search #703

robin-p-schmitt · 2021-10-01T09:01:01Z

My config looks as follows: https://gist.github.com/robin-p-schmitt/fe9880b8ff3cd1a4c7201626776fdaab. The relevant part seems to be this:

"source": {"class": "eval", "eval": <...specaug...>},
"source0": {"class": "split_dims", "axis": "F", "dims": (-1, 1), "from": "source"},  # (T,40,1)

"conv0": {
  "class": "conv", "from": "source0",
  "padding": "same", "filter_size": (3, 3),
  "n_out": 32, "activation": None, "with_bias": True},  # (T,40,32)

It works for both train as well as search mode when I am using /u/merboldt/setups/2020-01-08--rnnt-rna/crnn as my RETURNN version. However, when I am using the most recent RETURNN version (/u/schmitt/src/returnn), only train mode seems to work. When I try to run search with the current RETURNN version I get the following error: https://gist.github.com/robin-p-schmitt/5f89cd01ed4c5b74d9aebee256a4707c.

The main error seems to be:

2021-09-30 12:52:25.768174: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:1
115 : Not found: No algorithm worked!
TensorFlow exception: 2 root error(s) found.
  (0) Not found: No algorithm worked!
         [[node conv0/convolution (defined at u/schmitt/src/returnn/returnn/tf/layers/basic.py:4061) ]]
         [[output/rec/while/Switch_14/_553]]
  (1) Not found: No algorithm worked!
         [[node conv0/convolution (defined at u/schmitt/src/returnn/returnn/tf/layers/basic.py:4061) ]]
0 successful operations.

The log for the relevant layer is:

layer root/'source' output: Data{'source_output', [B,T|'time'[B],F|F'feature:data'(40)]}
layer root/'source0' output: Data{'source0_output', [B,T|'time'[B],F|F'feature:data'(40),'source0_split_dims1'(1)]}
layer root/'conv0' output: Data{'conv0_output', [B,T|'time'[B],'source0_split_dims1'(1),F|F'conv0:channel'(32)]}
layer root/'conv0p' output: Data{'conv0p_output', [B,T|'time'[B],'conv0p:pool:s1'(1),F|F'conv0:channel'(32)]}

I am using Python 3.8.0 with TF 2.3 and CUDA 10.1.

The text was updated successfully, but these errors were encountered:

albertz · 2021-10-01T09:06:22Z

The RETURNN version is not printed correctly (RETURNN starting up, version 1.0.0+unknown). Can you fix that? I assume latest RETURNN? Please mention that in the initial description.

Also explicitly mention the TF. Looks like Python 3.8.0 with TF 2.3.

Also mention the CUDA version. I assume CUDA 10.1 from the log.

albertz · 2021-10-01T09:16:15Z

Oh fuck, I just notice another bug: SplitDimsLayer (source0 here) marks the wrong feature dim axis. Thus ConvLayer has the wrong shape. (That's why explicitly specifying the shape via out_shape or so as in #706 would be helpful...) I reported this in #704. It's probably unrelated.

albertz · 2021-10-01T09:28:01Z

So, on the error itself, there are many similar errors when you search for it, e.g.:

robin-p-schmitt · 2021-10-01T09:29:49Z

The RETURNN version is not printed correctly (RETURNN starting up, version 1.0.0+unknown). Can you fix that? I assume latest RETURNN? Please mention that in the initial description.

I will try to fix this on the weekend.

albertz · 2021-10-01T09:29:56Z

Some of these report that the GPU has too less memory and this causes such an error.
Can you try with a smaller batch size?

robin-p-schmitt · 2021-10-01T09:31:05Z

Some of these report that the GPU has too less memory and this causes such an error. Can you try with a smaller batch size?

Yes, I will also do this over the weekend. If this and the other links don't help, I will also try to create a minimal config which reproduces the error.

WARNING: This potentially changed the behavior of configs. E.g. the one reported in #703. Most of the attention and transducer encoders which used initial convolutional layers. However, this bug probably did not exist for a long time. Fix #704.

tbscode · 2021-10-01T09:53:42Z

I get this also on some of my setups, which where working before.
Only the ones using conv-layers and only in recog.

albertz · 2021-10-01T10:06:44Z

Maybe #704 is actually relevant. This bug probably did not appear for too long, only since recently (I need to check since when exactly). Although it is somewhat strange that this could cause such an error.

I anyway would also test with a smaller batch size.

WARNING: This potentially changed the behavior of configs. E.g. the one reported in #703. Most of the attention and transducer encoders which used initial convolutional layers. However, this bug probably did not exist for a long time. Fix #704.

albertz · 2021-10-01T10:25:26Z

I fixed #704 via #705 now. So also try with current master again (and the original batch size).

But in any case, also test with smaller batch size (before testing with current master). I want to see if this already solves it.

tbscode · 2021-10-01T14:50:55Z

I've tried reducing batch size from 4000 to 500 for the search, same Not found: No algorithm worked! -errors with the old commit.
Searches work with bs 4000 on current master.

albertz · 2021-10-01T18:34:01Z

Interesting.

But one other thing: The model was trained with what RETURNN version exactly? Some version before 2ff056d from Wed Sep 22 (which introduced the bug)?

Because if so, I wonder that the model checkpoint loads properly at all.

Without the bug, i.e. before 2ff056d or after a0320ea, the conv0 layer should have a kernel size of list(filter_size) + [n_in, n_out], i.e. [3, 3, 1, 32].
With the bug, i.e. within those commits, the kernel size is [3, 3, 40, 32].

albertz · 2021-10-01T18:57:12Z

I was able to reproduce the TF error in a simple example using RETURNN (without split dims): https://gist.github.com/albertz/21e00a500e41eb0c8d27a8519e763f0e

rwth-i6/returnn#703

albertz · 2021-10-01T19:38:40Z

I tested actually the TF model checkpoint behavior, because I expected that this should throw some error, and it turns out that this is already buggy in TF. This maybe causes the problem here. I reported the problem here: tensorflow/tensorflow#52220

rwth-i6/returnn#703 tensorflow/tensorflow#52220

rwth-i6/returnn#703 tensorflow/tensorflow#52223

albertz · 2021-10-01T21:21:00Z

Now I could also reproduce the convolution error in pure TF. It's correct that it throws an exception, but the exception or error message is very misleading. I reported that as a bug here: tensorflow/tensorflow#52223

robin-p-schmitt · 2021-10-02T06:17:56Z

But one other thing: The model was trained with what RETURNN version exactly? Some version before 2ff056d from Wed Sep 22 (which introduced the bug)?

The model was trained with a RETURNN version from before Sep 22. I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model.

robin-p-schmitt · 2021-10-02T08:28:58Z

I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model.

So this actually worked!

albertz · 2021-10-02T20:07:59Z

The model was trained with a RETURNN version from before Sep 22. I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model.

But then you don't need to retrain. Any version before Sep 22 was fine. There should be no difference to current master.

robin-p-schmitt · 2021-10-02T21:57:45Z

But then you don't need to retrain. Any version before Sep 22 was fine. There should be no difference to current master.

I just checked and the config I previously trained on (where the error occurs) imported tf as from returnn.tf.compat import v1 as tf instead of import tensorflow as tf. Could this also have something to do with this? I am not totally sure about the RETURNN version I used there but I at least think it was one before Sep 22.
At least the recognition is working for me now if I train on https://gist.github.com/robin-p-schmitt/fe9880b8ff3cd1a4c7201626776fdaab and using the RETURNN repo from commit ea3e91f.

albertz · 2021-10-04T06:41:28Z

I just checked and the config I previously trained on (where the error occurs) imported tf as from returnn.tf.compat import v1 as tf instead of import tensorflow as tf. Could this also have something to do with this?

No, I don't think so. If there would have been any issue about that, you would have gotten some exception.

robin-p-schmitt · 2021-10-04T13:55:15Z

I get this also on some of my setups, which where working before. Only the ones using conv-layers and only in recog.

Does it also work for you if you retrain the model with a current RETURNN version and then do the recognition on this one? Or do you still get the same error then?

albertz · 2021-10-18T08:13:15Z

So, on RETURNN side, this was #704, and this is fixed now.

On TF side, there are still the two outstanding issues (see my comments above), and we need to wait for some fix of them. However, they are not critical. They just would have made it easier to detect and understand the problem.

This comment has been minimized.

Sign in to view

albertz mentioned this issue Oct 1, 2021

SplitDimsLayer wrong out feature axis when splitting feature axis #704

Closed

albertz mentioned this issue Oct 1, 2021

SplitDimsLayer fix feature_dim_axis on feature-dim split #705

Merged

albertz added a commit to albertz/playground that referenced this issue Oct 1, 2021

test TF checkpoint

f5a4ee5

rwth-i6/returnn#703

albertz mentioned this issue Oct 1, 2021

Model checkpoint load ignores wrong shape tensorflow/tensorflow#52220

Open

albertz added a commit to albertz/playground that referenced this issue Oct 1, 2021

test TF checkpoint comments

1ba3871

rwth-i6/returnn#703 tensorflow/tensorflow#52220

albertz mentioned this issue Oct 1, 2021

OP_REQUIRES failed at conv_ops.cc:1276 : Not found: No algorithm worked! tensorflow/tensorflow#52223

Open

albertz added a commit to albertz/playground that referenced this issue Oct 1, 2021

test conv invalid filter shape

4569c3a

rwth-i6/returnn#703 tensorflow/tensorflow#52223

albertz closed this as completed Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Not found: No algorithm worked!" error for convolutional layer during search #703

"Not found: No algorithm worked!" error for convolutional layer during search #703

robin-p-schmitt commented Oct 1, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

albertz commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021

robin-p-schmitt commented Oct 1, 2021

albertz commented Oct 1, 2021

robin-p-schmitt commented Oct 1, 2021

tbscode commented Oct 1, 2021

albertz commented Oct 1, 2021

albertz commented Oct 1, 2021

tbscode commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021

albertz commented Oct 1, 2021

albertz commented Oct 1, 2021

robin-p-schmitt commented Oct 2, 2021

robin-p-schmitt commented Oct 2, 2021

albertz commented Oct 2, 2021

robin-p-schmitt commented Oct 2, 2021 •

edited

Loading

albertz commented Oct 4, 2021

robin-p-schmitt commented Oct 4, 2021

albertz commented Oct 18, 2021

"Not found: No algorithm worked!" error for convolutional layer during search #703

"Not found: No algorithm worked!" error for convolutional layer during search #703

Comments

robin-p-schmitt commented Oct 1, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

albertz commented Oct 1, 2021 • edited Loading

albertz commented Oct 1, 2021 • edited Loading

albertz commented Oct 1, 2021

robin-p-schmitt commented Oct 1, 2021

albertz commented Oct 1, 2021

robin-p-schmitt commented Oct 1, 2021

tbscode commented Oct 1, 2021

albertz commented Oct 1, 2021

albertz commented Oct 1, 2021

tbscode commented Oct 1, 2021 • edited Loading

albertz commented Oct 1, 2021 • edited Loading

albertz commented Oct 1, 2021

albertz commented Oct 1, 2021

albertz commented Oct 1, 2021

robin-p-schmitt commented Oct 2, 2021

robin-p-schmitt commented Oct 2, 2021

albertz commented Oct 2, 2021

robin-p-schmitt commented Oct 2, 2021 • edited Loading

albertz commented Oct 4, 2021

robin-p-schmitt commented Oct 4, 2021

albertz commented Oct 18, 2021

robin-p-schmitt commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021 •

edited

Loading

tbscode commented Oct 1, 2021 •

edited

Loading

albertz commented Oct 1, 2021 •

edited

Loading

robin-p-schmitt commented Oct 2, 2021 •

edited

Loading