-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Not found: No algorithm worked!" error for convolutional layer during search #703
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The RETURNN version is not printed correctly (RETURNN starting up, version 1.0.0+unknown). Can you fix that? I assume latest RETURNN? Please mention that in the initial description. Also explicitly mention the TF. Looks like Python 3.8.0 with TF 2.3. Also mention the CUDA version. I assume CUDA 10.1 from the log. |
I will try to fix this on the weekend. |
Some of these report that the GPU has too less memory and this causes such an error. |
Yes, I will also do this over the weekend. If this and the other links don't help, I will also try to create a minimal config which reproduces the error. |
I get this also on some of my setups, which where working before. |
Maybe #704 is actually relevant. This bug probably did not appear for too long, only since recently (I need to check since when exactly). Although it is somewhat strange that this could cause such an error. I anyway would also test with a smaller batch size. |
I've tried reducing batch size from 4000 to 500 for the search, same |
Interesting. But one other thing: The model was trained with what RETURNN version exactly? Some version before 2ff056d from Wed Sep 22 (which introduced the bug)? Because if so, I wonder that the model checkpoint loads properly at all. Without the bug, i.e. before 2ff056d or after a0320ea, the |
I was able to reproduce the TF error in a simple example using RETURNN (without split dims): https://gist.github.com/albertz/21e00a500e41eb0c8d27a8519e763f0e |
I tested actually the TF model checkpoint behavior, because I expected that this should throw some error, and it turns out that this is already buggy in TF. This maybe causes the problem here. I reported the problem here: tensorflow/tensorflow#52220 |
Now I could also reproduce the convolution error in pure TF. It's correct that it throws an exception, but the exception or error message is very misleading. I reported that as a bug here: tensorflow/tensorflow#52223 |
The model was trained with a RETURNN version from before Sep 22. I will try to train one epoch with the current version of RETURNN and then try to do the recognition on that model. |
So this actually worked! |
But then you don't need to retrain. Any version before Sep 22 was fine. There should be no difference to current master. |
I just checked and the config I previously trained on (where the error occurs) imported |
No, I don't think so. If there would have been any issue about that, you would have gotten some exception. |
Does it also work for you if you retrain the model with a current RETURNN version and then do the recognition on this one? Or do you still get the same error then? |
So, on RETURNN side, this was #704, and this is fixed now. On TF side, there are still the two outstanding issues (see my comments above), and we need to wait for some fix of them. However, they are not critical. They just would have made it easier to detect and understand the problem. |
My config looks as follows: https://gist.github.com/robin-p-schmitt/fe9880b8ff3cd1a4c7201626776fdaab. The relevant part seems to be this:
It works for both
train
as well assearch
mode when I am using/u/merboldt/setups/2020-01-08--rnnt-rna/crnn
as my RETURNN version. However, when I am using the most recent RETURNN version (/u/schmitt/src/returnn
), onlytrain
mode seems to work. When I try to runsearch
with the current RETURNN version I get the following error: https://gist.github.com/robin-p-schmitt/5f89cd01ed4c5b74d9aebee256a4707c.The main error seems to be:
The log for the relevant layer is:
I am using Python 3.8.0 with TF 2.3 and CUDA 10.1.
The text was updated successfully, but these errors were encountered: