New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195
Comments
I don't know what a collab gist is. I put the minimal code necessary to run the error in the gist I linked. It needs some supporting code, the code I pasted in the issue box was just me pointing out the relevant code the error is occurring in, I didn't post the whole thing to avoid taking up space. E: It's three files, two if you don't use the environment.yml to set up the conda environment. You could easily merge two of them into one by copy+pasting if desired, I only put them in two to make it easier to see where the error was. |
I fixed the code in your link by inlining the code from the micro-module: It runs successfully on the colab gist, but it does NOT run on what appears to be x86_64 Ubuntu 18.04, and possibly other environments. I'm not surprised it runs on whatever the Google Compute backend is, since TF is developed and maintained by Google that's expected, but the error triggers under different architectures. I searched numerous SO answers and such when researching this to make sure there isn't another problem. Including the one you linked. The issue here is that on some machines, as you can see, the code runs successfully, as it should since the shape is correct, but it doesn't run on certain architectures. I've tried with two different Ubuntu 18.04 AWS instances now (one with a GPU and one without) and both failed. I could also try running it locally via a Linux live installation if that's important, but it appears that to reproduce someone on the team will need to do it on an environment other than that gist. If you want full reproduction steps:
I can also attempt to see if it can be reproduced in a Docker container if that would be easier. |
For those looking for a workaround this worked for me when trying to load a serialized model. |
@zoxoi We see that you are using older version of tensorflow. Many bugs have been fixed in latest version. We recommend that you upgrade to latest stable version of tensorflow 2.6.0 and let us know if the issue still persists in newer versions .Thanks! |
I'm not going to be able to test this for a bit, but it's on my to-do list. Just to stave off the inactivity autoclose. |
Bumping this to keep it open - still seeing this issue in tensorflow==2.9.1. Same pattern - no issue on MacOS, but as soon as I run on a Sagemaker notebook instance using Amazon Linux 2, the issue happens. The suggestion from @Jack-Fawcett does prevent the error from happening, for anyone else looking for a workaround ( |
cc'ing @qlzh727 |
Bumping this to keep it open - still seeing this issue in tensorflow==2.12 ob Sagemaker notebook instance using Amazon Linux 2, the issue happens. As mentioned by @very-meanly "The suggestion from @Jack-Fawcett does prevent the error from happening, for anyone else looking for a workaround (tf.keras.backend.set_image_data_format("channels_last"))" |
System information
Creating a subclass model with a
n LSTMrecurrent layer whenuse_bias=True
(i.e. the default) fails. The code can be found in the following gist:https://gist.github.com/zoxoi/9c331f9df1fa22112e061638ea200ec8
Note that the model
is the only important bit, as far as I know. The rest is a minor bit of supporting code just to trigger the issue such as compiling the model and creating a dummy dataset generator. The full traceback is in the included gist, to save space, but the ultimate error given is:
A few facts I've noticed:
model.fit()
. Explicitly callingmodel(tf.zeros(1, 96, 100))
works fine from my testing.ValueError: Shape must be at least rank 3 but is rank 2 for '{{node BiasAdd}} = BiasAdd[T=DT_FLOAT, data_format="NCHW"](add, bias)' with input shapes: [?,256], [256].
seems to always be 4x the number of cells (e.g. for 64 it's 250, 100 it's 400 etc). I don't know if this is surprising or not.time_major
argument to true, and both leaving the data the same and (more correctly) manually permuting/transposing the input into time major order, no success.For anybody else dealing with this: a workaround is to set use_bias=False when creating the LSTM. This has tradeoffs however, (beyond just the potential mathematical issues with using a bias allowing some correction), since it won't use the cuDNN kernel if use_bias is off which from my testing seems to have a fairly radical speed penalty (it's training more slowly than it is on my laptop CPU by about 40ms/step).
If you want to test on the exact same setup I'm using, I'm experiencing the issue on the AWS Deep Learning AMI Version 42.1.
Further important things I haven't tested:
If this is limited to LSTM or affects other recurrent layers.[Update: It affects GRU and SimpleRNN as well, updating title]If, even though it fails on CPU, it's perhaps related to CUDA 11 specifically somehow[Also failed on a GPU-less Linux AMI]The text was updated successfully, but these errors were encountered: