Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195

Open
zoxoi opened this issue Mar 30, 2021 · 10 comments
Assignees
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:bug Bug

Comments

@zoxoi
Copy link

zoxoi commented Mar 30, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Custom code
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 (AWS Deep Learning AMI)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.3, 2.4.1, several 2.5 and 2.6-nightly samples
  • Python version: 3.6
  • CUDA/cuDNN version: 11
  • GPU model and memory: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]; BUT also occurs on CPU wheels

Creating a subclass model with an LSTM recurrent layer when use_bias=True (i.e. the default) fails. The code can be found in the following gist:

https://gist.github.com/zoxoi/9c331f9df1fa22112e061638ea200ec8

Note that the model

class FailModel(Model):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.lstm = layers.LSTM(64, use_bias=True)

    def call(self, input, training=False):
        return self.lstm(input)

is the only important bit, as far as I know. The rest is a minor bit of supporting code just to trigger the issue such as compiling the model and creating a dummy dataset generator. The full traceback is in the included gist, to save space, but the ultimate error given is:

ValueError: Shape must be at least rank 3 but is rank 2 for '{{node BiasAdd}} = BiasAdd[T=DT_FLOAT, data_format="NCHW"](add, bias)' with input shapes: [?,256], [256].

A few facts I've noticed:

  • It doesn't seem to fail when calling the model, this behavior is limited to things such as model.fit(). Explicitly calling model(tf.zeros(1, 96, 100)) works fine from my testing.
  • The value given in the error ValueError: Shape must be at least rank 3 but is rank 2 for '{{node BiasAdd}} = BiasAdd[T=DT_FLOAT, data_format="NCHW"](add, bias)' with input shapes: [?,256], [256]. seems to always be 4x the number of cells (e.g. for 64 it's 250, 100 it's 400 etc). I don't know if this is surprising or not.
  • Since it may be related to NCHW format, I tried changing the time_major argument to true, and both leaving the data the same and (more correctly) manually permuting/transposing the input into time major order, no success.
  • This bug happens on multiple TF versions, 2.3 was the earliest I can test but persists up to the most recent 2.6 nightly.
  • This happens even if you force TF to not discover a GPU, or install tf-cpu instead.
  • However, this does NOT happen on my Macos Catalina 10.15.7 laptop, with an integrated GPU (i.e. tf-cpu is automatically installed), the layer works as intended with no errors.

For anybody else dealing with this: a workaround is to set use_bias=False when creating the LSTM. This has tradeoffs however, (beyond just the potential mathematical issues with using a bias allowing some correction), since it won't use the cuDNN kernel if use_bias is off which from my testing seems to have a fairly radical speed penalty (it's training more slowly than it is on my laptop CPU by about 40ms/step).

If you want to test on the exact same setup I'm using, I'm experiencing the issue on the AWS Deep Learning AMI Version 42.1.

Further important things I haven't tested:

  • If this is limited to LSTM or affects other recurrent layers. [Update: It affects GRU and SimpleRNN as well, updating title]
  • If this affects Windows in either configuration
  • If, even though it fails on CPU, it's perhaps related to CUDA 11 specifically somehow [Also failed on a GPU-less Linux AMI]
  • If this affects the functional API as well, I've only used subclass models
@zoxoi zoxoi added the type:bug Bug label Mar 30, 2021
@zoxoi zoxoi changed the title LSTM fails on (at least some) Linuxes where use_bias=True, succeeds on OS X LSTM fails on (at least some) Linuxes where use_bias=True, succeeds on Macos Mar 30, 2021
@zoxoi zoxoi changed the title LSTM fails on (at least some) Linuxes where use_bias=True, succeeds on Macos Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos Mar 30, 2021
@Saduf2019
Copy link
Contributor

@zoxoi
I ran the code shared on tf 2.4 and face a different error,Please find the gist here.
Could you please share a colab gist with the issue reported.

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Mar 31, 2021
@zoxoi
Copy link
Author

zoxoi commented Mar 31, 2021

I don't know what a collab gist is. I put the minimal code necessary to run the error in the gist I linked. It needs some supporting code, the code I pasted in the issue box was just me pointing out the relevant code the error is occurring in, I didn't post the whole thing to avoid taking up space.

E: It's three files, two if you don't use the environment.yml to set up the conda environment. You could easily merge two of them into one by copy+pasting if desired, I only put them in two to make it easier to see where the error was.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 2, 2021
@Saduf2019
Copy link
Contributor

@zoxoi
I ran the code but still face a differet error, Please find the gist here

As per the error reported, can you refer to these links and let us know if it helps:link

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Apr 6, 2021
@zoxoi
Copy link
Author

zoxoi commented Apr 9, 2021

I fixed the code in your link by inlining the code from the micro-module:
https://colab.research.google.com/drive/13YOXgRIjQ-QtKCXP8EcfUJGer8vCER03#scrollTo=R4gBFl1dMz1-&line=10&uniqifier=1

It runs successfully on the colab gist, but it does NOT run on what appears to be x86_64 Ubuntu 18.04, and possibly other environments. I'm not surprised it runs on whatever the Google Compute backend is, since TF is developed and maintained by Google that's expected, but the error triggers under different architectures.

I searched numerous SO answers and such when researching this to make sure there isn't another problem. Including the one you linked. The issue here is that on some machines, as you can see, the code runs successfully, as it should since the shape is correct, but it doesn't run on certain architectures. I've tried with two different Ubuntu 18.04 AWS instances now (one with a GPU and one without) and both failed. I could also try running it locally via a Linux live installation if that's important, but it appears that to reproduce someone on the team will need to do it on an environment other than that gist.

If you want full reproduction steps:

  1. Start with a clean Ubuntu 18.04 installation on an x86_64 architecture, I'm unsure if AMD or Intel matters; presence of a GPU and CUDA does NOT seem to matter
  2. Install any core dependencies necessary not present on that installation (e.g. build-essentials, etc)
  3. Install either conda or miniconda and create and activate an environment via the environment.yml above, or else install Python 3 via your preferred method (for full reproduction: I was using 3.6 because that's what we're pinned to due to other dependencies) and manually install Tensorflow 2.3+
  4. Either copy that colab gist into one file, or download the two .py files in the gist in the OP and place them in a folder
  5. Run the file containing the model definition via python3 ./main.py (or whatever you named it)

I can also attempt to see if it can be reproduced in a Docker container if that would be easier.

@Saduf2019 Saduf2019 added comp:keras Keras related issues TF 2.4 for issues related to TF 2.4 and removed stat:awaiting response Status - Awaiting response from author labels Apr 9, 2021
@Saduf2019 Saduf2019 assigned ymodak and unassigned Saduf2019 Apr 9, 2021
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 22, 2021
@Jack-Fawcett
Copy link

For those looking for a workaround this worked for me when trying to load a serialized model.
tf.keras.backend.set_image_data_format("channels_last")

@sushreebarsa
Copy link
Contributor

@zoxoi We see that you are using older version of tensorflow. Many bugs have been fixed in latest version. We recommend that you upgrade to latest stable version of tensorflow 2.6.0 and let us know if the issue still persists in newer versions .Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Aug 25, 2021
@zoxoi
Copy link
Author

zoxoi commented Aug 31, 2021

I'm not going to be able to test this for a bit, but it's on my to-do list. Just to stave off the inactivity autoclose.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Sep 2, 2021
@very-meanly
Copy link

Bumping this to keep it open - still seeing this issue in tensorflow==2.9.1. Same pattern - no issue on MacOS, but as soon as I run on a Sagemaker notebook instance using Amazon Linux 2, the issue happens. The suggestion from @Jack-Fawcett does prevent the error from happening, for anyone else looking for a workaround (tf.keras.backend.set_image_data_format("channels_last"))

@rchao rchao assigned qlzh727 and unassigned qlzh727 Jul 14, 2022
@rchao
Copy link
Contributor

rchao commented Jul 14, 2022

cc'ing @qlzh727

@Anirudh-Murali
Copy link

Bumping this to keep it open - still seeing this issue in tensorflow==2.12 ob Sagemaker notebook instance using Amazon Linux 2, the issue happens. As mentioned by @very-meanly "The suggestion from @Jack-Fawcett does prevent the error from happening, for anyone else looking for a workaround (tf.keras.backend.set_image_data_format("channels_last"))"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:bug Bug
Projects
None yet
Development

No branches or pull requests

10 participants