Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195

zoxoi · 2021-03-30T17:36:52Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Custom code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 (AWS Deep Learning AMI)
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.3, 2.4.1, several 2.5 and 2.6-nightly samples
Python version: 3.6
CUDA/cuDNN version: 11
GPU model and memory: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]; BUT also occurs on CPU wheels

Creating a subclass model with a~~n LSTM~~ recurrent layer when use_bias=True (i.e. the default) fails. The code can be found in the following gist:

https://gist.github.com/zoxoi/9c331f9df1fa22112e061638ea200ec8

Note that the model

class FailModel(Model):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.lstm = layers.LSTM(64, use_bias=True)

    def call(self, input, training=False):
        return self.lstm(input)

is the only important bit, as far as I know. The rest is a minor bit of supporting code just to trigger the issue such as compiling the model and creating a dummy dataset generator. The full traceback is in the included gist, to save space, but the ultimate error given is:

ValueError: Shape must be at least rank 3 but is rank 2 for '{{node BiasAdd}} = BiasAdd[T=DT_FLOAT, data_format="NCHW"](add, bias)' with input shapes: [?,256], [256].

A few facts I've noticed:

It doesn't seem to fail when calling the model, this behavior is limited to things such as model.fit(). Explicitly calling model(tf.zeros(1, 96, 100)) works fine from my testing.
The value given in the error ValueError: Shape must be at least rank 3 but is rank 2 for '{{node BiasAdd}} = BiasAdd[T=DT_FLOAT, data_format="NCHW"](add, bias)' with input shapes: [?,256], [256]. seems to always be 4x the number of cells (e.g. for 64 it's 250, 100 it's 400 etc). I don't know if this is surprising or not.
Since it may be related to NCHW format, I tried changing the time_major argument to true, and both leaving the data the same and (more correctly) manually permuting/transposing the input into time major order, no success.
This bug happens on multiple TF versions, 2.3 was the earliest I can test but persists up to the most recent 2.6 nightly.
This happens even if you force TF to not discover a GPU, or install tf-cpu instead.
However, this does NOT happen on my Macos Catalina 10.15.7 laptop, with an integrated GPU (i.e. tf-cpu is automatically installed), the layer works as intended with no errors.

For anybody else dealing with this: a workaround is to set use_bias=False when creating the LSTM. This has tradeoffs however, (beyond just the potential mathematical issues with using a bias allowing some correction), since it won't use the cuDNN kernel if use_bias is off which from my testing seems to have a fairly radical speed penalty (it's training more slowly than it is on my laptop CPU by about 40ms/step).

If you want to test on the exact same setup I'm using, I'm experiencing the issue on the AWS Deep Learning AMI Version 42.1.

Further important things I haven't tested:

~~If this is limited to LSTM or affects other recurrent layers.~~ [Update: It affects GRU and SimpleRNN as well, updating title]
If this affects Windows in either configuration
~~If, even though it fails on CPU, it's perhaps related to CUDA 11 specifically somehow~~ [Also failed on a GPU-less Linux AMI]
If this affects the functional API as well, I've only used subclass models

The text was updated successfully, but these errors were encountered:

Saduf2019 · 2021-03-31T06:03:26Z

@zoxoi
I ran the code shared on tf 2.4 and face a different error,Please find the gist here.
Could you please share a colab gist with the issue reported.

zoxoi · 2021-03-31T17:29:43Z

I don't know what a collab gist is. I put the minimal code necessary to run the error in the gist I linked. It needs some supporting code, the code I pasted in the issue box was just me pointing out the relevant code the error is occurring in, I didn't post the whole thing to avoid taking up space.

E: It's three files, two if you don't use the environment.yml to set up the conda environment. You could easily merge two of them into one by copy+pasting if desired, I only put them in two to make it easier to see where the error was.

Saduf2019 · 2021-04-06T09:39:38Z

@zoxoi
I ran the code but still face a differet error, Please find the gist here

As per the error reported, can you refer to these links and let us know if it helps:link

zoxoi · 2021-04-09T08:27:25Z

I fixed the code in your link by inlining the code from the micro-module:
https://colab.research.google.com/drive/13YOXgRIjQ-QtKCXP8EcfUJGer8vCER03#scrollTo=R4gBFl1dMz1-&line=10&uniqifier=1

It runs successfully on the colab gist, but it does NOT run on what appears to be x86_64 Ubuntu 18.04, and possibly other environments. I'm not surprised it runs on whatever the Google Compute backend is, since TF is developed and maintained by Google that's expected, but the error triggers under different architectures.

I searched numerous SO answers and such when researching this to make sure there isn't another problem. Including the one you linked. The issue here is that on some machines, as you can see, the code runs successfully, as it should since the shape is correct, but it doesn't run on certain architectures. I've tried with two different Ubuntu 18.04 AWS instances now (one with a GPU and one without) and both failed. I could also try running it locally via a Linux live installation if that's important, but it appears that to reproduce someone on the team will need to do it on an environment other than that gist.

If you want full reproduction steps:

Start with a clean Ubuntu 18.04 installation on an x86_64 architecture, I'm unsure if AMD or Intel matters; presence of a GPU and CUDA does NOT seem to matter
Install any core dependencies necessary not present on that installation (e.g. build-essentials, etc)
Install either conda or miniconda and create and activate an environment via the environment.yml above, or else install Python 3 via your preferred method (for full reproduction: I was using 3.6 because that's what we're pinned to due to other dependencies) and manually install Tensorflow 2.3+
Either copy that colab gist into one file, or download the two .py files in the gist in the OP and place them in a folder
Run the file containing the model definition via python3 ./main.py (or whatever you named it)

I can also attempt to see if it can be reproduced in a Docker container if that would be easier.

Jack-Fawcett · 2021-06-18T08:00:43Z

For those looking for a workaround this worked for me when trying to load a serialized model.
tf.keras.backend.set_image_data_format("channels_last")

sushreebarsa · 2021-08-25T09:11:37Z

@zoxoi We see that you are using older version of tensorflow. Many bugs have been fixed in latest version. We recommend that you upgrade to latest stable version of tensorflow 2.6.0 and let us know if the issue still persists in newer versions .Thanks!

zoxoi · 2021-08-31T18:04:38Z

I'm not going to be able to test this for a bit, but it's on my to-do list. Just to stave off the inactivity autoclose.

very-meanly · 2022-07-07T23:31:56Z

Bumping this to keep it open - still seeing this issue in tensorflow==2.9.1. Same pattern - no issue on MacOS, but as soon as I run on a Sagemaker notebook instance using Amazon Linux 2, the issue happens. The suggestion from @Jack-Fawcett does prevent the error from happening, for anyone else looking for a workaround (tf.keras.backend.set_image_data_format("channels_last"))

rchao · 2022-07-14T17:11:22Z

cc'ing @qlzh727

Anirudh-Murali · 2023-06-28T07:24:07Z

Bumping this to keep it open - still seeing this issue in tensorflow==2.12 ob Sagemaker notebook instance using Amazon Linux 2, the issue happens. As mentioned by @very-meanly "The suggestion from @Jack-Fawcett does prevent the error from happening, for anyone else looking for a workaround (tf.keras.backend.set_image_data_format("channels_last"))"

zoxoi added the type:bug Bug label Mar 30, 2021

google-ml-butler bot assigned Saduf2019 Mar 30, 2021

zoxoi changed the title ~~LSTM fails on (at least some) Linuxes where use_bias=True, succeeds on OS X~~ LSTM fails on (at least some) Linuxes where use_bias=True, succeeds on Macos Mar 30, 2021

zoxoi changed the title ~~LSTM fails on (at least some) Linuxes where use_bias=True, succeeds on Macos~~ Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos Mar 30, 2021

Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Mar 31, 2021

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 2, 2021

Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Apr 6, 2021

Saduf2019 added comp:keras Keras related issues TF 2.4 for issues related to TF 2.4 and removed stat:awaiting response Status - Awaiting response from author labels Apr 9, 2021

Saduf2019 assigned ymodak and unassigned Saduf2019 Apr 9, 2021

ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 22, 2021

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Aug 25, 2021

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Sep 2, 2021

rchao assigned qlzh727 and unassigned qlzh727 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195

Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195

zoxoi commented Mar 30, 2021 •

edited

Saduf2019 commented Mar 31, 2021

zoxoi commented Mar 31, 2021 •

edited

Saduf2019 commented Apr 6, 2021

zoxoi commented Apr 9, 2021 •

edited

Jack-Fawcett commented Jun 18, 2021

sushreebarsa commented Aug 25, 2021

zoxoi commented Aug 31, 2021

very-meanly commented Jul 7, 2022

rchao commented Jul 14, 2022

Anirudh-Murali commented Jun 28, 2023

Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195

Recurrent layers fail on (at least some) Linuxes where use_bias=True, succeeds on Macos #48195

Comments

zoxoi commented Mar 30, 2021 • edited

Saduf2019 commented Mar 31, 2021

zoxoi commented Mar 31, 2021 • edited

Saduf2019 commented Apr 6, 2021

zoxoi commented Apr 9, 2021 • edited

Jack-Fawcett commented Jun 18, 2021

sushreebarsa commented Aug 25, 2021

zoxoi commented Aug 31, 2021

very-meanly commented Jul 7, 2022

rchao commented Jul 14, 2022

Anirudh-Murali commented Jun 28, 2023

zoxoi commented Mar 30, 2021 •

edited

zoxoi commented Mar 31, 2021 •

edited

zoxoi commented Apr 9, 2021 •

edited