tf.keras computes incorrect loss values with 3+D data #25970

bersbersbers · 2019-02-21T14:38:50Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):

Yes. For a minimal example, run

from tensorflow import keras

layer = keras.layers.Input(shape=(1, 1, 1))
model = keras.models.Model(inputs=layer, outputs=layer)
model.compile(optimizer='adam', loss='poisson', metrics=['poisson'])
data = [[[[[1]]], [[[2]]], [[[3]]]]]
model.fit(x=data, y=data, batch_size=2, verbose=1, epochs=10)

and observe that loss and poisson values are different, and loss values vary:

Epoch 1/10
3/3 [==============================] - 1s 236ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 2/10
3/3 [==============================] - 0s 2ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 3/10
3/3 [==============================] - 0s 40ms/sample - loss: 0.5452 - poisson: 0.4393
Epoch 4/10
3/3 [==============================] - 0s 96ms/sample - loss: 0.5452 - poisson: 0.4393
Epoch 5/10
3/3 [==============================] - 0s 1ms/sample - loss: 0.9772 - poisson: 0.4393
Epoch 6/10
3/3 [==============================] - 0s 2ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 7/10
3/3 [==============================] - 1s 201ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 8/10
3/3 [==============================] - 0s 2ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 9/10
3/3 [==============================] - 0s 999us/sample - loss: 0.9772 - poisson: 0.4393
Epoch 10/10
3/3 [==============================] - 1s 327ms/sample - loss: 0.9772 - poisson: 0.4393

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Windows 10
TensorFlow installed from (source or binary):
pip install tensorflow
TensorFlow version (use command below):
v1.13.0-rc1-19-gc865ec5621, 1.13.0-rc2
Python version:
3.7.2 x64
CUDA/cuDNN version:
n/a
GPU model and memory:
n/a

Describe the current behavior
When fitting a model with loss="poisson", I would expect reported loss and poisson values to be identical.

Describe the expected behavior
loss values are incorrect. They vary from epoch to epoch.

Code to reproduce the issue
See above.

Other info / logs

More code examples and investigations at https://stackoverflow.com/q/54802328/880783

The text was updated successfully, but these errors were encountered:

pavithrasv · 2019-02-28T17:34:25Z

@bersbersbers are you still seeing this issue? I was not able to repro this on the latest nightly.

bersbersbers · 2019-02-28T18:11:30Z

@pavithrasv you are right, tf_nightly-1.13.0.dev20190227 does not have this issue. I can still repro it in 1.13.0rc2 as well as in 1.13.1 which has been released in the meantime. Since the issue reproduces in a stable release, I would be very interested what the underlying cause is and, in particular, how long it has existed.

bersbersbers · 2019-02-28T18:17:57Z

Here's my pip freeze in my current installation that does repro the issue:

absl-py==0.7.0
altgraph==0.16.1
astor==0.7.1
astroid==2.1.0
astropy==3.1.2
autopep8==1.4.3
awkward==0.8.4
bz2file==0.98
cachetools==3.1.0
certifi==2018.11.29
chardet==3.0.4
colorama==0.4.1
cycler==0.10.0
decorator==4.3.2
future==0.17.1
gast==0.2.2
grpcio==1.18.0
h5py==2.9.0
idna==2.8
imageio==2.5.0
imageio-ffmpeg==0.2.0
isbnlib==3.9.6
isbnlib-dnb==0.0.3
isbntools==4.3.19
isort==4.3.9
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
kiwisolver==1.0.1
lazy-object-proxy==1.3.1
macholib==1.11
Markdown==3.0.1
matplotlib==3.0.2
mccabe==0.6.1
mock==2.0.0
moviepy==1.0.0
nibabel==2.3.3
numpy==1.16.1
packaging==19.0
pandas==0.24.1
pbr==5.1.2
pefile==2018.8.8
Pillow==5.4.1
pip-review==1.0
pipdeptree==0.13.2
proglog==0.1.9
protobuf==3.6.1
psutil==5.5.1
py-essentials==1.4.12
pycodestyle==2.5.0
pydicom==1.2.2
pyhibp==3.0.0
PyInstaller==3.4
PyJWT==1.7.1
pylint==2.2.2
pyparsing==2.3.1
pypng==0.0.19
python-dateutil==2.8.0
pytz==2018.9
pywin32==224
pywin32-ctypes==0.2.0
PyYAML==3.13
requests==2.21.0
rope==0.12.0
scikit-learn==0.20.2
scipy==1.2.1
seaborn==0.9.0
six==1.12.0
sklearn==0.0
tee==0.0.3
tensorboard==1.13.0
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
tqdm==4.31.1
uproot==3.4.6
uproot-methods==0.4.3
urllib3==1.24.1
Werkzeug==0.14.1
wrapt==1.11.1

bersbersbers · 2019-03-06T17:58:02Z

I tried to pin down when this issue was introduced and fixed:

True = has bug

tf-nightly==
1.13.0-dev20190101    True
1.13.0-dev20190124    True
1.13.0-dev20190125    True
1.13.0-dev20190129    False
1.13.0-dev20190206    False
1.13.0.dev20190227    False
1.14.1-dev20190306    False

tensorflow==
1.10.0        False (Aug 8, 2018)
1.11.0-rc0    True (Sep 13, 2018)
1.11.0        True
1.12.0        True
1.13.0-rc1    True
1.13.1        True (Feb 26, 2019)
2.0.0-alpha0  False (Mar 6, 2019)

So, introduced some time in Aug/Sep 2018 - due to missing tf-nightly packages on PyPi from that time, I cannot get any closer. Fixed some time between Jan 25 and 29. That makes about 600 commits:
https://github.com/tensorflow/tensorflow/search?q=committer-date%3A2019-01-24..2019-01-30&unscoped_q=committer-date%3A2019-01-24..2019-01-30&type=Commits

pavithrasv · 2019-03-12T17:36:02Z

I am closing this issue as it has been fixed. Thank you for digging into the release details!

bersbersbers · 2019-04-29T12:32:30Z

This bug is still in the most current version, 1.13.1. Is there any release scheduled to be released soon to fix this? If not, I would be glad to use a workaround, but so far, I have not found any.

By the way, this is a reduced example where the batch size does divide the number of samples:

from tensorflow.keras import layers, metrics, models

layer = layers.Input(shape=(1, 1, 1))
model = models.Model(inputs=layer, outputs=layer)
model.compile(optimizer='adam', loss=metrics.mse, metrics=[metrics.mse])
model.fit(x=[[[[[0]]], [[[0]]]]], y=[[[[[1]]], [[[1]]]]])

It outputs:

2/2 [==============================]
 - 0s 23ms/sample - loss: 2.0000 - mean_squared_error: 1.0000

Note how loss and mean_squared_error are different. How can I get them to be identical?

pavithrasv · 2019-04-30T18:15:40Z

Have you tried https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0? We will also have a TF 1.14 release very soon.

bersbersbers · 2019-04-30T19:14:19Z

Have you tried https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0?

Yes, I have, see #25970 (comment). TF2 does not have that issue, but I don't want my research to rely on Alpha software currently :)

We will also have a TF 1.14 release very soon.

That is good news, thank you!

bersbersbers · 2019-05-25T09:04:10Z

I can confirm that this bug has been fixed in 1.14.0rc0:

from tensorflow import keras
layer = keras.layers.Input(shape=(1, 1, 1))
model = keras.models.Model(inputs=layer, outputs=layer)
model.compile(optimizer='adam', loss='poisson')
data = [[[[[1]]], [[[2]]], [[[3]]]]]
model.fit(x=data, y=data, batch_size=2, verbose=2, epochs=10)

Output:

WARNING: Logging before flag parsing goes to stderr.
W0525 11:03:20.369000  6872 __init__.py:308] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Epoch 1/10
3/3 - 0s - loss: 0.4393
Epoch 2/10
3/3 - 0s - loss: 0.4393
Epoch 3/10
3/3 - 0s - loss: 0.4393
Epoch 4/10
3/3 - 0s - loss: 0.4393
Epoch 5/10
3/3 - 0s - loss: 0.4393
Epoch 6/10
3/3 - 0s - loss: 0.4393
Epoch 7/10
3/3 - 0s - loss: 0.4393
Epoch 8/10
3/3 - 0s - loss: 0.4393
Epoch 9/10
3/3 - 0s - loss: 0.4393
Epoch 10/10
3/3 - 0s - loss: 0.4393

pavithrasv · 2019-05-25T15:28:00Z

Thank you!

arwen-x · 2020-04-10T05:37:01Z

I am using tf 2.1.0 and experience the same problem, can you suggest anything?
Thank you!

alrifai · 2020-05-10T10:52:32Z

I am also getting a delta between mse loss and mse metric values, but only when applying regularization (l2 or dropout).

oO0oO0oO0o0o00 · 2020-09-21T11:29:19Z

+1

AlmCoding · 2020-10-03T11:46:38Z

I have the same problem with tensorflow 2.3.0 when using l1/l2 regularization.

michalCyberfish · 2020-10-04T09:37:31Z

I'm having the same issue with TensorFlow GPU 2.1.0 and no regularization. However, this happens only on the validation step.

ofiryaish · 2020-10-09T09:54:46Z

@oO0oO0oO0o0o00 I have the same strange problem.

HoltSpalding · 2021-06-16T16:07:33Z

same problem

BoubakerA · 2022-12-07T16:05:10Z

When using weight regularization it seems to me that it is normal that the two functions don't ouptut the same results. That is because regularization adds the squared values of all weights to the loss function. In the other hand the MSE metric computes the MSE between the true output and the predicted one.

svh7rng · 2023-03-24T18:06:58Z

I have a similar problem.
I have written my own test_step(self, data):

    def test_step(self, data):
        # Unpack the data
        x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)

        y_pred = self.student(x[0], training=False)
        student_loss = self.compute_loss(x[0], y, y_pred, sample_weight)
        distillation_loss = self.distillation_loss_fn(tf.nn.softmax(x[1] / self.temperature, axis=1), tf.nn.softmax(y_pred / self.temperature, axis=1))* self.temperature**2          
        loss = self.alpha *student_loss + (1 - self.alpha) *distillation_loss
        
        self.compute_metrics(x[0], y, y_pred, sample_weight)
       
        results = {m.name: m.result() for m in self.metrics}
        results.update(
            {"student_loss": student_loss, "distillation_loss": distillation_loss, "minimized_loss": loss}
        )
        return results

during the validation of my model in the fit call, I get these outputs for the validation:

Epoch 1: LearningRateScheduler setting learning rate to 0.004999999888241291.
Epoch 1/15
9050/9050 [==============================] - 31s 3ms/step - loss: 0.4114 - sparse_categorical_accuracy: 0.8437 - student_loss: 0.4114 - distillation_loss: 0.5873 - minimized_loss: 0.4378 - val_loss: 0.3046 - val_sparse_categorical_accuracy: 0.9034 - val_student_loss: 0.2388 - val_distillation_loss: 0.4188 - val_minimized_loss: 0.2658 - lr: 0.0050

Epoch 2: LearningRateScheduler setting learning rate to 0.004999999888241291.
Epoch 2/15
9050/9050 [==============================] - 30s 3ms/step - loss: 0.3719 - sparse_categorical_accuracy: 0.8597 - student_loss: 0.3720 - distillation_loss: 0.5534 - minimized_loss: 0.3992 - val_loss: 0.3187 - val_sparse_categorical_accuracy: 0.8923 - val_student_loss: 0.3967 - val_distillation_loss: 0.3904 - val_minimized_loss: 0.3958 - lr: 0.0050

I wonder why val_loss is not the same as val_student_loss, since they are supposed to be the same. When I call model.evaluate() after training, val_loss and val_student_loss are the same.
I noticed that when the batch size for my generator is equal to the size of the dataset, the correct results are calculated. The generator is called in a tf.keras.utils.Sequence Object which is used for the validation data. I assume that the output is calculated based on the last batch only.

facaiy added type:bug Bug comp:keras Keras related issues labels Feb 22, 2019

facaiy assigned pavithrasv Feb 22, 2019

goldiegadde added the TF 1.13 Issues related to TF 1.13 label Mar 11, 2019

pavithrasv closed this as completed Mar 12, 2019

bersbersbers changed the title ~~tf.keras computes incorrect losses with 3+D data if batch size does not divide number of samples~~ tf.keras computes incorrect loss values with 3+D data Apr 29, 2019

vani-or mentioned this issue Nov 21, 2019

tf.keras computes incorrect loss values with Masking #34491

Closed

michalCyberfish mentioned this issue Oct 4, 2020

Loss and metric calculated differently on validation #43763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.keras computes incorrect loss values with 3+D data #25970

tf.keras computes incorrect loss values with 3+D data #25970

bersbersbers commented Feb 21, 2019 •

edited

Loading

pavithrasv commented Feb 28, 2019

bersbersbers commented Feb 28, 2019

bersbersbers commented Feb 28, 2019 •

edited

Loading

bersbersbers commented Mar 6, 2019 •

edited

Loading

pavithrasv commented Mar 12, 2019

bersbersbers commented Apr 29, 2019 •

edited

Loading

pavithrasv commented Apr 30, 2019

bersbersbers commented Apr 30, 2019

bersbersbers commented May 25, 2019

pavithrasv commented May 25, 2019

arwen-x commented Apr 10, 2020

alrifai commented May 10, 2020 •

edited

Loading

oO0oO0oO0o0o00 commented Sep 21, 2020

AlmCoding commented Oct 3, 2020

michalCyberfish commented Oct 4, 2020

ofiryaish commented Oct 9, 2020

HoltSpalding commented Jun 16, 2021

BoubakerA commented Dec 7, 2022 •

edited

Loading

svh7rng commented Mar 24, 2023 •

edited

Loading

tf.keras computes incorrect loss values with 3+D data #25970

tf.keras computes incorrect loss values with 3+D data #25970

Comments

bersbersbers commented Feb 21, 2019 • edited Loading

pavithrasv commented Feb 28, 2019

bersbersbers commented Feb 28, 2019

bersbersbers commented Feb 28, 2019 • edited Loading

bersbersbers commented Mar 6, 2019 • edited Loading

pavithrasv commented Mar 12, 2019

bersbersbers commented Apr 29, 2019 • edited Loading

pavithrasv commented Apr 30, 2019

bersbersbers commented Apr 30, 2019

bersbersbers commented May 25, 2019

pavithrasv commented May 25, 2019

arwen-x commented Apr 10, 2020

alrifai commented May 10, 2020 • edited Loading

oO0oO0oO0o0o00 commented Sep 21, 2020

AlmCoding commented Oct 3, 2020

michalCyberfish commented Oct 4, 2020

ofiryaish commented Oct 9, 2020

HoltSpalding commented Jun 16, 2021

BoubakerA commented Dec 7, 2022 • edited Loading

svh7rng commented Mar 24, 2023 • edited Loading

bersbersbers commented Feb 21, 2019 •

edited

Loading

bersbersbers commented Feb 28, 2019 •

edited

Loading

bersbersbers commented Mar 6, 2019 •

edited

Loading

bersbersbers commented Apr 29, 2019 •

edited

Loading

alrifai commented May 10, 2020 •

edited

Loading

BoubakerA commented Dec 7, 2022 •

edited

Loading

svh7rng commented Mar 24, 2023 •

edited

Loading