Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.keras computes incorrect loss values with 3+D data #25970

Closed
bersbersbers opened this issue Feb 21, 2019 · 19 comments
Closed

tf.keras computes incorrect loss values with 3+D data #25970

bersbersbers opened this issue Feb 21, 2019 · 19 comments
Assignees
Labels
comp:keras Keras related issues TF 1.13 Issues related to TF 1.13 type:bug Bug

Comments

@bersbersbers
Copy link

bersbersbers commented Feb 21, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):

Yes. For a minimal example, run

from tensorflow import keras

layer = keras.layers.Input(shape=(1, 1, 1))
model = keras.models.Model(inputs=layer, outputs=layer)
model.compile(optimizer='adam', loss='poisson', metrics=['poisson'])
data = [[[[[1]]], [[[2]]], [[[3]]]]]
model.fit(x=data, y=data, batch_size=2, verbose=1, epochs=10)

and observe that loss and poisson values are different, and loss values vary:

Epoch 1/10
3/3 [==============================] - 1s 236ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 2/10
3/3 [==============================] - 0s 2ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 3/10
3/3 [==============================] - 0s 40ms/sample - loss: 0.5452 - poisson: 0.4393
Epoch 4/10
3/3 [==============================] - 0s 96ms/sample - loss: 0.5452 - poisson: 0.4393
Epoch 5/10
3/3 [==============================] - 0s 1ms/sample - loss: 0.9772 - poisson: 0.4393
Epoch 6/10
3/3 [==============================] - 0s 2ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 7/10
3/3 [==============================] - 1s 201ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 8/10
3/3 [==============================] - 0s 2ms/sample - loss: 0.6740 - poisson: 0.4393
Epoch 9/10
3/3 [==============================] - 0s 999us/sample - loss: 0.9772 - poisson: 0.4393
Epoch 10/10
3/3 [==============================] - 1s 327ms/sample - loss: 0.9772 - poisson: 0.4393
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Windows 10

  • TensorFlow installed from (source or binary):
    pip install tensorflow

  • TensorFlow version (use command below):
    v1.13.0-rc1-19-gc865ec5621, 1.13.0-rc2

  • Python version:
    3.7.2 x64

  • CUDA/cuDNN version:
    n/a

  • GPU model and memory:
    n/a

Describe the current behavior
When fitting a model with loss="poisson", I would expect reported loss and poisson values to be identical.

Describe the expected behavior
loss values are incorrect. They vary from epoch to epoch.

Code to reproduce the issue
See above.

Other info / logs

More code examples and investigations at https://stackoverflow.com/q/54802328/880783

@facaiy facaiy added type:bug Bug comp:keras Keras related issues labels Feb 22, 2019
@pavithrasv
Copy link
Member

@bersbersbers are you still seeing this issue? I was not able to repro this on the latest nightly.

@bersbersbers
Copy link
Author

@pavithrasv you are right, tf_nightly-1.13.0.dev20190227 does not have this issue. I can still repro it in 1.13.0rc2 as well as in 1.13.1 which has been released in the meantime. Since the issue reproduces in a stable release, I would be very interested what the underlying cause is and, in particular, how long it has existed.

@bersbersbers
Copy link
Author

bersbersbers commented Feb 28, 2019

Here's my pip freeze in my current installation that does repro the issue:

absl-py==0.7.0
altgraph==0.16.1
astor==0.7.1
astroid==2.1.0
astropy==3.1.2
autopep8==1.4.3
awkward==0.8.4
bz2file==0.98
cachetools==3.1.0
certifi==2018.11.29
chardet==3.0.4
colorama==0.4.1
cycler==0.10.0
decorator==4.3.2
future==0.17.1
gast==0.2.2
grpcio==1.18.0
h5py==2.9.0
idna==2.8
imageio==2.5.0
imageio-ffmpeg==0.2.0
isbnlib==3.9.6
isbnlib-dnb==0.0.3
isbntools==4.3.19
isort==4.3.9
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
kiwisolver==1.0.1
lazy-object-proxy==1.3.1
macholib==1.11
Markdown==3.0.1
matplotlib==3.0.2
mccabe==0.6.1
mock==2.0.0
moviepy==1.0.0
nibabel==2.3.3
numpy==1.16.1
packaging==19.0
pandas==0.24.1
pbr==5.1.2
pefile==2018.8.8
Pillow==5.4.1
pip-review==1.0
pipdeptree==0.13.2
proglog==0.1.9
protobuf==3.6.1
psutil==5.5.1
py-essentials==1.4.12
pycodestyle==2.5.0
pydicom==1.2.2
pyhibp==3.0.0
PyInstaller==3.4
PyJWT==1.7.1
pylint==2.2.2
pyparsing==2.3.1
pypng==0.0.19
python-dateutil==2.8.0
pytz==2018.9
pywin32==224
pywin32-ctypes==0.2.0
PyYAML==3.13
requests==2.21.0
rope==0.12.0
scikit-learn==0.20.2
scipy==1.2.1
seaborn==0.9.0
six==1.12.0
sklearn==0.0
tee==0.0.3
tensorboard==1.13.0
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
tqdm==4.31.1
uproot==3.4.6
uproot-methods==0.4.3
urllib3==1.24.1
Werkzeug==0.14.1
wrapt==1.11.1

@bersbersbers
Copy link
Author

bersbersbers commented Mar 6, 2019

I tried to pin down when this issue was introduced and fixed:

True = has bug

tf-nightly==
1.13.0-dev20190101    True
1.13.0-dev20190124    True
1.13.0-dev20190125    True
1.13.0-dev20190129    False
1.13.0-dev20190206    False
1.13.0.dev20190227    False
1.14.1-dev20190306    False

tensorflow==
1.10.0        False (Aug 8, 2018)
1.11.0-rc0    True (Sep 13, 2018)
1.11.0        True
1.12.0        True
1.13.0-rc1    True
1.13.1        True (Feb 26, 2019)
2.0.0-alpha0  False (Mar 6, 2019)

So, introduced some time in Aug/Sep 2018 - due to missing tf-nightly packages on PyPi from that time, I cannot get any closer. Fixed some time between Jan 25 and 29. That makes about 600 commits:
https://github.com/tensorflow/tensorflow/search?q=committer-date%3A2019-01-24..2019-01-30&unscoped_q=committer-date%3A2019-01-24..2019-01-30&type=Commits

@goldiegadde goldiegadde added the TF 1.13 Issues related to TF 1.13 label Mar 11, 2019
@pavithrasv
Copy link
Member

I am closing this issue as it has been fixed. Thank you for digging into the release details!

@bersbersbers
Copy link
Author

bersbersbers commented Apr 29, 2019

This bug is still in the most current version, 1.13.1. Is there any release scheduled to be released soon to fix this? If not, I would be glad to use a workaround, but so far, I have not found any.

By the way, this is a reduced example where the batch size does divide the number of samples:

from tensorflow.keras import layers, metrics, models

layer = layers.Input(shape=(1, 1, 1))
model = models.Model(inputs=layer, outputs=layer)
model.compile(optimizer='adam', loss=metrics.mse, metrics=[metrics.mse])
model.fit(x=[[[[[0]]], [[[0]]]]], y=[[[[[1]]], [[[1]]]]])

It outputs:

2/2 [==============================]
 - 0s 23ms/sample - loss: 2.0000 - mean_squared_error: 1.0000

Note how loss and mean_squared_error are different. How can I get them to be identical?

@bersbersbers bersbersbers changed the title tf.keras computes incorrect losses with 3+D data if batch size does not divide number of samples tf.keras computes incorrect loss values with 3+D data Apr 29, 2019
@pavithrasv
Copy link
Member

Have you tried https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0? We will also have a TF 1.14 release very soon.

@bersbersbers
Copy link
Author

Have you tried https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0?

Yes, I have, see #25970 (comment). TF2 does not have that issue, but I don't want my research to rely on Alpha software currently :)

We will also have a TF 1.14 release very soon.

That is good news, thank you!

@bersbersbers
Copy link
Author

I can confirm that this bug has been fixed in 1.14.0rc0:

from tensorflow import keras
layer = keras.layers.Input(shape=(1, 1, 1))
model = keras.models.Model(inputs=layer, outputs=layer)
model.compile(optimizer='adam', loss='poisson')
data = [[[[[1]]], [[[2]]], [[[3]]]]]
model.fit(x=data, y=data, batch_size=2, verbose=2, epochs=10)

Output:

WARNING: Logging before flag parsing goes to stderr.
W0525 11:03:20.369000  6872 __init__.py:308] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Epoch 1/10
3/3 - 0s - loss: 0.4393
Epoch 2/10
3/3 - 0s - loss: 0.4393
Epoch 3/10
3/3 - 0s - loss: 0.4393
Epoch 4/10
3/3 - 0s - loss: 0.4393
Epoch 5/10
3/3 - 0s - loss: 0.4393
Epoch 6/10
3/3 - 0s - loss: 0.4393
Epoch 7/10
3/3 - 0s - loss: 0.4393
Epoch 8/10
3/3 - 0s - loss: 0.4393
Epoch 9/10
3/3 - 0s - loss: 0.4393
Epoch 10/10
3/3 - 0s - loss: 0.4393

@pavithrasv
Copy link
Member

Thank you!

@arwen-x
Copy link

arwen-x commented Apr 10, 2020

I am using tf 2.1.0 and experience the same problem, can you suggest anything?
Thank you!

@alrifai
Copy link

alrifai commented May 10, 2020

I am also getting a delta between mse loss and mse metric values, but only when applying regularization (l2 or dropout).

@oO0oO0oO0o0o00
Copy link

image

+1

@AlmCoding
Copy link

I have the same problem with tensorflow 2.3.0 when using l1/l2 regularization.

@michalCyberfish
Copy link

I'm having the same issue with TensorFlow GPU 2.1.0 and no regularization. However, this happens only on the validation step.

@ofiryaish
Copy link

@oO0oO0oO0o0o00 I have the same strange problem.

@HoltSpalding
Copy link

same problem

@BoubakerA
Copy link

BoubakerA commented Dec 7, 2022

When using weight regularization it seems to me that it is normal that the two functions don't ouptut the same results. That is because regularization adds the squared values of all weights to the loss function. In the other hand the MSE metric computes the MSE between the true output and the predicted one.

@svh7rng
Copy link

svh7rng commented Mar 24, 2023

I have a similar problem.
I have written my own test_step(self, data):

    def test_step(self, data):
        # Unpack the data
        x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)

        y_pred = self.student(x[0], training=False)
        student_loss = self.compute_loss(x[0], y, y_pred, sample_weight)
        distillation_loss = self.distillation_loss_fn(tf.nn.softmax(x[1] / self.temperature, axis=1), tf.nn.softmax(y_pred / self.temperature, axis=1))* self.temperature**2          
        loss = self.alpha *student_loss + (1 - self.alpha) *distillation_loss
        
        self.compute_metrics(x[0], y, y_pred, sample_weight)
       
        results = {m.name: m.result() for m in self.metrics}
        results.update(
            {"student_loss": student_loss, "distillation_loss": distillation_loss, "minimized_loss": loss}
        )
        return results

during the validation of my model in the fit call, I get these outputs for the validation:

Epoch 1: LearningRateScheduler setting learning rate to 0.004999999888241291.
Epoch 1/15
9050/9050 [==============================] - 31s 3ms/step - loss: 0.4114 - sparse_categorical_accuracy: 0.8437 - student_loss: 0.4114 - distillation_loss: 0.5873 - minimized_loss: 0.4378 - val_loss: 0.3046 - val_sparse_categorical_accuracy: 0.9034 - val_student_loss: 0.2388 - val_distillation_loss: 0.4188 - val_minimized_loss: 0.2658 - lr: 0.0050

Epoch 2: LearningRateScheduler setting learning rate to 0.004999999888241291.
Epoch 2/15
9050/9050 [==============================] - 30s 3ms/step - loss: 0.3719 - sparse_categorical_accuracy: 0.8597 - student_loss: 0.3720 - distillation_loss: 0.5534 - minimized_loss: 0.3992 - val_loss: 0.3187 - val_sparse_categorical_accuracy: 0.8923 - val_student_loss: 0.3967 - val_distillation_loss: 0.3904 - val_minimized_loss: 0.3958 - lr: 0.0050

I wonder why val_loss is not the same as val_student_loss, since they are supposed to be the same. When I call model.evaluate() after training, val_loss and val_student_loss are the same.
I noticed that when the batch size for my generator is equal to the size of the dataset, the correct results are calculated. The generator is called in a tf.keras.utils.Sequence Object which is used for the validation data. I assume that the output is calculated based on the last batch only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF 1.13 Issues related to TF 1.13 type:bug Bug
Projects
None yet
Development

No branches or pull requests