Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KerasClassifier.score is ... broken!? #38004

Closed
FirefoxMetzger opened this issue Mar 28, 2020 · 13 comments
Closed

KerasClassifier.score is ... broken!? #38004

FirefoxMetzger opened this issue Mar 28, 2020 · 13 comments
Assignees
Labels
comp:keras Keras related issues TF 2.2 Issues related to TF 2.2 type:bug Bug

Comments

@FirefoxMetzger
Copy link

I am using the scikit_learn wrapper to wrap a keras model and train / evaluate it in scikit learn. Calling KerasClassifer.score should return the accuracy of the classifier; however, no matter what I do, it just doesn't.

Looking at the source the code does two things:

  1. In case of sparse labels it converts them to a OneHot matrix (lines 296 - 300)
  2. It calles Sequential.evaluate and then hopes to find a metric called acc or accuracy which it treats as the accuracy of the model. (lines 302 - 307)

If it doesn't manage to find a named metric with the right name, it raises an exception.

I don't understand how this could possibly work (and it doesn't work for me). Given that the target labels are OneHot encoded the correct metric to use is CategoricalAccuracy; however, it is named

def __init__(self, name='categorical_accuracy', dtype=None):

Logically KerasClassifer.score raises an exception. Worse, the error message suggests to add the Accuracy metric to the model. This can be misleading, as it makes the error disappear and returns a value, but that value is not ... accurate (pun intended).

I suggest renaming accuracy in

if name in ['accuracy', 'acc']:

to categorical_accuracy, and, while at it, I suggest to add _estimator_type = "classifier" as a class variable. Scikit learn checks for it to identify KerasClassifier as a classifier, and without it a lot of functionality doesn't work as intended.

If there is agreement for this change I can submit a PR.

@amahendrakar
Copy link
Contributor

@FirefoxMetzger,
In order to expedite the trouble-shooting process, could you please provide the complete code to reproduce the issue reported here and also the TensorFlow version you are using. Thanks!

@amahendrakar amahendrakar added the stat:awaiting response Status - Awaiting response from author label Mar 30, 2020
@FirefoxMetzger
Copy link
Author

Sure: tf.__version__ == '2.0.0'
Minimal (not) working example to demonstrate the problem:

import tensorflow as tf
import numpy as np
import sklearn.preprocessing
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier 

## --- LOAD DATA ---
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

encoder = sklearn.preprocessing.OneHotEncoder(dtype=np.float32)
encoder.fit(Y_train.reshape((-1, 1)))

Y_train_encoded = encoder.transform(Y_train.reshape((-1, 1))).toarray()
Y_test_encoded = encoder.transform(Y_test.reshape((-1, 1))).toarray()


## --- SETUP MODEL ---
def setupModel():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.CategoricalAccuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
model = KerasClassifier(build_fn=setupModel,
                        epochs=2, # show that training executes
                        batch_size=256,
                       )

# train the model
model.fit(X_train, Y_train_encoded)

# evaluate the test accuracy
test_accuracy = model.score(X_test, Y_test_encoded, verbose=0)
print(f"The test accuracy is {test_accuracy * 100:.2f}%")

Minimal not working example to demonstrate the Accuracy metric problem:

import tensorflow as tf
import numpy as np
import sklearn.preprocessing
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier 

## --- LOAD DATA ---
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

encoder = sklearn.preprocessing.OneHotEncoder(dtype=np.float32)
encoder.fit(Y_train.reshape((-1, 1)))

Y_train_encoded = encoder.transform(Y_train.reshape((-1, 1))).toarray()
Y_test_encoded = encoder.transform(Y_test.reshape((-1, 1))).toarray()


## --- SETUP MODEL ---
def setupModel():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.Accuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
model = KerasClassifier(build_fn=setupModel,
                        epochs=2, # show that training executes
                        batch_size=256,
                       )

# train the model
model.fit(X_train, Y_train_encoded)

# evaluate the test accuracy
test_accuracy = model.score(X_test, Y_test_encoded, verbose=0)
print(f"The test accuracy is {test_accuracy * 100:.2f}%")

Y_pred = model.model.predict(X_test)
acc = tf.keras.metrics.CategoricalAccuracy()
acc.update_state(Y_test, Y_pred)
print(f"The (true) categorical accuracy is {acc.result().numpy()*100:.2f}%")

Output:

Train on 60000 samples
Epoch 1/2
60000/60000 [==============================] - 1s 13us/sample - loss: 67.0612 - accuracy: 0.6470
Epoch 2/2
60000/60000 [==============================] - 1s 9us/sample - loss: 29.1524 - accuracy: 0.7125
The test accuracy is 72.88%
The (true) categorical accuracy is 10.40%

The reason for this discrepancy is described in the opening comment.

@amahendrakar
Copy link
Contributor

Was able to reproduce the issue with TF v2.0, TF v2.1 and TF-nightly. Please find the attached gist. Thanks!

@amahendrakar amahendrakar added comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug and removed stat:awaiting response Status - Awaiting response from author labels Apr 1, 2020
@jvishnuvardhan
Copy link
Contributor

@FirefoxMetzger As mentioned in the error, I changed metric in the compile from metrics=[tf.keras.metrics.CategoricalAccuracy()] to `metrics=['accuracy']. With that change everything worked as expected. Please check the gist here.

Please close the issue if this was resolved for you. Thanks!

@jvishnuvardhan jvishnuvardhan added the stat:awaiting response Status - Awaiting response from author label Apr 1, 2020
@FirefoxMetzger
Copy link
Author

You are correct. It does resolve the exception, as I mentioned in my issue. It does not solve the underlying issue though. It simply doesn't crash anymore and instead silently computes the wrong accuracy; arguably much worse, because easier to miss.

Using metrics=['accuracy'] is, as far as I am aware, a shorthand for metrics=[tf.keras.metrics.Accuracy()], and - as demonstrated in my reply to @amahendrakar - the metric is unsuited for OneHot encoded data. To demonstrate this, the test case I posted above run on @jvishnuvardhan 's modified code produces the following output:

Epoch 1/2
235/235 [==============================] - 1s 3ms/step - loss: 51.1440 - accuracy: 0.3992
Epoch 2/2
235/235 [==============================] - 1s 3ms/step - loss: 25.3406 - accuracy: 0.5925
The test accuracy is 61.51%
The (true) categorical accuracy is 10.04%

Both accuracy numbers should be identical. Both represent the fraction of the test dataset that has been assigned the correct label by the model, and both numbers use the same model and same test data.

For reference, here is the code:

import tensorflow as tf
import numpy as np
import sklearn.preprocessing
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier 

## --- LOAD DATA ---
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

encoder = sklearn.preprocessing.OneHotEncoder(dtype=np.float32)
encoder.fit(Y_train.reshape((-1, 1)))

Y_train_encoded = encoder.transform(Y_train.reshape((-1, 1))).toarray()
Y_test_encoded = encoder.transform(Y_test.reshape((-1, 1))).toarray()


## --- SETUP MODEL ---
def setupModel():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(28,28)))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Softmax())

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-5),
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics = ['accuracy'] #metrics=[tf.keras.metrics.CategoricalAccuracy()]
                 )
    return model

# this model now uses the sklearn API instead of the Keras API
# it still trains using tensorflow under the hood
model = KerasClassifier(build_fn=setupModel,
                        epochs=2, # show that training executes
                        batch_size=256,
                       )

# train the model
model.fit(X_train, Y_train_encoded)

# evaluate the test accuracy
test_accuracy = model.score(X_test, Y_test_encoded, verbose=0)
print(f"The test accuracy is {test_accuracy * 100:.2f}%")

# test case to compare metrics
Y_pred = model.model.predict(X_test)
acc = tf.keras.metrics.CategoricalAccuracy()
acc.update_state(Y_test, Y_pred)
print(f"The (true) categorical accuracy is {acc.result().numpy()*100:.2f}%")

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 5, 2020
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 8, 2020
@bentyeh
Copy link

bentyeh commented Apr 16, 2020

Please take a look at issue #38596, linked above. I believe this is the underlying issue for why the Keras model.evaluate() method computes the wrong loss.

@amahendrakar amahendrakar self-assigned this Jun 18, 2020
@amahendrakar
Copy link
Contributor

Please take a look at issue #38596, linked above. I believe this is the underlying issue for why the Keras model.evaluate() method computes the wrong loss.

@FirefoxMetzger,
Please take a look at @bentyeh's comment and let us know if it works. Thanks!

@amahendrakar amahendrakar added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jun 18, 2020
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 25, 2020
@FirefoxMetzger
Copy link
Author

Hey @amahendrakar I checked out the comment, and it, unfortunately, doesn't resolve the issue.

It really seems to be the naming of the desired metric in line

if name in ['accuracy', 'acc']:

as I quoted in the opening post.

If you check the keras metrics, you can find the metric with name='accuracy' at

def __init__(self, name='accuracy', dtype=None):

whereas the desired metric is defined at
def __init__(self, name='categorical_accuracy', dtype=None):

under the name categorical_accuracy.

That is what I believe to be the issue, unless I misunderstand how the wrapper is supposed to work.

@google-ml-butler google-ml-butler bot removed the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 25, 2020
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 27, 2020
@amahendrakar
Copy link
Contributor

Was able to reproduce the issue with TF v2.2 and TF-nightly. Please find the attached gist. Thanks!

@amahendrakar amahendrakar added TF 2.2 Issues related to TF 2.2 and removed TF 2.0 Issues relating to TensorFlow 2.0 labels Jul 2, 2020
@amahendrakar amahendrakar removed their assignment Jul 2, 2020
@sachinprasadhs
Copy link
Contributor

Was able to reproduce your issue in Tf Nightly 2.6.0-dev20210524, please find the gist here. Thanks!

@tensorflowbutler
Copy link
Member

Hi There,

This is a stale issue. As you are using an older version of tensorflow, we are checking to see if you still need help on this issue. Please test the issue with the latest TensorFlow (TF2.7 and tf-nightly). If the issue still persists with the newer versions of TF, please feel free to open it in keras-team/keras repository by providing details about the issue and a standalone code to reproduce the issue. Thanks!

Please note that Keras development has moved to a separate Keras-team/keras repository to focus entirely on only Keras. Thanks!

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF 2.2 Issues related to TF 2.2 type:bug Bug
Projects
None yet
Development

No branches or pull requests

7 participants