Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Can save best model only with val_loss available, skipping.' during tf.keras.callbacks.ModelCheckpoint #33163

Closed
raceee opened this issue Oct 9, 2019 · 33 comments
Assignees
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.0 Issues relating to TensorFlow 2.0 type:support Support issues

Comments

@raceee
Copy link

raceee commented Oct 9, 2019

My model is built like:

Model:  "sequential"
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1379)              3801903   
_________________________________________________________________
dropout (Dropout)            (None, 1379)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_3 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 1380      
=================================================================
Total params: 9,512,343
Trainable params: 9,512,343
Non-trainable params: 0

Here the model is fitted with a callback.

checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath=filepath, mode='max', monitor='val_acc', verbose=2, save_best_only=True)
callbacks_list = [checkpoint]
model.fit(train_dataset, epochs=1000, callbacks=callbacks_list, verbose=2, steps_per_epoch=(number_of_samples//BATCH_SIZE))

This gives me the warning:

Epoch 2/1000
W1009 01:11:32.446842 11824 callbacks.py:990] Can save best model only with val_acc available, skipping.
@jubjamie
Copy link
Contributor

jubjamie commented Oct 9, 2019

Please make sure you fill in the issue template to ensure that your issue can be troubleshooted correctly. Key information includes System Specs and Tensorflow/CUDA/cuDNN versions. Without this it's harder to help!

A quick look at this suggests this may be because you are not actually supplying validation data to perform the val_acc metric on. Try setting validation in the fit() call and see how you get on.

@oanush oanush self-assigned this Oct 10, 2019
@oanush oanush added comp:keras Keras related issues type:support Support issues labels Oct 10, 2019
@oanush
Copy link

oanush commented Oct 10, 2019

@raceee ,
Can you please try solution provided by @jubjamie? Thanks!

@oanush oanush added the stat:awaiting response Status - Awaiting response from author label Oct 10, 2019
@raceee
Copy link
Author

raceee commented Oct 10, 2019

@jubjamie Thanks for the friendly correction. I am new to posting issues here.

I have tried your fix using a tf.data.Dataset as my validation_data:

checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath=filepath, mode='max', monitor='val_acc', verbose=2, save_best_only=True)
callbacks_list = [checkpoint]
model.fit(train_dataset, validation_data=x_test_dataset, epochs=1000, callbacks=callbacks_list, verbose=2, steps_per_epoch=(X_train_deleted_nans.shape[0]//BATCH_SIZE))

Then this error comes up:
ValueError: When passing an infinitely repeating dataset, you must specify the "validation_steps" argument.

So I set validation_steps to my batch size:

checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath=filepath, mode='max', monitor='val_acc', verbose=2, save_best_only=True)
callbacks_list = [checkpoint]
model.fit(train_dataset, validation_data=x_test_dataset, validation_steps=BATCH_SIZE, epochs=1000, callbacks=callbacks_list, verbose=2, steps_per_epoch=(X_train_deleted_nans.shape[0]//BATCH_SIZE))

and then I get the original warning:
W1009 22:43:42.809758 5456 callbacks.py:990] Can save best model only with val_acc available, skipping.

@Benzion18
Copy link

Benzion18 commented Oct 10, 2019

Hi,

I had similar problem and found that in my environment the metric name was "val_accuracy" and not "val_acc". I fixed it as below and it started to work:

best_weights = ModelCheckpoint('best.h5', verbose=1, monitor='val_accuracy', save_best_only=True, mode='auto')

In order to check what are the metrics names in your env, do this. It will print you the correct metrics names:

hist = model.fit(...)
for key in hist.history:
print(key)

@raceee
Copy link
Author

raceee commented Oct 10, 2019

@Benzion18 Thanks! That worked for me. In the documentation it refers to the monitor as 'val_acc'. I wonder if it is just you and I that have this difference.

@Benzion18
Copy link

In Google Colab it works with "val_acc", but in Kaggle with "val_accuracy". Why? No idea.

@raceee
Copy link
Author

raceee commented Oct 10, 2019

I was working in pycharm on my machine and it was 'val_accuracy'

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 11, 2019
@oanush oanush assigned jvishnuvardhan and unassigned oanush Oct 11, 2019
@jvishnuvardhan
Copy link
Contributor

jvishnuvardhan commented Oct 14, 2019

@Benzion18 Thanks! That worked for me. In the documentation it refers to the monitor as 'val_acc'. I wonder if it is just you and I that have this difference.

@raceee Could you provide a link to the web page wherer it mentions val_acc? You could create a PR to update it. Thanks!

@jvishnuvardhan jvishnuvardhan added stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 labels Oct 14, 2019
jvishnuvardhan added a commit that referenced this issue Oct 14, 2019
Closes the issue #33163
@jvishnuvardhan
Copy link
Contributor

@raceee Thanks for finding this typo. I found the link here to the web page where we need to update the typo. Thanks!

@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Oct 14, 2019
@abyanwang
Copy link

@Benzion18 Thank you so much!!

@MarkDaoust
Copy link
Member

Hi Everyone,

Nice work, it looks like you've worked out a solution.

To clarify what's happening here: The source of the confusion here is the model.compile line.
The name of the metric, later on, matches whatever you passed to compile, and if you pass a metric object it will use the name attribute of the metric

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])
history_acc = model.fit(train_images, train_labels, epochs=1)
history_acc.history
{'acc': [0.82243335], 'loss': [0.5034547645966212]}
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])
history_acc = model.fit(train_images, train_labels, epochs=1)
{'accuracy': [0.82243335], 'loss': [0.5034547645966212]}

@maiskovich
Copy link

This thread have helped me to use the right metric for TF1.x, but seems now for TF2.x it was fixed and 'val_acc' should be used

https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
From TF 2 docs:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_acc',
mode='max',
save_best_only=True)

@finknj
Copy link

finknj commented Jun 7, 2020

Was able to save an h5 file to the designated drive folder while monitoring val_accuracy using the following:

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath = checkpointPath + 'best.h5', monitor = 'val_accuracy', mode = 'max', save_best_only = True)

history = model.fit( x = x_train, y = y_train, validation_data = ( x_val, y_val ),
batch_size = BATCH_SIZE, validation_batch_size = BATCH_SIZE, callbacks = [model_checkpoint_callback],
epochs = EPOCHS, shuffle = True )

saveTrainedModel(model, history)

@Terkea
Copy link

Terkea commented Jul 4, 2020

try replacing val_acc with val_accuracy . it worked on my case

@Burton2000
Copy link

Burton2000 commented Jul 31, 2020

So should the documentation here (https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) be updated or this is fixed in 2.3? Still need to be val_accuracy on 2.2.

@stefan-falk
Copy link

Why is this not working for me (2.3.0)?

model.fit(
    train_dataset,
    batch_size=batch_size,
    steps_per_epoch=15,
    epochs=epochs,
    validation_steps=5,
    validation_data=dev_dataset,
    validation_batch_size=validation_batch_size,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2),
        tf.keras.callbacks.TensorBoard(
            log_dir=os.path.join(model_dir, 'logs'),
            update_freq=10
        ),
        tf.keras.callbacks.ModelCheckpoint(
            os.path.join(model_dir, 'ckpt-{epoch:06d}'),
            monitor='val_loss',
            save_best_only=True,
            save_weights_only=False,
            save_freq=15
        ),
    ]
)

Getting

WARNING:tensorflow:Can save best model only with val_loss available, skipping

@MarkDaoust
Copy link
Member

Remember that there's no "right" setting here, it just depends on what you name the metrics you pass to model.compile.

@Burton2000: Yeah, that page could be clearer. I'll send a quick update.

@stefan-falk: That's worrying, loss is the one metric that's populated automatically, so it would be hard to get the name wrong. Can you provide a minimal end-to-end reproduction? There's not enough context to reproduce the issue.

@stefan-falk
Copy link

@MarkDaoust I think I just used it wrong. I wanted to save checkpoints during an epoch and became a victim of Keras wizardry a bit. I have a custom model which now returns {m.name: m.result() for m in self.metrics} in train_step() and dict(loss=val_loss) in test_step() and I let it safe the model after an epoch.

tensorflow-copybara pushed a commit that referenced this issue Sep 29, 2020
#33163

PiperOrigin-RevId: 334404361
Change-Id: I07c768e830cf1e401ba8bba3be8a53a3db5c83af
@chopwoodwater
Copy link

chopwoodwater commented Oct 13, 2020

@stefan-falk have you solved it? I have similar issues. All of the options 'val_loss', 'val_acc', and 'val_accuracy' will produce that warning 'Can save best model only with XXX available, skipping' . The tensorflow version is TF 1.14, using tf.keras will produce above warnings. However, the code works correctly in pure Keras.

@MichaelSoegaard
Copy link

MichaelSoegaard commented Nov 29, 2020

This problem seems to persist unless I'm missing something. I've tried with Tensorflow 1.14 AND 2.1. I have also tried with "acc" - "val_acc" and the "accuracy" - "val_accuracy" variant mentioned. No matter what I try I get: "WARNING:tensorflow:Can save best model only with val_acc available, skipping."

`opt = tf.keras.optimizers.SGD(lr=0.000001, decay=1e-6, momentum=0.1, nesterov=False)

   # Compile model
model.compile(loss='categorical_crossentropy',
                optimizer=opt,
                metrics=['accuracy']
                )
checkpoint = ModelCheckpoint(f"models/{MODEL_ID}_best_epoch.hdf5", monitor='val_accuracy', verbose=1,
                                    save_best_only=True, mode='max', save_freq=1)

model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val, y_val),
    callbacks=[tensorboard, checkpoint ]
    )`

@MichaelSoegaard
Copy link

It just found the solution to my problem. "save_freq" was set to 1 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to "save_freq='epoch'" solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search.

@LEChaney
Copy link

This issue should not be closed. As noted by @MichaelSoegaard if save_freq is specified, then no matter what validation metric you specify as the monitor key you will get this error and model check-pointing will not work. Even specifying the number of batches to be a multiple of the number of batches per epoch does not solve the problem. This could be easily fixed by instead being able to specify the number of epochs after which check-pointing will occur. I cannot use epoch mode in my case as it takes far too long to save the model relative to the time it takes to process one epoch.

@stefan-falk
Copy link

@marchss Yes, as in my previous comment:

I have a custom model which now returns {m.name: m.result() for m in self.metrics} in train_step() and dict(loss=val_loss) in test_step() and I let it safe the model after an epoch.

but that was with Tensorflow 2.1.0 or 2.3.0 (or something around that).

@martsim6
Copy link

It just found the solution to my problem. "save_freq" was set to 1 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to "save_freq='epoch'" solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search.

this worked for me, Thanks!

@Roohi-Sharma
Copy link

It just found the solution to my problem. "save_freq" was set to 1 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to "save_freq='epoch'" solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search.

Thank you so much this solved the issue for me

@ahnaflodhi
Copy link

It just found the solution to my problem. "save_freq" was set to 1 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to "save_freq='epoch'" solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search.

It was previously called 'period' before the rename to save_freq.

@anomp
Copy link

anomp commented Apr 29, 2021

The issue still exists at TensorFlow 2.4.1 (Ubuntu 20.04). I tried to use the Least Common Multiplier of batch size and Keras.model.history.params.steps as save_freq however the validation loss seems to be calculated after the callback. Thus, still no validation metric can be used.

LEChaney wrote:
This issue should not be closed. As noted by @MichaelSoegaard if save_freq is specified, then no matter what validation metric you specify as the monitor key you will get this error and model check-pointing will not work. Even specifying the number of batches to be a multiple of the number of batches per epoch does not solve the problem. This could be easily fixed by instead being able to specify the number of epochs after which check-pointing will occur. I cannot use epoch mode in my case as it takes far too long to save the model relative to the time it takes to process one epoch.

@MarkDaoust
Copy link
Member

Maybe we need a clearer warning about save_freq=N.

@anomp, callbacks are easy to modify, I bet this will do what you want:

class MyModelCheckpoint(tf.keras.callbacks.ModelCheckpoint):
  def __init__(self, epoch_per_save=1, *args, **kwargs):
    self.epochs_per_save = epoch_per_save
    super().__init__(save_freq='epochs', *args, **kwargs)

  def on_epoch_end(self, epoch, logs):
    if epoch % self.epochs_per_save != 0:
      super().on_epoch_end(epoch, logs)

@MarkDaoust MarkDaoust reopened this Apr 29, 2021
@anomp
Copy link

anomp commented May 6, 2021

Maybe we need a clearer warning about save_freq=N.

@anomp, callbacks are easy to modify, I bet this will do what you want:

class MyModelCheckpoint(tf.keras.callbacks.ModelCheckpoint):
  def __init__(self, epoch_per_save=1, *args, **kwargs):
    self.epochs_per_save = epoch_per_save
    super().__init__(save_freq='epochs', *args, **kwargs)

  def on_epoch_end(self, epoch, logs):
    if epoch % self.epochs_per_save != 0:
      super().on_epoch_end(epoch, logs)

@MarkDaoust worked like a charm, thank you! These are my changes for anyone interested.

class MyModelCheckpoint(tf.keras.callbacks.ModelCheckpoint):
    def __init__(self, epoch_per_save=1, *args, **kwargs):
        logging.debug("MyModelCheckpoint called with epoch_per_save={}".format(epoch_per_save))
        self.epochs_per_save = epoch_per_save
        super().__init__(save_freq="epoch", *args, **kwargs)

    def on_epoch_end(self, epoch, logs):
        if epoch % self.epochs_per_save == 0:
            super().on_epoch_end(epoch, logs)

Callback definition:

callback_checkpoint = MyModelCheckpoint(epoch_per_save=5,
                                                    filepath=str(self.path_to_save_model),
                                                    monitor="val_loss"
                                                    verbose=1,
                                                    save_best_only=True,
                                                    mode="min"
                                                    )

Produces output:

Epoch 1/1000                                                                                                                        
1244/1245 [============================>.] - ETA: 0s - loss: 2.3031 - accuracy: 0.2450                                                     
Epoch 00001: val_loss improved from inf to 1.79107, saving model to output_directory/reports
WARNING:tensorflow:From python/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a futureversion.                                                                                                               
Instructions for updating:                                                                                                        
If using Keras pass *_constraint arguments to layers.                                              
From python/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.                                                                                                                
Instructions for updating:                                                                                                          
If using Keras pass *_constraint arguments to layers.                                                                                                                                                                                       
INFO:tensorflow:Assets written to: output_directory/reports/assets                  
Assets written to: output_directory/reports/assets                                 
1245/1245 [==============================] - 11s 8ms/step - loss: 2.3031 - accuracy: 0.2450 - val_loss: 1.7911 - val_accuracy: 0.4351
Epoch 2/1000                                                                                                                        
1245/1245 [==============================] - 9s 7ms/step - loss: 1.8651 - accuracy: 0.3727 - val_loss: 1.5282 - val_accuracy: 0.4881
Epoch 3/1000                                                                                                                        
1245/1245 [==============================] - 9s 7ms/step - loss: 1.6901 - accuracy: 0.4361 - val_loss: 1.3601 - val_accuracy: 0.5576    
Epoch 4/1000                                                                                                                        
1245/1245 [==============================] - 9s 7ms/step - loss: 1.5703 - accuracy: 0.4826 - val_loss: 1.2823 - val_accuracy: 0.5964
Epoch 5/1000                                                                          
1245/1245 [==============================] - 9s 7ms/step - loss: 1.4985 - accuracy: 0.5120 - val_loss: 1.1593 - val_accuracy: 0.6311
Epoch 6/1000
1240/1245 [============================>.] - ETA: 0s - loss: 1.4409 - accuracy: 0.5312
Epoch 00006: val_loss improved from 1.79107 to 1.08370, saving model to output_directory/reports
INFO:tensorflow:Assets written to: output_directory/reports/assets
Assets written to: output_directory/reports/assets
1245/1245 [==============================] - 10s 8ms/step - loss: 1.4405 - accuracy: 0.5313 - val_loss: 1.0837 - val_accuracy: 0.6506
Epoch 7/1000
1245/1245 [==============================] - 9s 7ms/step - loss: 1.4030 - accuracy: 0.5499 - val_loss: 1.0493 - val_accuracy: 0.6690
Epoch 8/1000
1245/1245 [==============================] - 9s 7ms/step - loss: 1.3724 - accuracy: 0.5596 - val_loss: 1.0612 - val_accuracy: 0.6677
Epoch 9/1000
1245/1245 [==============================] - 9s 7ms/step - loss: 1.3402 - accuracy: 0.5700 - val_loss: 0.9856 - val_accuracy: 0.6969
Epoch 10/1000
1245/1245 [==============================] - 9s 7ms/step - loss: 1.3217 - accuracy: 0.5779 - val_loss: 0.9939 - val_accuracy: 0.6856
Epoch 11/1000
1245/1245 [==============================] - ETA: 0s - loss: 1.3052 - accuracy: 0.5834
Epoch 00011: val_loss improved from 1.08370 to 0.94162, saving model to output_directory/reports
INFO:tensorflow:Assets written to: output_directory/reports/assets
Assets written to: output_directory/reports/assets

@jvishnuvardhan
Copy link
Contributor

I think this was resolved. Recent TF page on ModelCheckpoint https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint has most of the updates discussed in this thread.

If you still have some issues with ModelCheckpoint callback, please feel free to open a new issue.

I am closing this issue as this was resolved. Thanks!

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@0xBLCKLPTN
Copy link

filepath = "weights-{epoch:02d}-{accuracy:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, optimizer="adam" ,monitor='accuracy', verbose=2, save_best_only=True, mode="max")

Works fine.

@gulf1324
Copy link

MichaelSoegaard

THIS SAVED MY 3 HOURS OF MISERY TYSM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.0 Issues relating to TensorFlow 2.0 type:support Support issues
Projects
None yet
Development

No branches or pull requests