New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Can save best model only with val_loss available, skipping.' during tf.keras.callbacks.ModelCheckpoint #33163
Comments
Please make sure you fill in the issue template to ensure that your issue can be troubleshooted correctly. Key information includes System Specs and Tensorflow/CUDA/cuDNN versions. Without this it's harder to help! A quick look at this suggests this may be because you are not actually supplying validation data to perform the val_acc metric on. Try setting validation in the fit() call and see how you get on. |
@jubjamie Thanks for the friendly correction. I am new to posting issues here. I have tried your fix using a tf.data.Dataset as my validation_data:
Then this error comes up: So I set validation_steps to my batch size:
and then I get the original warning: |
Hi, I had similar problem and found that in my environment the metric name was "val_accuracy" and not "val_acc". I fixed it as below and it started to work: best_weights = ModelCheckpoint('best.h5', verbose=1, monitor='val_accuracy', save_best_only=True, mode='auto') In order to check what are the metrics names in your env, do this. It will print you the correct metrics names: hist = model.fit(...) |
@Benzion18 Thanks! That worked for me. In the documentation it refers to the monitor as 'val_acc'. I wonder if it is just you and I that have this difference. |
In Google Colab it works with "val_acc", but in Kaggle with "val_accuracy". Why? No idea. |
I was working in pycharm on my machine and it was 'val_accuracy' |
@raceee Could you provide a link to the web page wherer it mentions |
@Benzion18 Thank you so much!! |
Hi Everyone, Nice work, it looks like you've worked out a solution. To clarify what's happening here: The source of the confusion here is the
|
Without this fix, the model will not save, and the code will fail after the first pass of INCV when it tries to load a model that was not saved due to callback failure. This is a standard fix (see: tensorflow/tensorflow#33163).
This thread have helped me to use the right metric for TF1.x, but seems now for TF2.x it was fixed and 'val_acc' should be used https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint |
Was able to save an h5 file to the designated drive folder while monitoring val_accuracy using the following: model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath = checkpointPath + 'best.h5', monitor = 'val_accuracy', mode = 'max', save_best_only = True) history = model.fit( x = x_train, y = y_train, validation_data = ( x_val, y_val ), saveTrainedModel(model, history) |
try replacing |
So should the documentation here (https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) be updated or this is fixed in 2.3? Still need to be |
Why is this not working for me (2.3.0)? model.fit(
train_dataset,
batch_size=batch_size,
steps_per_epoch=15,
epochs=epochs,
validation_steps=5,
validation_data=dev_dataset,
validation_batch_size=validation_batch_size,
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=2),
tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(model_dir, 'logs'),
update_freq=10
),
tf.keras.callbacks.ModelCheckpoint(
os.path.join(model_dir, 'ckpt-{epoch:06d}'),
monitor='val_loss',
save_best_only=True,
save_weights_only=False,
save_freq=15
),
]
) Getting
|
Remember that there's no "right" setting here, it just depends on what you name the metrics you pass to @Burton2000: Yeah, that page could be clearer. I'll send a quick update. @stefan-falk: That's worrying, |
@MarkDaoust I think I just used it wrong. I wanted to save checkpoints during an epoch and became a victim of Keras wizardry a bit. I have a custom model which now returns |
#33163 PiperOrigin-RevId: 334404361 Change-Id: I07c768e830cf1e401ba8bba3be8a53a3db5c83af
@stefan-falk have you solved it? I have similar issues. All of the options 'val_loss', 'val_acc', and 'val_accuracy' will produce that warning 'Can save best model only with XXX available, skipping' . The tensorflow version is TF 1.14, using tf.keras will produce above warnings. However, the code works correctly in pure Keras. |
This problem seems to persist unless I'm missing something. I've tried with Tensorflow 1.14 AND 2.1. I have also tried with "acc" - "val_acc" and the "accuracy" - "val_accuracy" variant mentioned. No matter what I try I get: "WARNING:tensorflow:Can save best model only with val_acc available, skipping." `opt = tf.keras.optimizers.SGD(lr=0.000001, decay=1e-6, momentum=0.1, nesterov=False)
|
It just found the solution to my problem. "save_freq" was set to 1 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to "save_freq='epoch'" solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search. |
This issue should not be closed. As noted by @MichaelSoegaard if save_freq is specified, then no matter what validation metric you specify as the monitor key you will get this error and model check-pointing will not work. Even specifying the number of batches to be a multiple of the number of batches per epoch does not solve the problem. This could be easily fixed by instead being able to specify the number of epochs after which check-pointing will occur. I cannot use epoch mode in my case as it takes far too long to save the model relative to the time it takes to process one epoch. |
@marchss Yes, as in my previous comment:
but that was with Tensorflow 2.1.0 or 2.3.0 (or something around that). |
this worked for me, Thanks! |
Thank you so much this solved the issue for me |
It was previously called 'period' before the rename to save_freq. |
The issue still exists at TensorFlow 2.4.1 (Ubuntu 20.04). I tried to use the Least Common Multiplier of batch size and
|
Maybe we need a clearer warning about @anomp, callbacks are easy to modify, I bet this will do what you want:
|
@MarkDaoust worked like a charm, thank you! These are my changes for anyone interested.
Callback definition:
Produces output:
|
I think this was resolved. Recent TF page on ModelCheckpoint https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint has most of the updates discussed in this thread. If you still have some issues with I am closing this issue as this was resolved. Thanks! |
filepath = "weights-{epoch:02d}-{accuracy:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, optimizer="adam" ,monitor='accuracy', verbose=2, save_best_only=True, mode="max") Works fine. |
THIS SAVED MY 3 HOURS OF MISERY TYSM! |
My model is built like:
Here the model is fitted with a callback.
This gives me the warning:
The text was updated successfully, but these errors were encountered: