Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING:tensorflow:Can save best model only with val_loss available, skipping. #44107

Closed
alexliyihao opened this issue Oct 16, 2020 · 14 comments
Closed
Assignees
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.3 Issues related to TF 2.3 type:bug Bug

Comments

@alexliyihao
Copy link

System information
Current Google Colab Pro notebook, Tensorflow version 2.3.0

Describe the current behavior
when running .fit() with callback.ModelCheckpoint monitoring bothval_loss and val_sparse_categorical_accuracy the "WARNING:tensorflow:Can save best model only with val_loss available, skipping." pops up even if both terms are in hist.history returned from fit

Describe the expected behavior

Standalone code to reproduce the issue

model.compile(optimizer=tf.keras.optimizers.Adam(lr = 0.05),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

hist = model.fit(X_train,
                 y_train,
                 batch_size = BATCH_SIZE,
                 epochs=EPOCH,
                 validation_data=(X_val, y_val),
                 shuffle = True,
                 callbacks = [ModelCheckpoint("model.hdf5",
                              monitor = 'val_loss',
                              save_best_only = True,
                              save_weights_only = False,
                              save_freq= 1,
                              verbose = 0)],
                 verbose = 0)

where data are np.ndarray, and

for key in hist.history:
  print(key)

returns the following:

loss
sparse_categorical_accuracy
val_loss
val_sparse_categorical_accuracy
lr

@alexliyihao alexliyihao added the type:bug Bug label Oct 16, 2020
@Saduf2019 Saduf2019 added the TF 2.3 Issues related to TF 2.3 label Oct 19, 2020
@ThanasisMattas
Copy link

As stated here, monitor argument has to match one of the metric names passed at the metrics argument of model.compile(). So, you need to add tf.keras.metrics.SparseCategoricalCrossentropy to your metrics list (but this will print loss twice at each epoch) and "sparse_categorical_crossentropy" to monitor. But, this is not a solution, since this monitors the train set loss. I tried prepending val_ ("val_sparse_categorical_crossentropy"), which is a valid monitor being printed at each epoch, but got the same error.

@Saduf2019
Copy link
Contributor

@alexliyihao
Please share complete code, i ran the code shared and face this error, if possible share a colab gist with issue faced.

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Oct 19, 2020
@alexliyihao
Copy link
Author

Ah sorry, the complete code is a part of wrapped code and it took me a while to extract them out, a complete reproducible code is like following:

you can run it in https://colab.research.google.com/drive/1glZ4Mm5mo-Ev3YI9IlucfNH43AVZLFm0?usp=sharing ,I just have it run on normal Colab Notebook, the warning popped up as well. Not sure about local because my old macbook cannot handle much training= =

# init some fast hand data
import numpy as np
X_train = np.random.randn(5,200,200,3)
y_train = np.array([1,2,3,4,5])
X_val = np.random.randn(5,200,200,3)
y_val = np.array([1,2,3,4,5])
#-------------------------------------------------------
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Flatten, Dense, BatchNormalization
from tensorflow.keras.callbacks import ModelCheckpoint

# init a normal model efn+dense+dense
efn = tf.keras.applications.EfficientNetB2(weights='imagenet', include_top = False)
input = Input(shape= (200,200,3))
x = efn(input)
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
x = BatchNormalization()(x)
output = Dense(30, activation='softmax')(x) 
model = Model(input,output)

# compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(lr = 0.05),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
#-------------------------------------------------------
MONITOR = "val_loss" #try MONITOR = "val_sparse_categorical_accuracy" as well

hist = model.fit(X_train,
                 y_train,
                 batch_size = 64,
                 epochs=5,
                 validation_data=(X_val, y_val),
                 shuffle = True,
                 callbacks = [ModelCheckpoint("model.hdf5",
                              monitor = MONITOR,
                              save_best_only = True,
                              save_weights_only = False,
                              save_freq= 1,
                              verbose = 0)],
                 verbose = 0)
print(f"{MONITOR} in keys: {MONITOR in hist.history.keys()}", )

@Saduf2019 Saduf2019 added comp:keras Keras related issues and removed stat:awaiting response Status - Awaiting response from author labels Oct 20, 2020
@Saduf2019
Copy link
Contributor

@alexalemi
Could you please try with the actual data [and verbose 1] and let us know if the issue exists instead of trying with np.random dummy data.

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Oct 21, 2020
@alexliyihao
Copy link
Author

@Saduf2019

The data is some project-specific data so I'm afraid that I cannot provide the actual one, but I can confirm that the issue still exist.

Verbose = 1 will create this multi-line format in jupyter notebook, therefore I'm using tqdm.keras.TqdmCallback for most of the time. This is the output for 2 epochs when not using tqdm.keras.TqdmCallback, but using verbose = 1 for both fit() and ModelCheckpoint.

What can be also stated is that I tried tf.data.Dataset in <BatchDataset shapes: ((None, 200, 200, 3), (None,)), types: (tf.float64, tf.int64)> format, the error still exists. I personally think it's not related much to the input data.

Epoch 1/100
WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 1/28 [>.............................] - ETA: 0s - loss: 3.9611 - sparse_categorical_accuracy: 0.0156WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 2/28 [=>............................] - ETA: 4s - loss: 4.0169 - sparse_categorical_accuracy: 0.0547WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0958s vs `on_train_batch_end` time: 0.2455s). Check your callbacks.
WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 3/28 [==>...........................] - ETA: 5s - loss: 3.9681 - sparse_categorical_accuracy: 0.0573WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 4/28 [===>..........................] - ETA: 6s - loss: 3.9270 - sparse_categorical_accuracy: 0.0430WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 5/28 [====>.........................] - ETA: 6s - loss: 3.8824 - sparse_categorical_accuracy: 0.0437WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 6/28 [=====>........................] - ETA: 6s - loss: 3.8315 - sparse_categorical_accuracy: 0.0521WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 7/28 [======>.......................] - ETA: 5s - loss: 3.8645 - sparse_categorical_accuracy: 0.0536WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 8/28 [=======>......................] - ETA: 5s - loss: 3.8801 - sparse_categorical_accuracy: 0.0488WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 9/28 [========>.....................] - ETA: 5s - loss: 3.8523 - sparse_categorical_accuracy: 0.0486WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
10/28 [=========>....................] - ETA: 5s - loss: 3.8176 - sparse_categorical_accuracy: 0.0453WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
11/28 [==========>...................] - ETA: 5s - loss: 3.8034 - sparse_categorical_accuracy: 0.0455WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
12/28 [===========>..................] - ETA: 4s - loss: 3.7804 - sparse_categorical_accuracy: 0.0443WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
13/28 [============>.................] - ETA: 4s - loss: 3.7688 - sparse_categorical_accuracy: 0.0421WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
14/28 [==============>...............] - ETA: 4s - loss: 3.7455 - sparse_categorical_accuracy: 0.0435WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
15/28 [===============>..............] - ETA: 4s - loss: 3.7266 - sparse_categorical_accuracy: 0.0458WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
16/28 [================>.............] - ETA: 3s - loss: 3.7036 - sparse_categorical_accuracy: 0.0488WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
17/28 [=================>............] - ETA: 3s - loss: 3.7000 - sparse_categorical_accuracy: 0.0478WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
18/28 [==================>...........] - ETA: 3s - loss: 3.7036 - sparse_categorical_accuracy: 0.0469WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
19/28 [===================>..........] - ETA: 2s - loss: 3.6926 - sparse_categorical_accuracy: 0.0461WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
20/28 [====================>.........] - ETA: 2s - loss: 3.6940 - sparse_categorical_accuracy: 0.0469WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
21/28 [=====================>........] - ETA: 2s - loss: 3.6725 - sparse_categorical_accuracy: 0.0469WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
22/28 [======================>.......] - ETA: 1s - loss: 3.6490 - sparse_categorical_accuracy: 0.0490WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
23/28 [=======================>......] - ETA: 1s - loss: 3.6498 - sparse_categorical_accuracy: 0.0496WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
24/28 [========================>.....] - ETA: 1s - loss: 3.6471 - sparse_categorical_accuracy: 0.0482WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
25/28 [=========================>....] - ETA: 0s - loss: 3.6624 - sparse_categorical_accuracy: 0.0475WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
26/28 [==========================>...] - ETA: 0s - loss: 3.6560 - sparse_categorical_accuracy: 0.0475WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
27/28 [===========================>..] - ETA: 0s - loss: 3.6472 - sparse_categorical_accuracy: 0.0475WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
28/28 [==============================] - 12s 419ms/step - loss: 3.6460 - sparse_categorical_accuracy: 0.0470 - val_loss: 382778278225658249216.0000 - val_sparse_categorical_accuracy: 0.0317
Epoch 2/100
WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 1/28 [>.............................] - ETA: 0s - loss: 3.3321 - sparse_categorical_accuracy: 0.0156WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 2/28 [=>............................] - ETA: 4s - loss: 3.3042 - sparse_categorical_accuracy: 0.0469WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 3/28 [==>...........................] - ETA: 5s - loss: 3.2632 - sparse_categorical_accuracy: 0.0521WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 4/28 [===>..........................] - ETA: 5s - loss: 3.2630 - sparse_categorical_accuracy: 0.0586WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 5/28 [====>.........................] - ETA: 5s - loss: 3.2139 - sparse_categorical_accuracy: 0.0594WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 6/28 [=====>........................] - ETA: 6s - loss: 3.2333 - sparse_categorical_accuracy: 0.0573WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 7/28 [======>.......................] - ETA: 5s - loss: 3.1948 - sparse_categorical_accuracy: 0.0647WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 8/28 [=======>......................] - ETA: 5s - loss: 3.2231 - sparse_categorical_accuracy: 0.0586WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
 9/28 [========>.....................] - ETA: 5s - loss: 3.1942 - sparse_categorical_accuracy: 0.0677WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
10/28 [=========>....................] - ETA: 5s - loss: 3.1784 - sparse_categorical_accuracy: 0.0688WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
11/28 [==========>...................] - ETA: 5s - loss: 3.1815 - sparse_categorical_accuracy: 0.0696WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
12/28 [===========>..................] - ETA: 4s - loss: 3.1655 - sparse_categorical_accuracy: 0.0703WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
13/28 [============>.................] - ETA: 4s - loss: 3.1538 - sparse_categorical_accuracy: 0.0709WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
14/28 [==============>...............] - ETA: 4s - loss: 3.1474 - sparse_categorical_accuracy: 0.0703WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
15/28 [===============>..............] - ETA: 4s - loss: 3.1502 - sparse_categorical_accuracy: 0.0698WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
16/28 [================>.............] - ETA: 3s - loss: 3.1374 - sparse_categorical_accuracy: 0.0732WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
17/28 [=================>............] - ETA: 3s - loss: 3.1316 - sparse_categorical_accuracy: 0.0744WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
18/28 [==================>...........] - ETA: 3s - loss: 3.1259 - sparse_categorical_accuracy: 0.0764WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
19/28 [===================>..........] - ETA: 2s - loss: 3.1073 - sparse_categorical_accuracy: 0.0806WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
20/28 [====================>.........] - ETA: 2s - loss: 3.0857 - sparse_categorical_accuracy: 0.0859WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
21/28 [=====================>........] - ETA: 2s - loss: 3.0644 - sparse_categorical_accuracy: 0.0893WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
22/28 [======================>.......] - ETA: 1s - loss: 3.0625 - sparse_categorical_accuracy: 0.0888WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
23/28 [=======================>......] - ETA: 1s - loss: 3.0482 - sparse_categorical_accuracy: 0.0890WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
24/28 [========================>.....] - ETA: 1s - loss: 3.0320 - sparse_categorical_accuracy: 0.0911WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
25/28 [=========================>....] - ETA: 0s - loss: 3.0124 - sparse_categorical_accuracy: 0.0950WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
26/28 [==========================>...] - ETA: 0s - loss: 2.9966 - sparse_categorical_accuracy: 0.0974WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
27/28 [===========================>..] - ETA: 0s - loss: 2.9805 - sparse_categorical_accuracy: 0.1001WARNING:tensorflow:Can save best model only with val_sparse_categorical_crossentropy available, skipping.
28/28 [==============================] - 9s 332ms/step - loss: 2.9732 - sparse_categorical_accuracy: 0.1002 - val_loss: 9081760776192.0000 - val_sparse_categorical_accuracy: 0.0362

@Saduf2019 Saduf2019 removed the stat:awaiting response Status - Awaiting response from author label Oct 21, 2020
@Saduf2019 Saduf2019 assigned ymodak and unassigned Saduf2019 Oct 21, 2020
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 21, 2020
@ucesfpa
Copy link

ucesfpa commented Dec 17, 2020

Hello I am having exactly the same error.
Environment:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 95%   84C    P2   242W / 260W |  10776MiB / 11016MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 95%   84C    P2   240W / 260W |  10778MiB / 11019MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 28%   45C    P0    55W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   39C    P5    16W / 275W |      0MiB / 11178MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Running in nvidia docker
tensorflow/tensorflow:latest-gpu "/bin/bash" 6 days ago Up 6 days 0.0.0.0:5001->6006/tcp, 0.0.0.0:5000->8888/tcp tf
Installed with the following command
docker run --gpus all -d --name tf -it -p 5000:8888 -p 5001:6006 -v /home:/home tensorflow/tensorflow:latest-gpu
Python 3.6.9
tensorflow.version
'2.3.1'

THE CODE:

import os
import tensorflow as tf
from itertools import cycle
from tensorflow import keras
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
import utils.tf_losses as tfl
import utils.tf_utils as tf_utils
import utils.data as data
import argparse
import numpy as np
import pandas as pd
import io
from tensorflow.keras.callbacks import TensorBoard, ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from tensorflow.keras.metrics import Mean, MeanIoU, PrecisionAtRecall, Precision, Recall, Accuracy, AUC
import json
from glob import glob
from datetime import datetime

random_seed = 42 # Fixing the seed for PRNGs, to help reproducibility
np.random.seed(random_seed)
tf.random.set_seed(random_seed)

# # def gaussian_blur(img, kernel_size=11, sigma=5):
# # 	def gauss_kernel(channels, kernel_size, sigma):
# # 		ax = tf.range(-kernel_size // 2 + 1.0, kernel_size // 2 + 1.0)
# # 		xx, yy = tf.meshgrid(ax, ax)
# # 		kernel = tf.exp(-(xx ** 2 + yy ** 2) / (2.0 * sigma ** 2))
# # 		kernel = kernel / tf.reduce_sum(kernel)
# # 		kernel = tf.tile(kernel[..., tf.newaxis], [1, 1, channels])
# # 		return kernel
# #
# # 	gaussian_kernel = gauss_kernel(tf.shape(img)[-1], kernel_size, sigma)
# # 	gaussian_kernel = gaussian_kernel[..., tf.newaxis]
# #
# # 	return tf.nn.depthwise_conv2d(img, gaussian_kernel, [1, 1, 1, 1], padding='SAME', data_format='NHWC')
#
def _morph_dilation(input, k_size=3):
	if len(input.shape)!=4:
		raise ValueError("Inputs shape be 4 channels NWHC type")
	input = tf.cast(input,tf.float32)
	kernel = tf.zeros((k_size, k_size, 1))
	dilated_label = tf.nn.dilation2d(
		input,
		filters=kernel,
		strides=(1, 1, 1, 1),
		dilations=(1, 1, 1, 1),
		padding="SAME",
		data_format='NHWC'
	)
	return dilated_label

# # def _smooth_labels(onehot_tensor,label_smoothing):
# # 	num_classes = tf.cast(onehot_tensor.shape[-1],tf.float32)
# # 	return onehot_tensor * (1.0 - label_smoothing) + (label_smoothing / num_classes)

class CustomModel(Model):

	def train_step(self, data):
		# Unpack the data. Its structure depends on your model and
		# on what you pass to `fit()`.
		x, y = data
		with tf.GradientTape() as tape:
			y_pred = self(x, training=True)  # Forward pass
			# Compute the loss value
			# (the loss function is configured in `compile()`)
			loss = self.compiled_loss(y, y_pred)

		# Compute gradients
		trainable_vars = self.trainable_variables
		gradients = tape.gradient(loss, trainable_vars)
		# Update weights
		self.optimizer.apply_gradients(zip(gradients, trainable_vars))
		# Update metrics (includes the metric that tracks the loss)
		self.compiled_metrics.update_state(tf.squeeze(y,axis=-1), tf.argmax(tf.nn.softmax(y_pred,axis=-1),axis=-1))
		# Return a dict mapping metric names to current value
		return {m.name: m.result() for m in self.metrics}

	def test_step(self, data):
		# Unpack the data
		x, y = data
		if args['morph_dilation']==True:
			y = _morph_dilation(y,3)
		# Compute predictions
		y_pred = self(x, training=False)
		# Updates the metrics tracking the loss
		self.compiled_loss(y, y_pred)
		# Update the metrics.
		self.compiled_metrics.update_state(tf.squeeze(y,axis=-1), tf.argmax(tf.nn.softmax(y_pred,axis=-1),axis=-1))
		with file_writer.as_default():
			tf.summary.image("input", x, step=self.optimizer.iterations, max_outputs=params['batch_size'])
			tf.summary.image("ground_truth", tf.one_hot(tf.squeeze(y, axis=-1), depth=2, axis=-1), step=self.optimizer.iterations, max_outputs=params['batch_size'])
			tf.summary.image("prediction_output", y_pred, step=self.optimizer.iterations, max_outputs=params['batch_size'])
		# Return a dict mapping metric names to current value.
		# Note that it will include the loss (tracked in self.metrics).
		return {m.name: m.result() for m in self.metrics}

def train():
	train_dataset, val_dataset, train_steps_per_epoch, val_steps_per_epoch = data.dataset(args["data_dir"], batch_size=params['batch_size'], image_size = params['image_size'])
	inputs = Input(shape=(*params['image_size'], params['num_channels']), name='input')

	outputs = tf_utils.get_model(args, params, inputs)

	model = CustomModel(inputs, outputs)
#	tf_utils.print_summary(model=model,logdir=args['log_dir'],params=params,args=args)
	with file_writer.as_default():
		tf.summary.text('config', tf_utils.tb_config(logdir=args['log_dir'],params=params,args=args), step=0)

	callbacks = [
		# Callback to reduce the learning rate once the plateau has been reached:
		tf.keras.callbacks.ReduceLROnPlateau(
			monitor='val_loss',
			factor=0.1,
			patience=8,
			mode='auto',
			min_delta=0.001,
			cooldown=0,
			min_lr=1e-8
		),
		# Callback to stop the training once no more improvements are recorded:
		tf.keras.callbacks.EarlyStopping(
			monitor='val_loss',
			min_delta=0.001,
			patience=16,
			mode='auto',
			restore_best_weights=True
		),
		# Callback to log the graph, losses and metrics into TensorBoard:
		tf.keras.callbacks.TensorBoard(
			log_dir=args['log_dir'],
			histogram_freq=0,
			update_freq='epoch',
			write_graph=True,
			write_images=False
		),
		# Callback to save the model  specifying the epoch and val-loss in the filename:
		tf.keras.callbacks.ModelCheckpoint(
			filepath=save_path,),
			save_freq=5,
			monitor='val_loss',
			verbose=0,
			save_best_only=True,
			save_weights_only=False
		)
	]

	## Train Metrics ##
	train_metrics = [tf.keras.metrics.Mean() for _ in range(5)]
	train_metrics[0] = tf.keras.metrics.MeanIoU(num_classes=2, name='mIoU')
	train_metrics[1] = tf.keras.metrics.Precision(name='Precision')
	train_metrics[2] = tf.keras.metrics.Recall(name='Recall')
	train_metrics[3] = tf.keras.metrics.Accuracy(name='Accuracy')
	train_metrics[4] = tf.keras.metrics.AUC(curve='PR', name='AUC_PR')
	optimizer = tf.keras.optimizers.Adam(learning_rate=args['learning_rate'], beta_1=0.9, beta_2=0.999, epsilon=1e-07,
										 amsgrad=False, name='Adam')
	model.compile(
		optimizer=optimizer,
		loss = tfl.DiceLoss(from_logits=True),
		metrics=[train_metrics]
	)
	hist=model.fit(
		x=train_dataset,
		batch_size=params['batch_size'],
		epochs=params['Epochs'],
		verbose=1,
		callbacks=callbacks,
		validation_data=val_dataset,
		shuffle=True,
		class_weight=None,
		sample_weight=None,
		initial_epoch=0,
		steps_per_epoch=train_steps_per_epoch,
		validation_steps=val_steps_per_epoch,
		validation_freq=1,
		max_queue_size=10,
		workers=1,
		use_multiprocessing=False,
	)
	for key in hist.history:
		print(key)

if __name__ == '__main__':
	# -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # --
	# -- # -- Retrieve the config files and parse the arguments # -- # --
	# -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # -- # --
	with open('config_files/model_config.json','r') as f:
		configs = json.load(f)
	args = configs['args']
	params= configs['params']

	# -- # -- Add extra parameters
	params['final_activation'] = None
	params['pooling'] = True
	params['skipping'] = False
	args['morph_dilation'] = False

	# -- # -- Change the log_dir name coherently with the pooling parameter
	if params['pooling']==False:
	#	params['dilation_rate']=2 #uncomment to use dilated convolution
	#	params['strides']=1 #uncomment to use dilated convolution
		args['log_dir'] = os.path.join(args['log_dir'],'Un_'+args['model']+'_'+str(int(datetime.now().strftime("%Y%m%d%H%M%S"))))
	else:
		args['log_dir'] = os.path.join(args['log_dir'],args['model']+'_'+str(int(datetime.now().strftime("%Y%m%d%H%M%S"))))

        # -- # -- If log_dir does not exist, create it
	if not os.path.isdir(args['log_dir']):
		os.makedirs(args['log_dir'])

	# -- # -- If save_path does not exist, create it (it is the checkpoint saving dir)
	save_path=os.path.join(args['log_dir'],'checkpoint')
	if not os.path.isdir(save_path):
		os.makedirs(save_path)

	#args['dropout']= None
	file_writer = tf.summary.create_file_writer(args['log_dir'] + '/images')
	train()
	configs_savepath=args['log_dir']
	params = json.dumps(params, indent=4)
	with open(os.path.join(configs_savepath, 'model_params.json'), 'w') as params_file:
		params_file.write(params)
	args = json.dumps(args, indent=4)
	with open(os.path.join(configs_savepath, 'model_args.json'), 'w') as args_file:
		args_file.write(args)

Keys of hist.history

loss
mIoU
Precision
Recall
Accuracy
AUC_PR
val_loss
val_mIoU
val_Precision
val_Recall
val_Accuracy
val_AUC_PR
lr

When running I receive the following warning:

Epoch 20/300
15795/15795 [==============================] - 3008s 190ms/step - loss: 0.1272 - mIoU: 0.7991 - Precision: 0.7277 - Recall: 0.7746 - Accuracy: 0.9977 - AUC_PR: 0.5681 - val_loss: 0.1471 - val_mIoU: 0.7835 - val_Precision: 0.7082 - val_Recall: 0.7437 - val_Accuracy: 0.9977 - val_AUC_PR: 0.5313
Epoch 21/300 [=> . . . . . . . . . . . . . . . . . . . . . . . . . . . . .] - ETA: 41:43 - loss: 0.1243 - mIoU: 0.8025 - Precision: 0.7320 - Recall - 0.7808 - Accuracy: 0.9978 - AUC_PR: 0.5755WARNING: tensorflow: Can save best model only with val_loss available, skipping.

I tried to change

tf.keras.callbacks.ModelCheckpoint(
			filepath=save_path,),
			save_freq=5,
			monitor='val_AUC_PR', #'val_loss'
			verbose=0,
			save_best_only=True,
			save_weights_only=False
		)

But I receive the same warning:
WARNING: tensorflow:Can save best model only with val_AUC_PR available, skipping.

@ucesfpa
Copy link

ucesfpa commented Dec 17, 2020

It just found the solution to my problem. "save_freq" was set to 5 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to save_freq='epoch' solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search.
Refer to https://github.com/tensorflow/tensorflow/issues/33163 and look for the comment by MichaelSoegaard

@hanhanwu
Copy link

I changed save_freq=1 to save_freq='epoch' solved the problem

@ymodak
Copy link
Contributor

ymodak commented May 20, 2021

Closing this issue since the following setting helps resolves the problem for many users. Thank you.

save_freq='epoch' solved my problem.

@ymodak ymodak closed this as completed May 20, 2021
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@harisushehu
Copy link

It just found the solution to my problem. "save_freq" was set to 5 meaning it would save the model at each batch if val_acc improved. But as val_acc is computed after each epoch it didn't get any info on how each batch did. Changing to save_freq='epoch' solved my problem. Please note that 'save_freq' in a previous version was called something else. Don't remember it now, and I couldn't find the info by a quick Google search.
Refer to https://github.com/tensorflow/tensorflow/issues/33163 and look for the comment by MichaelSoegaard

save_freq was called period in the previous version

@Oriel-Barroso
Copy link

Oriel-Barroso commented Apr 3, 2023

I can't fixed using "save_freq='epoch'". Epoch need's to be a value or it's only a string?

@harisushehu
Copy link

I can't fixed using "save_freq='epoch'". Epoch need's to be a value or it's only a string?

You will need to set epoch=some value prior to using save_freq='epoch' . For instance, epoch = 5, save_freq = 'epoch'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.3 Issues related to TF 2.3 type:bug Bug
Projects
None yet
Development

No branches or pull requests

9 participants