# Surpassing Early Stopping: </br> A Correlation-driven stopping criterion for machine learning models.

### *A hands-on guide with the MNIST Handwritten Digit Classification Problem.*

Machine learning models are easy to overfit and hard to regularize. Usually, there is not enough emphasis on the out-of-sample performance of these models, which is why they underperform. There are numerous methods to regularize machine learning models, however, the best way is still stopping the training at the right time, before overfitting occurs. The vast majority of machine learning models simply stop the training at a pre-defined epoch number or utilize early stopping. Is there a better way to do it? 

## The Correlation-driven stopping criterion (CDSC)

The Correlation-Driven Stopping Criterion (CDSC) is designed to address limitations in existing stopping strategies like early stopping and maximum epoch-based methods. CDSC works by monitoring the rolling Pearson correlation between the training and the validation loss metrics. Training is stopped when this correlation drops below a predefined threshold and stays below a pre-defined number of epochs. This approach helps in accurately determining the optimal point to stop training, thus preventing overfitting and improving the model's generalization to data not seen before. It's a flexible method that can be fine-tuned for different scenarios, offering a significant improvement in efficiency and performance compared to traditional methods. </br> Let's see how it works:

![The method](./method.jpeg)

The basis of the method is the rolling Pearson correlation coefficient between the training and the test loss. The window in which the rolling correlation is calculated is shifted as the training progresses, it contains the metrics of the last ω epochs. Before epoch ω no correlation is calculated. If the first Pearson rolling correlation at epoch ω is negative or really low, the model or the problem is likely to be malformulated.
We introduce a threshold value μ. If the correlation falls below the patience value, we increment a counter. In the figure, the epoch where the correlation falls below the threshold is denoted with a vertical blue dashed line.
We introduce a patience value λ. We stop the training if the counter reaches  λ. This is the epoch where the training stops. In the figure, this epoch is denoted with a vertical red dashed line. Finally, we chose the model with the best validation error.

# Clone the repo and install the necessary dependencies

In [None]:
#!git clone https://github.com/vathyfogarassy/CDSC
!pip install tensorflow
!pip install matplotlib
!pip install numpy

# About the dataset

The MNIST dataset is a large database of handwritten digits that is widely used for training and testing in the field of machine learning. MNIST stands for "Modified National Institute of Standards and Technology." The dataset contains 70,000 images of handwritten digits, from 0 to 9, which are divided into a training set of 60,000 examples and a test set of 10,000 examples.

Each image in the MNIST dataset is a 28x28 pixel grayscale representation of a digit. These images have been size-normalized and centered in a fixed-size image. The simplicity of the MNIST dataset makes it a standard benchmark for evaluating the performance of a wide range of machine learning algorithms, especially those involving image recognition and computer vision.

We are going to divide the dataset into training, validation and test datasets, to benchmark the CDSC method and to see how it operates.

# Dataset and data preparation


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist


num_classes = 10
input_shape = (28, 28, 1)

(train_data, train_targets), (test_data, test_targets) = mnist.load_data()
randomizer = np.arange(len(test_data))
np.random.shuffle(randomizer)
test_data = test_data[randomizer]
test_targets = test_targets[randomizer]

print(train_data.shape)
valid_data = test_data[:-1:2,:,:]
valid_targets = test_targets[:-1:2]
test_data = test_data[1::2,:,:]
test_targets = test_targets[1::2]
train_data = np.expand_dims(train_data.astype("float32")/255,-1)
valid_data = np.expand_dims(valid_data.astype("float32")/255,-1)
test_data = np.expand_dims(test_data.astype("float32")/255,-1)



train_targets = tf.keras.utils.to_categorical(train_targets, num_classes)
valid_targets = tf.keras.utils.to_categorical(valid_targets, num_classes)
test_targets = tf.keras.utils.to_categorical(test_targets, num_classes)
    
dataset_save_path = ""
np.save(dataset_save_path+"TrainX.npy",train_data)
np.save(dataset_save_path+"ValidX.npy",valid_data)
np.save(dataset_save_path+"TestX.npy",test_data)
np.save(dataset_save_path+"TrainY.npy",train_targets)
np.save(dataset_save_path+"ValidY.npy",valid_targets)
np.save(dataset_save_path+"TestY.npy",test_targets)

The code above creates 6 datsets, targets and input data for the training, validation and test sets. We save these for further usage.

# Define the model

Second, we need to define the model to solve the handwritten digit classification problem. We will be using a CNN-based model. Despite its computational cost it is still commonly used for image processing tasks. We import the CDSC stopping method, which is implemented as a TensorFlow callback.

In [None]:
import tensorflow as tf
from CDSC_callback import CDSC

physical_devices = tf.config.list_physical_devices('GPU')
print(physical_devices)
#tf.config.experimental.set_memory_growth(physical_devices[0], True)

num_classes = 10
input_shape = (28, 28, 1)
    
filepath = "Best{val_loss:.2f}.hdf5"
    
CBCB = CDSC(filepath = filepath,
                    window_size = 5,
                    threshold = 0.4,
                    patience = 10,
                    )

model = tf.keras.models.Sequential(
    [
        tf.keras.Input(shape=input_shape),
        tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(num_classes, activation="softmax"),
    ]
)

We compile the model and observe the summary.

In [None]:
model.compile(loss="categorical_crossentropy", optimizer="adam",metrics = ["accuracy"])
model.summary()

We train the model with the CDSC method.

In [None]:
from CDSC_callback import CDSC

model.fit(train_data,train_targets,
              validation_data = (valid_data,valid_targets),
              epochs=150,
              shuffle=True,
              batch_size = 256,
              callbacks = [CBCB,tf.keras.callbacks.CSVLogger("LOG.csv", separator=',', append=False)],
      )

During training, the callback prints the training and the validation loss along with the correlation between them. If the problem is formulated correctly, after epoch omega, the correlation starts high and decreases during the training, until the stopping criterion triggers.


# Results with cross-validation

We tested the CDSC method with several different datasets and models. We used a cross-validation of 50 and conducted an extensive hyperparameter search on all three parameters.
We summarized our testing methodology in a figue:

![The method](./Testmethod.jpeg)

The results can be observed in the table below:

| Dataset                       | $\mu_{Best}$ | $\omega_{Best}$ | $\lambda_{Best}$ | $\overline{n}_{Stop}^{CD(pset_{Best})}$ | $\overline{n}_{Best}^{CD(pset_{Best})}$ | $\overline{\%e}_{ts}^{CD(pset_{Best})}$ |
|-------------------------------|--------------|-----------------|------------------|------------------------------------------|------------------------------------------|------------------------------------------|
| Credit Card Fraud Detection   | 0.75         | 30              | 65               | 119.28                                   | 79.00                                    | -1.35                                    |
| MNIST                         | 0.40         | 35              | 10               | 76.30                                    | 61.26                                    | -2.89                                    |
| Boston Hs.                    | -0.40        | 10              | 25               | 197.76                                   | 108.28                                   | -1.56                                    |
| Gold Daily Price Change       | -0.15        | 30              | 25               | 99.54                                    | 39.22                                    | -0.27                                    |


Where $\overline{n}_{Stop}^{CD(pset_{Best})}$ is the epoch where the training was stopped, $\overline{n}_{Best}^{CD(pset_{Best})}$ is the epoch where the validation error was the lowest, this model was selected for further usage, $\overline{\%e}_{ts}^{CD(pset_{Best})}$ is the percentage test error reduction from the baseline we used for the testing, which was an epoch limit

But how does it compare to the early stopping and the epoch limit methods with optimal parameters? Let's put these results in perspective?


| Dataset                  | $\overline{\%e}_{ts}^{ME(m_{Best})}$ | $\overline{\%e}_{ts}^{ES(p_{Best})}$ | $\overline{\%e}_{ts}^{CD(pset_{Best})}$ |
|--------------------------|--------------------------------------|--------------------------------------|------------------------------------------|
| Credit Card Fraud Detection | -1.20                               | -1.25                                | -1.35                                   |
| MNIST                    | -2.72                                | -0.95                                | -2.89                                   |
| Boston Housing           | -0.18                                | 0.00                                 | -1.56                                   |
| Gold Daily Price Change  | -0.05                                | -0.15                                | -0.27                                   |


where $\overline{\%e}_{ts}^{ME(m_{Best})}$, $\overline{\%e}_{ts}^{ES(p_{Best})}$, and $\overline{\%e}_{ts}^{CD(pset_{Best})}$ are the percentage error reductions from the baseline for the maximum number of epochs, the early stopping and the CDSC methods respectively.

# Conclusion

In this guide, we learned how to use the Correlation-driven stopping criterion with the MNIST dataset and we also got a grasp of the potential of the algorithm. The new method is capable of achieving better results than the early stopping and the maximum number of epochs methods.

All results presented in this article are based on the following publication:

Miseta, Tamás, Attila Fodor, and Ágnes Vathy-Fogarassy. "Surpassing early stopping: A novel correlation-based stopping criterion for neural networks." Neurocomputing 567 (2024): 127028.

If it was useful for your research work, please consider citing the article
