The paper is available at https://arxiv.org/abs/2212.09921.
We propose a novel optimization algorithm for training machine learning models called Input Normalized Stochastic Gradient Descent (INSGD), inspired by the Normalized Least Mean Squares (NLMS) algorithm used in adaptive filtering. When training complex models on large datasets, the choice of optimizer parameters, particularly the learning rate, is crucial to avoid divergence. Our algorithm updates the network weights using stochastic gradient descent with
Let us assume that we have a linear neuron at the last stage of the network.
let
where
The figure showing how the optimizer algorithm works for any layer is shown in the figure. For any other layer beside the last layer, the formula becomes slightly different. We use the gradient term as it is and normalize it with the input.
We incorporate momentum, a technique that aids in navigating high error and low curvature regions. In the INSGD algorithm, we introduce an input momentum term to estimate the power of the dataset, enabling power normalization. By replacing the denominator term with the estimated input power, we emphasize the significance of power estimation in our algorithm. Furthermore, the utilization of input momentum allows us to capture the norm of all the inputs. Denoted as
While estimating the input power is crucial, we encounter a challenge similar to AdaGrad. The normalization factor can grow excessively, resulting in infinitesimally small updates. To address this, we employ the logarithm function to stabilize the normalization factor. However, the use of the logarithm function introduces the risk of negative values. If the power is too low, the function could yield a negative value, reversing the direction of the update. To mitigate this, we employ a function with the rectified linear unit, which avoids the issue of negative values. Adding a regularizer may not be sufficient to resolve this problem, hence the choice of the rectified linear unit function. The update equation: