Why should we add a small number to the output and multiply a small number by the input of the softplus function? #703

nbro · 2019-12-28T18:16:08Z

In the article https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html, you have the following code

model = tfk.Sequential([
  tf.keras.layers.Dense(1 + 1),
  tfp.layers.DistributionLambda(
      lambda t: tfd.Normal(loc=t[..., :1],
                           scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))),
])

where you multiply 0.05 by the input to the softplus function (operation 1) and you add 1e-3 to its output (operation 2).

We want the scale (variance) to be non-negative. However, the softplus never produces a negative number, so there should be no need for adding 1e-3 to the output of the softplus. Similarly, I don't see the need for multiplying t[..., 1:] by 0.05.

I tried to train a network that attempts to model aleatoric uncertainty with and without operations 1 and 2. The results are indeed slightly different in both cases. Without operation 1 and 2, the uncertainty (i.e. variance) doesn't seem to be modeled correctly, apparently, in the cases where the points lie in a small range.

With operations 1 and 2.

Without operations 1 and 2.

The text was updated successfully, but these errors were encountered:

davmre · 2019-12-30T13:43:20Z

I can't speak to this particular case in much detail, but a general reason for the habit of adding a small constant like 1e-3 is to avoid numerical issues that might occur if the scale becomes very close to zero (or even zero exactly, e.g., softplus(-150.0) == 0.0 in float32). Adding the constant prevents the optimizer from ever considering pathologically small values. By contrast, multiplying the input by 0.05 doesn't change the space of possible outputs; and (as you point out) it doesn't change the optimal scale --- it's just a reparameterization. The effect is to precondition the optimization: if y = 0.05 * x, then df(y)/dx = 0.05 * df(y)/dy, so the effect is that gradients are divided by 20, while x also has a scale 20 times that of y, so the *relative* change in gradient is a factor of 400. That's equivalent to specifying that the optimizer's step size for the scale param should be 1/400 of the step size it takes for the loc. I can't say exactly why that was done in this example; it might not be *that* necessary since you'd expect an adaptive optimizer like Adam to work out reasonable step sizes on its own, eventually. I'd guess that it helps speed up the optimization, or avoid local minima, or both. It's also possible that it avoids numerical issues by preventing the optimizer from considering extreme values for the scale before it has a reasonable idea of the loc. Perhaps one of the authors of that post can say more.

…

On Sat, Dec 28, 2019 at 1:16 PM nbro ***@***.***> wrote: In the article https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html, you have the following code model = tfk.Sequential([ tf.keras.layers.Dense(1 + 1), tfp.layers.DistributionLambda( lambda t: tfd.Normal(loc=t[..., :1], scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))), ]) where you multiply 0.05 by the input to the softplus function. You also add 1e-3 to its output. We want the scale (variance) to be non-negative. However, the softplus never produces a negative number, so there should be no need for adding 1e-3 to the output of the softplus. Similarly, I don't see the need for multiplying t[..., 1:] by 0.05. Of course, you apparently want to make the input to the softplus smaller than the output of the previous dense layer, but why? A smaller input will make the softplus produce a smaller output, so an output closer to zero, but, at the same time, you add 1e-3 to the output of the softplus, so, in a way, these two operations are cancelling each other. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#703?email_source=notifications&email_token=AAHSFCQNRX4WK5KVYHKBQHTQ26JWVA5CNFSM4KAQL3J2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IDBR6UA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHSFCUVJSCCG7BSRQXHGVDQ26JWVANCNFSM4KAQL3JQ> .

nbro · 2019-12-30T20:43:38Z

@davmre You will get numeric errors whether or not you perform these two operations. If you look at the plot of the softmax function, it should roughly be zero when x=-150, so the addition of 1e-3 to the output of the softmax corresponds to an addition of a (significant) bias.

I don't understand what you mean by "optimal scale".

If, by reparameterization, you mean something similar to the reparameterization trick, then, yes, at first glance, it seems something similar. I am not sure I follow your reasoning though. The derivatives are taken with respect to the parameters of the model.

davmre · 2019-12-30T22:21:34Z

I just mean 'reparameterization' in the ordinary mathematical sense of the
term---you have a function f(x) that you want to express in terms of some
other parameter z=g(x), so you write it as f(g^-1(z)).

Here you might ordinarily define a normal RV in terms of its scale
parameter, but this code is implicitly choosing to express it in terms of a
different but equivalent parameter z = inverse_softplus(20 * scale),
which we write in terms of the inverse g^-1(z) = softplus(0.05 * z) to
get back the original scale parameter. Note that z here corresponds to
the activation t[..., 1] in the code, so another way to view this is that
we're asking the previous layer to produce activations of
inverse_softplus(20 * scale) which will be roughly 20 times larger than
if we'd just asked it to produce inverse_softplus(scale).

The effect of doing this is that optimization wrt scale will move 400 times
slower (since the reparameterization divides by gradients by 20, and
meanwhile the parameter we're optimizing is 20 times as large as before).
This is slightly complicated by the fact that we're actually optimizing
over the model weights θ, not the activations t(θ) directly, but the effect
flows downstream: consider breaking down the gradient into the sum of
contributions through the normal loc and the scale,

  ∂ loss / ∂ θ  = ( ∂ loss / ∂ loc(θ)  *  ∂ loc(θ) / ∂ θ  +
                    ∂ loss / ∂ scale(θ) * ∂ scale(θ) / ∂ t(θ) *  ∂ t(θ) / ∂ θ )

where t(θ) are the activations defining the scale from the previous layer.
Choosing the parameterization scale = softplus(0.05 * t(θ)) instead
of scale = softplus(t(θ)) makes the ∂ scale(θ) / ∂ t(θ) term 20 times smaller, while
leaving all others unchanged (and has the effect of requiring the weights
in the previous layer to be 20 times larger), so you'd still expect that
this change would result in the optimizer changing the scale much more
slowly than the loc parameter.

If you look at the plot of the softmax function, it should roughly be zero when x=-150, so the addition of 1e-3 to the output of the softmax corresponds to an addition of a bias.

Yup, exactly. The bias is that the optimizer won't consider values of less
than 1e-3 for the Gaussian scale. Adding this bias would be a bad idea if
we had reason to think the optimum was between 0 and 1e-3, but generally
it's not. The numerical issue we're worried about is that an optimizer
might misguidedly consider a near-zero value, and in that case evaluating
the Gaussian density

log p(x) = -.5 log(2 * pi * scale**2) - .5 * (x - loc)**2 / scale**2

will yield NaN because it's both dividing by and trying to take the log of
scale**2, which will be effectively zero. Since a NaN gradient ruins the
whole optimization, you generally want to prevent this expression from
ever being evaluated with near-zero scale, and adding a small constant
like 1e-3 is a cheap way to do that.

nbro · 2020-01-07T19:35:50Z

I've noticed that this trick is also used in Keras and TensorFlow. You may want to have a look at https://github.com/tensorflow/tensorflow/blob/7fda1add7cc637693781f4967ca290b6b659072b/tensorflow/python/keras/backend_config.py#L33, where the following function is defined

# Epsilon fuzz factor used throughout the codebase.
_EPSILON = 1e-7

@keras_export('keras.backend.epsilon')
def epsilon():
  return _EPSILON

which is, for example, used in the calculation of the binary cross-entropy loss https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/backend.py

# Compute cross entropy from probabilities.
bce = target * math_ops.log(output + epsilon())
bce += (1 - target) * math_ops.log(1 - output + epsilon())

The TensorFlow Probability creators may be interested in implementing a similar thing in TFP.

cserpell · 2020-06-11T00:37:28Z

Great answer. I understand the gradient changes when the constant is multiplied inside the softplus. Nevertheless, I don't understand why in this example, the constant is added, so it should not affect the optimization, and just adding a new parameterization that does nothing.

Copying it here if someone cannot go to the link:

# Specify the surrogate posterior over `keras.layers.Dense` `kernel` and `bias`.
def posterior_mean_field(kernel_size, bias_size=0, dtype=None):
  n = kernel_size + bias_size
  c = np.log(np.expm1(1.))
  return tf.keras.Sequential([
      tfp.layers.VariableLayer(2 * n, dtype=dtype),
      tfp.layers.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t[..., :n],
                     scale=1e-5 + tf.nn.softplus(c + t[..., n:])),
          reinterpreted_batch_ndims=1)),
  ])

JP-MRPhys · 2020-06-12T09:50:39Z

This is great work. To address one of the paper employed posterior sharpering. I also experienced that having variable sequence length cause issues with Backprop through time, this was in stock Tensorflow. I am not if that has any relationship this performance issue here but though to mention incase if folks have similar observation.

srvasude · 2021-02-13T13:20:40Z

Closing this as I believe davmre has answered the issue.

Basically, we want to avoid bad regions of parameter space for our optimization, and by using eps + tf.nn.softplus(x), we make sure the parameters are constrained positively and don't get in to problematic regions.

Another example is when you train a Gaussian Process with a tfp.math.ExponentiatedQuadratic kernel. Technically all positive values for amplitude and length_scale work for the kernel, but when the values get really small (close to zero) we encounter a host of numerical issues since the matrices have eigenvalues very close to zero. By constraining those parameter by something like amplitude = 1e-3 + tf.nn.softplus(amplitude), we are saying that the parameters shouldn't get too small and thus we avoid these regions of bad numerical issues.

nbro changed the title ~~Why should we add a small number to the input to the softplus function?~~ Why should we add a small number to the output and multiply a small number by the input of the softplus function? Dec 28, 2019

cserpell mentioned this issue Jun 11, 2020

How to Implement Bayesian LSTM layers for time-series prediction #394

Open

srvasude closed this as completed Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why should we add a small number to the output and multiply a small number by the input of the softplus function? #703

Why should we add a small number to the output and multiply a small number by the input of the softplus function? #703

nbro commented Dec 28, 2019 •

edited

davmre commented Dec 30, 2019 via email

nbro commented Dec 30, 2019 •

edited

davmre commented Dec 30, 2019 •

edited

nbro commented Jan 7, 2020 •

edited

cserpell commented Jun 11, 2020

JP-MRPhys commented Jun 12, 2020

srvasude commented Feb 13, 2021

Why should we add a small number to the output and multiply a small number by the input of the softplus function? #703

Why should we add a small number to the output and multiply a small number by the input of the softplus function? #703

Comments

nbro commented Dec 28, 2019 • edited

davmre commented Dec 30, 2019 via email

nbro commented Dec 30, 2019 • edited

davmre commented Dec 30, 2019 • edited

nbro commented Jan 7, 2020 • edited

cserpell commented Jun 11, 2020

JP-MRPhys commented Jun 12, 2020

srvasude commented Feb 13, 2021

nbro commented Dec 28, 2019 •

edited

nbro commented Dec 30, 2019 •

edited

davmre commented Dec 30, 2019 •

edited

nbro commented Jan 7, 2020 •

edited