Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why should we add a small number to the output and multiply a small number by the input of the softplus function? #703

Closed
nbro opened this issue Dec 28, 2019 · 7 comments

Comments

@nbro
Copy link
Contributor

nbro commented Dec 28, 2019

In the article https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html, you have the following code

model = tfk.Sequential([
  tf.keras.layers.Dense(1 + 1),
  tfp.layers.DistributionLambda(
      lambda t: tfd.Normal(loc=t[..., :1],
                           scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))),
])

where you multiply 0.05 by the input to the softplus function (operation 1) and you add 1e-3 to its output (operation 2).

We want the scale (variance) to be non-negative. However, the softplus never produces a negative number, so there should be no need for adding 1e-3 to the output of the softplus. Similarly, I don't see the need for multiplying t[..., 1:] by 0.05.

I tried to train a network that attempts to model aleatoric uncertainty with and without operations 1 and 2. The results are indeed slightly different in both cases. Without operation 1 and 2, the uncertainty (i.e. variance) doesn't seem to be modeled correctly, apparently, in the cases where the points lie in a small range.

With operations 1 and 2.

with-operation-1-and-2

Without operations 1 and 2.

without-operation-1-and-2

@nbro nbro changed the title Why should we add a small number to the input to the softplus function? Why should we add a small number to the output and multiply a small number by the input of the softplus function? Dec 28, 2019
@davmre
Copy link
Contributor

davmre commented Dec 30, 2019 via email

@nbro
Copy link
Contributor Author

nbro commented Dec 30, 2019

@davmre You will get numeric errors whether or not you perform these two operations. If you look at the plot of the softmax function, it should roughly be zero when x=-150, so the addition of 1e-3 to the output of the softmax corresponds to an addition of a (significant) bias.

I don't understand what you mean by "optimal scale".

If, by reparameterization, you mean something similar to the reparameterization trick, then, yes, at first glance, it seems something similar. I am not sure I follow your reasoning though. The derivatives are taken with respect to the parameters of the model.

@davmre
Copy link
Contributor

davmre commented Dec 30, 2019

I just mean 'reparameterization' in the ordinary mathematical sense of the
term---you have a function f(x) that you want to express in terms of some
other parameter z=g(x), so you write it as f(g^-1(z)).

Here you might ordinarily define a normal RV in terms of its scale
parameter, but this code is implicitly choosing to express it in terms of a
different but equivalent parameter z = inverse_softplus(20 * scale),
which we write in terms of the inverse g^-1(z) = softplus(0.05 * z) to
get back the original scale parameter. Note that z here corresponds to
the activation t[..., 1] in the code, so another way to view this is that
we're asking the previous layer to produce activations of
inverse_softplus(20 * scale) which will be roughly 20 times larger than
if we'd just asked it to produce inverse_softplus(scale).

The effect of doing this is that optimization wrt scale will move 400 times
slower (since the reparameterization divides by gradients by 20, and
meanwhile the parameter we're optimizing is 20 times as large as before).
This is slightly complicated by the fact that we're actually optimizing
over the model weights θ, not the activations t(θ) directly, but the effect
flows downstream: consider breaking down the gradient into the sum of
contributions through the normal loc and the scale,

  ∂ loss / ∂ θ  = ( ∂ loss / ∂ loc(θ)  *  ∂ loc(θ) / ∂ θ  +
                    ∂ loss / ∂ scale(θ) * ∂ scale(θ) / ∂ t(θ) *  ∂ t(θ) / ∂ θ )

where t(θ) are the activations defining the scale from the previous layer.
Choosing the parameterization scale = softplus(0.05 * t(θ)) instead
of scale = softplus(t(θ)) makes the ∂ scale(θ) / ∂ t(θ) term 20 times smaller, while
leaving all others unchanged (and has the effect of requiring the weights
in the previous layer to be 20 times larger), so you'd still expect that
this change would result in the optimizer changing the scale much more
slowly than the loc parameter.

If you look at the plot of the softmax function, it should roughly be zero when x=-150, so the addition of 1e-3 to the output of the softmax corresponds to an addition of a bias.

Yup, exactly. The bias is that the optimizer won't consider values of less
than 1e-3 for the Gaussian scale. Adding this bias would be a bad idea if
we had reason to think the optimum was between 0 and 1e-3, but generally
it's not. The numerical issue we're worried about is that an optimizer
might misguidedly consider a near-zero value, and in that case evaluating
the Gaussian density

log p(x) = -.5 log(2 * pi * scale**2) - .5 * (x - loc)**2 / scale**2

will yield NaN because it's both dividing by and trying to take the log of
scale**2, which will be effectively zero. Since a NaN gradient ruins the
whole optimization, you generally want to prevent this expression from
ever being evaluated with near-zero scale, and adding a small constant
like 1e-3 is a cheap way to do that.

@nbro
Copy link
Contributor Author

nbro commented Jan 7, 2020

I've noticed that this trick is also used in Keras and TensorFlow. You may want to have a look at https://github.com/tensorflow/tensorflow/blob/7fda1add7cc637693781f4967ca290b6b659072b/tensorflow/python/keras/backend_config.py#L33, where the following function is defined

# Epsilon fuzz factor used throughout the codebase.
_EPSILON = 1e-7

@keras_export('keras.backend.epsilon')
def epsilon():
  return _EPSILON

which is, for example, used in the calculation of the binary cross-entropy loss https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/backend.py

# Compute cross entropy from probabilities.
bce = target * math_ops.log(output + epsilon())
bce += (1 - target) * math_ops.log(1 - output + epsilon())

The TensorFlow Probability creators may be interested in implementing a similar thing in TFP.

@cserpell
Copy link

Great answer. I understand the gradient changes when the constant is multiplied inside the softplus. Nevertheless, I don't understand why in this example, the constant is added, so it should not affect the optimization, and just adding a new parameterization that does nothing.

Copying it here if someone cannot go to the link:

# Specify the surrogate posterior over `keras.layers.Dense` `kernel` and `bias`.
def posterior_mean_field(kernel_size, bias_size=0, dtype=None):
  n = kernel_size + bias_size
  c = np.log(np.expm1(1.))
  return tf.keras.Sequential([
      tfp.layers.VariableLayer(2 * n, dtype=dtype),
      tfp.layers.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t[..., :n],
                     scale=1e-5 + tf.nn.softplus(c + t[..., n:])),
          reinterpreted_batch_ndims=1)),
  ])

@JP-MRPhys
Copy link

This is great work. To address one of the paper employed posterior sharpering. I also experienced that having variable sequence length cause issues with Backprop through time, this was in stock Tensorflow. I am not if that has any relationship this performance issue here but though to mention incase if folks have similar observation.

@srvasude
Copy link
Member

Closing this as I believe davmre has answered the issue.

Basically, we want to avoid bad regions of parameter space for our optimization, and by using eps + tf.nn.softplus(x), we make sure the parameters are constrained positively and don't get in to problematic regions.

Another example is when you train a Gaussian Process with a tfp.math.ExponentiatedQuadratic kernel. Technically all positive values for amplitude and length_scale work for the kernel, but when the values get really small (close to zero) we encounter a host of numerical issues since the matrices have eigenvalues very close to zero. By constraining those parameter by something like amplitude = 1e-3 + tf.nn.softplus(amplitude), we are saying that the parameters shouldn't get too small and thus we avoid these regions of bad numerical issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants