Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinity mask breaks gradient #11756

Open
hongzimao opened this issue Jul 25, 2017 · 13 comments

Comments

@hongzimao
Copy link

@hongzimao hongzimao commented Jul 25, 2017

I'm trying to do softmax over selected indices, using infinity mask to silent out the unwanted ones. However, the gradient of those unwanted entires become nan as opposed to 0.

The reason I didn't use boolean mask is that the mask indices are different in my batch, which can't end up with a nice matrix form. If there's workaround here I'll be more than happy to adopt.

The code I tested the infinity mask is

import numpy as np
import tensorflow as tf

a = tf.placeholder(tf.float32, [5])
inf_mask = tf.placeholder(tf.float32, [5])

b = tf.multiply(a, inf_mask)
sf = tf.nn.softmax(b)

loss = (sf[2] - 0)
grad = tf.gradients(loss, a)

sess = tf.Session()

a_np = np.ones([5])
np_mask = np.ones([5]) * 4
np_mask[1] = -np.inf

print sess.run([sf, grad], feed_dict={
    a: a_np,
    inf_mask: np_mask
})

sess.close()

The output is [array([ 0.25, 0. , 0.25, 0.25, 0.25], dtype=float32), [array([-0.25, nan, 0.75, -0.25, -0.25], dtype=float32)]]

The mask is working but the gradient has a nan, which should have been 0 I think.

@aselle

This comment has been minimized.

Copy link
Member

@aselle aselle commented Jul 25, 2017

Could you ask on StackOverflow to see if anybody has a more elegant solution?

@hongzimao

This comment has been minimized.

Copy link
Author

@hongzimao hongzimao commented Jul 25, 2017

Yes, there's a solution on StackOverflow that uses np.finfo(np.float32).min instead of -np.inf https://stackoverflow.com/questions/45310221/tensorflow-infinity-mask-breaks-gradient/.

Also, it seems a mask (a 0-1 mask instead of negative infinity mask) applied after exponential also works.

I'm not sure how robust these methods are though. Hope someone can clarify. Thanks!

import numpy as np
import tensorflow as tf

a = tf.placeholder(tf.float32, [5])
inf_mask = tf.placeholder(tf.float32, [5])
zero_one_mask = tf.placeholder(tf.float32, [5])
exp_a = tf.exp(a)

b = tf.multiply(exp_a, zero_one_mask)
sf = b / tf.reduce_sum(b)

# b = tf.multiply(a, inf_mask)
# sf = tf.nn.softmax(b)

loss = (sf[2] - 0)
grad = tf.gradients(loss, a)

sess = tf.Session()

a_np = np.ones([5])
np_mask = np.ones([5])
np_mask[1] = 0

print sess.run([sf, grad], feed_dict={
    a: a_np,
    zero_one_mask: np_mask
})

sess.close()

The output is [array([ 0.25, 0. , 0.25, 0.25, 0.25], dtype=float32), [array([-0.0625 , -0. , 0.18750001, -0.0625 , -0.0625 ], dtype=float32)]].

@aselle

This comment has been minimized.

Copy link
Member

@aselle aselle commented Jul 27, 2017

softmax is written to avoid numerical inaccuracy for ill-conditioned finite values numbers. It does this by subtracting off the max abs value and doing the computation around that. That means that injecting infinities to its arguments will give you nans as you are seeing. This numerically robust computation is key for many models. I think if you can get away with the 0 to 1 solution that is pretty decent. You could look at some of the sparse cross entropy softmax with logits functions for maximum robustness and the ability to have a sparse subset of values.

@hongzimao

This comment has been minimized.

Copy link
Author

@hongzimao hongzimao commented Jul 27, 2017

Do you have some examples for "sparse cross entropy softmax with logits functions"? Thanks!

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Jul 27, 2017

Once you start feeding in infinities, then I expect there'll be NaN's in gradients. Fundamental reason is that gradients are obtained using simple algebraic transformations which only make sense for finite numbers. The robust way of handling things would be to use boolean masks.

It looks like "really large number" works in this case, although in general things can overflow and give unexpected nan. For instance, the example below produces nan gradient.

x = tf.placeholder(tf.float32)
y = tf.exp(x)
z = tf.exp(-y)
grad = tf.gradients(z,[x])[0]
print sess.run(grad, feed_dict={x: 1e10})    # => nan
@sy2737

This comment has been minimized.

Copy link

@sy2737 sy2737 commented May 3, 2018

So between masking with a very big negative number and doing softmax by hand (calculating sum of exponentials and stuff), which one is more robust/numerically stable?

Or is there a better way to do this now?

@bstriner

This comment has been minimized.

Copy link
Contributor

@bstriner bstriner commented Jun 12, 2018

@hongzimao @sy2737 I think you guys were on the right track originally, just didn't debug things quite correctly. You wanted a-inf_mask, not multiply. The second solution posted above is still dangerous. stable softmax should be e^(a-max(a)).

The key is that exp(-inf)==0, max(a, -inf)==a and a-inf==-inf. Unfortunately, 0*inf==nan, so making the mask correctly is tricky.

Two most numerically stable options would be either -inf mask or just using a sparse softmax (which might be better depending on what you are doing).

Below is an example of using -inf mask. It has some specifics because of broadcasting but you should be able to make it into whatever you need. Note that if your intention is to use this for loss calculations, you should be doing something else. Should only be using softmax itself for things like attention.

  • Use tf.sequence_mask to create a mask from sequence lengths
  • Create an infinity mask (this is the ugly part)
    -- tf.where to get the indices
    -- tf.tile to make as many infs as required (broadcasting doesn't seem to work)
    -- tf.scatter_nd to make the mask using the indices and the infs
  • Then just tf.nn.softmax(logits - infmask, axis=1)
def masked_softmax(logits, mask):
    """
    Masked softmax over dim 1, mask broadcasts over dim 2
    :param logits: (N, L, T)
    :param mask: (N, L)
    :return: probabilities (N, L, T)
    """
    v = tf.shape(logits)[2]
    indices = tf.cast(tf.where(tf.logical_not(mask)), tf.int32)
    inf = tf.constant(np.array([[np.inf]], dtype=np.float32), dtype=tf.float32)
    infs = tf.tile(inf, [tf.shape(indices)[0], v])
    infmask = tf.scatter_nd(
        indices=indices,
        updates=infs,
        shape=tf.shape(logits))
    _p = tf.nn.softmax(logits - infmask, axis=1)
    return _p
@SijanC147

This comment has been minimized.

Copy link

@SijanC147 SijanC147 commented Sep 15, 2018

Unfortunately, as @yaroslavvb mentioned the masked_softmax implementation by @bstriner broke for me when computing gradients, producing NaNs in computing the loss.

A simple workaround that got it working for me was replacing np.inf with tf.float32.max. This, of course, incurs some penalty as the padded values will not be completely negligible, but I think it is the most numerically stable approach.

I'm also asking if there are any other downsides to this approach as I'm only just starting out with Tensorflow and Machine Learning in general, so I'd appreciate knowing if this approach is actually breaking anything.

@LordBlackhawk

This comment has been minimized.

Copy link

@LordBlackhawk LordBlackhawk commented Oct 10, 2018

My solution to this problem:

def maskedSoftmax(logits, mask):
    """
    Masked softmax over dim 1
    :param logits: (N, L)
    :param mask: (N, L)
    :return: probabilities (N, L)
    """
    indices = tf.where(mask)
    values = tf.gather_nd(logits, indices)
    denseShape = tf.cast(tf.shape(logits), tf.int64)
    sparseResult = tf.sparse_softmax(tf.SparseTensor(indices, values, denseShape))
    result = tf.scatter_nd(sparseResult.indices, sparseResult.values, sparseResult.dense_shape)
    result.set_shape(logits.shape)
    return result

(Edit: My first proposal had problems with None in shape of logits)

@NickRyder

This comment has been minimized.

Copy link

@NickRyder NickRyder commented Oct 29, 2018

@LordBlackhawk's solution works very well for getting the probabilities. However I was not able to adapt this solution to get cross entropy, as apparently sparse_softmax_cross_entropy_with_logits cannot take sparse vectors as inputs.

Is the only solution for calculating softmax cross entropy with masks to apply a -tf.float32.max mask before hand?

@bstriner

This comment has been minimized.

Copy link
Contributor

@bstriner bstriner commented Oct 29, 2018

@NickRyder You can adapt the sparse_logsoftmax below. Inputs are dense logits and sparse indices. It gives you the normalized logits in a dense matrix. You can then use the sparse_crossentropy_loss below to get the logits at the labels.

def sparse_logsoftmax(logits, idx):
    dense_shape = tf.cast(tf.shape(logits), tf.int64)
    logits_values = tf.gather_nd(params=logits, indices=idx)
    sparse_logits = tf.SparseTensor(indices=idx, values=logits_values, dense_shape=dense_shape)
    lmax = tf.sparse_reduce_max(sp_input=sparse_logits, axis=-1, keep_dims=True)
    lmax = tf.stop_gradient(lmax)
    normed_logits = logits - lmax
    normed_exp_values = tf.exp(tf.gather_nd(params=normed_logits, indices=idx))
    sparse_normed_exp = tf.SparseTensor(indices=idx, values=normed_exp_values, dense_shape=dense_shape)
    normed_sum = tf.log(tf.sparse_reduce_sum(sp_input=sparse_normed_exp, axis=-1, keep_dims=True)) + lmax
    lsm = logits - normed_sum
    return lsm


def sparse_crossentropy_loss(logits, labels):
    n = tf.shape(labels)[0]
    idx = tf.stack((tf.range(n), labels), axis=-1)
    nll = - tf.reduce_mean(tf.gather_nd(params=logits, indices=idx))
    return nll
@carlosg-m

This comment has been minimized.

Copy link

@carlosg-m carlosg-m commented Sep 26, 2019

@hongzimao @sy2737 I think you guys were on the right track originally, just didn't debug things quite correctly. You wanted a-inf_mask, not multiply. The second solution posted above is still dangerous. stable softmax should be e^(a-max(a)).

The key is that exp(-inf)==0, max(a, -inf)==a and a-inf==-inf. Unfortunately, 0*inf==nan, so making the mask correctly is tricky.

Two most numerically stable options would be either -inf mask or just using a sparse softmax (which might be better depending on what you are doing).

Below is an example of using -inf mask. It has some specifics because of broadcasting but you should be able to make it into whatever you need. Note that if your intention is to use this for loss calculations, you should be doing something else. Should only be using softmax itself for things like attention.

  • Use tf.sequence_mask to create a mask from sequence lengths
  • Create an infinity mask (this is the ugly part)
    -- tf.where to get the indices
    -- tf.tile to make as many infs as required (broadcasting doesn't seem to work)
    -- tf.scatter_nd to make the mask using the indices and the infs
  • Then just tf.nn.softmax(logits - infmask, axis=1)
def masked_softmax(logits, mask):
    """
    Masked softmax over dim 1, mask broadcasts over dim 2
    :param logits: (N, L, T)
    :param mask: (N, L)
    :return: probabilities (N, L, T)
    """
    v = tf.shape(logits)[2]
    indices = tf.cast(tf.where(tf.logical_not(mask)), tf.int32)
    inf = tf.constant(np.array([[np.inf]], dtype=np.float32), dtype=tf.float32)
    infs = tf.tile(inf, [tf.shape(indices)[0], v])
    infmask = tf.scatter_nd(
        indices=indices,
        updates=infs,
        shape=tf.shape(logits))
    _p = tf.nn.softmax(logits - infmask, axis=1)
    return _p

I've adapted the -np.inf solution together with sparse_softmax_cross_entropy_with_logits and it worked flawlessly.

The only problem is that I expected it to be faster since we are only applying softmax to a subset of labels, but it takes the same time as the standard softmax over all the labels (10000+).

What is going on here? Are all the gradients being calculated even for the infinity mask?

Is there a more effecient solution that skips or ignores masked values, like tf.sparse.softmax suggested by @bstriner ?

@carlosg-m

This comment has been minimized.

Copy link

@carlosg-m carlosg-m commented Sep 26, 2019

I've just tried @LordBlackhawk solution that uses tf.sparse.softmax. It's 2x slower than @bstriner's np.inf together with sparse_softmax_cross_entropy_with_logits.

Both solutions are very elegant and using them provided sharper convergence and lower loss since the labels are restricted. But none of them is faster than the standard tf.nn.softmax.

This is baffling to me since I’m using thousands of labels but each record only has a really small percentage of non-masked labels that need to be updated. I expected the cost to be similar to having a single softmax with the same number of logits as the length of the largest subset of labels, in this case hundreds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.