Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seq2seq.SequenceLoss is probably incompatible with tf.keras.models.Model.compile #375

Closed
kazemnejad opened this issue Jul 27, 2019 · 15 comments
Labels

Comments

@kazemnejad
Copy link
Contributor

@kazemnejad kazemnejad commented Jul 27, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0.b1
  • TensorFlow Addons installed from (source, PyPi): PyPi
  • TensorFlow Addons version: 0.4.0
  • Python version and type (eg. Anaconda Python, Stock Python as in Mac, or homebrew installed Python etc): Stock Python
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • Is the GPU used? (yes/no): Yes

Describe the bug

I think there are two incompatibilities when we use tfa.seq2seq.SequenceLoss and tf.keras.Model at the same time. In the following example, I replicated a common scenario in building a Seq2seq model. i. e. The input is a 2D int32 tensor containing word ids, and the output/logit is a 3D tensor which represents distributions over our vocabulary in each time step.

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 50, mask_zero=True),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])
model.summary()

Now, Let's compile the model and use tfa.seq2seq.SequenceLoss() as its loss function:

import tensorflow_addons as tfa
loss_obj = tfa.seq2seq.SequenceLoss()
model.compile(optimizer='adam', loss=loss_obj)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    453   else:
    454     if values is None:
--> 455       raise ValueError("None values not supported.")
    456     # if dtype is provided, forces numpy array to be the type
    457     # provided if possible.

ValueError: None values not supported.

As you can see, it gives us an ambiguous error (complete traceback none_value.log). However, I suppose that this issue is rooted in the custom __call__ implementation. We had to override the super class __call__ method to handle the loss reduction by ourselves, There is nothing wrong with this approach and indeed it is supported by the Keras APIs. But Keras checks this in a weird way. I guess training.py#L1673 is the corresponding condition. it decides to call our loss function by .call() or .__call__() based on having reductionattribute. Since SequenceLoss is inherited from tf.keras.loss.Loss class, the reduction attribute is automatically (check the code at losses.py#L91) added to our class. Thus it will lead to a call to its .call() method where we basically return None.
Then I tried to fix this issue temperarly by deleting this attribute:

loss_obj = tfa.seq2seq.SequenceLoss()
delattr(loss_obj, 'reduction')
model.compile(optimizer='adam', loss=loss_obj)

It successfully passes that step, But it raises another exception (complete traceback shape_mismatch.log):

/usr/local/lib/python3.6/dist-packages/tensorflow_addons/seq2seq/loss.py in sequence_loss(logits, targets, weights, average_across_timesteps, average_across_batch, sum_over_timesteps, sum_over_batch, softmax_loss_function, name)
     92     if len(targets.get_shape()) != 2:
     93         raise ValueError(
---> 94             "Targets must be a [batch_size x sequence_length] tensor")
     95     if len(weights.get_shape()) != 2:
     96         raise ValueError(

ValueError: Targets must be a [batch_size x sequence_length] tensor

I followed the traceback, and it turns out that Keras does some loss preparation stuff in model.compile() method in which it gives our loss an empty 3D tenser with the same shape as our output is regardless of actual target label. And 3D tensors for target is apparently unsupported.

p.s: I encountered this issue when I was working on #231.

@kazemnejad kazemnejad changed the title seq2seq.SequenceLoss is probably incompatible with tf.keras.Model.compile seq2seq.SequenceLoss is probably incompatible with tf.keras.models.Model.compile Jul 27, 2019
@WindQAQ WindQAQ added the seq2seq label Jul 30, 2019
@WindQAQ

This comment has been minimized.

Copy link
Member

@WindQAQ WindQAQ commented Jul 30, 2019

cc @qlzh727 for visibility

@facaiy

This comment has been minimized.

Copy link
Member

@facaiy facaiy commented Sep 3, 2019

cc @guillaumekln Hi, Guillaume, can you take a look? Thanks!

@guillaumekln

This comment has been minimized.

Copy link
Contributor

@guillaumekln guillaumekln commented Sep 3, 2019

@kazemnejad, thanks for the detective work.

For the first error, I think your approach is reasonable: we could delete the reduction attribute in the SequenceLoss constructor to let Keras know that we are taking care of the loss reduction.

For the second error, this looks like a strange Keras behavior. It creates a placeholder with the same static rank and dtype as the model output but when running the graph the dynamic rank of the placeholder is different... It seems we could just accept a 3D targets and cast it to tf.int64. Could you give it a try and, if it works, send a PR?

@kazemnejad

This comment has been minimized.

Copy link
Contributor Author

@kazemnejad kazemnejad commented Sep 3, 2019

@kazemnejad, thanks for the detective work.

For the first error, I think your approach is reasonable: we could delete the reduction attribute in the SequenceLoss constructor to let Keras know that we are taking care of the loss reduction.

For the second error, this looks like a strange Keras behavior. It creates a placeholder with the same static rank and dtype as the model output but when running the graph the dynamic rank of the placeholder is different... It seems we could just accept a 3D targets and cast it to tf.int64. Could you give it a try and, if it works, send a PR?

Yes, of course. I give it a try by the end of this week.

@guillaumekln

This comment has been minimized.

Copy link
Contributor

@guillaumekln guillaumekln commented Sep 9, 2019

Hi @kazemnejad, did you have a chance to try that?

@kazemnejad

This comment has been minimized.

Copy link
Contributor Author

@kazemnejad kazemnejad commented Sep 9, 2019

@qlzh727

This comment has been minimized.

Copy link
Member

@qlzh727 qlzh727 commented Sep 9, 2019

Sorry for the very late reply. I was on a very long holiday.

For the reduction property, it might have something to do with new distribution strategy if it is used. Please see the docstring for more details.

reduction: (Optional) Type of `tf.keras.losses.Reduction` to apply to loss.
      Default value is `AUTO`. `AUTO` indicates that the reduction option will
      be determined by the usage context. For almost all cases this defaults to
      `SUM_OVER_BATCH_SIZE`.
      When used with `tf.distribute.Strategy`, outside of built-in training
      loops such as `tf.keras` `compile` and `fit`, using `AUTO` or
      `SUM_OVER_BATCH_SIZE` will raise an error. Please see
      https://www.tensorflow.org/alpha/tutorials/distribute/training_loops
      for more details on this.
@qlzh727

This comment has been minimized.

Copy link
Member

@qlzh727 qlzh727 commented Sep 9, 2019

@guillaumekln

This comment has been minimized.

Copy link
Contributor

@guillaumekln guillaumekln commented Sep 9, 2019

@qlzh727 Do you suggest changing the SequenceLoss constructor to read from reduction?

@kazemnejad

This comment has been minimized.

Copy link
Contributor Author

@kazemnejad kazemnejad commented Sep 9, 2019

@qlzh727
Unfortunately, The mentioned issue occurs when it is used with the built-in training loops such as compile. Nonetheless removing the reduction attribute works fine. And according to training.py#L1673, it seems this is the only way to force the Keras to call our __call__ method instead of .call. Do you suggest that we let the Keras do the reduction for us?

@qlzh727

This comment has been minimized.

Copy link
Member

@qlzh727 qlzh727 commented Sep 9, 2019

Ah, thanks for the reference to the training loop. Then I think removing the "reduction" should be the way to go.

@kazemnejad

This comment has been minimized.

Copy link
Contributor Author

@kazemnejad kazemnejad commented Sep 9, 2019

@qlzh727 @guillaumekln

There is also another incompatibility. As you know, we need to pass the sample_weight to the SequenceLoss class (to eliminate the effect of pad tokens on the loss value). In order to do this in the Keras-fashion, we have to use the following setting:

model.compile(optimizer='adam', loss=loss_obj, sample_weight_mode="temporal")
model.fit(x, y, sample_weight=weights, ...)

where y's shape is (batch, seq_length), weights' shape is (batch, seq_length), and model output shape is (batch, seq_length, vocab_size). In this configuration, Keras raises the following exception:

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 885, in standardize_weights
    'an input with shape ' + str(y.shape) + '. '
ValueError: Found a sample_weight array for an input with shape (2, 3). Timestep-wise sample weighting (use of sample_weight_mode="temporal") is restricted to outputs that are at least 3D, i.e. that have a time dimension.

If we take a look at the training_utils.py#L883, we will find out the following condition is the cause the problem.

# line 883 -> 889
if len(y.shape) < 3:
      raise ValueError('Found a sample_weight array for '
                       'an input with shape ' + str(y.shape) + '. '
                       'Timestep-wise sample weighting (use of '
                       'sample_weight_mode="temporal") is restricted to '
                       'outputs that are at least 3D, i.e. that have '
                       'a time dimension.')

And this condition is pretty much a dead-end for us.

So what is the cause all of these incompatibilities?
I think these issues are rooted in the differences between two mindsets: 1- Keras 2- SequenceLoss.

Keras mindset assumes:

  • model_output shape = (batch, seq_length, vocab_size)
  • y shape = (batch, seq_length, vocab_size) (in the one-hot format)
  • weigths shape = (batch, seq_length)

But the SeqeunceLoss assumes the following:

  • model_output shape = (batch, seq_length, vocab_size)
  • y shape = (batch, seq_length) (Sparse representation where each element is the label index)
  • weigths shape = (batch, seq_length)

As you can see, there is no way to use them together since they both will raise an exception on incompatible shapes.
To fix this problem, we have to either open an issue on the Keras APIs and tell them to provide more flexible setups or change the SequenceLoss to work with both mindsets.

@kazemnejad

This comment has been minimized.

Copy link
Contributor Author

@kazemnejad kazemnejad commented Sep 10, 2019

Hi @guillaumekln & @qlzh727, Did you have a chance to look at my comment? Thanks.

@guillaumekln

This comment has been minimized.

Copy link
Contributor

@guillaumekln guillaumekln commented Sep 10, 2019

Thanks for trying to make this work. Can we support both sparse and one-hot labels in sequence_loss? Checking the tensor rank should be enough to make the distinction.

@kazemnejad

This comment has been minimized.

Copy link
Contributor Author

@kazemnejad kazemnejad commented Sep 10, 2019

Yes, I think that would work perfectly fine. Actually, I was waiting for your comment. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.