Clarifying How to Correctly Access Alignment History #36
Comments
Hi Rylan, It'd be hard for me to debug your code. 2 suggestions: keep testing your code (e.g., try to test with examples of shorter input sequences instead of those with lengths 500) or try adapting this tutorial to your problem to see if the same issue remains. On your questions:
|
@lmthang I spent yesterday running different tests, and I found that an encoder/decoder, each with LSTM cells with num_units = 2 and no attention mechanism, still converges to 100% accuracy (just more slowly). Is it possible that by using Edit: I think I'm misusing TrainingHelper. What precisely are the Edit 2: The purpose of the TrainingHelper is to pass the correct output of the previous time step to the decoder, yes? I thought that TrainingHelper automatically passes the start symbol into the decoder at the first step. Now I'm starting to think that this is incorrect. |
@lmthang , regarding 3, yes - that is more clear :) |
I get the same problem for alignment. Though the loss is very low, the alignment is still random. I have tested with different attentions "scaled_luong" and "normed_bahdanau". Unfortunately, the alignment is not correct. |
I did! I wasn't using start and stop tokens properly. What do your sequences look like? |
@RylanSchaeffer, My encoder sequence is audio wave's logfbank feature with shape [batch_size, wave_len, feature_dim] and there is no start or stop tokens for it. My decoder sequence is English words with shape [batch_size, sentence_len, vocab_dim]. The start token is tf.one_hot([-1], vocab_size)=(0,0,...0,0) and stop token is tf.one_hot([vocab_size], vocab_size)=(0,0,...0,1). |
@RylanSchaeffer, In addition, I think the loss can be reduced to zero for large epoch number with just one training sample. But the alignment can't be learned using only one training sample. Is this correct? Thanks |
But the alignment can't be learned using only one training sample
I think the loss can be reduced to zero for large epoch number with just one training sample
why the loss is reduced to zero with the wrong alignment?
|
A couple probable issues here which I ran into myself. a) The TrainingHelper uses 100% teacher forcing. This means that the Decoder is given the groundtruth from the previous step instead of what it actually came up with. This makes it very difficult to interpret training loss; you'll want a separate run on test data to determine how well the model is doing. b) Keep in mind that the decoder is fully capable of learning a language model. So if you're producing characters, it can learn to guess the next character in a sequence without knowing the input data. Again, this makes it very hard to judge model quality based on training loss. You'll probably do better with the ScheduledOutputTrainingHelper. This lets you reduce the amount of teacher forcing, which will encourage the model to pay more attention to the actual inputs and less attention to the supplied groundtruth. If you read the 'Listen Attend and Spell' paper, they use 80% teacher forcing during training. One trick I've been using to make the training loss /slightly/ more interpretable is to build a model with all of the input sequences zero'ed out, to get a sense of how well it's possible to do with just the amount of teacher forcing you're providing. |
I have a problem with the Seq2Seq library, and I'm trying to use this tutorial to find out where my bug is. My problem is that my model's alignment values are initially uniformly distributed across encoder outputs; however, the alignment values remain the same even after training, despite the fact that my model's accuracy climbs from chance (25%) to 100%. My problem is structured in a way that the only way to do well at it is for the decoder to learn to pay attention.
Copying and pasting from the earlier Github issue:
As background, both my inputs and labeled outputs at each time step are vectors of shape
(4, )
. I run my encoder for 500 steps i.e. inputs have shape(minibatch size, 500, 4)
, and my decoder runs for approximately 40-41 steps i.e. final output has shape(minibatch size, 41, 4)
. Each output label depends roughly on 12 sequential inputs, so for example, the first output depends on inputs 1-12, the second output depends on inputs 13-24, etc.I don't use embeddings since doing so isn't applicable for my problem.
I reduced my model to a single layer encoder, single layer decoder to eliminate any mistake I might be making with multi-layered architectures. The encoder is a bidirectional RNN.
At the start of training, my
alignment_history
has roughly random uniform weights. Its shape is (41, minibatch size, 500) (although I could transpose it from time-major to batch-major).alignment_history
will have values between 0.001739 and 0.002241, which makes sense - randomly initialized attention should be around 1/500 = 0.002. Additionally, my model performs at chance (25% classification accuracy).During training, my model converges to 100% classification accuracy on both training and validation data, as shown below.
The model never sees the same training data twice, so I'm 99% confident that the model isn't memorizing the training data. However, after training, the values of alignment_history effectively haven't changed; the values now look randomly chosen from between 0.00185 and 0.00219.
My code is relatively straightforward. I have a class encapsulating my model. One method instantiates a RNN cell:
I have one method for building the encoder:
I similarly have another method for building the decoder:
I use both of these methods, and then project the output of the decoder to the same dimensionality as my labels.
Most of my code was written before the NMT tutorial was released, so I read the code and then stepped through it, but I can't find any glaring differences. I do have a couple of additional questions.
I have two hypotheses. One is that I'm incorrectly accessing my model's alignments, and the other is that I'm screwing something up in a much more significant way. Just to eliminate the first as a possibility, the correct way to access the decoder's alignments is through setting alignment_history=True in AttentionWrapper and then examining the values in decoder_final_states[0].alignment_history.stack(). Is this correct?
How is the attention mechanism's num_units chosen? Is the attention mechanism's number of units required to match the number of units in the RNN cell as well as the number of units in the AttentionWrapper, or is that not necessary?
I'm confused by the terminology used regarding memory, queries and keys. Memory and keys are both defined in English as "the set of source hidden states", but mathematically they're defined differently i.e. memory is W_2\overline{h}_s for Bahdanau Attention, but the keys are W_1h_t for Bahdanau Attention. My guess is that the tutorial means to say that the query h_t is converted into a key using W_1, and that key is then compared against keys generated from the encoder's hidden states i.e. W\overline{h}_s. Is this correct, or am I misunderstanding something?
The text was updated successfully, but these errors were encountered: