Feature Request: How to Access Attention Weights of Attention Wrapper #11067

RylanSchaeffer · 2017-06-26T20:55:10Z

OS: macOS Sierra version 10.12.5
TensorFlow Version: v1.2.0-rc2-21-g12f033d 1.2.0

This is a two-part request related to tensorflow.contrib.seq2seq. I would like the ability to visualize the attention weights of the AttentionWrapper, but I'm hampered by the lack of examples and I'm struggling to infer the input for BahdanauAttention's __call__ method's argument previous_alignments.

First, could someone clarify how to access the attention weights?

Second, would it be possible to add some tool that visualizes the attention weights (possibly to TensorBoard)?

The text was updated successfully, but these errors were encountered:

aselle · 2017-06-26T21:16:02Z

Thanks for the issue, but this question looks like a feature request that belongs in tensorboard
http://github.com/tensorflow/tensorboard/issues
Thanks!

RylanSchaeffer · 2017-06-26T21:19:55Z

@aselle , perhaps the second question belongs to TensorBoard, but I don't think that the first does at all. Even if TensorBoard had the visualization capability I'm looking for, if I wanted to add a summary to visualize the attention weights, I wouldn't know how to access the weights to add them as summaries. I think that either an example or additional documentation would be useful and relevant to TensorFlow, not TensorBoard.

RylanSchaeffer · 2017-06-27T16:57:33Z

@aselle If you have time, maybe you could answer the question on Stack Overflow? It's clear I'm not the only person struggling to access the AttentionWrapper weights.

RylanSchaeffer · 2017-06-27T16:59:09Z

@ebrevdo , you were helpful on an earlier thread. If you have the time, could you please provide an answer to my first question of how to access the attention mechanism's weights?

RylanSchaeffer · 2017-06-27T20:51:28Z

For anyone else wondering, you can access the alignments by setting alignment_history=True in AttentionWrapper. So if I have an object called model with my decoder's final state as a member variable, I can access the alignments as follows:

alignments = sess.run([model.decoder_final_states[0].alignment_history.stack()], feed_dict={whatever})

ebrevdo · 2017-06-28T04:45:35Z

Your approach is the right one. We don't enable alignments history by default because: 1. It requires extra memory 2. It's not always possible, i.e. when using a beam search decoder.

…

On Jun 27, 2017 9:59 AM, "Rylan Schaeffer" ***@***.***> wrote: @ebrevdo <https://github.com/ebrevdo> , you were helpful on an earlier thread. If you have the time, could you please provide an answer to my first question of how to access the attention mechanism's weights? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11067 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABtim_DZWf-z4I6VuuBQEv9utxrueGygks5sITT3gaJpZM4OF1_h> .

RylanSchaeffer · 2017-06-28T16:04:52Z

@ebrevdo , can you clarify what _keys are for AttentionWrapper?

Also, to confirm, the normalizer referred to by AttentionWrapper's call method's Step 4: Calculate the alignments by passing the score through the normalizer is the attention mechanism's probability_fn?

RylanSchaeffer · 2017-07-24T15:30:04Z

@ebrevdo , I'm having a strange problem with my Sequence to Sequence model's alignment history.

My model's alignment values are initially uniformly distributed (~0.002, for encoder outputs of 500 steps), which makes sense. However, the alignment values remain roughly the same (~0.002) even after training, despite the fact that my model's accuracy climbs from chance (25%) to 100%. My problem is structured in a way that the only way to do well at it is for the decoder to learn to pay attention.

I have no idea what might be causing this - has anyone experienced anything similar to this? Alternatively, does anyone have suggestions for debugging this issue?

Edit: I'm posting this on Stack Overflow. @aselle , if you know the answer, or know someone who knows, I'd appreciate it!

RylanSchaeffer · 2017-07-27T17:49:33Z

@oahziur , do you have an idea of what might be causing my problem?

RylanSchaeffer · 2017-07-29T19:15:28Z

I thought I'd add more information and code, in case that can help someone help me.

As background, both my inputs and labeled outputs at each time step are vectors of shape (4, ). I run my encoder for 500 steps i.e. inputs have shape (minibatch size, 500, 4), and my decoder runs for approximately 40-41 steps i.e. final output has shape (minibatch size, 41, 4). Each output label depends roughly on 12 sequential inputs, so for example, the first output depends on inputs 1-12, the second output depends on inputs 13-24, etc. I don't use embeddings.

I reduced my model to a single layer encoder, single layer decoder to eliminate any mistake I might be making with multi-layered architectures. The encoder is a bidirectional RNN.

At the start of training, my alignment_history has roughly random uniform weights. Its shape is (41, minibatch size, 500) (although I could transpose it from time-major to batch-major). alignment_history will have values between 0.001739 and 0.002241, which makes sense - randomly initialized attention should be around 1/500 = 0.002. Additionally, my model performs at chance (25% classification accuracy).

During training, my model converges to 100% classification accuracy on both training and validation data, as shown below.

The model never sees the same training data twice, so I'm 99% confident that the model isn't memorizing the training data. However, after training, the values of alignment_history effectively haven't changed; the values now look randomly chosen from between 0.00185 and 0.00219.

My code is relatively straightforward. I have a class encapsulating my model. One method instantiates a RNN cell:

@staticmethod
def _create_lstm_cell(cell_size):
    """
    Creates a RNN cell. If lstm_or_gru is True (default), create a Layer
    Normalized LSTM cell (if layer_norm is True (default); otherwise,
    create a vanilla LSTM cell. If lstm_or_gru is False, create a Gated
    Recurrent Unit cell.
    """

    if tf.flags.FLAGS.lstm_or_gru:
        if tf.flags.FLAGS.layer_norm:
            return LayerNormBasicLSTMCell(cell_size)
        else:
            return BasicLSTMCell(cell_size)
    else:
        return GRUCell(cell_size)

I have one method for building the encoder:

def _define_encoder(self):
    """
    Construct an encoder RNN using a bidirectional layer.
    """

    with tf.variable_scope('define_encoder'):

        encoder_outputs, encoder_final_states = bidirectional_dynamic_rnn(
            cell_fw=self._create_lstm_cell(ENCODER_SINGLE_DIRECTION_SIZE),
            cell_bw=self._create_lstm_cell(ENCODER_SINGLE_DIRECTION_SIZE),
            inputs=self.x,
            dtype=tf.float32,
            sequence_length=self.x_lengths,
            time_major=False  # default
        )

        # concatenate forward and backwards encoder outputs
        encoder_outputs = tf.concat(encoder_outputs, axis=-1)

        # concatenate forward and backwards cell states
        new_c = tf.concat([encoder_final_states[0].c, encoder_final_states[1].c], axis=1)
        new_h = tf.concat([encoder_final_states[0].h, encoder_final_states[1].h], axis=1)
        encoder_final_states = (LSTMStateTuple(c=new_c, h=new_h),)

    return encoder_outputs, encoder_final_states

I similarly have another method for building the decoder:

def _define_decoder(self, encoder_outputs, encoder_final_states):
    """
    Construct a decoder complete with an attention mechanism. The encoder's
    final states will be used as the decoder's initial states.
    """



    with tf.variable_scope('define_decoder'):
        # instantiate attention mechanism
        attention_mechanism = BahdanauAttention(num_units=DECODER_SIZE,
                                                memory=encoder_outputs,
                                                normalize=True)

        # wrap LSTM cell with attention mechanism
        attention_cell = AttentionWrapper(cell=self._create_lstm_cell(cell_size=DECODER_SIZE),
                                          attention_mechanism=attention_mechanism,
                                          # output_attention=False,  # doesn't seem to affect alignments
                                          alignment_history=True,
                                          attention_layer_size=DECODER_SIZE)  # arbitrarily chosen

        # create initial attention state of zeros everywhere
        decoder_initial_state = attention_cell.zero_state(batch_size=tf.flags.FLAGS.batch_size, dtype=tf.float32).clone(cell_state=encoder_final_states[0])


        # TODO: switch this out at inference time
        training_helper = TrainingHelper(inputs=self.y,  # feed in ground truth
                                         sequence_length=self.y_lengths)  # feed in sequence lengths

        decoder = BasicDecoder(cell=attention_cell,
                               helper=training_helper,
                               initial_state=decoder_initial_state
                               )

        # run decoder over input sequence
        decoder_outputs, decoder_final_states, decoder_final_sequence_lengths = dynamic_decode(
            decoder=decoder,
            maximum_iterations=41,
            impute_finished=True)

        decoder_outputs = decoder_outputs[0]
        decoder_final_states = (decoder_final_states,)

    return decoder_outputs, decoder_final_states

I use both of these methods, and then project the output of the decoder to the same dimensionality as my labels.

def _add_inference(self):
    """
    Create a Sequence-to-Sequence model using a bidirectional encoder and an
    attention mechanism-wrapped decoder.
    
    The outputs of the decoder need to be projected to a lower dimensional
    space i.e. from DECODER_SIZE to 4.
    """

    with tf.variable_scope('add_inference'):
        encoder_outputs, encoder_final_states = self._define_encoder()
        decoder_outputs, decoder_final_states = self._define_decoder(encoder_outputs, encoder_final_states)

        weights = tf.Variable(tf.truncated_normal(shape=[DECODER_SIZE, 4]))
        bias = tf.Variable(tf.truncated_normal(shape=[4]))
        logits = tf.tensordot(decoder_outputs, weights, axes=[[2], [0]]) + bias  # 2nd dimension of decoder outputs, 0th dimension of weights

    return encoder_final_states, decoder_final_states, logits

RylanSchaeffer · 2017-07-29T19:20:51Z

Most of my code was written before the NMT tutorial was released, so I read the code and then stepped through it, but I can't find any glaring differences. I do have a couple of additional questions.

I have two hypotheses. One is that I'm incorrectly accessing my model's alignments, and the other is that I'm screwing something up in a much more significant way. Just to eliminate the first as a possibility, the correct way to access the decoder's alignments is through setting alignment_history=True in AttentionWrapper and then examining the values in decoder_final_states[0].alignment_history.stack(). Is this correct?
How is the attention mechanism's num_units chosen? Is the attention mechanism's number of units required to match the number of units in the RNN cell as well as the number of units in the AttentionWrapper, or is that not necessary?
I'm confused by the terminology used regarding memory, queries and keys. Memory and keys are both defined in English as "the set of source hidden states", but mathematically they're defined differently i.e. memory is W_2\overline{h}_s for Bahdanau Attention, but the keys are W_1h_t for Bahdanau Attention. My guess is that the tutorial means to say that the query h_t is converted into a key using W_1, and that key is then compared against keys generated from the encoder's hidden states i.e. W\overline{h}_s. Is this correct, or am I misunderstanding something?

RylanSchaeffer · 2017-07-29T19:21:07Z

@lmthang , if you could help, I'd really appreciate it!

aselle closed this as completed Jun 26, 2017

RylanSchaeffer changed the title ~~Feature Request: Visualizing Attention Weights of Attention Wrapper~~ Feature Request: How to Access Attention Weights of Attention Wrapper Jun 27, 2017

RylanSchaeffer mentioned this issue Jul 29, 2017

Clarifying How to Correctly Access Alignment History tensorflow/nmt#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: How to Access Attention Weights of Attention Wrapper #11067

Feature Request: How to Access Attention Weights of Attention Wrapper #11067

RylanSchaeffer commented Jun 26, 2017

aselle commented Jun 26, 2017

RylanSchaeffer commented Jun 26, 2017 •

edited

RylanSchaeffer commented Jun 27, 2017

RylanSchaeffer commented Jun 27, 2017

RylanSchaeffer commented Jun 27, 2017 •

edited

ebrevdo commented Jun 28, 2017 via email

RylanSchaeffer commented Jun 28, 2017 •

edited

RylanSchaeffer commented Jul 24, 2017 •

edited

RylanSchaeffer commented Jul 27, 2017

RylanSchaeffer commented Jul 29, 2017 •

edited

RylanSchaeffer commented Jul 29, 2017

RylanSchaeffer commented Jul 29, 2017

Feature Request: How to Access Attention Weights of Attention Wrapper #11067

Feature Request: How to Access Attention Weights of Attention Wrapper #11067

Comments

RylanSchaeffer commented Jun 26, 2017

aselle commented Jun 26, 2017

RylanSchaeffer commented Jun 26, 2017 • edited

RylanSchaeffer commented Jun 27, 2017

RylanSchaeffer commented Jun 27, 2017

RylanSchaeffer commented Jun 27, 2017 • edited

ebrevdo commented Jun 28, 2017 via email

RylanSchaeffer commented Jun 28, 2017 • edited

RylanSchaeffer commented Jul 24, 2017 • edited

RylanSchaeffer commented Jul 27, 2017

RylanSchaeffer commented Jul 29, 2017 • edited

RylanSchaeffer commented Jul 29, 2017

RylanSchaeffer commented Jul 29, 2017

RylanSchaeffer commented Jun 26, 2017 •

edited

RylanSchaeffer commented Jun 27, 2017 •

edited

RylanSchaeffer commented Jun 28, 2017 •

edited

RylanSchaeffer commented Jul 24, 2017 •

edited

RylanSchaeffer commented Jul 29, 2017 •

edited