there is no one-layer MLP in attention Layer #5

hengchao0248 · 2017-04-17T12:08:07Z

Hello, Thank you for the excellent code for CNN,LSTM and HAN,I have learnt a lot from your code.
but, One question,
In the paper, I find there is a one-layer MLP in the attention layer, but I can not see this in your　implementd AttLayer, in the class AttLayer, there is only one context vector. Could you expain it for me?

richliao · 2017-04-18T04:20:53Z

I used Keras TimeDistributed(Dense()) layer to implement this verticle neurons interaction of LSTM output before feeding to attention layer. However, I really doubt if it's useful.

ni9elf · 2017-05-13T15:46:53Z

Isn't line 174 eij = K.tanh(K.dot(x, self.W)) in textClassifierHATT.py implementing the one-layer MLP to get u_it as hidden representation of h_it before computing the attention weights?
If yes, then why is another one-layer MLP being used in TimeDistributed in the form of Dense()? Please elaborate on the use of Dense() - there is no equivalent logic of the Dense layer being described in the paper. Is this your addition or from the paper originally?
Thanks.

philippschaefer4 · 2018-01-06T17:23:18Z

I have problems at the same point.

I can see that the "one layer MLP" is split in your code into the dense layer and line 174 in the attention-layer. The MLP in the article you mention in your blog consists of 1 input-layer (that are the 200 nodes for the 2*100 values of each wordvector-output of the Double-GRUs with 100 nodes each of the layer before) and one hidden layer (those are the 200 nodes of the dense layer) and one final layer. The final layer has ONE node and I think that is the one value per word that is computed at the beginning of the attention-layer in the code, isn't it? The attention-layer of course gets all MAX_WORDS=100 words as 200-length-vectors each at the same time as 100x200-matrix. So self.W in line 174 would a 200-length-vector of weights that is used on all words equally. So there are independant weights of the MLP from input to hidden (by use of TimeDistributed) and a common weight-vector from hidden to output.
By line 174 all 100 200-length-word-vectors are shriveled down to 100 scalars. The activation-function is tanh. The final values are the e's in the articles. The MLP is the function a(h) with e = a(h) and h as the "word-vector in context" from the bidirectional-GRUs.
The following part of the attention-layer just does exp (176) and softmax (177) which gives 100 percent-values as "attentions" per word (alpha in the articles). Each wordvector is then weighed with the words attentions (179) and the weighted vectors are summed up as the "sentence vector" (180) with length 200.

Now the questions are:
First: Am I right? Is this what happens here? Is there only ONE attention-layer with a shared weight? Or are the 100 TimeDistributed-DenseLayers connected to 100 implicitly TimeDistributed-Attention-Layers?
Second: If I am right, I can't see that the weights from hidden to output are shared among all words in the article(s) you refer to. Is it just different from that or didn't I understand the papers?
Third: It also seemes that you don't use the outputs of the Bidirectional GRUs as word-context-vectors to be weighted (as it is also shown in the picture from the article that is also on your blog) but the hidden layers of the MLPs. Shouldn't get the attention-layer somehow have access to l_lstm so it could use the original words-in-context to be weighed?

philippschaefer4 · 2018-01-06T22:03:09Z

I just looked up the TimeDistributed Layer-Wrapper again and realized that it means that the same weights are also shared among the input-hidden-layer-connection of the MLP. And I think I can also remember reading something about "shared weights" in the articles. So questions 1 and 2 are answered for me.

It is just the 3rd one that is left: why use the hidden-layers of the MLP as word-context and not the output of l_lstm?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

there is no one-layer MLP in attention Layer #5

there is no one-layer MLP in attention Layer #5

hengchao0248 commented Apr 17, 2017

richliao commented Apr 18, 2017

ni9elf commented May 13, 2017 •

edited

philippschaefer4 commented Jan 6, 2018

philippschaefer4 commented Jan 6, 2018 •

edited

there is no one-layer MLP in attention Layer #5

there is no one-layer MLP in attention Layer #5

Comments

hengchao0248 commented Apr 17, 2017

richliao commented Apr 18, 2017

ni9elf commented May 13, 2017 • edited

philippschaefer4 commented Jan 6, 2018

philippschaefer4 commented Jan 6, 2018 • edited

ni9elf commented May 13, 2017 •

edited

philippschaefer4 commented Jan 6, 2018 •

edited