Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

there is no one-layer MLP in attention Layer #5

Open
hengchao0248 opened this issue Apr 17, 2017 · 4 comments
Open

there is no one-layer MLP in attention Layer #5

hengchao0248 opened this issue Apr 17, 2017 · 4 comments

Comments

@hengchao0248
Copy link

Hello, Thank you for the excellent code for CNN,LSTM and HAN,I have learnt a lot from your code.
but, One question,
In the paper, I find there is a one-layer MLP in the attention layer, but I can not see this in your implementd AttLayer, in the class AttLayer, there is only one context vector. Could you expain it for me?

@richliao
Copy link
Owner

I used Keras TimeDistributed(Dense()) layer to implement this verticle neurons interaction of LSTM output before feeding to attention layer. However, I really doubt if it's useful.

@ni9elf
Copy link

ni9elf commented May 13, 2017

Isn't line 174 eij = K.tanh(K.dot(x, self.W)) in textClassifierHATT.py implementing the one-layer MLP to get u_it as hidden representation of h_it before computing the attention weights?
If yes, then why is another one-layer MLP being used in TimeDistributed in the form of Dense()? Please elaborate on the use of Dense() - there is no equivalent logic of the Dense layer being described in the paper. Is this your addition or from the paper originally?
Thanks.

@philippschaefer4
Copy link

I have problems at the same point.

I can see that the "one layer MLP" is split in your code into the dense layer and line 174 in the attention-layer. The MLP in the article you mention in your blog consists of 1 input-layer (that are the 200 nodes for the 2*100 values of each wordvector-output of the Double-GRUs with 100 nodes each of the layer before) and one hidden layer (those are the 200 nodes of the dense layer) and one final layer. The final layer has ONE node and I think that is the one value per word that is computed at the beginning of the attention-layer in the code, isn't it? The attention-layer of course gets all MAX_WORDS=100 words as 200-length-vectors each at the same time as 100x200-matrix. So self.W in line 174 would a 200-length-vector of weights that is used on all words equally. So there are independant weights of the MLP from input to hidden (by use of TimeDistributed) and a common weight-vector from hidden to output.
By line 174 all 100 200-length-word-vectors are shriveled down to 100 scalars. The activation-function is tanh. The final values are the e's in the articles. The MLP is the function a(h) with e = a(h) and h as the "word-vector in context" from the bidirectional-GRUs.
The following part of the attention-layer just does exp (176) and softmax (177) which gives 100 percent-values as "attentions" per word (alpha in the articles). Each wordvector is then weighed with the words attentions (179) and the weighted vectors are summed up as the "sentence vector" (180) with length 200.

Now the questions are:
First: Am I right? Is this what happens here? Is there only ONE attention-layer with a shared weight? Or are the 100 TimeDistributed-DenseLayers connected to 100 implicitly TimeDistributed-Attention-Layers?
Second: If I am right, I can't see that the weights from hidden to output are shared among all words in the article(s) you refer to. Is it just different from that or didn't I understand the papers?
Third: It also seemes that you don't use the outputs of the Bidirectional GRUs as word-context-vectors to be weighted (as it is also shown in the picture from the article that is also on your blog) but the hidden layers of the MLPs. Shouldn't get the attention-layer somehow have access to l_lstm so it could use the original words-in-context to be weighed?

@philippschaefer4
Copy link

philippschaefer4 commented Jan 6, 2018

I just looked up the TimeDistributed Layer-Wrapper again and realized that it means that the same weights are also shared among the input-hidden-layer-connection of the MLP. And I think I can also remember reading something about "shared weights" in the articles. So questions 1 and 2 are answered for me.

It is just the 3rd one that is left: why use the hidden-layers of the MLP as word-context and not the output of l_lstm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants