Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How did you solve the [unk] problem? #8

Closed
bhomass opened this issue Jun 10, 2019 · 12 comments
Closed

How did you solve the [unk] problem? #8

bhomass opened this issue Jun 10, 2019 · 12 comments

Comments

@bhomass
Copy link

bhomass commented Jun 10, 2019

I tried running some randomly selected text with lot's domain specific jargons. On my trained model, all the jargons got translated to [unk], which actually seems reasonable based on my understanding of the models (4 and 5). However, on your demo site, your model was able to spit back out the jargon words. Can you suggest what the difference might be that allowed your model to work effectively for the oov words?

@theamrzaki
Copy link
Owner

Model 4 and model 5 are built upon the concept of pointer generator model , which simply means , that the model trains a neural network to know when to generate a new novel word , and when to copy a word from the given text .

This is actually what happens when the model is confronted with out of dict (jargon, unk) words , it chooses to copy it from the original text , as it can't regenerate it , as it is simply hasn't seen it before .

So pointer generator model actually combines the 2 worlds of abstractive and extractive models in one model , you can know more here

@theamrzaki
Copy link
Owner

I truly hope this was helpful

@bhomass
Copy link
Author

bhomass commented Jun 15, 2019

My question is much deeper than your surface level answer. I know the basic idea of combining pointer and generator, and the claim that the pointer is there to take care of OOV. Looking thru the code though, the pointer seems to point to OOV in the article for training data, where the abstract (target) serves as the guidance. In test mode, there is no target, and the model is clueless as to when to use the OOV in the extended vocabulary, because those words never had a vector representation and do not participate in the context. When applying a trained model to a different domain, the oov problem is especially acute. For some reason, your trained model is able to generate those OOV word in the summary. In any case, I will run your code and model to track how its behaving to answer my questions.

@theamrzaki
Copy link
Owner

theamrzaki commented Jun 15, 2019

for pointer generator model in training i train a neural net with an equation where its main inputs are

  1. Decoder inputs
  2. Attention inputs
    so the model is trained to understand when to point to the word from the article in training , to do the same in testing , as in testing no summary is provided , so it copies the words from the given article

it knows whether to copy the word or not from the output of the neural net that was trained .

The equation of that neural net is (used in model 4) from this paper
alt text
where

  1. st : the decoder state→ decoder parameter
  2. xt : the decoder input→ decoder parameter
  3. ht ∗ : context vector → attention inputs
    where vectors wh ∗ , ws , wx and scalar bptr are learnable parameters

Pgen ,is the probability of generating the word either from Vocab distribution (P vocab), or from Attention distribution (sum of attentions of words) , (i.e : either generate a new word , or copy the word from the sentence)

alt text

there has been another implementation also for the pointer generator model from another page , it shares the same concept , same broad inputs (from decoder and from attention) , but it is not what is implemented in model 4 paper
alt text
where

  1. hi : hidden state of the decoder (output of decoder) → decoder parameter
  2. E[oi−1] : previous time step of decoder step → decoder parameter
  3. ci : attention-weighted context vector → attention inputs

Ws h ,Ws e ,Ws c , b s and v s are the learnable parameters.

so simply by teaching this neural net , your model would be able for each word to know whether to copy or to generate , according to the final output pgen that truly acts as a switch , to choose either to copy or to generate

@bhomass
Copy link
Author

bhomass commented Jun 15, 2019

Hey thanks for the thorough explanation. You are the most responsive author I have ever run into.

I have gone thru both the theory and the code in detail. I am afraid I am still ahead of the current comments you just made. Yes pgen is responsible for know how much to rely on copying, and that decision is based on the contextual information drew from the attention mechanism. This argument is correct when it comes to in vocab words.

You may have missed my point that oov words do not ever get a vector representation and do NOT participate in context derivations, and therefore during test mode, the model would be clueless when it should point to them. Is my point clear?

@bhomass
Copy link
Author

bhomass commented Jun 16, 2019

ok, it appears the model knows to point to oov words from the article without their vector representation. I imagine due to their position and contextual support from neighboring words.

@theamrzaki
Copy link
Owner

yes , this is from attention distribution ,

the model contains 2 parts :

  • 1 which is vocab representation (for creating a new word)

  • the other from attention (which is pointing to the position of the word ) , work with even words not found from vocab representation

this can be seen from this equation ,
where is contains the 2 parts
image

when out of vocab --> Pvocab = 0 , so the part of attention wins , and the word is copied from the source document ,
the t subscript (time) denotes the position of the word , so this actually would point to the exact word to be copied

@theamrzaki
Copy link
Owner

I hope this made things clearner

@bhomass
Copy link
Author

bhomass commented Jun 16, 2019

Hi theamrzake, thanks for adding the comment. However, I don't think you got this one. t is the time step of the decoding process, so it points to the position of the output. i points to the position of the encoder sequence. It works out so that for any oov word that shows up in the inferenced summary, the attention mechanism has to give high weights to a particular input oov word from the article (the right i). And that is exactly my question. Given the input oov is substituted with [unk], it does not offer the semantic meaning of the original oov word, so the model must be leverage other information to accentuate that oov word for the purpose of selecting it to point to.

And the "other" information, I am suggesting would be the order of the oov word and its neighboring, non-oov, words.

@theamrzaki
Copy link
Owner

you are correct concerning the t time step is for decoder ,
as for the oov , when the word is oov , the Prob(w) part that is responsible for creating a new word would be zero , as P(vocab) is zero , so all the focus would be from the other part of copying
image

this is the way the oov get copied , this happens for each word (i.e: Prob(word) )

@bhomass
Copy link
Author

bhomass commented Jun 16, 2019

It's been a good discussion. Thanks again.

@theamrzaki
Copy link
Owner

i truly hope it has been useful , thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants