How did you solve the [unk] problem? #8

bhomass · 2019-06-10T19:48:23Z

I tried running some randomly selected text with lot's domain specific jargons. On my trained model, all the jargons got translated to [unk], which actually seems reasonable based on my understanding of the models (4 and 5). However, on your demo site, your model was able to spit back out the jargon words. Can you suggest what the difference might be that allowed your model to work effectively for the oov words?

theamrzaki · 2019-06-13T18:23:02Z

Model 4 and model 5 are built upon the concept of pointer generator model , which simply means , that the model trains a neural network to know when to generate a new novel word , and when to copy a word from the given text .

This is actually what happens when the model is confronted with out of dict (jargon, unk) words , it chooses to copy it from the original text , as it can't regenerate it , as it is simply hasn't seen it before .

So pointer generator model actually combines the 2 worlds of abstractive and extractive models in one model , you can know more here

theamrzaki · 2019-06-13T18:23:16Z

I truly hope this was helpful

bhomass · 2019-06-15T02:40:32Z

My question is much deeper than your surface level answer. I know the basic idea of combining pointer and generator, and the claim that the pointer is there to take care of OOV. Looking thru the code though, the pointer seems to point to OOV in the article for training data, where the abstract (target) serves as the guidance. In test mode, there is no target, and the model is clueless as to when to use the OOV in the extended vocabulary, because those words never had a vector representation and do not participate in the context. When applying a trained model to a different domain, the oov problem is especially acute. For some reason, your trained model is able to generate those OOV word in the summary. In any case, I will run your code and model to track how its behaving to answer my questions.

theamrzaki · 2019-06-15T10:07:35Z

for pointer generator model in training i train a neural net with an equation where its main inputs are

Decoder inputs
Attention inputs
so the model is trained to understand when to point to the word from the article in training , to do the same in testing , as in testing no summary is provided , so it copies the words from the given article

it knows whether to copy the word or not from the output of the neural net that was trained .

The equation of that neural net is (used in model 4) from this paper

where

st : the decoder state→ decoder parameter
xt : the decoder input→ decoder parameter
ht ∗ : context vector → attention inputs
where vectors wh ∗ , ws , wx and scalar bptr are learnable parameters

Pgen ,is the probability of generating the word either from Vocab distribution (P vocab), or from Attention distribution (sum of attentions of words) , (i.e : either generate a new word , or copy the word from the sentence)

there has been another implementation also for the pointer generator model from another page , it shares the same concept , same broad inputs (from decoder and from attention) , but it is not what is implemented in model 4 paper

where

hi : hidden state of the decoder (output of decoder) → decoder parameter
E[oi−1] : previous time step of decoder step → decoder parameter
ci : attention-weighted context vector → attention inputs

Ws h ,Ws e ,Ws c , b s and v s are the learnable parameters.

so simply by teaching this neural net , your model would be able for each word to know whether to copy or to generate , according to the final output pgen that truly acts as a switch , to choose either to copy or to generate

bhomass · 2019-06-15T21:32:20Z

Hey thanks for the thorough explanation. You are the most responsive author I have ever run into.

I have gone thru both the theory and the code in detail. I am afraid I am still ahead of the current comments you just made. Yes pgen is responsible for know how much to rely on copying, and that decision is based on the contextual information drew from the attention mechanism. This argument is correct when it comes to in vocab words.

You may have missed my point that oov words do not ever get a vector representation and do NOT participate in context derivations, and therefore during test mode, the model would be clueless when it should point to them. Is my point clear?

bhomass · 2019-06-16T22:14:13Z

ok, it appears the model knows to point to oov words from the article without their vector representation. I imagine due to their position and contextual support from neighboring words.

theamrzaki · 2019-06-16T22:37:20Z

yes , this is from attention distribution ,

the model contains 2 parts :

1 which is vocab representation (for creating a new word)
the other from attention (which is pointing to the position of the word ) , work with even words not found from vocab representation

this can be seen from this equation ,
where is contains the 2 parts

when out of vocab --> Pvocab = 0 , so the part of attention wins , and the word is copied from the source document ,
the t subscript (time) denotes the position of the word , so this actually would point to the exact word to be copied

theamrzaki · 2019-06-16T22:37:38Z

I hope this made things clearner

bhomass · 2019-06-16T22:48:03Z

Hi theamrzake, thanks for adding the comment. However, I don't think you got this one. t is the time step of the decoding process, so it points to the position of the output. i points to the position of the encoder sequence. It works out so that for any oov word that shows up in the inferenced summary, the attention mechanism has to give high weights to a particular input oov word from the article (the right i). And that is exactly my question. Given the input oov is substituted with [unk], it does not offer the semantic meaning of the original oov word, so the model must be leverage other information to accentuate that oov word for the purpose of selecting it to point to.

And the "other" information, I am suggesting would be the order of the oov word and its neighboring, non-oov, words.

theamrzaki · 2019-06-16T22:56:36Z

you are correct concerning the t time step is for decoder ,
as for the oov , when the word is oov , the Prob(w) part that is responsible for creating a new word would be zero , as P(vocab) is zero , so all the focus would be from the other part of copying

this is the way the oov get copied , this happens for each word (i.e: Prob(word) )

bhomass · 2019-06-16T23:02:57Z

It's been a good discussion. Thanks again.

theamrzaki · 2019-06-16T23:04:04Z

i truly hope it has been useful , thank you

theamrzaki closed this as completed Jun 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How did you solve the [unk] problem? #8

How did you solve the [unk] problem? #8

bhomass commented Jun 10, 2019

theamrzaki commented Jun 13, 2019

theamrzaki commented Jun 13, 2019

bhomass commented Jun 15, 2019

theamrzaki commented Jun 15, 2019 •

edited

bhomass commented Jun 15, 2019

bhomass commented Jun 16, 2019

theamrzaki commented Jun 16, 2019

theamrzaki commented Jun 16, 2019

bhomass commented Jun 16, 2019 •

edited

theamrzaki commented Jun 16, 2019

bhomass commented Jun 16, 2019

theamrzaki commented Jun 16, 2019

How did you solve the [unk] problem? #8

How did you solve the [unk] problem? #8

Comments

bhomass commented Jun 10, 2019

theamrzaki commented Jun 13, 2019

theamrzaki commented Jun 13, 2019

bhomass commented Jun 15, 2019

theamrzaki commented Jun 15, 2019 • edited

bhomass commented Jun 15, 2019

bhomass commented Jun 16, 2019

theamrzaki commented Jun 16, 2019

theamrzaki commented Jun 16, 2019

bhomass commented Jun 16, 2019 • edited

theamrzaki commented Jun 16, 2019

bhomass commented Jun 16, 2019

theamrzaki commented Jun 16, 2019

theamrzaki commented Jun 15, 2019 •

edited

bhomass commented Jun 16, 2019 •

edited