[Enhancement] Vocab List in Dataloader #38

hzhwcmhf · 2019-01-11T07:48:11Z

For implemention of #8 copynet, dataloader should change behaviours.

In our mind, there should be 3 vocab list:

For model trainning, smallest. Only include words from train set. Call it set $V.
For metric, bigger. The model will be evaluated on this vocab list, including words from train set and test set. Call it set $M. But almostly all models can't generate words from $V-$M, because they haven't seen these. Howerver, copyNet can gen words from $V-$M by copy mechanism. It's necessary to take these words into accounts when we implement metrics. $V-$M can be expressed as UNK token for some models. Dataloader have to tranlate them into a uniform distribution on $V-$M.
The whole space of word, include not seen in all the data. Call it set $N. The words in $N-$M, we don't care about them, ignore them in evaluating models, as [BUG] bug in trim_index of dataloader #37 . $N-$M is the TRUE UNK.

Require:

hzhwcmhf · 2019-01-26T15:25:58Z

Done

hzhwcmhf added enhancement high priority not in progress, need attention labels Jan 11, 2019

hzhwcmhf added this to To do in Startup via automation Jan 11, 2019

hzhwcmhf changed the title ~~[Enhancement] Dataloader~~ [Enhancement] Vocab List in Dataloader Jan 11, 2019

This was referenced Jan 11, 2019

[Models] CopyNet #8

Closed

[Metric] perplexity considering UNK #26

Closed

hzhwcmhf added pause Pause todo will work on this, bot not immediately and removed pause Pause labels Jan 17, 2019

hzhwcmhf mentioned this issue Jan 25, 2019

Invalid vocab #63

Merged

hzhwcmhf closed this as completed Jan 26, 2019

Startup automation moved this from To do to Done Jan 26, 2019

Provide feedback