Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Vocab List in Dataloader #38

Closed
hzhwcmhf opened this issue Jan 11, 2019 · 1 comment
Closed

[Enhancement] Vocab List in Dataloader #38

hzhwcmhf opened this issue Jan 11, 2019 · 1 comment
Labels
high priority not in progress, need attention todo will work on this, bot not immediately
Projects

Comments

@hzhwcmhf
Copy link
Member

hzhwcmhf commented Jan 11, 2019

For implemention of #8 copynet, dataloader should change behaviours.

In our mind, there should be 3 vocab list:

  • For model trainning, smallest. Only include words from train set. Call it set $V.
  • For metric, bigger. The model will be evaluated on this vocab list, including words from train set and test set. Call it set $M. But almostly all models can't generate words from $V-$M, because they haven't seen these. Howerver, copyNet can gen words from $V-$M by copy mechanism. It's necessary to take these words into accounts when we implement metrics. $V-$M can be expressed as UNK token for some models. Dataloader have to tranlate them into a uniform distribution on $V-$M.
  • The whole space of word, include not seen in all the data. Call it set $N. The words in $N-$M, we don't care about them, ignore them in evaluating models, as [BUG] bug in trim_index of dataloader #37 . $N-$M is the TRUE UNK.

Require:

  • Change the behavior of dataloader, metric.
@hzhwcmhf hzhwcmhf added enhancement high priority not in progress, need attention labels Jan 11, 2019
@hzhwcmhf hzhwcmhf added this to To do in Startup via automation Jan 11, 2019
@hzhwcmhf hzhwcmhf changed the title [Enhancement] Dataloader [Enhancement] Vocab List in Dataloader Jan 11, 2019
This was referenced Jan 11, 2019
@hzhwcmhf hzhwcmhf added pause Pause todo will work on this, bot not immediately and removed pause Pause labels Jan 17, 2019
@hzhwcmhf hzhwcmhf mentioned this issue Jan 25, 2019
@hzhwcmhf
Copy link
Member Author

Done

Startup automation moved this from To do to Done Jan 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority not in progress, need attention todo will work on this, bot not immediately
Projects
Startup
  
Done
Development

No branches or pull requests

1 participant