-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Source/target pair text files; Icelandic parsing support; fixes #154
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Villi. I'd be happy to merge, but there's one line that needs to go away, I think (as tokenizer is a module now, not a class).
| vocab = text_encoder.SubwordTextEncoder(vocab_filepath) | ||
| return vocab | ||
|
|
||
| tokenizer = Tokenizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be removed.
|
Yes, sorry, I missed this change in the final merge with upstream. Should be OK now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
My pleasure! :-) |
|
I just found one small problem (source_vocab is used instead of target_vocab in wmt.tabbed_generator). I'll correct it in my next PR and release. It'd be great though to add an option to download the Icelandic parsing set instead of assuming it's there in a file. Is the set public? |
This PR adds text data generators that read source and target sentences in pairs, separated by tabs (\t), from a single source file.
It also adds scaffolding to support parsing of Icelandic text, either from characters or from subword tokens. (More coming later on that front.)
A few smaller fixes are included, such as replacing magic constants to do with reserved token ids with more meaningful identifiers, and getting rid of annoying strings in decoder output trace messages.