-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Question about the Subword encoding and tokenization procedure #155
Comments
No. If get_or_generate_vocab is called without the
No.
The condition if "en" in filepath is true for all the currently present files (even non-English files contain "en"), so 350 KB are used always (multiplied by the number of files for a given language).
Yes.
I don't know. |
Noted. Thanks 🍡 |
It seems that the wmt_ende_32k experiment has already been changed to looking for "vocab.endefr.32768" instead of "tokens.vocab.32768". Does that suggest that workspace can be shared now across different tasks? |
Yes, in 1.1.3 the Text2TextProblem class has a |
What is the most convenient way to train a multilingual model for wmt task given the built-in language pairs? |
@colmantse: The current default for ende and enfr translation problems is to use
tensor2tensor/tensor2tensor/data_generators/generator_utils.py Lines 323 to 329 in 8f3a7fd
So for training on any/all of those 3 languages, the most convenient way is to use the default.:-) |
ah okay, so i have to register a new problem like those in the wmt and use get or generate vocab function, switch those language id etc.. many thanks! |
if i want to make a pure character model for couple of language pairs. |
For character-based models, there is no need of training a vocabulary, the default char vocabulary works for all Unicode positions I believe. You just should register YOURLANG_CHR id in the SpaceID class. |
Thanks, I read further the wmt.py. However I haven't seen any sample of multilingual problem. The input space and target space could only register a single language. Furthermore, is it necessary for the model to learn about the input_space for multilingual translation task? I thought google only notify the network of the target space to facilitate the model to learn of the input space? Thanks a lot! @Property @Property |
The registered models are for one language pair only. But T2T allows you to combine them freely, e.g. BTW: The discussion has shifted from the original topic of this now-closed issue (subword encoding). |
ok, opening up here #235 |
TL;DR: Is a file budget of 3.5e5 (or 7e5) enough to read in enough tokens to build a reliable subword text tokenizer?
If I understand correctly then the steps that are followed to generate the training data are as follows:
When I ran this on my corpus (~5M lines in a single file), I noted that approximately 6000-10000 lines are read in and the "token count" dictionary contains about 20-30k entries.
Isnt this way too small a dictionary to learn a subword text encoder that can reliably represent and split words over an entire corpus?
I do understand that the subword text encoder will back off to characters which doesnt hurt performance too badly but shouldnt the dictionary contain a significantly larger number of words in order to train a good subword text encoder?
Note: I am trying to evaluate how the transformer works in a multilingual multiway scenario (5 sources and 5 targets). I have single files where all the source and target corpora have been merged and the source sentences have a "<2tgt_lang>" token at the beginning. Thus only the top few lines (~6000 in my case) are read in and this covers only one of the 20 language pairs I want to learn. I realized that the simplest way to deal with this is to keep the corpora files (corresponding to each language pair) separate and thus the "token count" dictionary will include tokens from all the language pairs.
Thanks in advance to anyone who helps me understand this.
The text was updated successfully, but these errors were encountered: