Question about the Subword encoding and tokenization procedure #155

prajdabre · 2017-07-14T07:13:48Z

TL;DR: Is a file budget of 3.5e5 (or 7e5) enough to read in enough tokens to build a reliable subword text tokenizer?

If I understand correctly then the steps that are followed to generate the training data are as follows:

Download and uncompress all the datasets
For each training text file (both source and target language files, assuming shared vocabulary) read a certain number of lines (controlled by "file budget" which is set to 3.5e5 in case of English and 7e5 otherwise).
Build a "token count" dictionary from these lines that are read in.
Build a subword vocabulary (and hence a subword tokenizer) from the token count dictionary.
Merge all the datasets into a single collection (files ending with a .lang1 and a .lang2)
Compile all the data into shards (10 by default) by processing the .lang1 and .lang2 files via the subword tokenizer.

When I ran this on my corpus (~5M lines in a single file), I noted that approximately 6000-10000 lines are read in and the "token count" dictionary contains about 20-30k entries.

Isnt this way too small a dictionary to learn a subword text encoder that can reliably represent and split words over an entire corpus?

I do understand that the subword text encoder will back off to characters which doesnt hurt performance too badly but shouldnt the dictionary contain a significantly larger number of words in order to train a good subword text encoder?

Note: I am trying to evaluate how the transformer works in a multilingual multiway scenario (5 sources and 5 targets). I have single files where all the source and target corpora have been merged and the source sentences have a "<2tgt_lang>" token at the beginning. Thus only the top few lines (~6000 in my case) are read in and this covers only one of the 20 language pairs I want to learn. I realized that the simplest way to deal with this is to keep the corpora files (corresponding to each language pair) separate and thus the "token count" dictionary will include tokens from all the language pairs.

Thanks in advance to anyone who helps me understand this.

martinpopel · 2017-07-18T23:23:33Z

For each training text file (both source and target language files, assuming shared vocabulary)

No. If get_or_generate_vocab is called without the sources parameter, all languages (listed in generator_utils.py _DATA_FILE_URLS) are used. So e.g. wmt_ende_tokens_32k uses tokens.vocab.32768 trained not only on English and German, but also on French and Macedonian (so this vocabulary can be used in multi-language experiments).
In contrast, the author of setimes_mken_tokens_32k has decided to use only English and Macedonian (and exclude German and French) when building tokens.vocab.32768.
The problem is that the filename is always the same (tokens.vocab.32768), but the content is different. So you should not run both ende and mken experiments in the same "workspace". The file is not re-generated if it already exists.
Of course, it would be better to fix this and e.g. encode at least the languages used into the filename (e.g. tokens.vocab-en-de-fr-mk.32k).

read a certain number of lines

No. file_byte_budget is the number of bytes (or maybe Unicode positions in Python3), not lines.

which is set to 3.5e5 in case of English and 7e5 otherwise

The condition if "en" in filepath is true for all the currently present files (even non-English files contain "en"), so 350 KB are used always (multiplied by the number of files for a given language).
Perhaps, the condition should be changed to if filepath.endswith("en").

the simplest way to deal with this is to keep the corpora files (corresponding to each language pair) separate and thus the "token count" dictionary will include tokens from all the language pairs.

Yes.

Is a file budget of 3.5e5 (or 7e5) enough to read in enough tokens to build a reliable subword text tokenizer?

I don't know.
For morphologically rich languages (or for CJK), I think 350 KB is not enough for a reasonable vocabulary, even if we need it only to find 32k wordpieces.

prajdabre · 2017-07-19T06:56:12Z

Noted. Thanks 🍡

anglil · 2017-07-28T01:25:07Z

It seems that the wmt_ende_32k experiment has already been changed to looking for "vocab.endefr.32768" instead of "tokens.vocab.32768". Does that suggest that workspace can be shared now across different tasks?

lukaszkaiser · 2017-07-28T02:56:52Z

Yes, in 1.1.3 the Text2TextProblem class has a vocab_name property that governs which vocabulary it creates and uses. That should make it easier to create new problems with different vocabularies.

colmantse · 2017-08-18T09:32:25Z

What is the most convenient way to train a multilingual model for wmt task given the built-in language pairs?

martinpopel · 2017-08-18T09:58:24Z

@colmantse: The current default for ende and enfr translation problems is to use vocab.endefr, that is a vocabulary trained on English, German and French (with French slightly over-represented). See

tensor2tensor/tensor2tensor/data_generators/wmt.py

Line 55 in 8f3a7fd

return "vocab.endefr"

tensor2tensor/tensor2tensor/data_generators/generator_utils.py

Lines 323 to 329 in 8f3a7fd

    
           def get_or_generate_vocab(data_dir, 
        
                                     tmp_dir, 
        
                                     vocab_filename, 
        
                                     vocab_size, 
        
                                     sources=None): 
        
             """Generate a vocabulary from the datasets in sources (_DATA_FILE_URLS).""" 
        
             sources = sources or _DATA_FILE_URLS

So for training on any/all of those 3 languages, the most convenient way is to use the default.:-)
If more language pairs are to be trained in the multi-task model, one should explicitly list the sources as the last parameter to get_or_generate_vocab.

colmantse · 2017-08-18T10:04:01Z

ah okay, so i have to register a new problem like those in the wmt and use get or generate vocab function, switch those language id etc..

many thanks!

colmantse · 2017-08-18T10:04:30Z

if i want to make a pure character model for couple of language pairs.

martinpopel · 2017-08-18T10:22:26Z

For character-based models, there is no need of training a vocabulary, the default char vocabulary works for all Unicode positions I believe. You just should register YOURLANG_CHR id in the SpaceID class.

colmantse · 2017-08-18T10:50:36Z

Thanks, I read further the wmt.py. However I haven't seen any sample of multilingual problem. The input space and target space could only register a single language. Furthermore, is it necessary for the model to learn about the input_space for multilingual translation task? I thought google only notify the network of the target space to facilitate the model to learn of the input space?

Thanks a lot!

@Property
def input_space_id(self):
return problem.SpaceID.EN_CHR

@Property
def target_space_id(self):
return problem.SpaceID.CS_CHR

martinpopel · 2017-08-18T11:27:38Z

The registered models are for one language pair only. But T2T allows you to combine them freely, e.g.
--problems=translate_ende_wmt32k-translate_ende_wmt32k_rev-translate_enfr_wmt32k-translate_enfr_wmt32k_rev.

BTW: The discussion has shifted from the original topic of this now-closed issue (subword encoding).
It may be better (for other users) to keep the discussion tidy and open a new issue or chat for new questions.

colmantse · 2017-08-18T11:31:40Z

ok, opening up here #235

prajdabre closed this as completed Jul 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the Subword encoding and tokenization procedure #155

Question about the Subword encoding and tokenization procedure #155

prajdabre commented Jul 14, 2017

martinpopel commented Jul 18, 2017 •

edited

Loading

prajdabre commented Jul 19, 2017

anglil commented Jul 28, 2017

lukaszkaiser commented Jul 28, 2017

colmantse commented Aug 18, 2017

martinpopel commented Aug 18, 2017

colmantse commented Aug 18, 2017

colmantse commented Aug 18, 2017

martinpopel commented Aug 18, 2017

colmantse commented Aug 18, 2017

martinpopel commented Aug 18, 2017

colmantse commented Aug 18, 2017 •

edited

Loading

Question about the Subword encoding and tokenization procedure #155

Question about the Subword encoding and tokenization procedure #155

Comments

prajdabre commented Jul 14, 2017

martinpopel commented Jul 18, 2017 • edited Loading

prajdabre commented Jul 19, 2017

anglil commented Jul 28, 2017

lukaszkaiser commented Jul 28, 2017

colmantse commented Aug 18, 2017

martinpopel commented Aug 18, 2017

colmantse commented Aug 18, 2017

colmantse commented Aug 18, 2017

martinpopel commented Aug 18, 2017

colmantse commented Aug 18, 2017

martinpopel commented Aug 18, 2017

colmantse commented Aug 18, 2017 • edited Loading

martinpopel commented Jul 18, 2017 •

edited

Loading

colmantse commented Aug 18, 2017 •

edited

Loading