Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Question about the Subword encoding and tokenization procedure #155

Closed
prajdabre opened this issue Jul 14, 2017 · 12 comments
Closed

Question about the Subword encoding and tokenization procedure #155

prajdabre opened this issue Jul 14, 2017 · 12 comments

Comments

@prajdabre
Copy link

TL;DR: Is a file budget of 3.5e5 (or 7e5) enough to read in enough tokens to build a reliable subword text tokenizer?

If I understand correctly then the steps that are followed to generate the training data are as follows:

  1. Download and uncompress all the datasets
  2. For each training text file (both source and target language files, assuming shared vocabulary) read a certain number of lines (controlled by "file budget" which is set to 3.5e5 in case of English and 7e5 otherwise).
  3. Build a "token count" dictionary from these lines that are read in.
  4. Build a subword vocabulary (and hence a subword tokenizer) from the token count dictionary.
  5. Merge all the datasets into a single collection (files ending with a .lang1 and a .lang2)
  6. Compile all the data into shards (10 by default) by processing the .lang1 and .lang2 files via the subword tokenizer.

When I ran this on my corpus (~5M lines in a single file), I noted that approximately 6000-10000 lines are read in and the "token count" dictionary contains about 20-30k entries.

Isnt this way too small a dictionary to learn a subword text encoder that can reliably represent and split words over an entire corpus?

I do understand that the subword text encoder will back off to characters which doesnt hurt performance too badly but shouldnt the dictionary contain a significantly larger number of words in order to train a good subword text encoder?

Note: I am trying to evaluate how the transformer works in a multilingual multiway scenario (5 sources and 5 targets). I have single files where all the source and target corpora have been merged and the source sentences have a "<2tgt_lang>" token at the beginning. Thus only the top few lines (~6000 in my case) are read in and this covers only one of the 20 language pairs I want to learn. I realized that the simplest way to deal with this is to keep the corpora files (corresponding to each language pair) separate and thus the "token count" dictionary will include tokens from all the language pairs.

Thanks in advance to anyone who helps me understand this.

@martinpopel
Copy link
Contributor

martinpopel commented Jul 18, 2017

For each training text file (both source and target language files, assuming shared vocabulary)

No. If get_or_generate_vocab is called without the sources parameter, all languages (listed in generator_utils.py _DATA_FILE_URLS) are used. So e.g. wmt_ende_tokens_32k uses tokens.vocab.32768 trained not only on English and German, but also on French and Macedonian (so this vocabulary can be used in multi-language experiments).
In contrast, the author of setimes_mken_tokens_32k has decided to use only English and Macedonian (and exclude German and French) when building tokens.vocab.32768.
The problem is that the filename is always the same (tokens.vocab.32768), but the content is different. So you should not run both ende and mken experiments in the same "workspace". The file is not re-generated if it already exists.
Of course, it would be better to fix this and e.g. encode at least the languages used into the filename (e.g. tokens.vocab-en-de-fr-mk.32k).

read a certain number of lines

No. file_byte_budget is the number of bytes (or maybe Unicode positions in Python3), not lines.

which is set to 3.5e5 in case of English and 7e5 otherwise

The condition if "en" in filepath is true for all the currently present files (even non-English files contain "en"), so 350 KB are used always (multiplied by the number of files for a given language).
Perhaps, the condition should be changed to if filepath.endswith("en").

the simplest way to deal with this is to keep the corpora files (corresponding to each language pair) separate and thus the "token count" dictionary will include tokens from all the language pairs.

Yes.

Is a file budget of 3.5e5 (or 7e5) enough to read in enough tokens to build a reliable subword text tokenizer?

I don't know.
For morphologically rich languages (or for CJK), I think 350 KB is not enough for a reasonable vocabulary, even if we need it only to find 32k wordpieces.

@prajdabre
Copy link
Author

Noted. Thanks 🍡

@anglil
Copy link

anglil commented Jul 28, 2017

It seems that the wmt_ende_32k experiment has already been changed to looking for "vocab.endefr.32768" instead of "tokens.vocab.32768". Does that suggest that workspace can be shared now across different tasks?

@lukaszkaiser
Copy link
Contributor

Yes, in 1.1.3 the Text2TextProblem class has a vocab_name property that governs which vocabulary it creates and uses. That should make it easier to create new problems with different vocabularies.

@colmantse
Copy link

What is the most convenient way to train a multilingual model for wmt task given the built-in language pairs?

@martinpopel
Copy link
Contributor

@colmantse: The current default for ende and enfr translation problems is to use vocab.endefr, that is a vocabulary trained on English, German and French (with French slightly over-represented). See


def get_or_generate_vocab(data_dir,
tmp_dir,
vocab_filename,
vocab_size,
sources=None):
"""Generate a vocabulary from the datasets in sources (_DATA_FILE_URLS)."""
sources = sources or _DATA_FILE_URLS

So for training on any/all of those 3 languages, the most convenient way is to use the default.:-)
If more language pairs are to be trained in the multi-task model, one should explicitly list the sources as the last parameter to get_or_generate_vocab.

@colmantse
Copy link

ah okay, so i have to register a new problem like those in the wmt and use get or generate vocab function, switch those language id etc..

many thanks!

@colmantse
Copy link

if i want to make a pure character model for couple of language pairs.

@martinpopel
Copy link
Contributor

For character-based models, there is no need of training a vocabulary, the default char vocabulary works for all Unicode positions I believe. You just should register YOURLANG_CHR id in the SpaceID class.

@colmantse
Copy link

Thanks, I read further the wmt.py. However I haven't seen any sample of multilingual problem. The input space and target space could only register a single language. Furthermore, is it necessary for the model to learn about the input_space for multilingual translation task? I thought google only notify the network of the target space to facilitate the model to learn of the input space?

Thanks a lot!

@Property
def input_space_id(self):
return problem.SpaceID.EN_CHR

@Property
def target_space_id(self):
return problem.SpaceID.CS_CHR

@martinpopel
Copy link
Contributor

The registered models are for one language pair only. But T2T allows you to combine them freely, e.g.
--problems=translate_ende_wmt32k-translate_ende_wmt32k_rev-translate_enfr_wmt32k-translate_enfr_wmt32k_rev.

BTW: The discussion has shifted from the original topic of this now-closed issue (subword encoding).
It may be better (for other users) to keep the discussion tidy and open a new issue or chat for new questions.

@colmantse
Copy link

colmantse commented Aug 18, 2017

ok, opening up here #235

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants