Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Conversation

@vthorsteinsson
Copy link
Contributor

@vthorsteinsson vthorsteinsson commented Jun 28, 2017

This PR updates the Tokenizer and the SubwordTextEncoder to be Unicode-based instead of UTF-8 based.

This means that the full Unicode set of punctuation and separator characters is recognized when tokenizing the text, and that wordpieces are always formed at bone fide Unicode character boundaries. The former improvement helps to clean up the vocabulary and make it more efficient, since Unicode punctuation such as proper double quotes and em dashes is no longer seen as parts of words, while the latter avoids potential errors when wordpieces could start or end within UTF-8 multi-byte sequences denoting a single character.

The drawback is that it is not feasible to include the entire set of Unicode code points in the vocabulary (in contrast with including all code points 0..255 for UTF-8). This is mitigated by careful coding to make sure that all Unicode input appearing in the tokenizer training set is represented in the vocabulary, but if Unicode code points later appear during inference that were not seen during training, they are mapped to Unicode REPLACEMENT_CHARACTER (uFFFD). This should not impact performance in any meaningful way.

I believe this PR is especially beneficial for processing of text in non-ISO-Latin-1 languages that rely heavily on Unicode, such as Chinese, Hindi, Turkish, etc.

Secondarily, the PR contains a couple of performance enhancements, such as only writing a vocabulary file at the end of the binary search for an optimal vocabulary, not for every intermediate result (which could in fact lead to a nonoptimal vocabulary file being the last written one).

This PR has been tested and works (AFAIK :-) ) under both Python2 and Python3.

@vthorsteinsson
Copy link
Contributor Author

I think I found a bug in this PR - please hold on reviewing it until I've pushed an update.

@vthorsteinsson
Copy link
Contributor Author

Solved the bug and did some more cleanup and optimization. The underscore that denotes word boundaries is now back as a trailing character on the last wordpiece of each word. (I'm still not convinced that this is necessarily better than having it as a separate id, but will investigate that further and outside this PR.) When generating the vocabulary, SubwordTextEncoder now first collects its alphabet, i.e. the set of Unicode characters that appear in the token stream. This alphabet is always included in the vocabulary as single-character tokens, for catch-all purposes even when encoding training or inference text that was not seen during vocabulary generation. (In the case of EN-DE, the alphabet has 224 characters.)

Please review and let me know what you think.

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, thanks Villi!

@lukaszkaiser
Copy link
Contributor

This is wonderful, great to have it corrected. It could be useful to add more tests to make sure this functionality doesn't degenerate in the course of time (e.g., someone changing it back or forgetting something later). But I'll leave that for a next PR, merging now.

@lukaszkaiser lukaszkaiser merged commit af235c1 into tensorflow:master Jun 29, 2017
@vthorsteinsson vthorsteinsson deleted the fix-tokens branch June 29, 2017 17:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants