Unicode in SubwordTextEncoder #66

vthorsteinsson · 2017-06-28T12:32:59Z

This PR updates the Tokenizer and the SubwordTextEncoder to be Unicode-based instead of UTF-8 based.

This means that the full Unicode set of punctuation and separator characters is recognized when tokenizing the text, and that wordpieces are always formed at bone fide Unicode character boundaries. The former improvement helps to clean up the vocabulary and make it more efficient, since Unicode punctuation such as proper double quotes and em dashes is no longer seen as parts of words, while the latter avoids potential errors when wordpieces could start or end within UTF-8 multi-byte sequences denoting a single character.

The drawback is that it is not feasible to include the entire set of Unicode code points in the vocabulary (in contrast with including all code points 0..255 for UTF-8). This is mitigated by careful coding to make sure that all Unicode input appearing in the tokenizer training set is represented in the vocabulary, but if Unicode code points later appear during inference that were not seen during training, they are mapped to Unicode REPLACEMENT_CHARACTER (uFFFD). This should not impact performance in any meaningful way.

I believe this PR is especially beneficial for processing of text in non-ISO-Latin-1 languages that rely heavily on Unicode, such as Chinese, Hindi, Turkish, etc.

Secondarily, the PR contains a couple of performance enhancements, such as only writing a vocabulary file at the end of the binary search for an optimal vocabulary, not for every intermediate result (which could in fact lead to a nonoptimal vocabulary file being the last written one).

This PR has been tested and works (AFAIK :-) ) under both Python2 and Python3.

vthorsteinsson · 2017-06-28T17:49:45Z

I think I found a bug in this PR - please hold on reviewing it until I've pushed an update.

vthorsteinsson · 2017-06-29T14:49:03Z

Solved the bug and did some more cleanup and optimization. The underscore that denotes word boundaries is now back as a trailing character on the last wordpiece of each word. (I'm still not convinced that this is necessarily better than having it as a separate id, but will investigate that further and outside this PR.) When generating the vocabulary, SubwordTextEncoder now first collects its alphabet, i.e. the set of Unicode characters that appear in the token stream. This alphabet is always included in the vocabulary as single-character tokens, for catch-all purposes even when encoding training or inference text that was not seen during vocabulary generation. (In the case of EN-DE, the alphabet has 224 characters.)

Please review and let me know what you think.

lukaszkaiser

The code looks good, thanks Villi!

lukaszkaiser · 2017-06-29T17:16:20Z

This is wonderful, great to have it corrected. It could be useful to add more tests to make sure this functionality doesn't degenerate in the course of time (e.g., someone changing it back or forgetting something later). But I'll leave that for a next PR, merging now.

vthorsteinsson added 8 commits June 25, 2017 01:09

Merge fix-next

7ea979c

Import urllib fix

5f490a1

Merge remote-tracking branch 'upstream/master'

99f7bef

Merge remote-tracking branch 'upstream/master'

98f45d1

Tokenizer and subword token handling in Unicode

8595cc6

Merge remote-tracking branch 'upstream/master' into fix-tokens

576bccb

Subtoken handling enhancements

0cf8783

Merge remote-tracking branch 'upstream/master' into fix-tokens

eddb835

vthorsteinsson added 2 commits June 29, 2017 12:51

Separated alphabet generation; added bisect()

c6211f1

Merge remote-tracking branch 'upstream/master' into fix-tokens

bc75385

lukaszkaiser approved these changes Jun 29, 2017

View reviewed changes

lukaszkaiser merged commit af235c1 into tensorflow:master Jun 29, 2017

vthorsteinsson deleted the fix-tokens branch June 29, 2017 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode in SubwordTextEncoder #66

Unicode in SubwordTextEncoder #66

Uh oh!

vthorsteinsson commented Jun 28, 2017 •

edited

Loading

Uh oh!

vthorsteinsson commented Jun 28, 2017

Uh oh!

vthorsteinsson commented Jun 29, 2017

Uh oh!

lukaszkaiser left a comment

Uh oh!

lukaszkaiser commented Jun 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Unicode in SubwordTextEncoder #66

Unicode in SubwordTextEncoder #66

Uh oh!

Conversation

vthorsteinsson commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vthorsteinsson commented Jun 28, 2017

Uh oh!

vthorsteinsson commented Jun 29, 2017

Uh oh!

lukaszkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

lukaszkaiser commented Jun 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vthorsteinsson commented Jun 28, 2017 •

edited

Loading