-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Unicode in SubwordTextEncoder #66
Conversation
|
I think I found a bug in this PR - please hold on reviewing it until I've pushed an update. |
|
Solved the bug and did some more cleanup and optimization. The underscore that denotes word boundaries is now back as a trailing character on the last wordpiece of each word. (I'm still not convinced that this is necessarily better than having it as a separate id, but will investigate that further and outside this PR.) When generating the vocabulary, Please review and let me know what you think. |
lukaszkaiser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, thanks Villi!
|
This is wonderful, great to have it corrected. It could be useful to add more tests to make sure this functionality doesn't degenerate in the course of time (e.g., someone changing it back or forgetting something later). But I'll leave that for a next PR, merging now. |
This PR updates the Tokenizer and the SubwordTextEncoder to be Unicode-based instead of UTF-8 based.
This means that the full Unicode set of punctuation and separator characters is recognized when tokenizing the text, and that wordpieces are always formed at bone fide Unicode character boundaries. The former improvement helps to clean up the vocabulary and make it more efficient, since Unicode punctuation such as proper double quotes and em dashes is no longer seen as parts of words, while the latter avoids potential errors when wordpieces could start or end within UTF-8 multi-byte sequences denoting a single character.
The drawback is that it is not feasible to include the entire set of Unicode code points in the vocabulary (in contrast with including all code points 0..255 for UTF-8). This is mitigated by careful coding to make sure that all Unicode input appearing in the tokenizer training set is represented in the vocabulary, but if Unicode code points later appear during inference that were not seen during training, they are mapped to Unicode REPLACEMENT_CHARACTER (uFFFD). This should not impact performance in any meaningful way.
I believe this PR is especially beneficial for processing of text in non-ISO-Latin-1 languages that rely heavily on Unicode, such as Chinese, Hindi, Turkish, etc.
Secondarily, the PR contains a couple of performance enhancements, such as only writing a vocabulary file at the end of the binary search for an optimal vocabulary, not for every intermediate result (which could in fact lead to a nonoptimal vocabulary file being the last written one).
This PR has been tested and works (AFAIK :-) ) under both Python2 and Python3.