This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Description
This may not directly related to tensor2tensor, but I am curious about to what extend NER could improve translation quality in general. Here are some examples where NER could apply.
- numbers, especially long digits e.g., 100,000,000. To my understand, it would be split into a serious of tokens with each element of being one digits. If it could be replaced by a special token such as _number, the length will be reduced to one.
- dates, for example 2 October 2018 will also be one token if converted properly.
The benefit of doing this is to shorten sentences and thus yielding a more simplified sentence structure. On the other hand, there are also bad sides. For example, it relies on a good NER system. It may also cause trouble when post-processing those NEs after translated into another language, one of which could be to retain the orders of NEs as of in source sentences.
Any comments, suggestions?