Adds a tokenization postprocessor for manual token cleanup #1290
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
tokenize_postprocessor
config element tostanza.Pipeline
which allows a callable to be passed which can override default tokenizationstanza.models.tokenization.utils.reassemble_doc_from_tokens
which allows for reassemblage of a document from re-adjusted tokenizationsFor instance:
whereby, the single argument passed to
tokenize_postprocessor
is a list of lists containing string sentence and word tokenizationsto mark a MWT, each element can optionally be a tuple with a Bool second element. If the word
dai
in the following sentence as anMWT
, for instance, we will be passedThe callable passed to
tokenize_postprocessor
must return a list of lists in the same format: adjusting word, sentence tokenizations and MWT designations as needed.Unit test coverage
reassemble_doc_from_tokens
function against normal & OOV characters