Adds a tokenization postprocessor for manual token cleanup #1290

Jemoka · 2023-10-02T05:48:54Z

Description

creates a tokenize_postprocessor config element to stanza.Pipeline which allows a callable to be passed which can override default tokenization
creates stanza.models.tokenization.utils.reassemble_doc_from_tokens which allows for reassemblage of a document from re-adjusted tokenizations

For instance:

nlp = stanza.Pipeline(lang="en", processors="tokenize",
                                    tokenize_postprocessor=lambda draft: do_stuff_to_draft(draft))

whereby, the single argument passed to tokenize_postprocessor is a list of lists containing string sentence and word tokenizations

[['Joe', 'Smith', 'lives', 'in', 'California', '.'], 
 ['Joe', "'s", 'favorite', 'food', 'is', 'pizza', '.'], 
 ['He', 'enjoys', 'going', 'to', 'the', 'beach', '.']]

to mark a MWT, each element can optionally be a tuple with a Bool second element. If the word dai in the following sentence as an MWT, for instance, we will be passed

[['Diglielo', ('dai', True), 'venire', 'a', 'mangiare', 'un', "po'", 'di', 'margarina']]

The callable passed to tokenize_postprocessor must return a list of lists in the same format: adjusting word, sentence tokenizations and MWT designations as needed.

Unit test coverage

integration test of postprocessor, and type check of the argument passed
unit test of the reassemble_doc_from_tokens function against normal & OOV characters

If a postprocessor is provided, the tokenizer passes the candidate tokenization before finalizing the document construction. This allows for a postprocessor which fixes certain known errors or simply adjusts the tokenization to better match the user's downstream preferences

Jemoka added 4 commits October 1, 2023 22:39

Adds a tokenization postprocessor for manual token cleanup

f03fe90

Merge remote-tracking branch 'origin/dev'

341bbfc

moves postprocessor to a seperate function

799fdfc

added comments outlining the importance of spaces

388bcd8

AngledLuffa merged commit 456c40e into stanfordnlp:dev Oct 2, 2023

Jemoka mentioned this pull request Oct 24, 2023

manual MWT control in tokenization #1302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a tokenization postprocessor for manual token cleanup #1290

Adds a tokenization postprocessor for manual token cleanup #1290

Jemoka commented Oct 2, 2023 •

edited

Adds a tokenization postprocessor for manual token cleanup #1290

Adds a tokenization postprocessor for manual token cleanup #1290

Conversation

Jemoka commented Oct 2, 2023 • edited

Description

Unit test coverage

Jemoka commented Oct 2, 2023 •

edited