Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a tokenization postprocessor for manual token cleanup #1290

Merged
merged 4 commits into from Oct 2, 2023

Conversation

Jemoka
Copy link
Member

@Jemoka Jemoka commented Oct 2, 2023

Description

  • creates a tokenize_postprocessor config element to stanza.Pipeline which allows a callable to be passed which can override default tokenization
  • creates stanza.models.tokenization.utils.reassemble_doc_from_tokens which allows for reassemblage of a document from re-adjusted tokenizations

For instance:

nlp = stanza.Pipeline(lang="en", processors="tokenize",
                                    tokenize_postprocessor=lambda draft: do_stuff_to_draft(draft))

whereby, the single argument passed to tokenize_postprocessor is a list of lists containing string sentence and word tokenizations

[['Joe', 'Smith', 'lives', 'in', 'California', '.'], 
 ['Joe', "'s", 'favorite', 'food', 'is', 'pizza', '.'], 
 ['He', 'enjoys', 'going', 'to', 'the', 'beach', '.']]

to mark a MWT, each element can optionally be a tuple with a Bool second element. If the word dai in the following sentence as an MWT, for instance, we will be passed

[['Diglielo', ('dai', True), 'venire', 'a', 'mangiare', 'un', "po'", 'di', 'margarina']]

The callable passed to tokenize_postprocessor must return a list of lists in the same format: adjusting word, sentence tokenizations and MWT designations as needed.

Unit test coverage

  • integration test of postprocessor, and type check of the argument passed
  • unit test of the reassemble_doc_from_tokens function against normal & OOV characters

@AngledLuffa AngledLuffa merged commit 456c40e into stanfordnlp:dev Oct 2, 2023
AngledLuffa pushed a commit that referenced this pull request Oct 2, 2023
If a postprocessor is provided, the tokenizer passes the candidate tokenization before finalizing the document construction.  This allows for a postprocessor which fixes certain known errors or simply adjusts the tokenization to better match the user's downstream preferences
AngledLuffa pushed a commit that referenced this pull request Oct 2, 2023
If a postprocessor is provided, the tokenizer passes the candidate tokenization before finalizing the document construction.  This allows for a postprocessor which fixes certain known errors or simply adjusts the tokenization to better match the user's downstream preferences
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants