Skip to content

How the TokenizeProcessor batching work on unpunctuted input? #1519

@liad-inon

Description

@liad-inon

I get out of memory errors on unpunctuated text input. And I believe the reason might be the batch dividing method on the TokenizeProcessor. The docs claim that the batches are divided based on "paragraphs", but I couldn't find how Stanza defines a paragraph. Is it by punctuation?

I need to confirm this before starting to write a batching method of my own.

My text input comes from OpenAI Whisper model, so punctuation is sometimes missing in my input for less supported languages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions