I get out of memory errors on unpunctuated text input. And I believe the reason might be the batch dividing method on the TokenizeProcessor. The docs claim that the batches are divided based on "paragraphs", but I couldn't find how Stanza defines a paragraph. Is it by punctuation?
I need to confirm this before starting to write a batching method of my own.
My text input comes from OpenAI Whisper model, so punctuation is sometimes missing in my input for less supported languages.