How the TokenizeProcessor batching work on unpunctuted input?

I get out of memory errors on unpunctuated text input. And I believe the reason might be the batch dividing method on the TokenizeProcessor. The docs claim that the batches are divided based on "paragraphs", but I couldn't find how Stanza defines a paragraph. Is it by punctuation?

I need to confirm this before starting to write a batching method of my own. 

My text input comes from OpenAI Whisper model, so punctuation is sometimes missing in my input for less supported languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How the TokenizeProcessor batching work on unpunctuted input? #1519

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How the TokenizeProcessor batching work on unpunctuted input? #1519

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions