Batch Sizes not used anywhere? Out of mem... #1327

andrePankraz · 2024-01-05T22:14:46Z

Describe the bug
I have some out of mems with 35 GB processes, stanze could be tracked down as reason.

To Reproduce
Steps to reproduce the behavior:

Take e.g. stanza.MultilingualPipeline() with self.nlp = stanza.MultilingualPipeline( model_dir=f"{get_from_env('model_dir', 'MODELS_FOLDER', 'data/models/')}stanza", lang_id_config={ "langid_clean_text": True, "langid_lang_subset": ["de", "en"], }, lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False}, "en": {"processors": "tokenize", "verbose": False}, }, use_gpu=False, )
Call self.nlp(lines) with multiple 1000 lines.
LangId processor clusters lines by length, creates tensor and calls LSTM. If one cluster happens to be some 100 lines long (and each line with some complexity), we get the described out of mem.

Expected behavior
No out of mem ;) For instance by really using batching in the pipelines?!

The classes implement some batch initializing params, but don't seem to do anything with them (or i cannot see it).
E.G. MultilingualPipeline.init has a param ld_batch_size=64, which isn't used anywhere in this class (e.g. for initializing sub processors).
The processor LangIDBiLSTM also has self.batch_size = batch_size with default 64 - but again, it doesn't seem to be used anywhere.

Do I have wrong expectations? OK; I can batch myself, but it doesn't seem to be the intension of this wrapper (and it shpouldn't) or I called call the LSTM just directly without all this wrapper stuff.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-01-11T13:01:19Z

Would you provide the complete stack trace please?

AngledLuffa · 2024-01-11T13:19:10Z

Ultimately I would like to be able to recreate the problem, but the following doesn't OOM on a 3090, nowhere near using up all my RAM:

import stanza

pipe = stanza.MultilingualPipeline(lang_id_config={ "langid_clean_text": True,
                                                    "langid_lang_subset": ["de", "en"] },
                                   lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False},
                                                  "en": {"processors": "tokenize", "verbose": False}})

text = "\n\n".join("This is a sample text %d" % i for i in range(10000))
# discarding the result each time
result = pipe(text)

text = "\n".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)

text = "   ".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)

andrePankraz added the bug label Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Sizes not used anywhere? Out of mem... #1327

Batch Sizes not used anywhere? Out of mem... #1327

andrePankraz commented Jan 5, 2024

AngledLuffa commented Jan 11, 2024

AngledLuffa commented Jan 11, 2024

Batch Sizes not used anywhere? Out of mem... #1327

Batch Sizes not used anywhere? Out of mem... #1327

Comments

andrePankraz commented Jan 5, 2024

AngledLuffa commented Jan 11, 2024

AngledLuffa commented Jan 11, 2024