Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Sizes not used anywhere? Out of mem... #1327

Open
andrePankraz opened this issue Jan 5, 2024 · 2 comments
Open

Batch Sizes not used anywhere? Out of mem... #1327

andrePankraz opened this issue Jan 5, 2024 · 2 comments
Labels

Comments

@andrePankraz
Copy link

Describe the bug
I have some out of mems with 35 GB processes, stanze could be tracked down as reason.

To Reproduce
Steps to reproduce the behavior:

  1. Take e.g. stanza.MultilingualPipeline() with self.nlp = stanza.MultilingualPipeline( model_dir=f"{get_from_env('model_dir', 'MODELS_FOLDER', 'data/models/')}stanza", lang_id_config={ "langid_clean_text": True, "langid_lang_subset": ["de", "en"], }, lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False}, "en": {"processors": "tokenize", "verbose": False}, }, use_gpu=False, )
  2. Call self.nlp(lines) with multiple 1000 lines.
  3. LangId processor clusters lines by length, creates tensor and calls LSTM. If one cluster happens to be some 100 lines long (and each line with some complexity), we get the described out of mem.

Expected behavior
No out of mem ;) For instance by really using batching in the pipelines?!

The classes implement some batch initializing params, but don't seem to do anything with them (or i cannot see it).
E.G. MultilingualPipeline.init has a param ld_batch_size=64, which isn't used anywhere in this class (e.g. for initializing sub processors).
The processor LangIDBiLSTM also has self.batch_size = batch_size with default 64 - but again, it doesn't seem to be used anywhere.

Do I have wrong expectations? OK; I can batch myself, but it doesn't seem to be the intension of this wrapper (and it shpouldn't) or I called call the LSTM just directly without all this wrapper stuff.

@AngledLuffa
Copy link
Collaborator

Would you provide the complete stack trace please?

@AngledLuffa
Copy link
Collaborator

Ultimately I would like to be able to recreate the problem, but the following doesn't OOM on a 3090, nowhere near using up all my RAM:

import stanza

pipe = stanza.MultilingualPipeline(lang_id_config={ "langid_clean_text": True,
                                                    "langid_lang_subset": ["de", "en"] },
                                   lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False},
                                                  "en": {"processors": "tokenize", "verbose": False}})

text = "\n\n".join("This is a sample text %d" % i for i in range(10000))
# discarding the result each time
result = pipe(text)

text = "\n".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)

text = "   ".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants