You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LangId processor clusters lines by length, creates tensor and calls LSTM. If one cluster happens to be some 100 lines long (and each line with some complexity), we get the described out of mem.
Expected behavior
No out of mem ;) For instance by really using batching in the pipelines?!
The classes implement some batch initializing params, but don't seem to do anything with them (or i cannot see it).
E.G. MultilingualPipeline.init has a param ld_batch_size=64, which isn't used anywhere in this class (e.g. for initializing sub processors).
The processor LangIDBiLSTM also has self.batch_size = batch_size with default 64 - but again, it doesn't seem to be used anywhere.
Do I have wrong expectations? OK; I can batch myself, but it doesn't seem to be the intension of this wrapper (and it shpouldn't) or I called call the LSTM just directly without all this wrapper stuff.
The text was updated successfully, but these errors were encountered:
Ultimately I would like to be able to recreate the problem, but the following doesn't OOM on a 3090, nowhere near using up all my RAM:
import stanza
pipe = stanza.MultilingualPipeline(lang_id_config={ "langid_clean_text": True,
"langid_lang_subset": ["de", "en"] },
lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False},
"en": {"processors": "tokenize", "verbose": False}})
text = "\n\n".join("This is a sample text %d" % i for i in range(10000))
# discarding the result each time
result = pipe(text)
text = "\n".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)
text = " ".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)
Describe the bug
I have some out of mems with 35 GB processes, stanze could be tracked down as reason.
To Reproduce
Steps to reproduce the behavior:
self.nlp = stanza.MultilingualPipeline( model_dir=f"{get_from_env('model_dir', 'MODELS_FOLDER', 'data/models/')}stanza", lang_id_config={ "langid_clean_text": True, "langid_lang_subset": ["de", "en"], }, lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False}, "en": {"processors": "tokenize", "verbose": False}, }, use_gpu=False, )
Expected behavior
No out of mem ;) For instance by really using batching in the pipelines?!
The classes implement some batch initializing params, but don't seem to do anything with them (or i cannot see it).
E.G. MultilingualPipeline.init has a param ld_batch_size=64, which isn't used anywhere in this class (e.g. for initializing sub processors).
The processor LangIDBiLSTM also has self.batch_size = batch_size with default 64 - but again, it doesn't seem to be used anywhere.
Do I have wrong expectations? OK; I can batch myself, but it doesn't seem to be the intension of this wrapper (and it shpouldn't) or I called call the LSTM just directly without all this wrapper stuff.
The text was updated successfully, but these errors were encountered: