[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

jplu · 2023-02-28T13:41:06Z

Hello,

I struggle to find how to do not load all the processors in a multilingual context.

When I do:

from stanza.pipeline.multilingual import MultilingualPipeline

nlp = MultilingualPipeline()
text = "my piece of text in a random language."
doc = nlp(text)

I get the following processors for the detected language:

============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

Bust I only need to use and load the "tokenize" processor. How can I avoid the usage of all the other processors?

I specify that I don't know the language of the input text in advance.

Thanks in advance for any help :)

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2023-02-28T15:33:30Z

You can use the lang_configs field to pass in a map from language to config:

lang_configs = { "en": {"processors"="tokenize"} }
MultilingualPipeline(lang_configs=lang_configs)

Unfortunately, this doesn't generalize well since you'd have to add the same thing for every language your pipeline will encounter. I'll add the ability to pass in a default map.

AngledLuffa · 2023-02-28T15:34:21Z

sorry, had to edit mistake in the above code - don't rely on the email

jplu · 2023-02-28T15:44:57Z

That's basically the workaround I found as well but as you said, it must be specified for each lang. Would be nice indeed to have this ability to pass a default list of processors :)

Should I open a feature request instead?

AngledLuffa · 2023-02-28T15:55:53Z

That's fine. I'm doing it right now

…s will allow for specifying only the tokenize processor as requested in #1199 for example

AngledLuffa · 2023-02-28T16:41:47Z

If you install the dev branch, you can now pass in a defaultdict:

    lang_configs = defaultdict(lambda: dict(processors="tokenize"))
    nlp = MultilingualPipeline(lang_configs=lang_configs)

Not sure it's the best solution, but passing in a defaultdict wouldn't have worked as expected previously

jplu · 2023-02-28T17:13:55Z

Perfect! Looks good to me 👍 Thanks very much!

jplu added the question label Feb 28, 2023

AngledLuffa added a commit that referenced this issue Feb 28, 2023

Add the ability to pass in a defaultdict to MultilingualPipeline. Thi…

70fd2fd

…s will allow for specifying only the tokenize processor as requested in #1199 for example

jplu closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

jplu commented Feb 28, 2023

AngledLuffa commented Feb 28, 2023 •

edited

AngledLuffa commented Feb 28, 2023

jplu commented Feb 28, 2023 •

edited

AngledLuffa commented Feb 28, 2023

AngledLuffa commented Feb 28, 2023

jplu commented Feb 28, 2023 •

edited

[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

Comments

jplu commented Feb 28, 2023

AngledLuffa commented Feb 28, 2023 • edited

AngledLuffa commented Feb 28, 2023

jplu commented Feb 28, 2023 • edited

AngledLuffa commented Feb 28, 2023

AngledLuffa commented Feb 28, 2023

jplu commented Feb 28, 2023 • edited

AngledLuffa commented Feb 28, 2023 •

edited

jplu commented Feb 28, 2023 •

edited

jplu commented Feb 28, 2023 •

edited