Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

Closed
jplu opened this issue Feb 28, 2023 · 6 comments
Closed

[QUESTION] How to load only the tokenizer in multilingual pipeline #1199

jplu opened this issue Feb 28, 2023 · 6 comments
Labels

Comments

@jplu
Copy link

jplu commented Feb 28, 2023

Hello,

I struggle to find how to do not load all the processors in a multilingual context.

When I do:

from stanza.pipeline.multilingual import MultilingualPipeline

nlp = MultilingualPipeline()
text = "my piece of text in a random language."
doc = nlp(text)

I get the following processors for the detected language:

============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

Bust I only need to use and load the "tokenize" processor. How can I avoid the usage of all the other processors?

I specify that I don't know the language of the input text in advance.

Thanks in advance for any help :)

@jplu jplu added the question label Feb 28, 2023
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Feb 28, 2023

You can use the lang_configs field to pass in a map from language to config:

lang_configs = { "en": {"processors"="tokenize"} }
MultilingualPipeline(lang_configs=lang_configs)

Unfortunately, this doesn't generalize well since you'd have to add the same thing for every language your pipeline will encounter. I'll add the ability to pass in a default map.

@AngledLuffa
Copy link
Collaborator

sorry, had to edit mistake in the above code - don't rely on the email

@jplu
Copy link
Author

jplu commented Feb 28, 2023

That's basically the workaround I found as well but as you said, it must be specified for each lang. Would be nice indeed to have this ability to pass a default list of processors :)

Should I open a feature request instead?

@AngledLuffa
Copy link
Collaborator

That's fine. I'm doing it right now

AngledLuffa added a commit that referenced this issue Feb 28, 2023
…s will allow for specifying only the tokenize processor as requested in #1199 for example
@AngledLuffa
Copy link
Collaborator

If you install the dev branch, you can now pass in a defaultdict:

    lang_configs = defaultdict(lambda: dict(processors="tokenize"))
    nlp = MultilingualPipeline(lang_configs=lang_configs)

Not sure it's the best solution, but passing in a defaultdict wouldn't have worked as expected previously

@jplu
Copy link
Author

jplu commented Feb 28, 2023

Perfect! Looks good to me 👍 Thanks very much!

@jplu jplu closed this as completed Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants