Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en' #20

Closed
ulhaqi12 opened this issue Mar 10, 2021 · 2 comments
Closed

Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en' #20

ulhaqi12 opened this issue Mar 10, 2021 · 2 comments

Comments

@ulhaqi12
Copy link

ulhaqi12 commented Mar 10, 2021

Hi,
I was trying to translate 19203 sentence data from german to English using the translate_stream method explained in the following link.
https://github.com/UKPLab/EasyNMT/blob/main/examples/translation_streaming.py

I set the chunk size to 32. After successfully translating 3 chunks and writing output on file it gave an error of the model. Can you guide me with this issue? I am pasting error wording down here.

  0%|▌                                                | 96/19203 [00:54<3:00:03,  1.77it/s]
Exception: Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'. Make sure that:

- 'Helsinki-NLP/opus-mt-nds-en' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'Helsinki-NLP/opus-mt-nds-en' is the correct path to a directory containing relevant tokenizer files


  1%|▋                                                 | 127/19203 [01:06<2:46:24,  1.91it/s]
Traceback (most recent call last):
  File "translate.py", line 12, in <module>
    for translation in model.translate_stream(sentences, chunk_size=32, target_lang='en'):
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 297, in translate_stream
    translated = self.translate(batch, show_progress_bar=False, **kwargs)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 124, in translate
    translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 210, in translate_sentences
    raise e
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 205, in translate_sentences
    translated = self.translate_sentences(grouped_sentences, source_lang=lng, target_lang=target_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 222, in translate_sentences
    output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/models/OpusMT.py", line 46, in translate_sentences
    tokenizer, model = self.load_model(model_name)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/models/OpusMT.py", line 28, in load_model
    tokenizer = MarianTokenizer.from_pretrained(model_name)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'. Make sure that:

- 'Helsinki-NLP/opus-mt-nds-en' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'Helsinki-NLP/opus-mt-nds-en' is the correct path to a directory containing relevant tokenizer files


@nreimers
Copy link
Member

The language detection step identified a document as language 'nds', which is low German.

However, there is no model that can translate from NDS to EN. Hence the error.

The automatic language detection sadly does not work perfectly for noisy data. So if you know the source language, it is best to set it (translate(..., src_lang='de', ...))

In that case, the language must not be determined, it loads directly the correct model (opus-mt-de-en) and this error is avoided.

@ulhaqi12
Copy link
Author

oh, got it. solved when I mentioned source language. btw the parameter is 'source_lang' not 'src_lang'.
image
Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants