Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do documents or large texts need special characters to tokenize before translation? #51

Closed
Wittline opened this issue Sep 16, 2021 · 2 comments

Comments

@Wittline
Copy link

Hi @nreimers

Will the method Translate split or tokenize the input text in sentences?

model.translate(text, source_lang='es', target_lang='en')

I'm working with large sentences or documents, but without commas or special characters, because I cleaned the text in previous steps, So, I am sending sentences like the example below, but very large:

"amor edificio casa perro" --> expected output ---> "love building house dog"

As you can notice, there is no presence of commas or dots, will your model need this characters for translate in chunks?

@nreimers
Copy link
Member

Yes, sentence delimiters (. ! ?) are needed. Otherwise you can also chunk it your self and provide a list of sentences.

Also I don't know if the models will work well, they have not been trained on such strange input text.

@Wittline
Copy link
Author

Wittline commented Sep 16, 2021

I removed stopwords and removed special characters as well, I am working in a sentiment analysis report, I think some techniques for detect the sentiment score does not need contextual polarity, that is the reason I am calculating the score based in the average of scores of all words present by document.

I think this could work:

for translation in model.translate_stream(text, show_progress_bar=False, chunk_size=16, source_lang='es', target_lang='en'):
      translated_text.append(translation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants