Do documents or large texts need special characters to tokenize before translation? #51

Wittline · 2021-09-16T17:51:05Z

Will the method Translate split or tokenize the input text in sentences?

model.translate(text, source_lang='es', target_lang='en')

I'm working with large sentences or documents, but without commas or special characters, because I cleaned the text in previous steps, So, I am sending sentences like the example below, but very large:

"amor edificio casa perro" --> expected output ---> "love building house dog"

As you can notice, there is no presence of commas or dots, will your model need this characters for translate in chunks?

nreimers · 2021-09-16T18:09:57Z

Yes, sentence delimiters (. ! ?) are needed. Otherwise you can also chunk it your self and provide a list of sentences.

Also I don't know if the models will work well, they have not been trained on such strange input text.

Wittline · 2021-09-16T18:21:52Z

I removed stopwords and removed special characters as well, I am working in a sentiment analysis report, I think some techniques for detect the sentiment score does not need contextual polarity, that is the reason I am calculating the score based in the average of scores of all words present by document.

I think this could work:

for translation in model.translate_stream(text, show_progress_bar=False, chunk_size=16, source_lang='es', target_lang='en'):
      translated_text.append(translation)

Wittline closed this as completed Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do documents or large texts need special characters to tokenize before translation? #51

Do documents or large texts need special characters to tokenize before translation? #51

Wittline commented Sep 16, 2021

nreimers commented Sep 16, 2021

Wittline commented Sep 16, 2021 •

edited

Loading

Do documents or large texts need special characters to tokenize before translation? #51

Do documents or large texts need special characters to tokenize before translation? #51

Comments

Wittline commented Sep 16, 2021

nreimers commented Sep 16, 2021

Wittline commented Sep 16, 2021 • edited Loading

Wittline commented Sep 16, 2021 •

edited

Loading