### Dealing with non-English inputs

Neural machine translation (NMT) is the approach adoped ever more frequently to tackle machine translation tasks. Despite our goal in the challenge is not exactly match it we might benefit from applying it to the data at hand. We have seen in <a href='https://www.kaggle.com/erelin6613/eda-elementary'>EDA: Elementary...</a> nearly a half of the inputs are non-English texts.

In this notebook we utilize Marian model, if you want to get deeper understanding of it the paper <a href='https://arxiv.org/pdf/1805.12096.pdf'>Marian: Cost-effective High-Quality Neural Machine Translation in C++</a> is a good place to start. We will use models pretrained on a massive collection of translations <a href='http://opus.nlpl.eu/'>OPUS</a>. It will be understatment to say Language Technology Research Group at the University of Helsinki did a great job at bridging so many languages. If you are interested in their work check out their blog at https://blogs.helsinki.fi/language-technology/.

In [None]:
!pip install -q transformers
!pip install -q mosestokenizer
!pip install -q translators

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm.notebook import tqdm
from transformers import MarianMTModel, MarianTokenizer

In [None]:
root_dir = '../input/contradictory-my-dear-watson'
train_path = 'train.csv'
test_path = 'test.csv'

train_df = pd.read_csv(os.path.join(root_dir, train_path))
test_df = pd.read_csv(os.path.join(root_dir, test_path))

In [None]:
models = {k: f'Helsinki-NLP/opus-mt-{k}-en' for k in train_df.lang_abv.unique()}
models

In [None]:
def translate_df(df, compare=True, fields=['premise', 'hypothesis']):
    
    def translate_google(string, lang):
        import translators as ts
        string = ts.google(query_text=string,
                           from_language=lang, 
                           to_language='en',
                            sleep_seconds=1)
        return string

    def translate_merian(tokenizer, model, string):
        batch = tokenizer.prepare_translation_batch(
                src_texts=[string])
        gen = model.generate(**batch)
        translation = tokenizer.batch_decode(
            gen, skip_special_tokens=True)
        
        return translation[0]
    
    def compare(subset, tokenizer, model):
        idx = subset.index[-1]
        original = subset[fields[0]][idx]
        g_translation = translate_google(
            subset[fields[0]][idx], subset.lang_abv[idx])
        trs = translate_merian(tokenizer, model, subset[fields[0]][idx])
        print(f'Original: {original},\n\
        Google trainslation: {g_translation}\n\
        Marian translation: {trs}')
        
        
    for k in models:
        if k == 'en':
            continue
        print('translating: ', k)
        
        try:
            tokenizer = MarianTokenizer.from_pretrained(models[k])
            model = MarianMTModel.from_pretrained(models[k])
        except Exception as e:
            print(e)
            continue
        subset = df[df.lang_abv==k]
        
        if compare:
            compare(subset, tokenizer, model)
            
        for idx in tqdm(subset.index):
            for f in fields:
                df.loc[idx, f] = translate_merian(
                    tokenizer, model, df.loc[idx, f])

    return df

In [None]:
train_df = translate_df(train_df)
test_df = translate_df(test_df)

In [None]:
train_df

In [None]:
test_df

In [None]:
train_df.to_csv('train_df.csv', index=False)
test_df.to_csv('test_df.csv', index=False)

Was it useful? Probably not as much as one might expect, but it is certainly fun application. Hope you enjoyed this little overview of Google vs MarianMT :)