We'll learn embeddings from WMT and will learn translation task from multi30k, so results are more comparable (if we would extract 30k sentences from WMT, we couldn't compare with anybody). Besides, in the article authors do precisely this.

First thing: tokenization.

In [1]:
import os
import nltk

multi30k_data_dir = '../data/multi30k'
wmt17_data_dir = '../data/wmt17'
generated_data_dir = '../data/generated'

if not os.path.exists(generated_data_dir): os.mkdir(generated_data_dir)

nltk.download('punkt')
files_to_tokenize = []

# Tokenizing multi30k
for file_name in os.listdir(multi30k_data_dir):
    input_file_path = '{}/{}'.format(multi30k_data_dir, file_name)
    output_file_path = '{}/{}.tok'.format(generated_data_dir, file_name)

    files_to_tokenize.append((input_file_path, output_file_path))

# Tokenizing WMT
wmt17_file_name_src = '{}/{}'.format(wmt17_data_dir, 'europarl-v7.de-en.en')
wmt17_file_name_trg = '{}/{}'.format(wmt17_data_dir, 'europarl-v7.de-en.de')
files_to_tokenize.append((wmt17_file_name_src, '%s/wmt17.en.tok' % generated_data_dir))
files_to_tokenize.append((wmt17_file_name_trg, '%s/wmt17.de.tok' % generated_data_dir))


# Tokenization
for input_file_path, output_file_path in files_to_tokenize:
    print('Tokenizing', input_file_path)
    with open(input_file_path, 'r', encoding='utf-8') as file:
        lines = file.read().splitlines()
    
    tokenized = [' '.join(nltk.word_tokenize(line)) for line in lines]
    
    with open(output_file_path, 'w', encoding='utf-8') as file:
        for line in tokenized:
            file.write(line + os.linesep)

[nltk_data] Downloading package punkt to /home/universome/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Tokenizing ../data/multi30k/train.de
Tokenizing ../data/multi30k/train.en
Tokenizing ../data/multi30k/val.en
Tokenizing ../data/multi30k/test.en
Tokenizing ../data/multi30k/test.de
Tokenizing ../data/multi30k/val.de
Tokenizing ../data/wmt17/europarl-v7.de-en.en
Tokenizing ../data/wmt17/europarl-v7.de-en.de


Ok, we have tokenized staff. Let's now compute BPEs

In [2]:
%%bash
subword_nmt="../ext-libs/subword-nmt"
src="../data/generated/wmt17.en.tok"
trg="../data/generated/wmt17.de.tok"
# num_bpes=8000
num_bpes_src=4000
num_bpes_trg=4000

# bpes="../data/generated/bpes"
bpes_src="../data/generated/bpes.en"
bpes_trg="../data/generated/bpes.de"
vocab_src="../data/generated/vocab.en"
vocab_trg="../data/generated/vocab.de"

# python ../ext-libs/subword-nmt/learn_joint_bpe_and_vocab.py \
#     --input "$src" "$trg" \
#     -s "$num_bpes" \
#     -o "$bpes" \
#     --write-vocabulary "$vocab_src" "$vocab_trg"
python "$subword_nmt/learn_bpe.py" -s $num_bpes_src < $src > $bpes_src
python "$subword_nmt/learn_bpe.py" -s $num_bpes_trg < $trg > $bpes_trg

# Let's apply bpe here for all our tokenized files
for file in $(ls ../data/generated/*.tok)
do
    lang="${file: -6:2}"
    python "$subword_nmt/apply_bpe.py" -c "../data/generated/bpes.$lang" < "$file" > "$file.bpe"
done

# And finally, we should generate vocab
python "$subword_nmt/get_vocab.py" < "$src.bpe" > $vocab_src
python "$subword_nmt/get_vocab.py" < "$trg.bpe" > $vocab_trg

It is not a good thing to learn embeddings here, buut...

In [None]:
import fasttext


model_src = fasttext.skipgram('../data/generated/wmt17.en.tok.bpe',
                              '../trained_models/wmt17.en.tok.bpe_cbow',
                              dim=512, min_count=1, silent=0, thread=4)

model_trg = fasttext.skipgram('../data/generated/wmt17.de.tok.bpe',
                              '../trained_models/wmt17.de.tok.bpe_cbow',
                              dim=512, min_count=1, silent=0, thread=4)

In [None]:
# Let's remove .bin files which we do not use
# !rm ../trained_models/*.bin