We'll learn embeddings from WMT and will learn translation task from multi30k, so results are more comparable (if we would extract 30k sentences from WMT, we couldn't compare with anybody). Besides, in the article authors do precisely this.

First thing: tokenization.

In [None]:
%%bash

mosesdecoder="../ext-libs/mosesdecoder"
multi30k_data_dir="../data/multi30k"
europarl_data_dir="../data/europarl-v7"
generated_data_dir="../data/generated"

threads=6

mkdir -p generated_data_dir

# Tokenizing multi30k
for file in $(ls "$multi30k_data_dir")
do
    cat "$multi30k_data_dir/$file" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads > \
    $generated_data_dir/$file.tok
done

for file in $(ls "$europarl_data_dir")
do
    lang="${file: -2}"
    cat "$multi30k_data_dir/$file" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads > \
    $generated_data_dir/europarl.$lang.tok
done

Ok, we have tokenized staff. Let's now compute BPEs

In [None]:
%%bash

subword_nmt="../ext-libs/subword-nmt"
src="../data/generated/europarl.en.tok"
trg="../data/generated/europarl.fr.tok"
# num_bpes=8000
num_bpes_src=8000
num_bpes_trg=8000

# bpes="../data/generated/bpes"
bpes_src="../data/generated/bpes.en"
bpes_trg="../data/generated/bpes.fr"
vocab_src="../data/generated/vocab.en"
vocab_trg="../data/generated/vocab.fr"

# python ../ext-libs/subword-nmt/learn_joint_bpe_and_vocab.py \
#     --input "$src" "$trg" \
#     -s "$num_bpes" \
#     -o "$bpes" \
#     --write-vocabulary "$vocab_src" "$vocab_trg"
python "$subword_nmt/learn_bpe.py" -s $num_bpes_src < $src > $bpes_src
python "$subword_nmt/learn_bpe.py" -s $num_bpes_trg < $trg > $bpes_trg

# Let's apply bpe here for all our tokenized files
for file in $(ls ../data/generated/*.tok)
do
    lang="${file: -6:2}"
    python "$subword_nmt/apply_bpe.py" -c "../data/generated/bpes.$lang" < "$file" > "$file.bpe"
done

# And finally, we should generate vocab
python "$subword_nmt/get_vocab.py" < "$src.bpe" > $vocab_src
python "$subword_nmt/get_vocab.py" < "$trg.bpe" > $vocab_trg

It is not a good thing to learn embeddings here, buut...

In [None]:
import fasttext

num_threads = 6

model_src = fasttext.skipgram('../data/generated/europarl.en.tok.bpe',
                              '../trained_models/europarl.en.tok.bpe_cbow',
                              dim=512, min_count=1, silent=0, thread=num_threads)

model_trg = fasttext.skipgram('../data/generated/europarl.fr.tok.bpe',
                              '../trained_models/europarl.fr.tok.bpe_cbow',
                              dim=512, min_count=1, silent=0, thread=num_threads)

In [None]:
# Let's remove .bin files which we do not use
!rm ../trained_models/*.bin