Let's try to train in unsupervised manner?
For this we have to do the following:
* add noise to input sentences (swaps and tokens removal)
* learning word embeddings from something
* backtranslation
* adversarial losses:
    * FFN (TODO: try RNN/Transformer?) discriminator(s) to distinguish encodings
    * TODO: discriminator(s) to distinguish produced styles/translations?

In [1]:
import sys; sys.path += ['..', '../src']

Ok, let's now learn word embeddings.

We'll learn embeddings from WMT and will learn translation task from multi30k, so results are more comparable (if we would extract 30k sentences from WMT, we couldn't compare with anybody). Besides, in the article authors do precisely this.

In [2]:
import os

multi30k_data_dir = '../data/multi30k'
wmt17_data_dir = '../data/wmt17'
generated_data_dir = '../data/generated'

if not os.path.exists(generated_data_dir): os.mkdir(generated_data_dir)

First thing: tokenization.

In [3]:
import os
import nltk


nltk.download('punkt')
files_to_tokenize = []

# Tokenizing multi30k
for file_name in os.listdir(multi30k_data_dir):
    input_file_path = '{}/{}'.format(multi30k_data_dir, file_name)
    output_file_path = '{}/{}.tok'.format(generated_data_dir, file_name)

    files_to_tokenize.append((input_file_path, output_file_path))

# Tokenizing WMT
wmt17_file_name_src = '{}/{}'.format(wmt17_data_dir, 'europarl-v7.de-en.en')
wmt17_file_name_trg = '{}/{}'.format(wmt17_data_dir, 'europarl-v7.de-en.de')
files_to_tokenize.append((wmt17_file_name_src, '%s/wmt17.en.tok' % generated_data_dir))
files_to_tokenize.append((wmt17_file_name_trg, '%s/wmt17.de.tok' % generated_data_dir))


# Tokenization
for input_file_path, output_file_path in files_to_tokenize:
    print('Tokenizing', input_file_path)
    with open(input_file_path, 'r', encoding='utf-8') as file:
        lines = file.read().splitlines()
    
    tokenized = [' '.join(nltk.word_tokenize(line)) for line in lines]
    
    with open(output_file_path, 'w', encoding='utf-8') as file:
        for line in tokenized:
            file.write(line + os.linesep)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/universome/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Tokenizing ../data/multi30k/train.de
Tokenizing ../data/multi30k/train.en
Tokenizing ../data/multi30k/val.en
Tokenizing ../data/multi30k/test.en
Tokenizing ../data/multi30k/test.de
Tokenizing ../data/multi30k/val.de
Tokenizing ../data/wmt17/europarl-v7.de-en.en
Tokenizing ../data/wmt17/europarl-v7.de-en.de


Ok, we have tokenized staff. Let's now compute BPEs

In [68]:
%%bash
# TODO: use WMT after dev is done
# src="../data/generated/wmt17.en.tok"
# trg="../data/generated/wmt17.de.tok"
src="../data/generated/train.en.tok"
trg="../data/generated/train.de.tok"
num_bpes=16000

bpes="../data/generated/bpes"
src_vocab="../data/generated/vocab.en"
trg_vocab="../data/generated/vocab.de"

python ../ext-libs/subword-nmt/learn_joint_bpe_and_vocab.py \
    --input "$src" "$trg" \
    -s "$num_bpes" \
    -o "$bpes" \
    --write-vocabulary "$src_vocab" "$trg_vocab"

# Let's apply bpe here for all our tokenized files
for file in $(ls ../data/generated/*.tok)
do
    lang="${file: -6:2}"
    echo "For file $file we use lang: $lang."
    vocab="../data/generated/vocab.$lang"
  
    python ../ext-libs/subword-nmt/apply_bpe.py -c $bpes \
       --vocabulary "$vocab" < "$file" > "$file.bpe"
done

For file ../data/generated/test.de.tok we use lang: de.
For file ../data/generated/test.en.tok we use lang: en.
For file ../data/generated/train.de.tok we use lang: de.
For file ../data/generated/train.en.tok we use lang: en.
For file ../data/generated/val.de.tok we use lang: de.
For file ../data/generated/val.en.tok we use lang: en.
For file ../data/generated/wmt17.de.tok we use lang: de.
For file ../data/generated/wmt17.en.tok we use lang: en.


Argh, we finally have BPE files for wmt17 and can generate embeddings for them. Let's do it!

In [1]:
import torch
import numpy as np
import fasttext
from tqdm import tqdm


# TODO: skipgram is more accurate (but slower to train)
# TODO: do not forget to use wmt17 after development is done
# model_src = fasttext.cbow('../data/generated/wmt17.en.tok.bpe', '../trained_models/wmt17.en.tok.bpe_cbow', dim=512)
# model_trg = fasttext.cbow('../data/generated/wmt17.de.tok.bpe', '../trained_models/wmt17.de.tok.bpe_cbow', dim=512)
# model_src = fasttext.cbow('../data/generated/train.en.tok.bpe', '../trained_models/wmt17.en.tok.bpe_cbow',
#                           dim=512, min_count=1, silent=0)
# model_trg = fasttext.cbow('../data/generated/train.de.tok.bpe', '../trained_models/wmt17.de.tok.bpe_cbow',
#                           dim=512, min_count=1, silent=0)


def load_embeddings(embeddings_path):
    embeddings = {}
    
    with open(embeddings_path, 'r', encoding='utf-8') as f:
        next(f) # Skipping first line, because it's header info
        for line in tqdm(f):
            values = line.rstrip().rsplit(' ')
            word = values[0]
            embeddings[word] = np.asarray(values[1:], dtype='float32')
        
    return embeddings


def init_emb_matrix(emb_matrix, emb_dict, token2id):
    emb_size = emb_matrix.size(1)
    
    for word, idx in token2id.items():
        if not word in emb_dict:
            print('Skipping ', word)
            continue
        emb_matrix[idx] = torch.FloatTensor(emb_dict[word])

In [2]:
# Let's remove .bin files which we do not use
# !rm ../trained_models/*.bin

Now we should initialize our transformer with learnt embeddings, initialize discriminator and add adversarial loss.
When we are done with that — we are only left with training the thing!

In [2]:
import sys; sys.path += ['..', '../src']

######################################################################

import os

from src.vocab import Vocab
from src.transformer.models import Transformer
from src.models import FFN

DATA_PATH = '../data/generated'
max_len = 200 # TODO: Dostoevsky has much longer sentences

vocab_src = Vocab.from_file(os.path.join(DATA_PATH, 'vocab.en'))
vocab_trg = Vocab.from_file(os.path.join(DATA_PATH, 'vocab.de'))

transformer = Transformer(len(vocab_src), len(vocab_trg), max_len)
discriminator = FFN(512, 3, 1024)

# Initializing transformer encoder and decoder with embeddings
embeddings_src = load_embeddings('../trained_models/wmt17.en.tok.bpe_cbow.vec')
embeddings_trg = load_embeddings('../trained_models/wmt17.de.tok.bpe_cbow.vec')

init_emb_matrix(transformer.encoder.src_word_emb.weight.data, embeddings_src, vocab_src.token2id)
init_emb_matrix(transformer.decoder.tgt_word_emb.weight.data, embeddings_trg, vocab_trg.token2id)

train_src_path = os.path.join(DATA_PATH, 'train.en.tok.bpe')
train_trg_path = os.path.join(DATA_PATH, 'train.de.tok.bpe')
val_src_path = os.path.join(DATA_PATH, 'val.en.tok.bpe')
val_trg_path = os.path.join(DATA_PATH, 'val.de.tok.bpe')

train_src = open(train_src_path, 'r', encoding='utf-8').read().splitlines()
train_trg = open(train_trg_path, 'r', encoding='utf-8').read().splitlines()
val_src = open(val_src_path, 'r', encoding='utf-8').read().splitlines()
val_trg = open(val_trg_path, 'r', encoding='utf-8').read().splitlines()

train_src = [s.split() for s in train_src]
train_trg = [s.split() for s in train_trg]
val_src = [s.split() for s in val_src]
val_trg = [s.split() for s in val_trg]

train_src_idx = [[vocab_src.token2id.get(t, vocab_src.unk) for t in s] for s in train_src]
train_trg_idx = [[vocab_trg.token2id.get(t, vocab_trg.unk) for t in s] for s in train_trg]

7416it [00:01, 5396.73it/s]
10502it [00:02, 4550.66it/s]


Skipping  __BOS__
Skipping  __EOS__
Skipping  __UNK__
Skipping  __PAD__
Skipping  __BOS__
Skipping  __EOS__
Skipping  __UNK__
Skipping  __PAD__


And now we should write a training procedure, including backtranslation and noising.
That's not so easy, as it may seem.
Also we should write loss functions and add training visualization.

In [29]:
import src
import importlib

importlib.reload(src.utils.umt_batcher)
importlib.reload(src.transformer.models)

<module 'src.transformer.models' from '../src/transformer/models.py'>

In [31]:
import os

from src.vocab import Vocab
from src.transformer.models import Transformer
from src.models import FFN

DATA_PATH = '../data/generated'
max_len = 200 # TODO: Dostoevsky has much longer sentences

transformer = Transformer(len(vocab_src), len(vocab_trg), max_len)
discriminator = FFN(512, 3, 1024)

init_emb_matrix(transformer.encoder.src_word_emb.weight.data, embeddings_src, vocab_src.token2id)
init_emb_matrix(transformer.decoder.tgt_word_emb.weight.data, embeddings_trg, vocab_trg.token2id)

#####################

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.optim import Adam, RMSprop
from tqdm import tqdm

from src.utils.umt_batcher import UMTBatcher
import src.transformer.constants as constants

use_cuda = torch.cuda.is_available()

def reconstruction_criterion(vocab_size):
    ''' With PAD token zero weight '''
    weight = torch.ones(vocab_size)
    weight[constants.PAD] = 0

    return nn.CrossEntropyLoss(weight)


ae_criterion_src = reconstruction_criterion(len(vocab_src))
ae_criterion_trg = reconstruction_criterion(len(vocab_trg))
translation_criterion_src_to_trg = reconstruction_criterion(len(vocab_trg))
translation_criterion_trg_to_src = reconstruction_criterion(len(vocab_src))
adv_criterion = nn.BCELoss()

transformer_optimizer = Adam(transformer.get_trainable_parameters(), lr=3e-4, betas=(0.5, 0.999))
discriminator_optimizer = RMSprop(discriminator.parameters(), lr=5e-4)

training_data = UMTBatcher(train_src_idx, train_trg_idx, vocab_src, vocab_trg,
                           batch_size=32, shuffle=True)

losses = []

for batch in tqdm(training_data, mininterval=2, desc='  - (Training)   ', leave=False):
    src_noised, trg_noised, src, trg = batch
    
    # Resetting gradients
    transformer_optimizer.zero_grad()
    discriminator_optimizer.zero_grad()
    
    
    ### Training autoencoder ###
    transformer.train()
    # Computing translation for ~src->src and ~trg->trg autoencoding tasks
    print('Training discriminator')
    print('Computing predictions')
    preds_src, encodings_src = transformer(src_noised, src, return_encodings=True, use_src_embs_in_decoder=True)
    preds_trg, encodings_trg = transformer(trg_noised, trg, return_encodings=True, use_trg_embs_in_encoder=True)

    print('Computing losses')
    ae_loss_src = ae_criterion_src(preds_src, src[:, 1:].contiguous().view(-1))
    ae_loss_trg = ae_criterion_trg(preds_trg, trg[:, 1:].contiguous().view(-1))
    
    print('Computing gradients')
    ae_loss_src.backward(retain_graph=True)
    ae_loss_trg.backward(retain_graph=True)
    
    ### Training translator ###
    print('Training translator')
    transformer.eval()
    # Get translations for backtranslation
    print('Computing back-translations')
    bt_trg, *_ = transformer.translate_batch(src, beam_size=2, max_len=10)
    bt_src, *_ = transformer.translate_batch(trg, use_trg_embs_in_encoder=True, use_src_embs_in_decoder=True, beam_size=2, max_len=10)
    
    bt_trg = Variable(torch.LongTensor(bt_trg))
    bt_src = Variable(torch.LongTensor(bt_src))

    # We are given n-best translations. Let's pick the best one
    bt_trg = bt_trg[:,0,:]
    bt_src = bt_src[:,0,:]

    if use_cuda:
        bt_trg = bt_trg.cuda()
        bt_src = bt_src.cuda()
    
    # Computing predictions for back-translated sentences
    transformer.train()
    print('Computing predictions (translations of back-translations)')
    bt_src_preds = transformer(bt_trg, src, use_trg_embs_in_encoder=True, use_src_embs_in_decoder=True)
    bt_trg_preds = transformer(bt_src, trg)
    
    print('Computing losses')
    loss_bt_src = translation_criterion_trg_to_src(bt_src_preds, src[:, 1:].contiguous().view(-1))
    loss_bt_trg = translation_criterion_src_to_trg(bt_trg_preds, trg[:, 1:].contiguous().view(-1))
    
    print('Computing gradients')
    loss_bt_src.backward(retain_graph=True)
    loss_bt_trg.backward(retain_graph=True)
    
    print('Updating weights')
    transformer_optimizer.step()
    
    
    # Resetting gradients before adversarial update
    transformer_optimizer.zero_grad()

    
    ### Training discriminator ###
    print('Training discriminator')
    print('Computing predictions')
    domains_preds_src = discriminator(encodings_src.view(-1, 512))
    domains_preds_trg = discriminator(encodings_trg.view(-1, 512))
    
    # Generating targets for discriminator and generator
    true_domains_src = Variable(torch.Tensor([0] * len(domains_preds_src)))
    true_domains_trg = Variable(torch.Tensor([1] * len(domains_preds_trg)))
    fake_domains_src = Variable(torch.Tensor([1] * len(domains_preds_src)))
    fake_domains_trg = Variable(torch.Tensor([0] * len(domains_preds_trg)))

    if use_cuda:
        true_domains_src = true_domains_src.cuda()
        true_domains_trg = true_domains_trg.cuda()
        fake_domains_src = fake_domains_src.cuda()
        fake_domains_trg = fake_domains_trg.cuda()

    # True domains for discriminator loss
    print('Computing losses')
    discr_loss_src = adv_criterion(domains_preds_src, true_domains_src)
    discr_loss_trg = adv_criterion(domains_preds_trg, true_domains_trg)

    print('Computing gradients')
    discr_loss_src.backward(retain_graph=True)
    discr_loss_trg.backward(retain_graph=True)

    print('Updating parameters')
    discriminator_optimizer.step()

    transformer_optimizer.zero_grad()
    discriminator_optimizer.zero_grad()

    ### Training generator ###
    print('Training generator')
    print('Computing losses')
    # Faking domains for generator loss
    gen_loss_src = adv_criterion(domains_preds_src, fake_domains_src)
    gen_loss_trg = adv_criterion(domains_preds_trg, fake_domains_trg)

    print('Computing gradients')
    gen_loss_src.backward(retain_graph=True)
    gen_loss_trg.backward(retain_graph=True)

    print('Updating parameters')
    transformer_optimizer.step()


    ### Now, let's compute some statistics and vizualize our staff
    losses.append({
        'ae_loss_src': ae_loss_src.data[0],
        'ae_loss_trg': ae_loss_trg.data[0],
        'loss_bt_src': loss_bt_src.data[0],
        'loss_bt_trg': loss_bt_trg.data[0],
        'discr_loss_src': discr_loss_src.data[0],
        'discr_loss_trg': discr_loss_trg.data[0],
        'gen_loss_src': gen_loss_src.data[0],
        'gen_loss_trg': gen_loss_trg.data[0]  
    })

    print('Losses:', losses[-1])

Skipping  __BOS__
Skipping  __EOS__
Skipping  __UNK__
Skipping  __PAD__
Skipping  __BOS__
Skipping  __EOS__
Skipping  __UNK__
Skipping  __PAD__


  result = self.forward(*input, **kwargs)


Training discriminator
Computing predictions
Computing losses
Computing gradients
Training translator
Computing back-translations



  

 10%|█         | 1/10 [00:00<00:06,  1.44it/s][A
 20%|██        | 2/10 [00:01<00:07,  1.08it/s][A
 30%|███       | 3/10 [00:03<00:07,  1.06s/it][A
 40%|████      | 4/10 [00:04<00:06,  1.07s/it][A
 50%|█████     | 5/10 [00:05<00:05,  1.11s/it][A
 60%|██████    | 6/10 [00:06<00:04,  1.11s/it][A
 70%|███████   | 7/10 [00:07<00:03,  1.14s/it][A
 80%|████████  | 8/10 [00:09<00:02,  1.17s/it][A
 90%|█████████ | 9/10 [00:10<00:01,  1.18s/it][A
100%|██████████| 10/10 [00:12<00:00,  1.21s/it][A
[A
  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:00<00:04,  1.87it/s][A
 20%|██        | 2/10 [00:01<00:05,  1.38it/s][A
 30%|███       | 3/10 [00:02<00:05,  1.30it/s][A
 40%|████      | 4/10 [00:03<00:04,  1.24it/s][A
 50%|█████     | 5/10 [00:04<00:04,  1.15it/s][A
 60%|██████    | 6/10 [00:05<00:03,  1.02it/s][A
 70%|███████   | 7/10 [00:07<00:03,  1.08s/it][A
 80%|████████  | 8/10 [00:10<00:02,  1.26s/it][A
 90%|█████████ | 9/10 [00:12<00:01,  1.36s/it][

Computing predictions (translations of back-translations)
Computing losses
Computing gradients
Updating weights
Training discriminator
Computing predictions
Computing losses
Computing gradients


  "Please ensure they have the same size.".format(target.size(), input.size()))
  "Please ensure they have the same size.".format(target.size(), input.size()))


Updating parameters
Training generator
Computing losses
Computing gradients
Updating parameters


  - (Training)   :   0%|          | 1/907 [01:06<16:41:19, 66.31s/it]

Losses: {'ae_loss_src': 8.957502365112305, 'ae_loss_trg': 9.309548377990723, 'loss_bt_src': 8.96019172668457, 'loss_bt_trg': 9.308553695678711, 'discr_loss_src': 0.6752368807792664, 'discr_loss_trg': 0.7115498781204224, 'gen_loss_src': 0.7113943099975586, 'gen_loss_trg': 0.6750862002372742}
Training discriminator
Computing predictions
Computing losses
Computing gradients
Training translator
Computing back-translations



  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:01<00:12,  1.44s/it][A
 20%|██        | 2/10 [00:04<00:16,  2.06s/it][A
 30%|███       | 3/10 [00:06<00:14,  2.08s/it][A
 40%|████      | 4/10 [00:07<00:11,  1.92s/it][A
 50%|█████     | 5/10 [00:09<00:09,  1.86s/it][A
 60%|██████    | 6/10 [00:10<00:06,  1.75s/it][A
 70%|███████   | 7/10 [00:12<00:05,  1.72s/it][A
 80%|████████  | 8/10 [00:14<00:03,  1.80s/it][A
 90%|█████████ | 9/10 [00:16<00:01,  1.89s/it][A
100%|██████████| 10/10 [00:18<00:00,  1.89s/it][A
[A
  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:00<00:05,  1.58it/s][A
 20%|██        | 2/10 [00:01<00:06,  1.23it/s][A
 30%|███       | 3/10 [00:02<00:06,  1.16it/s][A
 40%|████      | 4/10 [00:03<00:05,  1.09it/s][A
 50%|█████     | 5/10 [00:04<00:04,  1.00it/s][A
 60%|██████    | 6/10 [00:06<00:04,  1.11s/it][A
 70%|███████   | 7/10 [00:08<00:03,  1.27s/it][A
 80%|████████  | 8/10 [00:11<00:02,  1.42s/it][A
 90%|█████

Computing predictions (translations of back-translations)


KeyboardInterrupt: 