# Transformer based Machine Translation Using GluonNLP

In this notebook, we introduce the *Transformer* model for sequence to sequence transduction.
Introduced in the [2017 NIPS paper "Attention is All You Need](https://arxiv.org/abs/1706.03762), the models are both more accurate and lighter to train than previous seq2seq models.
Because the model is a bit complicated we'll import some of the code to keep the nortebook managable, showing you the important parts: we'll introduce the data format used by the model and load up a pretrained model that achieves state of the art results on the
WMT 2014 English-German dataset.
Finally we will show how to evaluate the model
and translate a few sentences ourselves with the `BeamSearchTranslator`.

## Preparation 

In [1]:
import warnings
warnings.filterwarnings('ignore')

import random
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp
import nmt

ctx = mx.gpu(0)

### Pretrained Model

GluonNLP provides pretrained models for many standard architectures.
First, let's retrieve a model pretrained on the WMT2014 English-to-German task.

In [2]:
transformer_model, src_vocab, tgt_vocab = nmt.transformer.get_model(
    'transformer_en_de_512',
    dataset_name='WMT2014',
    pretrained=True,
    ctx=ctx)

In [3]:
print(src_vocab)
print(tgt_vocab)

Vocab(size=36794, unk="<unk>", reserved="['<eos>', '<bos>', '<eos>']")
Vocab(size=36794, unk="<unk>", reserved="['<eos>', '<bos>', '<eos>']")


In [4]:
print(transformer_model)

NMTModel(
  (encoder): TransformerEncoder(
    (dropout_layer): Dropout(p = 0.1, axes=())
    (layer_norm): LayerNorm(eps=1e-05, axis=-1, center=True, scale=True, in_channels=512)
    (transformer_cells): HybridSequential(
      (0): TransformerEncoderCell(
        (dropout_layer): Dropout(p = 0.1, axes=())
        (attention_cell): MultiHeadAttentionCell(
          (_base_cell): DotProductAttentionCell(
            (_dropout_layer): Dropout(p = 0.1, axes=())
          )
          (proj_query): Dense(512 -> 512, linear)
          (proj_key): Dense(512 -> 512, linear)
          (proj_value): Dense(512 -> 512, linear)
        )
        (proj): Dense(512 -> 512, linear)
        (ffn): PositionwiseFFN(
          (ffn_1): Dense(512 -> 2048, Activation(relu))
          (ffn_2): Dense(2048 -> 512, linear)
          (dropout_layer): Dropout(p = 0.1, axes=())
          (layer_norm): LayerNorm(eps=1e-05, axis=-1, center=True, scale=True, in_channels=512)
        )
        (layer_norm): LayerNorm

### Dataset

The following shows how to process the dataset.
The processing steps include:

1. clip the source and target sequences
2. split the string input to a list of tokens
3. map the string token into its index in the vocabulary
4. append EOS token to source sentence and add BOS and EOS tokens to target sentence.

Let's first look at the WMT 2014 corpus.

In [5]:
src_lang, tgt_lang = 'en', 'de'
data_test = nlp.data.WMT2014BPE('newstest2014', src_lang=src_lang, tgt_lang=tgt_lang)

In [6]:
data_test[0]

('Or@@ land@@ o Blo@@ om and Mir@@ anda Ker@@ r still love each other',
 'Or@@ land@@ o Blo@@ om und Mir@@ anda Ker@@ r lieben sich noch immer')

In [7]:
test_text = nlp.data.WMT2014('newstest2014', src_lang=src_lang, tgt_lang=tgt_lang)
test_text[0]

('Orlando Bloom and Miranda Kerr still love each other',
 'Orlando Bloom und Miranda Kerr lieben sich noch immer')

In [8]:
import dataprocessor

print(dataprocessor.TrainValDataTransform.__doc__)

Transform the machine translation dataset.

    Clip source and the target sentences to the maximum length. For the source sentence, append the
    EOS. For the target sentence, append BOS and EOS.

    Parameters
    ----------
    src_vocab : Vocab
    tgt_vocab : Vocab
    src_max_len : int
    tgt_max_len : int
    


In [9]:
transform_fn = dataprocessor.TrainValDataTransform(src_vocab, tgt_vocab, -1, -1)
dataset_processed = data_test.transform(transform_fn, lazy=False)
print(*dataset_processed[0], sep='\n')

[ 7300 21964 23833  1935 24004 11836  6698 11839  5565 25464 27950 22544
 16202 24272     3]
[    2  7300 21964 23833  1935 24004 29615  6698 11839  5565 25464 22297
 27121 23712 20558     3]


### Sampler and DataLoader

In [10]:
data_test_with_len = gluon.data.SimpleDataset([(ele[0], ele[1], len(
    ele[0]), len(ele[1]), i) for i, ele in enumerate(dataset_processed)])

Now, we have obtained `data_train`, `data_val`, and `data_test`. The next step
is to construct sampler and DataLoader. The first step is to construct batchify
function, which pads and stacks sequences to form mini-batch.

In [11]:
test_batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Pad(),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack(dtype='float32'),
    nlp.data.batchify.Stack())

We can then construct bucketing samplers, which generate batches by grouping
sequences with similar lengths.

In [12]:
bucket_scheme = nlp.data.ExpWidthBucket(bucket_len_step=1.2)

In [13]:
test_batch_sampler = nlp.data.FixedBucketSampler(
    lengths=dataset_processed.transform(lambda src, tgt: len(tgt)),
    use_average_length=True,
    bucket_scheme=bucket_scheme,
    batch_size=256)

In [14]:
print(test_batch_sampler.stats())

FixedBucketSampler:
  sample_num=2737, batch_num=364
  key=[10, 14, 19, 25, 33, 42, 53, 66, 81, 100]
  cnt=[101, 243, 386, 484, 570, 451, 280, 172, 41, 9]
  batch_size=[25, 18, 13, 10, 8, 6, 5, 4, 3, 2]


Given the samplers, we can create DataLoader, which is iterable.

In [15]:
test_data_loader = gluon.data.DataLoader(
    data_test_with_len,
    batch_sampler=test_batch_sampler,
    batchify_fn=test_batchify_fn,
    num_workers=8)
len(test_data_loader)

364

## Pretrained Transformer

Next, we will load the pretrained state of the art (SOTA) Transformer using the model API in GluonNLP.
In this way, we can easily get access to the SOTA machine translation model and use it in your own application.

In [16]:
print(src_vocab)
print(tgt_vocab)

Vocab(size=36794, unk="<unk>", reserved="['<eos>', '<bos>', '<eos>']")
Vocab(size=36794, unk="<unk>", reserved="['<eos>', '<bos>', '<eos>']")


In [17]:
print(transformer_model)

NMTModel(
  (encoder): TransformerEncoder(
    (dropout_layer): Dropout(p = 0.1, axes=())
    (layer_norm): LayerNorm(eps=1e-05, axis=-1, center=True, scale=True, in_channels=512)
    (transformer_cells): HybridSequential(
      (0): TransformerEncoderCell(
        (dropout_layer): Dropout(p = 0.1, axes=())
        (attention_cell): MultiHeadAttentionCell(
          (_base_cell): DotProductAttentionCell(
            (_dropout_layer): Dropout(p = 0.1, axes=())
          )
          (proj_query): Dense(512 -> 512, linear)
          (proj_key): Dense(512 -> 512, linear)
          (proj_value): Dense(512 -> 512, linear)
        )
        (proj): Dense(512 -> 512, linear)
        (ffn): PositionwiseFFN(
          (ffn_1): Dense(512 -> 2048, Activation(relu))
          (ffn_2): Dense(2048 -> 512, linear)
          (dropout_layer): Dropout(p = 0.1, axes=())
          (layer_norm): LayerNorm(eps=1e-05, axis=-1, center=True, scale=True, in_channels=512)
        )
        (layer_norm): LayerNorm

### Model evaluation

Next, we will generate the SOTA results on validation and test datasets respectively. For ease of illustration, we only show the loss on the TOY validation and test datasets. To be able to obtain the SOTA results, please set the `demo=False`, and we are able to achieve 27.09 as the BLEU score.

In [18]:
translator = nmt.translation.BeamSearchTranslator(
    model=transformer_model,
    beam_size=4,
    scorer=nlp.model.BeamSearchScorer(alpha=0.6, K=5),
    max_length=200)

In [19]:
test_loss_function = nmt.loss.SoftmaxCEMaskedLoss()
test_loss_function.hybridize()

In [20]:
detokenizer = nlp.data.SacreMosesDetokenizer()

In [21]:
import utils
test_loss, _ = utils.evaluate(transformer_model, test_data_loader,
                              test_loss_function, translator, tgt_vocab,
                              detokenizer, ctx)

print('Pretrained model test loss %.2f' % (test_loss))

Pretrained model test loss 1.21


## Conclusion

- We showcased the Transformer, a powerful deep neural network approach to sequnce transduction (seq2seq) tasks. We showed you how to access a pretrained model in Gluon that matches SOTA results on the WMT 2014 English-German task.
- The Gluon NLP Toolkit provides high-level APIs that could drastically simplify the development process of modeling for NLP tasks sharing the encoder-decoder structure.
- The low-level APIs in the GluonNLP Toolkit makes it easy to customize these models.

Documentation can be found at http://gluon-nlp.mxnet.io/index.html

Code is here https://github.com/dmlc/gluon-nlp

## References

[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.