# Implementation

In the previous tutorial I already described all stages of machine translation experiment, starting from data preparation and finishing model training and evaluation. This time I am going to go through this process one more time. But for this tutorial I will implment all stages with a help of [AllenNLP](https://allennlp.org). It is a powerful NLP library, which offers a set of abstractions on top of PyTorch, which allows to reduce amount of repetitive code[^allen].  AllenNLP is so great, that I can without a doubt say that it is like [Keras](https://keras.io/) for PyTorch.

[^allen][Here](https://allennlp.org/tutorials) you can find a great tutorial on Part-of-Speech tagging with AllenNLP.

## Data preparation

As in the previous tutorial, I use **bilingual datasets** from [tatoeba.org](https://tatoeba.org/eng/downloads).
You can download language pairs data from http://www.manythings.org/anki/, or if there is no pair (as for Ukrainian-German), you can use my script [get_dataset.py](https://github.com/tsdaemon/neural-experiments/blob/master/nmt/scripts/get_dataset.py). It downloads raw data from tatoeba and extracts a bilingual dataset as a `csv` file.

In [2]:
import pandas as pd
import os

source_lang = 'ukr'
target_lang = 'deu'
data_dir = 'data/'

os.chdir('../')
corpus = pd.read_csv(os.path.join(data_dir, '{}-{}.csv'.format(source_lang, target_lang)), delimiter='\t')
corpus.head(3)

Unnamed: 0,textukr,textdeu
0,Він наказав мені негайно вийти з кімнати.,"Er befahl mir, den Raum umgehend zu verlassen."
1,У всесвіті багато галактик.,Es gibt viele Galaxien im Universum.
2,У Всесвіті є багато галактик.,Es gibt viele Galaxien im Universum.


AllenNLP code uses [MyPy](http://mypy-lang.org/). This is a static typing extension for Python which allows to define type contracts for function arguments and return objects. Like this:

```python
def tokenize(input: str) -> List[str]:
    return input.split(' ')
```

Although I always adored Python for the dynamic typing, I find such semi-static approach very convenient for development, especially in huge projects. Therefore I decided to inherit it in this tutorial.

First I defined a custom class for ukrainian word splitter `TokenizeUkWordSplitter` inherited from `WordSplitter`. AllenNLP's `WordTokenizer` offers an abstraction over sentence parsing, which includes composes splitting words into tokens, filtering stop-words, and stemming words. Here I only need a custom splitter for Ukrainian language.

In [16]:
from typing import List

from allennlp.data.tokenizers.token import Token
from allennlp.data.tokenizers.word_splitter import WordSplitter
from allennlp.data.tokenizers.word_tokenizer import WordTokenizer

from tokenize_uk import tokenize_words

SOS_token = '<start>'
EOS_token = '<end>'
UNK_token = '<unk>'
PAD_token = '<pad>'

class TokenizeUkWordSplitter(WordSplitter):
    """
    Word splitter wrapper around tokenize_uk.tokenize_words
    """
    def split_words(self, sentence: str) -> List[Token]:
        return [Token(word) for word in tokenize_words(sentence)]
    
ukr_tokenizer = WordTokenizer(word_splitter=TokenizeUkWordSplitter(), 
                              start_tokens=[SOS_token], 
                              end_tokens=[EOS_token])
    
ukr_tokenizer.tokenize("Він наказав мені негайно вийти з кімнати.")

[<start>, Він, наказав, мені, негайно, вийти, з, кімнати, ., <end>]

In [17]:
deu_tokenizer = WordTokenizer()
deu_tokenizer.tokenize("Er befahl mir, den Raum umgehend zu verlassen.")

[Er, befahl, mir, ,, den, Raum, umgehend, zu, verlassen, .]

Next I implemented a custom `DatasetReader` for my data. It accepts corpus  

In [20]:
from typing import Iterator

from allennlp.data import Instance
from allennlp.data.fields import TextField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.data.token_indexers import SingleIdTokenIndexer

class TatoebaDatasetReader(DatasetReader):
    """
    DatasetReader for NMT data, accept pandas dataframe with target and source sentences.

    """
    def __init__(self, 
                 source_lang: str, 
                 target_lang: str, 
                 source_tokenizer: WordTokenizer, 
                 target_tokenizer: WordTokenizer) -> None:
        super().__init__(lazy=False)
        
        self.source_lang = source_lang
        self.target_lang = target_lang

        self.source_tokenizer = source_tokenizer
        self.target_tokenizer = target_tokenizer
        
        self.token_indexers = {"tokens": SingleIdTokenIndexer()}
        
    def pair_to_instance(self, source_tokens: List[Token], target_tokens: List[Token]) -> Instance:
        source_field = TextField(source_tokens, self.token_indexers)
        target_field = TextField(target_tokens, self.token_indexers)
        fields = {
            "source_field": source_field,
            "target_field": target_field
        }

        return Instance(fields)
    
    def _read(self, df: pd.DataFrame) -> Iterator[Instance]:
        for _, row in df.iterrows():
            source_sent = row['text' + self.source_lang]
            target_sent = row['text' + self.target_lang]
            yield self.pair_to_instance(
                self.source_tokenizer.tokenize(source_sent),
                self.target_tokenizer.tokenize(target_sent)
            )

source_tokenizer = ukr_tokenizer
target_tokenizer = deu_tokenizer

reader = TatoebaDatasetReader(source_lang, target_lang, source_tokenizer, target_tokenizer)
reader.read(corpus)

2591it [00:01, 2295.99it/s]
18036it [00:06, 2601.50it/s]


[<allennlp.data.instance.Instance at 0x7fd50a56e9e8>,
 <allennlp.data.instance.Instance at 0x7fd50a572908>,
 <allennlp.data.instance.Instance at 0x7fd50a578898>,
 <allennlp.data.instance.Instance at 0x7fd50a57d860>,
 <allennlp.data.instance.Instance at 0x7fd50a503748>,
 <allennlp.data.instance.Instance at 0x7fd50a59cef0>,
 <allennlp.data.instance.Instance at 0x7fd50a593048>,
 <allennlp.data.instance.Instance at 0x7fd50a591f28>,
 <allennlp.data.instance.Instance at 0x7fd50a5a07f0>,
 <allennlp.data.instance.Instance at 0x7fd50a5a44a8>,
 <allennlp.data.instance.Instance at 0x7fd50a5a8470>,
 <allennlp.data.instance.Instance at 0x7fd50a5bb828>,
 <allennlp.data.instance.Instance at 0x7fd50a5ae908>,
 <allennlp.data.instance.Instance at 0x7fd50a59acf8>,
 <allennlp.data.instance.Instance at 0x7fd50a39d198>,
 <allennlp.data.instance.Instance at 0x7fd50a39d470>,
 <allennlp.data.instance.Instance at 0x7fd50a39dd30>,
 <allennlp.data.instance.Instance at 0x7fd50a39ddd8>,
 <allennlp.data.instance.Ins