# Neural Machine Translation

In this work you need to implement any Seq2seq architecture to train neural German-Ukrainian translator.

Mostly copy of [practical-pytorch](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb) tutorial.

In [65]:
import string
import re
import random
import time
import math
import os

import pandas as pd
import nltk
from tokenize_uk import tokenize_words
from tqdm import tqdm

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

Managing **environment** (CPU or GPU) is not simple in pytorch. You need to explicitly select environment for all tensors in your model. Use this constant to define GPU usage.

In [68]:
USE_CUDA = False

## Load data

We will use open source sentence pairs from [tatoeba.org](https://tatoeba.org/eng/downloads). Download [sentences archive](http://downloads.tatoeba.org/exports/sentences.tar.bz2) and extract it into `./data/part3`.

In [28]:
%env DATA_DIR = ./../data/part3/

env: DATA_DIR=./../data/part3/


In [33]:
!wget https://downloads.tatoeba.org/exports/sentences.tar.bz2 -P $DATA_DIR
!tar xvjC $DATA_DIR -f $DATA_DIR/sentences.tar.bz2

x sentences.csv


In [34]:
!wget https://downloads.tatoeba.org/exports/links.tar.bz2 -P $DATA_DIR
!tar xvjC $DATA_DIR -f $DATA_DIR/links.tar.bz2

--2018-07-01 13:36:49--  https://downloads.tatoeba.org/exports/links.tar.bz2
Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 94.130.77.194
Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72852273 (69M) [application/octet-stream]
Saving to: ‘./../data/part3/links.tar.bz2’


2018-07-01 13:37:31 (1.66 MB/s) - ‘./../data/part3/links.tar.bz2’ saved [72852273/72852273]

x links.csv


In [36]:
data_dir = './../data/part3/'
sentences = pd.read_csv(os.path.join(data_dir, 'sentences.csv'), names=['id', 'lang', 'text'], header=None, delimiter='\t')
links = pd.read_csv(os.path.join(data_dir, 'links.csv'), names=['sent_id', 'tran_id'], header=None, delimiter='\t')

Choose any languages you want to train translator for.

In [56]:
source_lang = 'ukr'
target_lang = 'deu'

source_sentences = sentences[sentences.lang == source_lang]
source_sentences = source_sentences.merge(links, left_on='id', right_on='sent_id')
target_sentences = sentences[sentences.lang == target_lang]

bilang_sentences = source_sentences.merge(target_sentences, left_on='tran_id', 
                                          right_on='id', 
                                          suffixes=[source_lang, target_lang])
bilang_sentences = bilang_sentences[['text'+source_lang, 'text'+target_lang]]

file_name = os.path.join(data_dir, '{source}-{target}.csv'.format(source=source_lang, target=target_lang)) 
bilang_sentences.to_csv(file_name, index=False, sep='\t')

bilang_sentences.head()

Unnamed: 0,textukr,textdeu
0,Він наказав мені негайно вийти з кімнати.,"Er befahl mir, den Raum umgehend zu verlassen."
1,У всесвіті багато галактик.,Es gibt viele Galaxien im Universum.
2,У Всесвіті є багато галактик.,Es gibt viele Galaxien im Universum.
3,Вона приймає душ щоранку.,Sie nimmt jeden Morgen eine Dusche.
4,Вона приймає душ щоранку.,Sie duscht jeden Morgen.


### Indexing words

We'll need a unique index per word to use as the inputs and targets of the networks later. To keep track of all this we will use a helper class called Lang which has word → index (word2index) and index → word (index2word) dictionaries, as well as a count of each word word2count to use to later replace rare words.

In [59]:
SOS_token = 0
EOS_token = 1
UNK_token = 2
PAD_token = 3

class Vocab:
    def __init__(self, tokenizer):
        self.index2word = {0: "SOS", 1: "EOS", 2: "UNK", 3: "PAD"}
        self.word2index = {v: k for k, v in self.index2word.items()}
        self.word2count = {}
        self.tokenizer = tokenizer

        self.n_words = 4
      
    def index_words(self, sentence):
        tokenized = self.tokenizer(sentence)
        for word in tokenized:
            self.index_word(word)
        return tokenized

    def index_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [42]:
# Lowercase, trim, and remove non-letter characters
def normalize_string(s):
    s = s.lower().strip()
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

To read the data file we will split the file into lines, and then split lines into pairs. Define tokenizers for you languages to split sentences on words.

In [61]:
source_tokenizer = tokenize_words
target_tokenizer = nltk.tokenize.WordPunctTokenizer().tokenize

def read_langs(source_lang, source_tokenizer, target_lang, target_tokenizer, input_file):
    corpora = pd.read_csv(input_file, delimiter='\t')
    
    source_vocab = Vocab(source_tokenizer)
    target_vocab = Vocab(target_tokenizer)
    
    source_corpora = []
    target_corpora = []
    for i, row in tqdm(corpora.iterrows()):
        source_sent = row['text'+source_lang]
        target_sent = row['text'+target_lang]
        
        source_tokenized = source_vocab.index_words(source_sent)
        target_tokenized = target_vocab.index_words(target_sent)
        
        source_corpora.append(source_tokenized)
        target_corpora.append(target_tokenized)
    
    return source_vocab, target_vocab, list(zip(source_corpora, target_corpora))

source_vocab, target_vocab, corpora = read_langs(source_lang, source_tokenizer, target_lang, target_tokenizer, file_name)

14755it [00:01, 10853.61it/s]


### Filtering

Since there are a lot of example sentences and we want to train something quickly, we'll trim the data set to only relatively short and simple sentences. Here the maximum length is 10 words (that includes punctuation) and we're filtering to sentences that translate to the form "I am" or "He is" etc. (accounting for apostrophes being removed).

In [64]:
MAX_LENGTH = 8

corpora_filtered = [(source_sent, target_sent) for source_sent, target_sent in corpora
                   if len(source_sent) <= MAX_LENGTH and len(target_sent) <= MAX_LENGTH]

12833

### Turning training data into Tensors
To train we need to turn the sentences into something the neural network can understand, which of course means numbers. Each sentence will be split into words and turned into a Tensor, where each word is replaced with the index (from the Lang indexes made earlier). While creating these tensors we will also append the EOS token to signal that the sentence is over.

A Tensor is a multi-dimensional array of numbers, defined with some type e.g. FloatTensor or LongTensor. In this case we'll be using LongTensor to represent an array of integer indexes.

In [69]:
def indexes_from_sentence(vocab, sentence):
    return [vocab.word2index[word] for word in sentence]

def tensor_from_sentence(lang, sentence):
    indexes = indexes_from_sentence(lang, sentence)
    indexes.append(EOS_token)
    indexes.insert(0, SOS_token)
    tensor = torch.LongTensor(indexes)
    if USE_CUDA: var = tensor.cuda()
    return tensor

def tensors_from_pair(source_sent, target_sent):
    source_tensor = tensor_from_sentence(source_vocab, source_sent)
    target_tensor = tensor_from_sentence(target_vocab, target_sent)
    
    return (source_tensor, target_tensor)

tensors = []
for source_sent, target_sent in corpora_filtered:
    tensors.append(tensors_from_pair(source_sent, target_sent))

## Encoder

The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.

In [75]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers=n_layers, bidirectional=False)
        
    def forward(self, word_inputs, hidden):
        # Note: we run this all at once (over the whole input sequence)
        seq_len = len(word_inputs)
        embedded = self.embedding(word_inputs).view(seq_len, 1, -1)
        output, hidden = self.lstm(embedded, hidden)
        return output, hidden

    def init_hidden(self):
        hidden = torch.zeros(2, self.n_layers, 1, self.hidden_size)
        if USE_CUDA: hidden = hidden.cuda()
        return hidden

## Decoder

In [78]:
class DecoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(DecoderRNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers=n_layers, bidirectional=False)
        
    def forward(self, word_inputs, hidden):
        # Note: we run this one by one
        embedded = self.embedding(word_inputs)
        output, hidden = self.lstm(embedded, hidden)
        return output, hidden

## Test

To make sure the Encoder and Decoder model are working (and working together) we'll do a quick test with fake word inputs:

In [81]:
encoder_test = EncoderRNN(10, 10, 2)
print(encoder_test)

encoder_hidden = encoder_test.init_hidden()
word_input = torch.LongTensor([1, 2, 3])
if USE_CUDA:
    encoder_test.cuda()
    word_input = word_input.cuda()


encoder_outputs, encoder_hidden = encoder_test(word_input, encoder_hidden)

EncoderRNN(
  (embedding): Embedding(10, 10)
  (lstm): LSTM(10, 10, num_layers=2)
)


In [84]:
encoder_outputs.shape

torch.Size([3, 1, 10])

In [85]:
decoder_test = DecoderRNN(10, 10, 2)
print(decoder_test)

word_inputs = torch.LongTensor([1, 2, 3])
decoder_attns = torch.zeros(1, 3, 3)
decoder_hidden = encoder_hidden[-1]

if USE_CUDA:
    decoder_test.cuda()
    word_inputs = word_inputs.cuda()

for i in range(3):
    decoder_output, decoder_hidden = decoder_test(word_inputs[i], decoder_hidden)
    print(decoder_output.size(), decoder_hidden.size())

DecoderRNN(
  (embedding): Embedding(10, 10)
  (lstm): LSTM(10, 10, num_layers=2)
)


RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)