<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Moduels" data-toc-modified-id="Import-Moduels-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Moduels</a></span></li><li><span><a href="#Set-Enviroment" data-toc-modified-id="Set-Enviroment-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Set Enviroment</a></span></li><li><span><a href="#GRU복습" data-toc-modified-id="GRU복습-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>GRU복습</a></span></li><li><span><a href="#Prepairing-Data" data-toc-modified-id="Prepairing-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Prepairing Data</a></span></li><li><span><a href="#Building-the-Seq2Seq-Model" data-toc-modified-id="Building-the-Seq2Seq-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Building the Seq2Seq Model</a></span><ul class="toc-item"><li><span><a href="#Encoder" data-toc-modified-id="Encoder-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Encoder</a></span></li><li><span><a href="#Attention" data-toc-modified-id="Attention-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Attention</a></span></li><li><span><a href="#Decoder" data-toc-modified-id="Decoder-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Decoder</a></span></li><li><span><a href="#Seq2Seq" data-toc-modified-id="Seq2Seq-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Seq2Seq</a></span></li></ul></li><li><span><a href="#Train" data-toc-modified-id="Train-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Train</a></span></li><li><span><a href="#Reference" data-toc-modified-id="Reference-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

구현은 [Pytorch 공식 튜토리얼](https://tutorials.pytorch.kr/beginner/torchtext_translation_tutorial.html)을 참고하였습니다.

## Import Moduels

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import os
import random
import math
import time

## Set Enviroment

In [2]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [3]:
seed_everything(1234)

BATCH_SIZE = 128

device = "cuda" if torch.cuda.is_available() else "cpu" 

## GRU복습

In [11]:
rnn = nn.GRU(input_size = 10, hidden_size = 10, num_layers = 1, bidirectional=True)
input = torch.randn(5, 3, 10)  # seq_len, batch, input_size
h0 = torch.randn(1*2, 3, 10) # num_layers*num_direction, batch, hidden_size
output, hn = rnn(input, h0)

In [12]:
print(output.size())
print(hn.size())

torch.Size([5, 3, 20])
torch.Size([2, 3, 10])


In [13]:
torch.cat((hn[-2,:,:], hn[-1,:,:]), dim = 1).size()

torch.Size([3, 20])

## Prepairing Data

In [4]:
# conda
# python -m spacy download de
# python -m spacy download en

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

In [5]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [6]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), fields = (SRC, TRG))

https://torchtext.readthedocs.io/en/latest/datasets.html#torchtext.datasets.Multi30k

**Parameters**

**`exts`** – A tuple containing the extension to path for each language.

**`fields`** – A tuple containing the fields that will be used for data in each language.

**`root`** – Root dataset storage directory. Default is ‘.data’.

**`train`** – The prefix of the train data. Default: ‘train’.

**`validation`** – The prefix of the validation data. Default: ‘val’.

**`test`** – The prefix of the test data. Default: ‘test’.

In [8]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [9]:
print(f'SRC vocab is {len(SRC.vocab)}')
print(f'TRG vocab is {len(TRG.vocab)}')

SRC vocab is 7855
TRG vocab is 5893


In [10]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## Building the Seq2Seq Model

### Encoder

![image](https://user-images.githubusercontent.com/47301926/77835354-3152e680-718f-11ea-81e5-f9a3c63e34a6.png)

양방향 GRU를 사용

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

The RNN returns outputs and hidden.

outputs is of size **[src len, batch size, hid dim * num directions]**

hidden is of size **[n layers * num directions, batch size, hid dim]**

where [-2, :, :] gives the top layer **forward RNN hidden state** after the final time-step and 

[-1, :, :] gives the top layer **backward RNN hidden state** after the final time-step 

(i.e. after it has seen the first word in the sentence).

In [24]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

ouputs이 encoder의 전체 hidden state가 되고

hidden이 decoder의 첫번쨰 hidden state가 됨.

### Attention

![image](https://user-images.githubusercontent.com/47301926/77835366-4cbdf180-718f-11ea-8863-5aceb4a8cad4.png)


Graphically, this looks something like above. 

This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. 

The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

This will take in the **previous hidden state of the decoder**, $s_{t-1}$, and all of the stacked **forward and backward hidden states from the encoder**, $H$. 

$a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1.

$a_t$ , that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode,  𝑦̂ 𝑡+1 .

We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (attn) and a $\tanh$ activation function.

$$E_t = \tanh(\text{attn}(s_{t-1}, H))$$

This can be thought of as calculating how well each encoder hidden state "matches" the previous decoder hidden state

$$\hat{a}_t = v E_t$$

We can think of $v$ as the weights for a weighted sum of the energy across all encoder hidden states.

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

This gives us the attention over the source sentence!

In [20]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1)

### Decoder

Next up is the decoder.

The decoder contains the attention layer, attention, which takes the previous hidden state, $s_{t-1}$, all of the encoder hidden states, $H$, and returns the attention vector, $a_t$.

We then use this attention vector to create a weighted source vector, $w_t$, denoted by weighted, which is a weighted sum of the encoder hidden states, $H$, using $a_t$ as the weights.

$$w_t = a_t H$$
The embedded input word, $d(y_t)$, the weighted source vector, $w_t$, and the previous decoder hidden state, $s_{t-1}$, are then all passed into the decoder RNN, with $d(y_t)$ and $w_t$ being concatenated together.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$
We then pass $d(y_t)$, $w_t$ and $s_t$ through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. This is done by concatenating them all together.

$$\hat{y}_{t+1} = f(d(y_t), w_t, s_t)$$
The image below shows decoding the first word in an example translation.



![image](https://user-images.githubusercontent.com/47301926/77835378-55aec300-718f-11ea-8339-d6c448828832.png)

In [21]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0) #[batch, emb_dim]
        output = output.squeeze(0) #[batch, dec_hid_dim]
        weighted = weighted.squeeze(0) #[batch, enc_hid_dim*2]
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0)

### Seq2Seq

This is the first model where we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of enc_dim * 2 to enc_dim * 2 if encoder_is_bidirectional else enc_dim.

This seq2seq encapsulator is similar to the last two. 
The only difference is that the encoder returns both the **final hidden state** (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as **every hidden state** (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that hidden and encoder_outputs are passed to the decoder.

Briefly going over all of the steps:

the outputs tensor is created to hold all predictions, $\hat{Y}$

the source sequence, $X$, is fed into the encoder to receive $z$ and $H$

the initial decoder hidden state is set to be the context vector, $s_0 = z = h_T$


we use a batch of <sos> tokens as the first input, $y_1$
    
we then decode within a loop:

inserting the input token $y_t$, 

previous hidden state, $s_{t-1}$, and all encoder outputs, $H$, into the decoder

receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$

we then decide if we are going to teacher force or not, setting the next input as appropriate

In [22]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

## Train

In [25]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)

ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

In [33]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [34]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,518,917 trainable parameters


In [35]:
optimizer = optim.Adam(model.parameters())

In [36]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Variables:	
freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.

stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.

itos – A list of token strings indexed by their numerical identifiers.

In [37]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [38]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [39]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [40]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 5s
	Train Loss: 5.016 | Train PPL: 150.758
	 Val. Loss: 4.849 |  Val. PPL: 127.667
Epoch: 02 | Time: 1m 6s
	Train Loss: 4.170 | Train PPL:  64.690
	 Val. Loss: 4.709 |  Val. PPL: 110.940
Epoch: 03 | Time: 1m 4s
	Train Loss: 3.533 | Train PPL:  34.222
	 Val. Loss: 3.815 |  Val. PPL:  45.375
Epoch: 04 | Time: 1m 5s
	Train Loss: 2.967 | Train PPL:  19.439
	 Val. Loss: 3.465 |  Val. PPL:  31.982
Epoch: 05 | Time: 1m 5s
	Train Loss: 2.562 | Train PPL:  12.963
	 Val. Loss: 3.298 |  Val. PPL:  27.063
Epoch: 06 | Time: 1m 5s
	Train Loss: 2.267 | Train PPL:   9.654
	 Val. Loss: 3.267 |  Val. PPL:  26.243
Epoch: 07 | Time: 1m 5s
	Train Loss: 2.022 | Train PPL:   7.552
	 Val. Loss: 3.143 |  Val. PPL:  23.163
Epoch: 08 | Time: 1m 5s
	Train Loss: 1.797 | Train PPL:   6.031
	 Val. Loss: 3.246 |  Val. PPL:  25.684
Epoch: 09 | Time: 1m 5s
	Train Loss: 1.625 | Train PPL:   5.080
	 Val. Loss: 3.253 |  Val. PPL:  25.865
Epoch: 10 | Time: 1m 5s
	Train Loss: 1.516 | Train PPL:   4.553


## Reference

https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb