TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION  
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

In [None]:
a sequence is a sentence, a step is a word

Note that the class defined here, Encoder, Decoder, are for sequence of length 1
the input is of shape [1, 1]

def train receives an input_tensor and a target_tensor, both of [len_seq, 1], representing to a sentence
in def train, the network unrolled in for loop, for an input of sequence of len_seq (< MAX_LENGTH)  
i.e., one run on def train is equivalent to rnn ([seq_len, 1])

input for EncoderRNN example
tensor([[123]])

output of EncoderRNN
of shape [1, 1,100]
encoder_output[0, 0] is of shape [100]

output of AttnDecoder
of shape [1, output_lang.n_words]

input_tensor for def train example
tensor([[ 118],
        [ 214],
        [2479],
        [   5],
        [   1]])
        
def trainIters  
one iteration one def train
training_pairs : n_iters of sentence pairs chosen randomly

seq-to-seq  
encoder: seq to vector
decoder: vector to seq

character-level RNN  
a character -> one-hot vector, 'b' to [0, 1, ...] with len 26  
word-level RNN  
a word -> one-hot vector, 'the' to [0, 0, 1, 0, ... ,0] with large enough len, in the example


Attention  
If only the context vector is passed betweeen the encoder and decoder, that single vector carries the burden of encoding the entire sentence.  
Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs.  

a context vector is not enough, so using encoder hidden states additionally  
CHECK encoder output or hidden state?

In this example, encoder outputs are collected in encoder_outputs of (MAX_LENGTH=10, hidden_size)

어텐션의 기본 아이디어는 디코더에서 출력 단어를 예측하는 매 시점(time step)마다, 인코더에서의 전체 입력 문장을 다시 한 번 참고한다는 점입니다. 단, 전체 입력 문장을 전부 다 동일한 비율로 참고하는 것이 아니라, 해당 시점에서 예측해야할 단어와 연관이 있는 입력 단어 부분을 좀 더 집중(attention)해서 보게 됩니다.

attention_weights from AttnDecoderRNN in this example represents the location where most attended at a step

since English and Frecnch have the same order, attention matrix looks like an identity matrix

The KEY difference than SimpleDecoder  
the first hidden for the gru is the same,  
but the input is processed through attention layers before passing to the gru

In [1]:
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

  return torch._C._cuda_getDeviceCount() > 0


In [2]:
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: 'SOS', 1: 'EOS'}
        self.n_words = 2
        
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
            
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words  # the new word index n if n words exits in Lang
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [3]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [4]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")
    
    lines = open(f"data/{lang1}-{lang2}.txt", encoding='utf-8').read().strip().split('\n')
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)
        
    return input_lang, output_lang, pairs

In [5]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filterPair(p):  # sentences of words less than MAX_LENGTH, starting with eng_prefixes
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and p[1].startswith(eng_prefixes)
    
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

In [6]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print(f"Read {len(pairs)} sentence pairs")
    pairs = filterPairs(pairs)
    print(f"Trimmed to {len(pairs)} sentence pairs")
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
print(random.choice(pairs))

Reading lines...
Read 135842 sentence pairs
Trimmed to 10599 sentence pairs
Counting words...
Counted words:
fra 4345
eng 2803
['elle est toujours habillee en noir .', 'she is always dressed in black .']


In [7]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

In [None]:
tensorsFromPair(random.choice(pairs))[0][ei], a word, will be the input to Encoder

In [8]:
_input = tensorsFromPair(random.choice(pairs))

In [18]:
_input[0]

tensor([[ 118],
        [ 214],
        [2479],
        [   5],
        [   1]])

In [15]:
_input[0][1]

tensor([214])

In [10]:
input_lang.n_words

4345

In [12]:
_output = nn.Embedding(input_lang.n_words, 100)(_input[0])

In [16]:
_output[0]

tensor([[ 2.1126e+00,  1.2956e+00,  8.5368e-02, -2.4100e-01,  5.9298e-01,
         -1.1132e-01,  9.4729e-02,  2.5043e-01,  9.5987e-02,  3.7630e-01,
          9.9021e-01, -3.2076e-01,  4.0774e-02, -7.8760e-01,  1.0080e+00,
         -6.3093e-04, -1.5754e-01,  3.8873e-01, -2.3648e-01, -5.6157e-01,
         -1.8980e+00, -2.6584e-01, -2.2812e+00, -7.5095e-01, -3.3917e-01,
         -1.0389e+00, -2.2137e+00, -1.9303e-01,  8.5242e-01, -3.9417e-01,
         -2.0396e+00, -6.8679e-01,  4.1503e-01,  3.1417e-01, -9.4584e-01,
         -1.9354e+00,  1.2471e-02, -8.1016e-01,  4.4150e-01, -2.5884e-01,
          2.5792e+00,  3.6317e-01,  6.7826e-01,  3.1396e-02, -5.4542e-01,
          1.8786e-01,  2.6292e-01,  5.5097e-01,  1.0643e+00,  8.6758e-01,
          5.5881e-01,  4.9783e-01,  1.3995e+00,  1.7342e-01,  5.1705e-01,
         -7.0786e-01, -8.4849e-01,  6.6602e-01, -9.4551e-01, -5.4360e-01,
          8.8164e-02, -3.9118e-01, -3.3974e-01, -7.7146e-01,  1.6687e-01,
         -1.3690e+00,  6.2364e-02, -8.

In [37]:
_input.shape
_output.shape

torch.Size([1, 100])

In this example,  
nn.Embedding receives an index of an integer in (0, input_lang.n_words), e.g., tensor([214]),  
and returns an embedding vector of size (1, 100), 

In [8]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        # input_size: input_lang.n_words
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        
    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        # from (1, hidden_size) to (1, 1, hidden_size)
        # len_seq is 1 since the input is a word
        output, hidden = self.gru(embedded, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [9]:
# the the input will be <SOS> with the context vector as the hidden state

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        # output_size: output_lang.n_words
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(output_size, hidden_size)  # transform a word of the output language
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))  # output: (1, 1, hidden_size), output[0]: (1, hidden_size)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [None]:
torch.bmm: batch-matrix-multiplication, bmm with (10, 3, 4) and (10, 4, 5) -> (10, 3, 5)

attn_weights (1, MAX_LENGTH)
encoder_outputs (MAX_LENGTH, hidden_size)
CHECK : why MAX_LENGTH? may be for accounting for inputs

bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0)) -> attn_applied
(1, 1, MAX_LENGTH) * (1, MAX_LENGTH, hidden_size) -> (1, 1, hidden_size)

encoder_outputs is not directly referred,
but is multiplied by attention weights in bmm and adjusted to attn_applied before using

one encoder output for one word is (1, hidden_size) vector
encoder_outputs (MAX_LENGTH, hidden_size) is a collection of (1, hidden_size) vectors of MAX_LENGTH words 
MAX_LENGTH encoder outputs multiplied by corresponding weights condensed to (1, hidden_size) like one word encoder output

each weight element <-> each input word
[0, 0, 0.5, 0.5, 0] means 3rd 4th words are important at this decoding step

weights are obtained from input and previous hidden as context

attn_combine
combines the information in the input and the transformed encoder_outputs 

In [10]:
# tensor[0] is just a squeezing, (1, 1, features) -> (1, features), for passing to Linear layers
# unsqueezed for passing to other layers

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length
        
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attn = nn.Linear(hidden_size * 2, max_length)
        self.attn_combine = nn.Linear(hidden_size *2, hidden_size)
        self.dropout = nn.Dropout(dropout_p)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        
    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)
        
        _attn_weights = self.attn(torch.cat([embedded[0], hidden[0]], 1))  # two of (1, 1, hidden) are squeezed and concatenated then paased to Linear layer
        attn_weights = F.softmax(_attn_weights, dim=1)  # softmax along axis 1
        attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))
        
        output = self.attn_combine(torch.cat([embedded[0], attn_applied[0]], 1)).unsqueeze(0)
        # Linear layer outputs (1, hidden_size*2), need shaped to (1, 1, hidden_size*2) for GRU layer
        
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)
        
# CHECK F.softmax v F.log_softmax        

In [43]:
_output = AttnDecoderRNN(100, output_lang.n_words)(torch.tensor([[123]]), torch.zeros(1, 1, 100), torch.zeros(10, 100))

In [44]:
_output[0].shape

torch.Size([1, 2803])

In [48]:
topv, topi = _output[0].topk(1)

In [51]:
topi.squeeze().shape

torch.Size([])

In [45]:
_output[1].shape

torch.Size([1, 1, 100])

In [46]:
_output[2].shape

torch.Size([1, 10])

In [66]:
_output[2]

tensor([[0.0780, 0.1313, 0.0925, 0.0964, 0.1230, 0.0756, 0.0577, 0.1398, 0.1404,
         0.0654]], grad_fn=<SoftmaxBackward>)

In [65]:
_output[2].data

tensor([[0.0780, 0.1313, 0.0925, 0.0964, 0.1230, 0.0756, 0.0577, 0.1398, 0.1404,
         0.0654]])

In [67]:
_output[2].detach()

tensor([[0.0780, 0.1313, 0.0925, 0.0964, 0.1230, 0.0756, 0.0577, 0.1398, 0.1404,
         0.0654]])

In [68]:
outputs = torch.zeros(10, 10)

In [69]:
outputs[0] = _output[2]

Teacher forcing  
use the target as the next input, instead of the current output  
decoder_input = target_tensor[di]  
vs  
topv, topi = decoder_output.topk(1)  
decoder_input = topi.squeeze().detach()  

decoder_input will be the input for the next time step

but Note that the first input is the SOS token

In [11]:
teacher_forcing_ratio = 0.5

def train(input_tensor, target_tensor, encoder, decoder, 
          encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()
    
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)
    
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
    
    loss = 0
    
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]
    
    decoder_input = torch.tensor([[SOS_token]], device=device)
    decoder_hidden = encoder_hidden
    
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
    
    if use_teacher_forcing:
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # the target at this step becomes the input for the next step
    
    else:
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            topv, topi = decoder_output.topk(1)  # topi is the argmax index among output.n_words
            decoder_input = topi.squeeze().detach()  # size [1] to []
            
            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break
                
    loss.backward()  # the unrolled network backpropagate at once
    
    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item() / target_length

In [12]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))  # elapsed, remaining

In [18]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0
    plot_loss_total = 0
    
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs)) for _ in range(n_iters)]
    criterion = nn.NLLLoss()
    
    for iter in range(1, n_iters+1):
        training_pair = training_pairs[iter-1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]
        
        loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        
        print_loss_total += loss
        plot_loss_total += loss
        
        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))
            print_loss_total = 0
        
        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
            
    showPlot(plot_losses)

In [14]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)  # every 0.2
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

In [15]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        
        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
        
        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]
        
        decoder_input = torch.tensor([[SOS_token]])
        decoder_hidden = encoder_hidden
        
        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)
        
        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
            
            decoder_input = topi.squeeze().detach()
        
        return decoded_words, decoder_attentions[:di+1]

In [16]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

In [19]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 75000, print_every=5000)

6m 42s (- 93m 50s) (5000 6%) 2.8340
13m 38s (- 88m 41s) (10000 13%) 2.2710
20m 20s (- 81m 20s) (15000 20%) 1.9862


KeyboardInterrupt: 

In [None]:
evaluateRandomly(encoder1, attn_decoder1)

In [None]:
def showAttention(input_sentence, output_words, attentions):
    # Set up figure with colorbar
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()

In [None]:
def evaluateAndShowAttention(input_sentence):
    output_words, attentions = evaluate(
        encoder1, attn_decoder1, input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions)


evaluateAndShowAttention("elle a cinq ans de moins que moi .")

evaluateAndShowAttention("elle est trop petit .")

evaluateAndShowAttention("je ne crains pas de mourir .")

evaluateAndShowAttention("c est un jeune directeur plein de talent .")

In [19]:
decoder_input = torch.tensor([[SOS_token]], device=device)

In [21]:
decoder_input.shape

torch.Size([1, 1])

In [6]:
re.sub(r"([.!?])", r" \1", ".!?")  # \1 refers the matched 

' . ! ?'

In [8]:
list(reversed([1,2,3,4,5]))

[5, 4, 3, 2, 1]

In [31]:
_output = EncoderRNN(input_lang.n_words, 100)(torch.tensor([[123]]), torch.zeros(1, 1,100))

In [36]:
_output[0].shape

torch.Size([1, 1, 100])