# Seq2Seq architecture

## What is Seq2Seq?

Seq2Seq is a neural network architecture that is used for sequence to sequence problems. It is used for tasks like machine translation, text summarization, speech recognition, etc. It is a neural network architecture that consists of two recurrent neural networks (RNNs) connected to each other. The first RNN is called the encoder and the second RNN is called the decoder. The encoder RNN reads the input sequence and the decoder RNN generates the output sequence.

![Encoder-decode-structure](https://miro.medium.com/max/828/1*YAPAHVYhsaEARYz45bO0Gg.jpeg)

### Encoder

All the information of the input sentence is extracted into a dense vector by the encoder. The encoded information is stored in the hidden state of the encoder RNN. 

![](https://miro.medium.com/max/640/1*pQwlJ5c2XOLGg0_-KUJ3MQ.jpeg)


### Decoder

The decoder do decoding/translating the output of the encoder. It uses this hidden state vector to generate the output sequence. 

![](https://miro.medium.com/max/828/1*rkgcxYFzLZjz7o6hz3FsQw.jpeg)

### Algorithm

![](https://miro.medium.com/max/828/1*bjSD5iFeP5vbSzQ0MuAf5w.jpeg)


Assume we have Source sentence $S^{(i)}$ and the target translation $T^{(i)}. We want to maximize

$$\sum_i \log P(T^{(i)}|S^{(i)})$$

- Compute the hidden state $h_t$ for each word (time step) in the input sentence $S^{(i)}$ using the RNN.
- Use it as the input hidden state to the decoder RNN
- Compute the $\hat{y_t}$ for every time step in the target sequence
- Multiply the probs of the actual outputs in the target sequence
- Finally, we get $$argmax_TP(T \mid S)$$ and $T$ is the target sequence (translated sentence).

## Why do we need Seq2Seq?

## Applications

### Machine Translation

Machine translation is a task that you translate a sentence from one language to another. For example, you can translate a sentence from English to French. This task is very difficult because there are many words in a language that have multiple meanings. For example, the word “bank” can mean a financial institution or a river bank. This is a very difficult task for a machine to do. Seq2Seq is used to solve this problem.

Encoder encodes the input sentence and pass the decoded information into the deocder. The decoder generates the output sentence.

### Image captioning

In this problem, we encode the image into a vector and pass it to the decoder. The decoder generates the caption for the image.

![img-captioning](https://miro.medium.com/max/1400/1*6BFOIdSHlk24Z3DFEakvnQ.png)

## Implementation of Seq2Seq in Machine Translation

![Seq2Seq](https://miro.medium.com/max/828/1*ravhj1M9KFg0u77aDqM_MQ.png)

In [1]:
!pip install gdown -q



In [2]:
import os
import gdown

if os.path.exists('/content/'):
    # os.system('!gdown 1ty8k-omlU3zvSUemx2gvBaEQWAWZAQ1C')
    # os.system("!gdown ")
    gdown.download("https://drive.google.com/file/d/1ty8k-omlU3zvSUemx2gvBaEQWAWZAQ1C", output='./train.en')
    gdown.download("https://drive.google.com/file/d/1mzDv83hvTlsLNSg7XNIIFLub36YVOf6u", output='./train.vi')

In [3]:
from __future__ import unicode_literals, print_function, division
import re
import string
import random
import unicodedata
from io import open
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [4]:
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {}
        self.special_tokens = ["<sos>", "<eos>", "<unk>", "<pad>"]
        for token in self.special_tokens:
            self.word2index[token] = len(self.word2index)
            self.index2word[len(self.index2word)] = token
        self.n_words = len(self.index2word)  # Count SOS, EOS, UNK, and special tokens

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
            
    def __len__(self):
        return self.n_words

In [5]:
def normalizeString(s):
    # s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

def readLangs(filename, lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    from_lang = None
    with open(f'{filename}.{lang1}', 'r+', encoding='utf8') as f:
        from_lang = f.read().strip().split('\n')
        
    to_lang = None
    with open(f'{filename}.{lang2}', 'r+', encoding='utf8') as f:
        to_lang = f.read().strip().split('\n')
        
    # Split every line into pairs and normalize
    pairs = list(zip(from_lang, to_lang))
    # pairs = [[normalizeString(s) for s in l] for l in pairs]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

In [6]:
MAX_LENGTH = 512

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

def prepareData(filename, lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(filename, lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('./train-en-vi/train', 'en', 'vi')

Reading lines...
Read 133317 sentence pairs
Trimmed to 133311 sentence pairs
Counting words...
Counted words:
en 54152
vi 25610


In [7]:
print(f'max length of input: {max([len(pair[0].split(" ")) for pair in pairs])}')
print(f'max length of output: {max([len(pair[1].split(" ")) for pair in pairs])}')

max length of input: 485
max length of output: 481


In [8]:
train_pairs, test_pairs = train_test_split(pairs, test_size=0.2, random_state=42)

class MTDataset(torch.utils.data.Dataset):
    def __init__(self, pairs, lang1, lang2, max_length=MAX_LENGTH):
        self.pairs = pairs
        self.lang1 = lang1
        self.lang2 = lang2
        
    def __len__(self):
        return len(self.pairs)
    
    def __getitem__(self, idx):
        source, target = self.pairs[idx]
        
        source = ["<sos>"] + source.split(" ") + ["<eos>"]
        source_ids = [self.lang1.word2index.get(word, self.lang1.word2index["<unk>"]) for word in source]
        if len(source_ids) < MAX_LENGTH:
            source_ids += [self.lang1.word2index["<pad>"]] * (MAX_LENGTH - len(source_ids))
        elif len(source_ids) > MAX_LENGTH:
            source_ids = source_ids[:MAX_LENGTH]
        source_ids = torch.tensor(source_ids, dtype=torch.long)
        
        target = ["<sos>"] + target.split(" ") + ["<eos>"]
        target_ids = [self.lang2.word2index.get(word, self.lang2.word2index["<unk>"]) for word in target]
        if len(target_ids) < MAX_LENGTH:
            target_ids += [self.lang2.word2index["<pad>"]] * (MAX_LENGTH - len(target_ids))
        elif len(target_ids) > MAX_LENGTH:
            target_ids = target_ids[:MAX_LENGTH]
        target_ids = torch.tensor(target_ids, dtype=torch.long)
        
        return source_ids, target_ids
    
train_dataset = MTDataset(train_pairs, input_lang, output_lang)
test_dataset = MTDataset(test_pairs, input_lang, output_lang)

In [9]:
tmp = 12
for word in train_dataset[tmp][0]:
    print(input_lang.index2word[word.item()], end=' ')
print()

for word in train_dataset[tmp][1]:
    print(output_lang.index2word[word.item()], end=' ')

<sos> In fact , nothing even actually comes close to our ability to restore hearing . <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <p

In [10]:
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout_p):
        """
            :param input_size: size of input vocabulary
            :param embedding_size: size of embedding layer
            :param hidden_size: size of hidden layer
            :param num_layers: number of layers
            :param dropout_p: dropout probability
        """
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.dropout = nn.Dropout(dropout_p)
        self.embedding = nn.Embedding(input_size, embedding_dim=embedding_size)
        self.rnn = nn.LSTM(
            embedding_size, 
            hidden_size, 
            num_layers,
            dropout=dropout_p
        )

    def forward(self, x):
        """
            :param x: input of shape (seq_length, batch_size)
            :return: output of shape (seq_length, batch_size, hidden_size)
            :return: hidden of shape (num_layers, batch_size, hidden_size)
        """
        # embedding shape: (seq_length, batch_size, embedding_size)
        embedding = self.dropout(self.embedding(x))
        output, (hidden, cell) = self.rnn(embedding)
        
        return hidden, cell

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [11]:
class Decoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size, num_layers, dropout_p):
        """
            :param input_size: size of target vocabulary
            :param embedding_size: size of embedding layer
            :param hidden_size: size of hidden layer (same as encoder)
            :param output_size: size of output layer (same as target vocabulary)
            :param num_layers: number of layers
            :param dropout_p: dropout probability
        """
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.output_size = output_size
        
        self.dropout = nn.Dropout(dropout_p)
        self.embedding = nn.Embedding(input_size, embedding_dim=embedding_size)
        
        self.rnn = nn.LSTM(
            embedding_size,
            hidden_size,
            num_layers,
            dropout=dropout_p
        )
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden, cell):
        """
            :param x: input of shape (batch_size)
            :param hidden: hidden state of shape (num_layers, batch_size, hidden_size)
            :param cell: cell state of shape (num_layers, batch_size, hidden_size)
        """
        
        # shape of x: (batch_size) => (1, batch_size) because we are processing one word at a time
        x = x.unsqueeze(0)
        
        embedding = self.dropout(self.embedding(x))
        # embedding shape: (1, batch_size, embedding_size)
        
        outputs, (hidden, cell) = self.rnn(embedding, (hidden, cell))
        # outputs shape: (1, batch_size, hidden_size)
        
        predictions = self.fc(outputs)
        # shape of predictions: (1, batch_size, output_size)
        
        predictions = predictions.squeeze(0)
        
        return predictions, hidden, cell

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [12]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        """
            :param source: source sentence of shape (src_len, batch_size)
            :param target: target sentence of shape (trg_len, batch_size)
            :param teacher_forcing_ratio: probability of using teacher forcing
        """
        batch_size = source.shape[1] # shape: (src_len, batch_size)
        target_length = target.shape[0]
        target_vocab_size = self.decoder.output_size
        
        # tensor to store decoder outputs one word at a time
        outputs = torch.zeros(target_length, batch_size, target_vocab_size).to(device)
        
        # last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(source)
        
        # Get start token <sos> for each batch
        x = target[0]
        
        for t in range(1, target_length):
            output, hidden, cell = self.decoder(x, hidden, cell)
            
            # output in shape (batch_size, target_vocab_size)
            outputs[t] = output 

            # get index of highest probability word in target vocabulary
            best_guess = output.argmax(1)
            
            # use teacher forcing
            x = target[t] if random.random() < teacher_forcing_ratio else best_guess
        
        return outputs

In [14]:
from utils import *

def train(num_epochs, learning_rate, batch_size, device, ):
    raise NotImplementedError

num_epochs = 20
learning_rate = 0.001
batch_size = 8

load_model = False
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_size_encoder = len(input_lang)
input_size_decoder = len(output_lang)
output_size = len(output_lang)
encoder_embedding_size = 256
decoder_embedding_size = 256
hidden_size = 512
num_layers = 2
enc_dropout = 0.5
dec_dropout = 0.5

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)

writer = SummaryWriter(f'runs/loss_plot')
step = 0

encoder = Encoder(
    input_size_encoder, 
    encoder_embedding_size, 
    hidden_size, 
    num_layers, 
    enc_dropout).to(device)

decoder = Decoder(
    input_size_decoder,
    decoder_embedding_size,
    hidden_size,
    output_size,
    num_layers,
    dec_dropout).to(device)

model = Seq2Seq(encoder, decoder).to(device)

pad_idx = output_lang.word2index['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# if load_model:
#     load_checkpoint(torch.load())
example = "I am a student so I go to school every day ."

for epoch in range(num_epochs):
    print(f'Epoch [{epoch}/{num_epochs}]')
    
    checkpoint = {'state_dict': model.state_dict(), 'optimizer': optimizer.state_dict()}
    
    model.eval()
    translated_sentence = translate_sentence(model, example, input_lang, output_lang, device, max_length=50)
    print(f'Translated example sentence: \n {translated_sentence}')

    model.train()
    bar = tqdm(train_loader)
    for batch_idx, batch in enumerate(bar):
        inp_data = batch[0].to(device)
        target = batch[1].to(device)
        
        output = model(inp_data, target)
        # output shape: (target_length, batch_size, target_vocab_size)
        
        # reshape output and target to calculate loss
        # ignoring the <sos> token
        output = output[1:].reshape(-1, output.shape[2])
        target = target[1:].reshape(-1)
        
        optimizer.zero_grad()
        loss = criterion(output, target)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
        optimizer.step()
        
        writer.add_scalar('Training loss', loss, global_step=step)
        bar.set_description(f'Epoch [{epoch}/{num_epochs}]')
        bar.set_postfix(loss=loss.item())
        bar.update()
        step += 1

Epoch [0/20]
Translated example sentence: 
 ['Randers', '16', '16', 'Hopskin', 'thàm', 'Suarez', 'triết', 'nhiều', 'Arusha', 'khan', 'Perceptive', 'Perceptive', 'O`Toole', 'Motetema', 'Kết', 'Garmin', 'Garmin', 'ngào', 'leatherjacket', 'leatherjacket', 'leatherjacket', 'click', 'click', 'Miebach', 'Vardi', 'đoái', 'đoái', 'Leikei', 'rhinovirus', 'click', 'Bambir', 'nhiếc', 'nhiếc', 'tuệ', 'Cleveland', 'Cleveland', 'Đề', 'giới.Tại', 'Todagin', 'Thà', 'click', 'click', 'Vardi', 'Riley', 'mắng', 'mắng', 'biriani', 'biriani', 'ngao', 'Cornwall']


  0%|          | 0/13331 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

In [None]:
import time
import math
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

In [None]:
teacher_forcing_ratio = 0.5

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing
    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

In [None]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

In [None]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]
    
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

In [None]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
decoder1 = DecoderRNN(hidden_size, output_lang.n_words).to(device)

trainIters(encoder1, decoder1, 75000, print_every=5000)

In [None]:
evaluateRandomly(encoder1, decoder1)

## References

- 
- https://medium0.com/@saikrishna4820/lstm-language-translation-18c076860b23
- https://towardsdatascience.com/what-is-an-encoder-decoder-model-86b3d57c5e1a
- https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html