#  Sequence to Sequence
In this notebook we will be teaching a neural network to translate from French to English.

This is made possible by the simple but powerful idea of the [sequence
to sequence network](https://arxiv.org/abs/1409.3215>), in which two
recurrent neural networks work together to transform one sequence to
another. An **encoder** network condenses an input sequence into a vector,
and a **decoder** network unfolds that vector into a new sequence.

![](imgs/seq2seq.png)

In [1]:
from __future__ import unicode_literals
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
%matplotlib inline

## Pre-processing data
The data for this project is a set of many thousands of English to
French translation pairs.

In [2]:
def download_dataset():
    ! wget https://download.pytorch.org/tutorial/data.zip
    ! unzip data.zip

In [3]:
# to download the dataset
#download_dataset()

We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Lang`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` to use to later replace rare words.

In [4]:
SOS_token = 1
EOS_token = 2
class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {"PAD": 0, "SOS": 1, "EOS": 2, "UNK": 3}
        self.word2count = {}
        self.index2word = {0: "PAD", 1: "SOS", 2: "EOS", 3: "UNK"}
        self.n_words = 4  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

The files are all in Unicode, to simplify we will turn Unicode
characters to ASCII, make everything lowercase, and trim most
punctuation.




In [5]:
def unicodeToAscii(s):
    """Turn a Unicode string to plain ASCII
    
    https://stackoverflow.com/a/518232/2809427
    """
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalizeString(s):
    """Lowercase, trim, and remove non-letter characters"""
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [6]:
def readLangs(filename):
    # Read the file and split into lines
    lines = open(filename).read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    return pairs

In [7]:
# filtering some of the data
MAX_LENGTH = 15

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filterPair(p):
    return len(p[0].split(' ')) <= MAX_LENGTH and \
        len(p[1].split(' ')) <= MAX_LENGTH and \
        p[0].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

The full process for preparing the data is:

-  Read text file and split into lines, split lines into pairs
-  Normalize text, filter by length and content
-  Make word lists from sentences in pairs




In [8]:
pairs = readLangs("data/eng-fra.txt")
print("Read %s sentence pairs" % len(pairs))
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))

Read 135842 sentence pairs
Trimmed to 12898 sentence pairs


In [9]:
def prepareData(data_filename):
    pairs = readLangs(data_filename)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    
    
    #randomize the data with a fixed seed for repeatability
    random.seed(4)
    random.shuffle(pairs)
    #choose the first 10 pairs for testing and the rest for training
    valid_pairs = pairs[0:300]
    train_pairs = pairs[300:len(pairs)]
    
    print("number of test pairs: %s" % len(valid_pairs))
    print("number of train pairs: %s" % len(train_pairs))
    
    input_lang = Lang("english")
    output_lang = Lang("french")
    
    print("Counting words...")
    cnt = 0
    for pair in pairs:
        input_lang.addSentence(pair[1])
        output_lang.addSentence(pair[0])
        
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs, train_pairs, valid_pairs

input_lang, output_lang, pairs, train_pairs, valid_pairs = prepareData("data/eng-fra.txt")
random.seed(4)
print(random.choice(pairs))

Read 135842 sentence pairs
Trimmed to 12898 sentence pairs
number of test pairs: 300
number of train pairs: 12598
Counting words...
Counted words:
english 5070
french 3331
['he is too drunk to drive home .', 'il est trop saoul pour conduire jusque chez lui .']


In [10]:
train_pairs[0]

['he is a tennis player .', 'c est un joueur de tennis .']

# Dataset

In [11]:
def encode_sentence(s, vocab2index, N=MAX_LENGTH + 2, padding_start=True):
    enc = np.zeros(N, dtype=np.int32)
    enc1 = np.array([SOS_token] + [vocab2index.get(w, vocab2index["UNK"]) for w in s.split()] + [EOS_token])
    l = min(N, len(enc1))
    if padding_start:
        enc[:l] = enc1[:l]
    else:
        enc[N-l:] = enc1[:l]
    return enc, l

In [12]:
train_pairs[0]

['he is a tennis player .', 'c est un joueur de tennis .']

In [13]:
encode_sentence(train_pairs[0][0], input_lang.word2index, padding_start=False)

(array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   1,   3,   3,  79,
        554,   3,  11,   2], dtype=int32), 8)

In [14]:
encode_sentence(train_pairs[0][1], output_lang.word2index)

(array([  1,   3,   3,   3,   3,   3, 499,  11,   2,   0,   0,   0,   0,
          0,   0,   0,   0], dtype=int32), 9)

In [15]:
class PairDataset(Dataset):
    def __init__(self, pairs, input_lang, output_lang):
        self.pairs = pairs
        self.input_word2index = input_lang.word2index
        self.output_word2index = output_lang.word2index
    
    def __len__(self):
        return len(self.pairs)
    
    def __getitem__(self, idx):
        x, n_x = encode_sentence(self.pairs[idx][1], self.input_word2index, padding_start=False)
        y, n_y = encode_sentence(self.pairs[idx][0], self.output_word2index)
        return x, y
    
train_ds = PairDataset(train_pairs, input_lang, output_lang)
valid_ds = PairDataset(valid_pairs, input_lang, output_lang)

In [16]:
train_ds[0]

(array([  0,   0,   0,   0,   0,   0,   0,   0,   1,  44,  45,  97, 553,
         16, 554,  11,   2], dtype=int32),
 array([  1,  90,  38,  39, 499, 500,  11,   2,   0,   0,   0,   0,   0,
          0,   0,   0,   0], dtype=int32))

In [17]:
batch_size=5
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

## The Seq2Seq Model

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.

A `Sequence to Sequence network <https://arxiv.org/abs/1409.3215>`__, or
seq2seq network, or `Encoder Decoder
network <https://arxiv.org/pdf/1406.1078v3.pdf>`__, is a model
consisting of two RNNs called the encoder and decoder. The encoder reads
an input sequence and outputs a single vector, and the decoder reads
that vector to produce an output sequence.

### The Encoder

The encoder of a seq2seq network is a RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

![](imgs/encoder-network.png)

In [18]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size, padding_idx=0)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)

    def forward(self, x):
        x = self.embedding(x)
        output, hidden = self.gru(x)
        return output, hidden

In [19]:
x, y = next(iter(train_dl))

In [20]:
x, y

(tensor([[   0,    0,    0,    0,    0,    0,    0,    1,   44,   45,   46, 3225,
            16,   20,  154,   11,    2],
         [   0,    0,    0,    0,    0,    1,   22,   23,   24,   25,  480,   79,
            97,  706, 1964,   11,    2],
         [   0,    0,    0,    0,    0,    0,    0,    0,    1,   22,   24,  747,
            16,  231, 3304,   11,    2],
         [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    1,   57,
            58,  113, 3665,   11,    2],
         [   0,    0,    0,    1,   12,  287, 1370,   40,   46, 1371,   16,  784,
            79,   99, 1372,   11,    2]], dtype=torch.int32),
 tensor([[   1,   90,   38, 1214,   76,   21,  851,   11,    2,    0,    0,    0,
             0,    0,    0,    0,    0],
         [   1,   17,   18,   23,  472,   15,  361, 1513,   11,    2,    0,    0,
             0,    0,    0,    0,    0],
         [   1,   17,   64,   96,  133,  208, 2368,   11,    2,    0,    0,    0,
             0,    0,    0,    0,   

In [21]:
input_size = input_lang.n_words
hidden_size = 100
encoder = EncoderRNN(input_size, hidden_size)

In [22]:
enc_outputs, enc_hidden = encoder(x.long())

In [23]:
enc_outputs.shape, enc_hidden.shape

(torch.Size([5, 17, 100]), torch.Size([1, 5, 100]))

The  Decoder
-----------

In [24]:
class DecoderRNN(nn.Module):
    def __init__(self, output_size, hidden_size):
        super(DecoderRNN, self).__init__()

        self.embedding = nn.Embedding(output_size, hidden_size, padding_idx=0)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input)
        output, hidden = self.gru(embedded, hidden)
        output = self.out(hidden[-1])
        return output, hidden

In [25]:
output_size = output_lang.n_words
hidden_size = 100

In [26]:
batch_size = y.size(0)
decoder_input = SOS_token*torch.ones(batch_size,1).long()
decoder_input.shape

torch.Size([5, 1])

In [27]:
decoder = DecoderRNN(output_size, hidden_size)

In [28]:
output, hidden = decoder(decoder_input, enc_hidden)

In [29]:
hidden.shape, output.shape

(torch.Size([1, 5, 100]), torch.Size([5, 3331]))

Training
========

In [30]:
def train_batch(x, y, encoder, decoder, encoder_optimizer, decoder_optimizer,
                teacher_forcing_ratio=0.5):

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    batch_size = y.size(0)
    target_length = y.size(1)

    enc_outputs, enc_hidden = encoder(x)

    loss = 0
    dec_input = y[:,0].unsqueeze(1) # allways SOS
    hidden = enc_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    for di in range(1, target_length):
        output, hidden = decoder(dec_input, hidden)
        yi =  y[:, di]
        if (yi>0).sum() > 0:
            # ignoring padding
            loss += F.cross_entropy(output, yi, ignore_index = 0, reduction="sum")/(yi>0).sum()
        if use_teacher_forcing:
            dec_input = y[:, di].unsqueeze(1)  # Teacher forcing: Feed the target as the next input
        else:                
            dec_input = output.argmax(dim=1).unsqueeze(1).detach()

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item()

In [31]:
def train(encoder, decoder, enc_optimizer, dec_optimizer, epochs = 10,
          teacher_forcing_ratio=0.5):
    for i in range(epochs):
        total_loss = 0
        total = 0
        encoder.train()
        decoder.train()
        for x, y in train_dl:
            x = x.long().cuda()
            y = y.long().cuda()
            loss = train_batch(x, y, encoder, decoder, enc_optimizer, dec_optimizer,
                               teacher_forcing_ratio)
            total_loss = loss*x.size(0)
            total += x.size(0)
        if i%10 == 0:
            print("train loss %.3f" % (total_loss / total))   

In [32]:
input_size = input_lang.n_words
output_size = output_lang.n_words
hidden_size = 300
encoder = EncoderRNN(input_size, hidden_size).cuda()
decoder = DecoderRNN(output_size, hidden_size).cuda()
enc_optimizer = optim.Adam(encoder.parameters(), lr=0.01)
dec_optimizer = optim.Adam(decoder.parameters(), lr=0.01) 

In [33]:
batch_size= 1000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

In [34]:
train(encoder, decoder, enc_optimizer, dec_optimizer, epochs = 20)

train loss 2.418
train loss 1.320


In [35]:
enc_optimizer = optim.Adam(encoder.parameters(), lr=0.001)
dec_optimizer = optim.Adam(decoder.parameters(), lr=0.001) 
train(encoder, decoder, enc_optimizer, dec_optimizer, epochs = 40)

train loss 0.888
train loss 0.669
train loss 0.157
train loss 0.121


In [36]:
train(encoder, decoder, enc_optimizer, dec_optimizer, epochs = 300, teacher_forcing_ratio=0.0)

train loss 0.408
train loss 0.303
train loss 0.278
train loss 0.265
train loss 0.133
train loss 0.279
train loss 0.165
train loss 0.162
train loss 0.121
train loss 0.087
train loss 0.209
train loss 0.123
train loss 0.114
train loss 0.105
train loss 0.063
train loss 0.052
train loss 0.110
train loss 0.147
train loss 0.049
train loss 0.088
train loss 0.037
train loss 0.181
train loss 0.035
train loss 0.034
train loss 0.023
train loss 0.029
train loss 0.049
train loss 0.039
train loss 0.033
train loss 0.026


Evaluation
==========

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder's predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there. We also store the decoder's
attention outputs for display later.




* `model.eval()` will notify all your layers that you are in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode.
* `torch.no_grad()` impacts the autograd engine and deactivate it. It will reduce memory usage and speed up computations but you won’t be able to backprop (which you don’t want in an eval script).

In [44]:
def decoding(x, y, encoder, decoder, max_length=MAX_LENGTH+2):
    decoder = decoder.eval()
    loss = 0
    with torch.no_grad():   
        batch_size = x.size(0)
        enc_outputs, hidden = encoder(x)
        dec_input = SOS_token*torch.ones(batch_size, 1).long().cuda()  # SOS
        decoded_words = []
        for di in range(1, max_length):
            output, hidden = decoder(dec_input, hidden)
            pred = output.argmax(dim=1)
            decoded_words.append(pred.cpu().numpy())
            dec_input = output.argmax(dim=1).unsqueeze(1).detach()
            yi =  y[:, di]
            if (yi>0).sum() > 0:
                # ignoring padding
                loss += F.cross_entropy(
                    output, yi, ignore_index = 0, reduction="sum")/(yi>0).sum()
        return loss.item()/batch_size, np.transpose(decoded_words)

In [45]:
batch_size=300
valid_dl_2 = DataLoader(valid_ds, batch_size=batch_size, shuffle=True)

x, y = next(iter(valid_dl_2)) 
x = x.long().cuda()
y = y.long().cuda()

loss, _ = decoding(x, y, encoder, decoder)
loss

0.17409254709879557

In [46]:
batch_size=5
train_dl_2 = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

x, y = next(iter(train_dl_2)) 
x = x.long().cuda()
y = y.long().cuda()

We can evaluate random sentences from the training set and print out the
input, target, and output to make some subjective quality judgements:




In [47]:
def print_results(x, y, encoder, decoder):
    _, decoded_words = decoding(x, y, encoder, decoder)
    for i in range(x.shape[0]):
        xi = x[i].cpu().numpy()
        yi = y[i].cpu().numpy()
        y_hat = decoded_words[i]
        x_sent = ' '.join([input_lang.index2word[t] for t in xi if t > 3])
        y_sent = ' '.join([output_lang.index2word[t] for t in yi if t > 3])
        y_hat_sent = ' '.join([output_lang.index2word[t] for t in y_hat if t > 3])
        print('>', x_sent)
        print('=', y_sent)
        print('<', y_hat_sent)
        print('')

In [48]:
print_results(x, y, encoder, decoder)

> nous ne sommes plus amies .
= we re not friends anymore .
< we re not friends anymore .

> je ne suis pas aveugle .
= i m not blind .
< i m not blind .

> elles courent dans le parc .
= they are running in the park .
< they are running in the park .

> tu es tres curieux .
= you re very curious .
< you re very curious .

> j ai du mal a dormir .
= i m having trouble sleeping .
< i m having trouble sleeping .



In [49]:
batch_size=10
valid_dl_2 = DataLoader(valid_ds, batch_size=batch_size, shuffle=True)

x, y = next(iter(valid_dl_2)) 
x = x.long().cuda()
y = y.long().cuda()

In [50]:
print_results(x, y, encoder, decoder)

> je travaille dans la recherche contre le sida .
= i am engaged in aids research .
< i am in charge of the opinion .

> je ne suis pas cette sorte de fille .
= i m not that kind of girl .
< i m not married to .

> t es une drole de fille .
= you re a funny girl .
< you re a funny gal .

> elle est pauvre mais elle est heureuse .
= she is poor but she is happy .
< she is poor but she looks happy .

> je suis furieux .
= i m furious .
< i m in .

> c est un menteur notoire .
= he s a notorious liar .
< he s a man of his word .

> je cherche mes cles . les as tu vues ?
= i m looking for my keys . have you seen them ?
< i m looking for my keys . have you seen them ?

> c est une beaute .
= she is a beauty .
< she s a cutie .

> nous n allons pas sortir .
= we re not going out .
< we re not going to .

> tu n es pas un tueur et moi non plus .
= you re not a killer and neither am i .
< you re not the only one i like this .



# Credits
The original notebook was written by Sean Robertson <https://github.com/spro/practical-pytorch>_