<a href="https://colab.research.google.com/github/CS7140/PA-10/blob/main/Q3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install d2l==0.15.1
!pip install ipython-autotime
%load_ext autotime

time: 1.44 ms


Rajesh Sakhamuru

12-9-2020
# Encoder-Decoder Seq2Seq for English to French Translation

The encoder and decoder do not have to be the same type of neural network. There are many potential uses where encoder-decoder architecture can be useful. One such example that shows that the neural network types do not have to be the same is when this encoder-decoder architecture is used for image captioning. The data being encoded is an image via a Convolutional Neural Network (CNN) but the decoder neural network would be a Recurrent Neural Network (RNN) which would use the state output of the CNN as input to generate a text caption for the image.

If the encoder and decoder differ in the number of layers or the number of hidden units, the decoder hidden state can be initialized by the encoder hidden state output with every decoder layer. This way, the alignment of the encoder state with the decoder state is not a problem because this method allows the decoder to learn which parts of the input sequence are relevant to each part of the output sequence.

Another solution would be to use an attention model which would develop a context vector which is specifically filtered for each output time-step. Attention models achieve similar results to feeding the hidden state to each decoder layer but much more effeciently and explicitly makes the selection of which parts input sequence are relevant to which parts of the output sequence.

Below I have my implementation of Seq2Seq which is an Encoder-Decoder neural network for translating English to French based on the D2L implementation.

In [2]:
from d2l import torch as d2l
import collections
import torch
import math
from torch import nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

time: 574 ms


In [3]:
class Encoder(nn.Module):
    def __init__(self, vocabSize, embedSize, hiddenSize, numLayers, **kwargs):
        super(Encoder, self).__init__(**kwargs)
        self.embed = nn.Embedding(vocabSize, embedSize)
        self.gru = nn.GRU(embedSize, hiddenSize, num_layers=numLayers, dropout=0.2)

    def forward(self, X, *args):
        embedded = self.embed(X)
        embedded = embedded.permute(1, 0, 2)
        output, hiddenState = self.gru(embedded)
        return output, hiddenState

time: 6.51 ms


In [4]:
class Decoder(nn.Module):
    def __init__(self, vocabSize, embedSize, hiddenSize, numLayers, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.embed = nn.Embedding(vocabSize, embedSize)
        self.gru = nn.GRU(embedSize + hiddenSize, hiddenSize, numLayers, dropout=0.2)
        self.dense = nn.Linear(hiddenSize, vocabSize)

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]

    def forward(self, X, state):
        embedded = self.embed(X)
        embedded = embedded.permute(1,0,2)
        context = state[-1].repeat(embedded.shape[0], 1, 1)
        X_context = torch.cat((embedded, context), 2)
        output, hiddenState = self.gru(X_context, state)
        output = self.dense(output)
        output = output.permute(1,0,2)
        return output, hiddenState

time: 20.8 ms


In [5]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)

time: 6.12 ms


In [6]:
class EncoderDecoderLoss(nn.CrossEntropyLoss):
    def sequenceMask(self, weights, validLen, value=0):
        maxlen = weights.size(1)
        mask = torch.arange((maxlen), dtype=torch.float32,
                            device=weights.device)[None, :] < validLen[:, None]
        weights[~mask] = value
        return weights

    def forward(self, pred, label, validLen):
        self.reduction='none'
        unweightedLoss = super(EncoderDecoderLoss, self).forward(pred.permute(0,2,1), label)

        weights = torch.ones_like(label)
        weights = self.sequenceMask(weights, validLen)
    
        weightedLoss = (unweightedLoss * weights)
        return weightedLoss.mean(dim=1)

time: 15.1 ms


In [7]:
def initializeWeights(model):
    if type(model) == nn.Linear:
            torch.nn.init.xavier_uniform_(model.weight)
    if type(model) == nn.GRU:
        for param in model._flat_weights_names:
            if "weight" in param:
                torch.nn.init.xavier_uniform_(model._parameters[param])

time: 5.59 ms


In [8]:
def gradClipping(model, theta):
    if isinstance(model, nn.Module):
        params = [p for p in model.parameters() if p.requires_grad]
    else:
        params = model.params
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm

time: 7.62 ms


In [9]:
def trainModel(model, dataIter, lr, numEpochs, vocabList, device):
    model.apply(initializeWeights)
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    lossFunc = EncoderDecoderLoss()
    model.train()

    for epoch in range(numEpochs):
        perplexity = [0.0] * 2
        for batch in dataIter:
            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
            bos = torch.tensor([vocabList['<bos>']]*Y.shape[0], device=device)
            bos = bos.reshape(-1,1)
            decInput = torch.cat([bos, Y[:,:-1]], 1)

            preds, _ = model(X, decInput, X_valid_len)
            loss = lossFunc(preds, Y, Y_valid_len)
            loss.sum().backward()
            gradClipping(model, 1)
            num_tokens = Y_valid_len.sum()
            optimizer.step()
            
            with torch.no_grad():
                perplexity = [a + float(b) for a, b in zip(perplexity, [loss.sum(), num_tokens])]
        if (epoch + 1) % 50 == 0:
                print('epoch:',epoch+1,'|| loss:',loss.sum().item())

time: 25.7 ms


In [10]:
def predictTranslation(model, eng, engVocab, freVocab, numSteps, device):
    src_tokens = engVocab[eng.lower().split(' ')] + [engVocab['<eos>']]
    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
    src_tokens = d2l.truncate_pad(src_tokens, numSteps, engVocab['<pad>'])

    # Add the batch axis
    enc_X = torch.unsqueeze(torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
    enc_outputs = model.encoder(enc_X, enc_valid_len)
    dec_state = model.decoder.init_state(enc_outputs, enc_valid_len)

    # Add the batch axis
    dec_X = torch.unsqueeze(torch.tensor([freVocab['<bos>']], dtype=torch.long, device=device), dim=0)
    output_seq = []
    for _ in range(numSteps):
        Y, dec_state = model.decoder(dec_X, dec_state)

        # We use the token with the highest prediction likelihood as the input
        # of the decoder at the next time step
        dec_X = Y.argmax(dim=2)
        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
        
        # Once the end-of-sequence token is predicted, the generation of
        # the output sequence is complete
        
        if pred == freVocab['<eos>']:
            break
        output_seq.append(pred)

    return ' '.join(freVocab.to_tokens(output_seq))

time: 26.2 ms


In [11]:
def translate(model, engPhrases, frePhrases, engVocab, freVocab, numSteps, device):
    for eng, fra in zip(engPhrases, frePhrases):
        translation = predictTranslation(model, eng, engVocab, freVocab, numSteps, device)
        print(eng, '=>', translation, '|| actual:', fra)

time: 3.07 ms


In [12]:
embedSize, hiddenSize, numLayers = 32, 32, 2
lr, numEpochs = 0.005, 350

trainIterator, engVocab, freVocab = d2l.load_data_nmt(batch_size=64, num_steps=10)

encoder = Encoder(len(engVocab), embedSize, hiddenSize, numLayers)
decoder = Decoder(len(freVocab), embedSize, hiddenSize, numLayers)
model = EncoderDecoder(encoder, decoder)

trainModel(model, trainIterator, lr, numEpochs, freVocab, device)

epoch: 50 || loss: 8.068324089050293
epoch: 100 || loss: 3.3441240787506104
epoch: 150 || loss: 3.24617075920105
epoch: 200 || loss: 2.3754284381866455
epoch: 250 || loss: 3.188391923904419
epoch: 300 || loss: 3.27647066116333
epoch: 350 || loss: 1.7688812017440796
time: 1min 54s


In [13]:
engPhrases = ['go .', "i lost .", 'i\'m home .', 'he\'s calm .']
frePhrases = ['va !', 'j\'ai perdu .', 'je suis chez moi .', 'il est calme .']
translate(model, engPhrases, frePhrases, engVocab, freVocab, numSteps=10, device=device)

go . => va ! || actual: va !
i lost . => j'ai perdu perdu . || actual: j'ai perdu .
i'm home . => je suis chez moi la <unk> ! || actual: je suis chez moi .
he's calm . => il est perdu bon . || actual: il est calme .
time: 39.6 ms
