# KAIST AI605 Assignment 3: Transformer

TA in charge: Jaehyeong Jo (harryjo97@kaist.ac.kr)

**Due date**:  Nov 22 (Mon) 11:00pm, 2021  


## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/aGZZ86YpCdv2zEVt9). 

You need to submit both (1) a PDF of this notebook, and (2) a link to CoLab for execution (.ipynb file is also allowed).

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points. You can obtain up to 3 bonus points (i.e. max score is 23 points). For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will only use Python 3.7 and PyTorch 1.9, which is already available on Colab:

# **<<This is a Late Submission !!! (5 days)>>**

In [1]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.8.3
torch 1.7.0


## 1. Attention Layer

We will first start with going over a few concepts that you learned in your high school statistics class. The variance of a random variable $X$, $\text{Var}(X)$ is defined as $\text{E}[(X-\mu)^2]$ where $\mu$ is the mean of $X$. Furthermore, given two independent random variables $X$ and $Y$ and a constant $a$,
$$ \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y),$$
$$ \text{Var}(aX) = a^2\text{Var}(X),$$
$$ \text{Var}(XY) = \text{E}(X^2)\text{E}(Y^2) - [\text{E}(X)]^2[\text{E}(Y)]^2.$$

> **Problem 1.1** *(3 points)* Suppose we are given two sets of $n$ random variables, $X_1 \dots X_n$ and $Y_1 \dots Y_n$, where all of these $2n$ variables are mutually independent and have a mean of $0$ and a variance of $1$. Prove that
$$\text{Var}\left(\sum_i^n X_i Y_i\right) = n.$$

> **Answer 1.1**

\begin{align}
\text{Var}\big(\sum_i^n X_i Y_i\big) &= \sum_i^n\text{Var}(X_i Y_i) ~~ \text{(by indep.)} \\
&= \sum_i^n E(X_i^2)E(Y_i^2) - E(X_i)^2 E(Y_i)^2 \\
&= \sum_i^n 1\cdot 1 - 0 \cdot 0 \\
&= n
\end{align}

In Lecture 11 and 12, we discussed how the attention is computed in Transformer via the following equation,
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.$$
> **Problem 1.2** *(3 points)*  Suppose $Q$ and $K$ are matrices of independent variables each of which has a mean of $0$ and a variance of $1$. Using what you learned from Problem 1.1., show that
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.$$

> **Answer 1.2**

Let $Q\in \mathbb{R}^{B\times d_k}$ and $K\in \mathbb{R}^{B\times d_k}$.
For every $\text{Var}(Q_i K_j^\top), \forall i,j\in[B]$, since $Q_i$ and $K_j^\top$ are mutually independent and have zero mean and unit variance, the following holds:

\begin{align}
\text{Var}\big(\frac{Q_i K_j^\top}{\sqrt{d_k}}\big) &= \frac{1}{d_k} \text{Var}\big(Q_i K_j^\top \big) \\
&= \frac{1}{d_k} \text{Var}\big(\sum_{k=1}^{d_k} Q_{ik} K_{kj}^\top \big) \\
&= \frac{1}{d_k} \cdot d_k ~~~ \text{(Problem 1.1)} \\ 
&= 1
\end{align}


> **Problem 1.3** *(4 points)* What would happen if the assumption that the variance of $Q$ and $K$ is $1$ does not hold? Consider each case of it being higher and lower than $1$ and conjecture what it implies, respectively.

>**Answer 1.3**

If the variance of $Q$ and $K$ are higher than $1$,  $\text{Var}(QK^T) \ge d_k$. Therefore, scaling with $\sqrt{d_k}$ may not be a good choice. Therefore, scaling with $\sqrt{d_k}$ is not good enough; considering the softmax operation in Attention formula, the temperature scale of $\sqrt{d_k}$ makes a sharp distribution, attending only a few points. On the other hand, if the variance of $Q$ and $K$ are smaller than $1$, $\text{Var}(QK^T) \le d_k$. In this case, considering the softmax operation in Attention formula, the temperature scale of $\sqrt{d_k}$ makes a flat distribution, giving every similar attention scores.

## 2. Transformer

In this section, you will implement Transformer for a few tasks that are simpler than machine translation. First, go through [Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) and make sure you understand every block of the code. Then, you will reuse these code where appropriate to create models for following three tasks. Note that we do not provide a separate training or evaluation data, so it is your job to be able to create these in a reasonable manner.


# Problem 2.1


> **Problem 2.1** *(4 points)* Create a model that takes a random set of input symbols from a vocabulary of digits (i.e. 0, 1, ... , 8, 9) as the input and generate back the same symbols. Instead of varying length, we fix the length to 32. Make sure to report that your model's accuracy (gives credit only if the entire output sequence is correct) goes above 90%. Note that a similar problem is also in Annotated Transformer, and copying code is allowed.

In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

In [3]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask,
                            tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
    
    
class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

    
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)
    

class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
    

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))
    

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)
    
    
class Decoder(nn.Module):
    "Generic N layer decoder with masking."
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)
    
    
class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
 
    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)
    
def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

In [4]:
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)
    

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))
    
    
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)
    
    
class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], 
                         requires_grad=False)
        return self.dropout(x)

In [5]:
def make_model(src_vocab, tgt_vocab, N=6, 
               d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), 
                             c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab))
    
    # This was important from their code. 
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model

class Batch:
    "Object for holding a batch of data with mask during training."
    def __init__(self, src, trg=None, pad=0):
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if trg is not None:
            self.trg = trg[:, :-1]
            self.trg_y = trg[:, 1:]
            self.trg_mask = \
                self.make_std_mask(self.trg, pad)
            self.ntokens = (self.trg_y != pad).data.sum()
    
    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & Variable(
            subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data))
        return tgt_mask
    

class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
            (self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup ** (-1.5)))
        
def get_std_opt(model):
    return NoamOpt(model.src_embed[0].d_model, 2, 4000,
            torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

class LabelSmoothing(nn.Module):
    "Implement label smoothing."
    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None
        
    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))
    
class SimpleLossCompute:
    "A simple loss compute and train function."
    def __init__(self, generator, criterion, opt=None):
        self.generator = generator
        self.criterion = criterion
        self.opt = opt
        
    def __call__(self, x, y, norm):
        seq_len = y.size(-1)
        x = self.generator(x)
        pred = x.data.max(dim=-1)[1]
        correct = ((pred == y).sum(dim=-1) == seq_len).sum().item()
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)), 
                              y.contiguous().view(-1)) / norm
        loss.backward()
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()
        return loss.data * norm, correct

In [6]:
def data_gen(V, batch, nbatches):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data = torch.from_numpy(np.random.randint(1, V, size=(batch, 33)))
        data[:, 0] = 1
        src = Variable(data, requires_grad=False)
        tgt = Variable(data, requires_grad=False)
        yield Batch(src, tgt, 0)

In [7]:
def run_epoch(data_iter, model, loss_compute, epoch):
    "Standard Training and Logging Function"
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    acc = 0
    cnt = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.trg, 
                            batch.src_mask, batch.trg_mask)
        loss, correct = loss_compute(out, batch.trg_y, batch.ntokens)
        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        acc += correct
        cnt += batch.trg_y.shape[0]
        if (i+1) % 20 == 0:
            elapsed = time.time() - start
            print("Epoch: %d Step: %d Loss: %f Tokens per Sec: %f" %
                    (epoch, i+1, loss / batch.ntokens, tokens / elapsed))
            start = time.time()
            tokens = 0
    return total_loss / total_tokens, acc / cnt

In [8]:
# Train the simple copy task.
V = 10
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

for epoch in range(100):
    model.train()
    run_epoch(data_gen(V, 30, 20), model, 
              SimpleLossCompute(model.generator, criterion, model_opt), epoch)
    model.eval()
    loss, acc = run_epoch(data_gen(V, 30, 5), model, 
                    SimpleLossCompute(model.generator, criterion, None), epoch)
    print('Validation Loss: {:.4f}, Validation Acc: {:.2f}%'.format(loss, acc * 100))

  nn.init.xavier_uniform(p)


Epoch: 0 Step: 20 Loss: 2.292718 Tokens per Sec: 6041.841309
Validation Loss: 2.1984, Validation Acc: 0.00%
Epoch: 1 Step: 20 Loss: 2.213340 Tokens per Sec: 6070.613281
Validation Loss: 2.1488, Validation Acc: 0.00%
Epoch: 2 Step: 20 Loss: 2.113729 Tokens per Sec: 6122.917969
Validation Loss: 2.1139, Validation Acc: 0.00%
Epoch: 3 Step: 20 Loss: 2.154241 Tokens per Sec: 6077.809570
Validation Loss: 2.0846, Validation Acc: 0.00%
Epoch: 4 Step: 20 Loss: 2.106472 Tokens per Sec: 6212.707520
Validation Loss: 2.0769, Validation Acc: 0.00%
Epoch: 5 Step: 20 Loss: 2.050679 Tokens per Sec: 6183.625977
Validation Loss: 2.0213, Validation Acc: 0.00%
Epoch: 6 Step: 20 Loss: 2.006129 Tokens per Sec: 6050.239258
Validation Loss: 1.9740, Validation Acc: 0.00%
Epoch: 7 Step: 20 Loss: 1.999506 Tokens per Sec: 6232.646973
Validation Loss: 1.9442, Validation Acc: 0.00%
Epoch: 8 Step: 20 Loss: 1.708790 Tokens per Sec: 6243.300293
Validation Loss: 1.4965, Validation Acc: 0.00%
Epoch: 9 Step: 20 Loss: 1.72

# Problem 2.2



> **Problem 2.2** *(6 points)* Now, we will implement a bit more useful function, so-called spelling error correction. Your job is to create a model whose input is a word with spelling errors, and the output is the spelling-corrected word. Here, your vocabulary will be character instead of word. You can create your own training data by using an existing text corpus as the target and inject noise into it to use it as the input. You are free to use whichever text corpus you like. If you can't think of one, please use context data in SQuAD Dataset (see Assignment 2). Report accuracy in your own evaluation data (you will receive full credit as long as both the evaluation data and the accuracy are reasonable), and also show 5 examples where it succeeds at correcting spelling.



> **Answer 2.2** I used SQuAD validation set. The training set is the words from 'context' in SQuAD, and the evaluation set is the words from 'question' in SQuAD. Evaluation accuracy for spelling check is **21.3%**. Also, see the code below for 5 examples of correcting spelling.

In [9]:
from datasets import load_dataset
from pprint import pprint

squad_dataset = load_dataset('squad')
pprint(squad_dataset['validation'][0])

Reusing dataset squad (/home/sungnyun/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

{'answers': {'answer_start': [177, 177, 177],
             'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the '
            'champion of the National Football League (NFL) for the 2015 '
            'season. The American Football Conference (AFC) champion Denver '
            'Broncos defeated the National Football Conference (NFC) champion '
            'Carolina Panthers 24–10 to earn their third Super Bowl title. The '
            "game was played on February 7, 2016, at Levi's Stadium in the San "
            'Francisco Bay Area at Santa Clara, California. As this was the '
            '50th Super Bowl, the league emphasized the "golden anniversary" '
            'with various gold-themed initiatives, as well as temporarily '
            'suspending the tradition of naming each Super Bowl game with '
            'Roman numerals (under which the game would have been known as '
            '"Super

In [10]:
import copy
from tqdm import tqdm


words = list()
test_words = list()

for i in tqdm(range(len(squad_dataset['validation']))):
    context = squad_dataset['validation'][i]['context'].split(' ')
    question = squad_dataset['validation'][i]['question'].split(' ')
    for word, test_word in zip(context, question): 
        if word not in words:
            words.append(word)
        if test_word not in test_words:
            test_words.append(test_word)

#answer = copy.deepcopy(words)
print(len(words), len(test_words))

100%|██████████| 10570/10570 [00:07<00:00, 1355.23it/s]

8467 15634





In [11]:
vocab = list([chr(i) for i in range(ord('a'),ord('z')+1)])
vocab += ['UNK', 'PAD']
char2id = {char: i for i, char in enumerate(vocab)} 
id2char = {i : char for i, char in enumerate(vocab)} 

In [12]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import random


class SentenceBatch(Dataset):
    def __init__(self, src, target, pad=27):
        src = torch.Tensor(src)
        target = torch.Tensor(target)
        assert src.shape == target.shape
        
        self.src = torch.ones(src.size(0), src.size(1)+1)
        self.src[:, 1:] = src
        self.src = self.src.long()
        
        self.target = torch.ones(target.size(0), target.size(1)+1)
        self.target[:, 1:] = target
        self.target = self.target.long()
        
        self.src_mask = (self.src != pad).unsqueeze(-2)
        
        self.trg = self.target[:, :-1]
        self.trg_y = self.target[:, 1:]
        self.trg_mask = self.make_std_mask(self.trg, pad)
    
    def __len__(self):
        return len(self.src)

    def __getitem__(self, idx):
        src = self.src[idx]
        trg = self.trg[idx]
        trg_y = self.trg_y[idx]
        src_mask = torch.tensor(self.src_mask[idx], dtype=torch.int64)
        trg_mask = torch.tensor(self.trg_mask[idx], dtype=torch.int64)

        return src, trg, trg_y, src_mask, trg_mask

    @staticmethod
    def make_std_mask(trg, pad):
        "Create a mask to hide padding and future words."
        trg_mask = (trg != pad).unsqueeze(-2)
        trg_mask = trg_mask & Variable(
            subsequent_mask(trg.size(-1)).type_as(trg_mask.data))
        return trg_mask

    
length = 32
train_src = []
train_trg = []

for word in tqdm(words):
    trg_id = [char2id[char] if char in char2id else 26 for char in word.lower()]
    wordlen = len(trg_id)
    src_id = trg_id[:]
    if wordlen >= 2:
        rand1 = random.randint(0, wordlen-1)
        rand2 = random.randint(0, 25)
        src_id[rand1] = rand2
    if wordlen < length:
        src_id += (length - wordlen) * [27] 
        trg_id += (length - wordlen) * [27]
    elif wordlen > length:
        src_id = src_id[:length]
        trg_id = trg_id[:length]
    train_src.append(src_id)
    train_trg.append(trg_id)
    
test_src = []
test_trg = []

for word in tqdm(test_words):
    trg_id = [char2id[char] if char in char2id else 26 for char in word.lower()]
    wordlen = len(trg_id)
    src_id = trg_id[:]
    if wordlen >= 2:
        rand1 = random.randint(0, wordlen-1)
        rand2 = random.randint(0, 25)
        src_id[rand1] = rand2
    if wordlen < length:
        src_id += (length - wordlen) * [27] 
        trg_id += (length - wordlen) * [27]
    elif wordlen > length:
        src_id = src_id[:length]
        trg_id = trg_id[:length]
    test_src.append(src_id)
    test_trg.append(trg_id)

100%|██████████| 8467/8467 [00:00<00:00, 298580.56it/s]
100%|██████████| 15634/15634 [00:00<00:00, 295495.71it/s]


In [13]:
trainset = SentenceBatch(train_src, train_trg)
testset = SentenceBatch(test_src, test_trg)

train_loader = DataLoader(trainset, batch_size=256, shuffle=True)
test_loader = DataLoader(testset, batch_size=256)

In [14]:
class SimpleLossCompute:
    "A simple loss compute and train function."
    def __init__(self, criterion, opt=None):
        self.criterion = criterion
        self.opt = opt
        
    def __call__(self, x, y, norm):
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)), 
                              y.contiguous().view(-1)) / norm
        loss.backward()
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()
        return loss.data * norm
    
    

In [16]:
V = 28
criterion = LabelSmoothing(size=V, padding_idx=27, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))
loss_compute = SimpleLossCompute(criterion, opt=model_opt)
model.cuda()

for epoch in range(30):
    model.train()
    total_loss = 0
    total_ntokens = 0
    correct = 0
    total = 0
    
    for i, batch in enumerate(train_loader):
        src, trg, trg_y, src_mask, trg_mask = batch
        src, trg, trg_y, src_mask, trg_mask = src.cuda(), trg.cuda(), trg_y.cuda(), src_mask.cuda(), trg_mask.cuda()
        out = model.generator(model.forward(src, trg, src_mask, trg_mask))
        
        pred = out.data.max(dim=-1)[1]
        correct += ((pred*src_mask.squeeze()[:,1:] == (trg_y*src_mask.squeeze()[:,1:])).sum(dim=-1) == 32).cpu().sum().item()
        loss = loss_compute(out, trg_y, trg_y.size(0)*trg_y.size(1))
        
        total_loss += loss 
        total_ntokens += trg_y.size(0)*trg_y.size(1)
        total += trg.size(0)
        
    print('Epoch: {:d}, Train Loss: {:.4f}, Train Acc: {:.2f}%'.format(epoch, total_loss/total_ntokens, correct/total*100))

    correct = 0
    total = 0
    model.eval()
    with torch.no_grad():
        for i, batch in enumerate(test_loader):
            src, trg, trg_y, src_mask, trg_mask = batch        
            src, trg, trg_y, src_mask, trg_mask = src.cuda(), trg.cuda(), trg_y.cuda(), src_mask.cuda(), trg_mask.cuda()
            out = model.generator(model.forward(src, trg, src_mask, trg_mask))
            
            pred = out.data.max(dim=-1)[1]
            correct += ((pred*src_mask.squeeze()[:,1:] == (trg_y*src_mask.squeeze()[:,1:])).sum(dim=-1) == 32).cpu().sum().item()
            total += trg.size(0)
    print('Epoch: {:d}, Test Acc: {:.2f}%'.format(epoch, correct/total*100))

  nn.init.xavier_uniform(p)
  src_mask = torch.tensor(self.src_mask[idx], dtype=torch.int64)
  trg_mask = torch.tensor(self.trg_mask[idx], dtype=torch.int64)


Epoch: 0, Train Loss: 0.6295, Train Acc: 4.71%
Epoch: 0, Test Acc: 3.33%
Epoch: 1, Train Loss: 0.3621, Train Acc: 6.40%
Epoch: 1, Test Acc: 5.51%
Epoch: 2, Train Loss: 0.2315, Train Acc: 9.31%
Epoch: 2, Test Acc: 9.05%
Epoch: 3, Train Loss: 0.1734, Train Acc: 12.63%
Epoch: 3, Test Acc: 12.05%
Epoch: 4, Train Loss: 0.1554, Train Acc: 14.99%
Epoch: 4, Test Acc: 12.51%
Epoch: 5, Train Loss: 0.1491, Train Acc: 16.13%
Epoch: 5, Test Acc: 13.94%
Epoch: 6, Train Loss: 0.1416, Train Acc: 17.40%
Epoch: 6, Test Acc: 15.32%
Epoch: 7, Train Loss: 0.1414, Train Acc: 18.29%
Epoch: 7, Test Acc: 14.95%
Epoch: 8, Train Loss: 0.1413, Train Acc: 18.25%
Epoch: 8, Test Acc: 12.43%
Epoch: 9, Train Loss: 0.1467, Train Acc: 18.29%
Epoch: 9, Test Acc: 14.98%
Epoch: 10, Train Loss: 0.1424, Train Acc: 18.13%
Epoch: 10, Test Acc: 15.98%
Epoch: 11, Train Loss: 0.1490, Train Acc: 16.90%
Epoch: 11, Test Acc: 13.94%
Epoch: 12, Train Loss: 0.1487, Train Acc: 17.01%
Epoch: 12, Test Acc: 13.51%
Epoch: 13, Train Loss: 0.

In [17]:
correct = 0
total = 0
cnt = 0
model.eval()
with torch.no_grad():
    for i, batch in enumerate(test_loader):
        src, trg, trg_y, src_mask, trg_mask = batch        
        src, trg, trg_y, src_mask, trg_mask = src.cuda(), trg.cuda(), trg_y.cuda(), src_mask.cuda(), trg_mask.cuda()
        out = model.generator(model.forward(src, trg, src_mask, trg_mask))

        pred = out.data.max(dim=-1)[1]
        correct = ((pred*src_mask.squeeze()[:,1:] == (trg_y*src_mask.squeeze()[:,1:])).sum(dim=-1) == 32)
        idx = (correct == 1).nonzero()[-5:]
        
        src = src[idx].squeeze()[:, 1:].cpu().numpy()
        pred = pred[idx].squeeze().cpu().numpy()
        src_mask = src_mask[idx].squeeze()[:, 1:].cpu().numpy()
        length = src_mask.sum(-1)
        
        for j in range(len(src)):
            print(f'Source {j}: ', [id2char[id_] for id_ in src[j]][:length[j]])
            print(f'Target {j}: ', [id2char[id_] for id_ in pred[j]][:length[j]])
            
        break

Source 0:  ['p', 'e', 'r', 'k', 'o', 'r', 'm', 'e', 'd']
Target 0:  ['p', 'e', 'r', 'f', 'o', 'r', 'm', 'e', 'd']
Source 1:  ['l', 'i', 's', 'r']
Target 1:  ['l', 'i', 's', 't']
Source 2:  ['s', 't', 'a', 't', 'i', 'o', 'v']
Target 2:  ['s', 't', 'a', 't', 'i', 'o', 'n']
Source 3:  ['UNK', 'UNK', 'e']
Target 3:  ['UNK', 'UNK', 'UNK']
Source 4:  ['c', 'a', 'm', 'r']
Target 4:  ['c', 'a', 'm', 'e']


  src_mask = torch.tensor(self.src_mask[idx], dtype=torch.int64)
  trg_mask = torch.tensor(self.trg_mask[idx], dtype=torch.int64)
	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729096996/work/torch/csrc/utils/python_arg_parser.cpp:882.)
  idx = (correct == 1).nonzero()[-5:]
