<a href="https://colab.research.google.com/github/siddharthchd/deepLearning_20/blob/main/transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from torch import nn
import torch.nn.functional as f
import numpy as np

# Self-Attention

We have a set of t inputs $\{x_{i}\}^{t}_{i=1} \in \mathbb{R}^{n}$

The x's can be considered a matrix with n rows and t cols

$X = \begin{bmatrix} x_{1}&x_{2}&\cdots&x_{t} \end{bmatrix} \in \mathbb{R}^{n\times t}$

Hidden representation is a linear combination of the column vectors $x_{i}$ : $h = \alpha_{1}x_{1} + \alpha_{2}x_{2} + \cdots + \alpha_{t}x_{t} = Xa \in \mathbb{R}^{n}$

"Hard" attention: $\lvert\lvert{a}\rvert\rvert_{0} = 1$

i.e., $a$ is a one-hot vector $\rightarrow$ multiplication by $X$ is a selection of columns
select one element of the set

"Soft" attention: $\lvert\lvert{a}\rvert\rvert_{1} = 1$

i.e., constraint is that the summation of elements of $a$, the $\alpha$'s, sum to $1$

Where do the $a$'s come from?
$a = [soft]\arg\max_{\beta}(X^{T}x)\in\mathbb{R}^{t}$

$a$ is the value of the scalar product of input vector $x$ with every other vector in the set (denoted $X$). Every element in the final product is the scalar product of all elements against a given $x$.

n.b.: $\beta$ is the parameter of the soft argmax (usu. referred to as "softmax") (in energy terms, the inverse of the temperature, $\exp$ of argument divided by sum of $\exp$s). It's there whenever you have soft argmax; $\beta$ is usually set to one so you don't see it but it's inside the $\exp$

use argmax -> one-hot encoding

soft argmax-> exponential divided by summation of all exponentials

A set of $x$'s implies a set of $a's$; Stack the vectors $a$ in matrix $A \in \mathbb{R}^{t\times t}$

$a$ has size $t$ for $t$ rows in $x^{T}$

a set of $a$'s implies a set of $h$'s: $H \in \mathbb{R}^{n\times t}$

Finally: $H = XA \in\mathbb{R}$
$H$ is a linear combination of the elements of $X$ using the factors in the columns of $A$

Overall: mix the components of the set of $x$'s by using these coefficients which are computed using the soft argmax, where each component has a score of cosine similarity (dot product) of a given $x$ against the set of $x$'s

# Key-value store

Conceptually, we are checking how aligned is the query against all the values in the dataset (compute how matching the dataset values are with respect to your query)

We can retrieve the single maximum matching element with argmax OR use soft argmax to find a probability distribution - can retrieve things in order/have a sequence of decreasingly less correlated/similar items

Queries, keys, and values are rotations of input $x$: $q = W_q x; k = W_k x; v = W_v x$; these rotations $W_q, W_k, W_v$ are training parameters

Attention is completely based on orientation; the only nonlinearity is the soft argmax for probability distribution

$q, k$ must have the same dimension; $v$ is the returned value/content associated with a given key

Given that we have a set of x's we'll have a set of queries, keys, values, we can make a matrix stacking them all up; matrix has $t$ cols of row vectors of size $d$

Next: check  query against all keys - transpose $K$ against every $q$ query

This returns $t$ scores which constitute a probability distribution over the space of possible matching sequences

## Transformer Model

### Multi-head attention module

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
nn_Softargmax = nn.Softmax

In [None]:
# Multiple heads : allows for multiple properties per query

class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, num_heads, p, d_input = None):

        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        if d_input is None:
            d_xq = d_xk = d_xv = d_model
        else:
            d_xq, d_xk, d_xv = d_input

        # Make sure that the embedding dimension of model is a multiple of number of heads
        assert d_model % self.num_heads == 0
        
        self.d_k = d_model // self.num_heads

        # Matrices allowing to rotate current input
        # (These are still of dimension d_model. They will be split into number of heads)
        self.W_q = nn.Linear(d_xq, d_model, bias = False)
        self.W_k = nn.Linear(d_xk, d_model, bias = False)
        self.W_v = nn.Linear(d_xv, d_model, bias = False)

        # Outputs of all sub-layers need to be of dimension d_model
        self.W_h = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V):

        batch_size = Q.size(0)
        k_length = K.size(-2)

        # Scaling by d_k so that the soft(arg)max doesn't saturate
        Q = Q / np.sqrt(self.d_k)

        # Multiplication between one query and all keys
        scores = torch.matmul(Q, K.transpose(2, 3)) # (bs, n_heads, q_length, k_length)

        # Compute the mixing coefficients
        A = nn_Softargmax(dim = -1)(scores) # (bs, n_heads, q_length, k_length)

        # Get the weighted average of the values - multiply mixing coeff with V matrix
        H = torch.matmul(A, V)  # (bs, n_heads, q_length, dim_per_head)

        return H, A

    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (heads X depth)
        Return after transpose to put in shape (batch_size X num_heads X seq_length X d_k)
        """
        return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

    def group_heads(self, x, batch_size):
        """
        Combine the heads again to get (batch_size X seq_length X (num_heads times d_k))
        """
        return x.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)

    def forward(self, X_q, X_k, X_v):

        batch_size, seq_length, dim = X_q.size()

        # Apply W transformation (learned rotation of x input), then split into num_heads
        Q = self.split_heads(self.W_q(X_q), batch_size) # (bs, n_heads, q_length, dim_per_head)
        K = self.split_heads(self.W_k(X_k), batch_size) # (bs, n_heads, k_length, dim_per_head)
        V = self.split_heads(self.W_v(X_v), batch_size) # (bs, n_heads, v_length, dim_per_head)

        # Compute the scaled dot product between one query against all keys
        # i.e. Calculate the attention weights fro each of the heads
        H_cat, A = self.scaled_dot_product_attention(Q, K, V)

        # Put all the heads back together by concat
        H_cat = self.group_heads(H_cat, batch_size) # (bs, q_length, dim)

        # Final linear layer
        H = self.W_h(H_cat) # (bs, q_length, dim)

        return H, A

### Check how the self-attention mechanism works

In [None]:
temp_mha = MultiHeadAttention(d_model = 512, num_heads = 8, p = 0)

def print_out(Q, K, V):

    temp_out, temp_attn = temp_mha.scaled_dot_product_attention(Q, K, V)

    print('Attention weights are : ', temp_attn.squeeze())
    print('Output is : ', temp_out.squeeze())

In [None]:
test_K = torch.tensor(
    [[10, 0, 0],
     [ 0,10, 0],
     [ 0, 0,10],
     [ 0, 0,10]]
).float()[None, None]

test_V = torch.tensor(
    [[   1,0,0],
     [  10,0,0],
     [ 100,5,0],
     [1000,6,0]]
).float()[None, None]

test_Q = torch.tensor(
    [[0, 10, 0]]
).float()[None, None]

print_out(test_Q, test_K, test_V)

Attention weights are :  tensor([3.7266e-06, 9.9999e-01, 3.7266e-06, 3.7266e-06])
Output is :  tensor([1.0004e+01, 4.0993e-05, 0.0000e+00])


## 1D convolution with `kernel_size = 1`

This is equivalent to an MLP with one hidden layer and ReLU activation applied to each and every element in the set.

In [None]:
# Element wise feedforward = 1d Convolution with kernel size 1
# linear layer maps a representation to some other representation ( is a transformation)
# convolution maps one set to another set - which is what we arr actually doing here 
# applt same linear transform to every elemnt in a sequence

# conv hidden layer is applied to every component in the set - every element treated separately
# if you apply same linear layer to every element in a sequence -> that's a convolution
# in practice, implementations generally use a linear layer

class CNN(nn.Module):

    def __init__(self, d_model, hidden_dim, p):

        super().__init__()
        self.k1convL1 = nn.Linear(d_model, hidden_dim)
        self.k1convL2 = nn.Linear(hidden_dim, d_model)
        self.activation = nn.ReLU()

    def forward(self, x):

        x = self.k1convL1(x)
        x = self.activation(x)
        x = self.k1convL2(x)

        return x
    

## Transformer Encoder

In [None]:
# Components of encoder block : 
# 1 : self attention
# 2 : convolution - MLP applied to every element in the set

class EncodeLayer(nn.Module):

    def __init__(self, d_model, num_heads, conv_hidden_dim, p = 0.1):

        super().__init__()

        self.mha = MultiHeadAttention(d_model, num_heads, p)
        self.cnn = CNN(d_model, conv_hidden_dim, p)

        self.layernorm1 = nn.LayerNorm(normalized_shape = d_model, eps = 1e-6)
        self.layernorm2 = nn.LayerNorm(normalized_shape = d_model, eps = 1e-6)

    def forward(self, x):

        # Multi-head attention
        attention_output, _ = self.mha(x, x, x) # (batch_size, input_seq_len, d_model)

        # Layer norm after adding the residual connection
        out1 = self.layernorm1(x + attention_output)    # (batch_size, input_seq_len, d_model)

        # Feed forward
        cnn_output = self.cnn(out1) # (batch_size, input_seq_len, d_model)

        # Second layer norm after adding residual connection
        out2 = self.layernorm2(out1 + cnn_output)   # (batch_size, input_seq_len, d_model)

        return out2

## Positional Embeddings

see https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Until here, basically like a bag of words, but also want to send information about what position the item takes. So far, the encoder, transformer, attention are all permutation equivariant. For sentences, we need to account for the order of words. We can add information about the position to the initial encoding of the words. In this way, the encoding is not integrated into the model itself, but rather enhances the input to the model with information about its own position. Since the Transformer architecture is equipped with residual connections, the information from the input of the model (containing positional embeddings) can propagate to further layers (further layers remain aware of position).

Some criteria for position-sensitive encoding:
- Should output a unique encoding for each time-step/word position in a sentence
- Distance between any two time-steps should be consistent across sentences with different lengths
- Model should generalize to longer sentences without any efforts; values should be bounded
- Must be deterministic

Let $t$ be the desired position in an input sentence, $p_{t} \in \mathbb{R}^{d}$ be its corresponding encoding, and $d$ be the encoding dimension, same as the word embedding dimension - the positional embedding is a transformation of the word embedding:

$$\psi^{\prime}(w_t) = \psi(w_t)+p_t$$

Sinusoidal positional embeddings:

\begin{aligned}
E(p, 2i)    &= \sin(p / 10000^{2i / d}) \\
E(p, 2i+1) &= \cos(p / 10000^{2i / d})
\end{aligned}

- the positional embedding $p_t$ as a vector containing pairs of sines and cosines

- represents $p_{t+\phi}$ as a linear function of $p_t$ for any fixed offset $\phi$ - the sines and cosines implement a rotation transformation

- position as the frequency of flip in value when incrementing, which varies depending on the bit position -> sinusoidal functions as the continuous version of alternating bits


In [None]:
def create_sinusoidal_embeddings(nb_p, dim, E):

    theta = np.array([
            [p / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
            for p in range(nb_p)
    ])
    E[:, 0 :: 2] = torch.FloatTensor(np.sin(theta[:, 0 :: 2]))
    E[:, 1 :: 2] = torch.FloatTensor(np.cos(theta[:, 1 :: 2]))
    E.detach_()
    E.requires_grad = False
    E = E.to(device)

class Embeddings(nn.Module):

    def __init__(self, d_model, vocab_size, max_position_embeddings, p):

        super().__init__()
        self.word_embeddings = nn.Embedding(vocab_size, d_model, padding_idx = 1) # a simple lookup table that stores embeddings of a fixed dictionary and size
        self.position_embeddings = nn.Embedding(max_position_embeddings, d_model)
        create_sinusoidal_embeddings(
            nb_p = max_position_embeddings,
            dim = d_model,
            E = self.position_embeddings.weight
        )

        self.LayerNorm = nn.LayerNorm(d_model, eps = 1e-12)

    def forward(self, input_ids):

        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype = torch.long, device = input_ids.device)  # (max_seq_length)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)

        # Get word embeddings for each input
        word_embeddings = self.word_embeddings(input_ids)   # (bs, max_seq_length, dim)

        # Get position embeddings for each position id
        position_embeddings = self.position_embeddings(position_ids)    # (bs, max_seq_length, dim)

        # Add them both
        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)

        # Layer norm
        embeddings = self.LayerNorm(embeddings) # (bs, max_seq_length, dim)

        return embeddings

## Overall Encoder

### (Blocks of N Encoder Layers + Positional encoding + Input embedding)

In [10]:
class Encoder(nn.Module):

    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim, input_vocab_size, maximum_position_encoding, p = 0.1):

        super().__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        # Apply permutation-sensitive embeddings
        self.embedding = Embeddings(d_model, input_vocab_size, maximum_position_encoding, p)

        self.enc_layers = nn.ModuleList()

        for _ in range(num_layers):

            self.enc_layers.append(EncodeLayer(d_model, num_heads, ff_hidden_dim, p))

    def forward(self, x):

        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)

        # Stack multiple to make network 'more powerful'
        # append several encoders together
        for i in range(self.num_layers):

            x = self.enc_layers[i](x)

        return x    # (batch_size, input_seq_len, d_model)

In [11]:
import torchtext.data as data
import torchtext.datasets as datasets

In [12]:
max_len = 200

text = data.Field(sequential = True, fix_length = max_len, batch_first = True, lower = True, dtype = torch.long)
label = data.LabelField(sequential = False, dtype = torch.long)
datasets.IMDB.download('./')

ds_train, ds_test = datasets.IMDB.splits(text, label, path = './imdb/aclImdb')

print('Train : ', len(ds_train))
print('Test : ', len(ds_test))
print('train.fields : ', ds_train.fields)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:09<00:00, 8.98MB/s]


Train :  25000
Test :  25000
train.fields :  {'text': <torchtext.data.field.Field object at 0x7fbc13d3f278>, 'label': <torchtext.data.field.LabelField object at 0x7fbc13d3f2b0>}


In [13]:
ds_train, ds_valid = ds_train.split(0.9)

print('Train : ', len(ds_train))
print('Validation : ', len(ds_valid))
print('Test : ', len(ds_test))

Train :  22500
Validation :  2500
Test :  25000


In [14]:
num_words = 50_000
text.build_vocab(ds_train, max_size = num_words)
label.build_vocab(ds_train)
vocab = text.vocab

In [16]:
batch_size = 164
train_loader, valid_loader, test_loader = data.BucketIterator.splits((ds_train, ds_valid, ds_test), batch_size = batch_size, sort_key = lambda x : len(x.text), repeat = False)

In [17]:
class TransformerClassifier(nn.Module):

    def __init__(self, num_layers, d_model, num_heads, conv_hidden_dim, input_vocab_size, num_answers):

        super().__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, conv_hidden_dim, input_vocab_size, maximum_position_encoding = 10000)
        self.dense = nn.Linear(d_model, num_answers)

    def forward(self, x):

        x = self.encoder(x)

        x, _ = torch.max(x, dim = 1)
        x = self.dense(x)

        return x

In [18]:
model = TransformerClassifier(num_layers = 1, d_model = 32, num_heads = 2, conv_hidden_dim = 128, input_vocab_size = 50002, num_answers = 2)
model.to(device)

TransformerClassifier(
  (encoder): Encoder(
    (embedding): Embeddings(
      (word_embeddings): Embedding(50002, 32, padding_idx=1)
      (position_embeddings): Embedding(10000, 32)
      (LayerNorm): LayerNorm((32,), eps=1e-12, elementwise_affine=True)
    )
    (enc_layers): ModuleList(
      (0): EncodeLayer(
        (mha): MultiHeadAttention(
          (W_q): Linear(in_features=32, out_features=32, bias=False)
          (W_k): Linear(in_features=32, out_features=32, bias=False)
          (W_v): Linear(in_features=32, out_features=32, bias=False)
          (W_h): Linear(in_features=32, out_features=32, bias=True)
        )
        (cnn): CNN(
          (k1convL1): Linear(in_features=32, out_features=128, bias=True)
          (k1convL2): Linear(in_features=128, out_features=32, bias=True)
          (activation): ReLU()
        )
        (layernorm1): LayerNorm((32,), eps=1e-06, elementwise_affine=True)
        (layernorm2): LayerNorm((32,), eps=1e-06, elementwise_affine=True)
    

In [19]:
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.001)
epochs = 10
t_total = len(train_loader) * epochs

In [24]:
def train(train_loader, valid_loader):

    for epoch in range(epochs):

        train_iterator, valid_iterator = iter(train_loader), iter(valid_loader)
        nb_batches_train = len(train_loader)
        train_acc = 0
        model.train()
        losses = 0.0

        for batch in train_iterator:

            x = batch.text.to(device)
            y = batch.label.to(device)

            out = model(x)

            loss = f.cross_entropy(out, y)

            model.zero_grad()

            loss.backward()
            losses += loss.item()

            optimizer.step()

            train_acc += (out.argmax(1) == y).cpu().numpy().mean()

        print(f'Training loss at epoch {epoch} is {losses / nb_batches_train}')
        print(f'Training accuracy : {train_acc / nb_batches_train}')
        print('Evaluating on validation')
        evaluate(valid_loader)

In [25]:
def evaluate(data_loader):

    data_iterator = iter(data_loader)
    nb_batches = len(data_loader)
    model.eval()
    acc = 0

    for batch in data_iterator : 

        x = batch.text.to(device)
        y = batch.label.to(device)

        out = model(x)
        acc += (out.argmax(1) == y).cpu().numpy().mean()

    print(f'Eval accuracy : {acc / nb_batches}')

In [26]:
train(train_loader, valid_loader)

Training loss at epoch 0 is 0.6815453564775162
Training accuracy : 0.5639967744786144
Evaluating on validation
Eval accuracy : 0.6300685975609756
Training loss at epoch 1 is 0.6111233126426089
Training accuracy : 0.6740732149169321
Evaluating on validation
Eval accuracy : 0.7048018292682927
Training loss at epoch 2 is 0.5283002438752548
Training accuracy : 0.7448248055850126
Evaluating on validation
Eval accuracy : 0.7245045731707315
Training loss at epoch 3 is 0.451469285764556
Training accuracy : 0.7948756185931424
Evaluating on validation
Eval accuracy : 0.7676448170731707
Training loss at epoch 4 is 0.37960064195204474
Training accuracy : 0.8336205372923293
Evaluating on validation
Eval accuracy : 0.7825457317073169
Training loss at epoch 5 is 0.32337034450492996
Training accuracy : 0.8649920466595967
Evaluating on validation
Eval accuracy : 0.7959222560975611
Training loss at epoch 6 is 0.2702807648026425
Training accuracy : 0.8940604012018385
Evaluating on validation
Eval accurac

In [27]:
evaluate(test_loader)

Eval accuracy : 0.804866978408346
