### topics

* self attention
* transformer
* pooling
* normalization
* residual connections
* positional encoding


## Self Attention 

**Attention** is a mechanism to find out where to look in the context trying to predict parts of a sequence (a sequence over time like text).


**Self attention** is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence (from Lilian Weng).

* for each element (word), we determine a weighted representation depending on how well the element suits (in) its context
* we do it via word embeddings and a similarity measure
* dot product is a similarity measure

* example: movie recommendation
    * a 1st vector represents features of the movie
    * a 2nd vector represents interests of a user
* dot product is high if overlap is high

![movie.png](attachment:movie.png)

### Rational behind Self Attention

* self attention: how well fits a word in its context (sentence)
* represent a word by its word embedding
* x is the sequence of word embeddings of a sentence
* take the dot product of x with itself
* result: a matrix that records the mutual similarity of all words

In [2]:
import torch
import torch.nn.functional as F

torch.manual_seed(10)
x=torch.rand(1,2,3)
x                     # 2 words, 3 dimensional word embedding

tensor([[[0.4581, 0.4829, 0.3125],
         [0.6150, 0.2139, 0.4118]]])

In [2]:
x.transpose(1,2)

tensor([[[0.4581, 0.6150],
         [0.4829, 0.2139],
         [0.3125, 0.4118]]])

In [3]:
torch.matmul(x,x.transpose(1,2))   

tensor([[[0.5406, 0.5137],
         [0.5137, 0.5936]]])

In [4]:
# same operation with batch matmul

raw_weights=torch.bmm(x,x.transpose(1,2))
raw_weights

tensor([[[0.5406, 0.5137],
         [0.5137, 0.5936]]])

* cell 0,0: similarity of first vector with itself (0.5406)
* cell 0,1: similarity of first vector with second vector (0.5137)
* cell 1,0: similarity of second vector with first vector   (0,1 and 1,0 are symmetric)
* cell 1,1:  similarity of second vector with itself

In [5]:
weights = F.softmax(raw_weights, dim=2)
weights

tensor([[[0.5067, 0.4933],
         [0.4800, 0.5200]]])

* we get a probability distribution per row
* 0,1 and 1,0 are no longer symmetric (since softmax)

In [6]:
x

tensor([[[0.4581, 0.4829, 0.3125],
         [0.6150, 0.2139, 0.4118]]])

0.5355 

0.4933 weights the importance of all values of the second vector from the perspective of the first vector

as an effect: if a vector is very similar to its context then its importance is raised (the values increase)

In [None]:
y = torch.bmm(weights, x)   # or simply torch.matmul(weights, x)
y

result: contextualized input sequence

**that's the basic idea**


### Learnable Self Attention (query, key, value)

* so far, everything is static - no room for learning
* in order to learn, we need weight matrices

$$\large A(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$


### query, key, value

Every input vector $𝐱_i$ is used in three different ways in the self attention operation:

* It is compared to every other vector to establish the weights for its own output $𝐲_i$
* It is compared to every other vector to establish the weights for the output of the j-th vector $𝐲_j$
* It is used as part of the weighted sum to compute each output vector once the weights have been established

These roles are called the query, the key and the value.

(see: http://peterbloem.nl/blog/transformers)

![selfatt2.png](attachment:selfatt2.png)

![selfatt3.png](attachment:selfatt3.png)

In [3]:
torch.manual_seed(10)

fdim=2     # feature dimension
seqdim=3

x=torch.rand(1,seqdim,fdim)
x

tensor([[[0.4581, 0.4829],
         [0.3125, 0.6150],
         [0.2139, 0.4118]]])

In [4]:
torch.manual_seed(0)

Wq = torch.randn(1,fdim,seqdim)
Wq

tensor([[[ 1.5410, -0.2934, -2.1788],
         [ 0.5684, -1.0845, -1.3986]]])

In [5]:
Q=torch.matmul(Wq,x)
Q

tensor([[[ 0.1481, -0.3337],
         [-0.3777, -0.9685]]])

In [6]:
torch.manual_seed(1)

Wk = torch.rand(1,fdim,seqdim)
Wk

tensor([[[0.7576, 0.2793, 0.4031],
         [0.7347, 0.0293, 0.7999]]])

In [7]:
K=torch.matmul(Wk,x)
K

tensor([[[0.5206, 0.7036],
         [0.5168, 0.7022]]])

In [9]:
torch.manual_seed(2)

Wv = torch.rand(1,fdim,seqdim)
Wv

tensor([[[0.6147, 0.3810, 0.6371],
         [0.4745, 0.7136, 0.6190]]])

In [10]:
V = torch.matmul(Wv,x)
V

tensor([[[0.5370, 0.7935],
         [0.5728, 0.9229]]])

In [11]:
dk=len(Wk[0][0])
dk

3

In [12]:
import numpy as np

QK=F.softmax(Q @ K.transpose(1,2)/np.sqrt(dk),dim=2)
QK

tensor([[[0.5000, 0.5000],
         [0.4996, 0.5004]]])

In [13]:
torch.matmul(QK,V)

tensor([[[0.5549, 0.8582],
         [0.5549, 0.8583]]])

### Multi-head (self) Attention

* we split each embedding vector (and thus the whole input vector) into n parts 
* we get n new input matrices with reduced dimension
    * d = embedding dimension
    * dimension of heads thus d/n 
* each head gets Q,K,V weight matrices
* after attention application, we concate all
* finally we apply another matrix multiplication to produce the right output dimension

for a nice visualization see: http://jalammar.github.io/illustrated-transformer/

![multi.png](attachment:multi.png)

* the first and the third vector are no longer identical
* the positional encoding can be learned by a linear function

### pytorch attention

In [35]:
import torch
from torch import nn
import torch.nn.functional as F

x=torch.rand(1,3,3)
print("input x:\n",x)

class attention(nn.Module):
    def __init__(self,  fdim):
        super().__init__()

        self.tokeys = nn.Linear(fdim,fdim,bias=False)
        self.toqueries = nn.Linear(fdim,fdim,bias=False)
        self.tovalues = nn.Linear(fdim,fdim, bias=False)
                
       # print(list(self.tokeys.parameters()))
    
    def forward(self,x):
        
        Q=self.toqueries(x)
        K=self.tokeys(x)
        V=self.tovalues(x)
        
      #  print(Q.size(),K.size())
        
        QK = Q @ K.transpose(1,2)
        
        QK_softmax = F.softmax(QK, dim=2)
        
        weighted_values = QK_softmax @ V
        return weighted_values
    
selfatt=attention(3)   

print("\nweighted x:\n",selfatt(x))
#print(list(selfatt.parameters()))


input x:
 tensor([[[0.7125, 0.0592, 0.6270],
         [0.5905, 0.6248, 0.4776],
         [0.0624, 0.2718, 0.4086]]])

weighted x:
 tensor([[[-0.1944,  0.0189, -0.0947],
         [-0.1949,  0.0231, -0.0969],
         [-0.1924,  0.0232, -0.0985]]], grad_fn=<UnsafeViewBackward>)


# Transformer

* powerful DL approach based on (self) attention, ....
* very popular (not a scientific category, though): BERT


### new properties

* multi head attention
* layer normalization
* residual connections (add & norm)
* positional encoding

![transformer.png](attachment:transformer.png)

see: Attention is all you need (https://arxiv.org/pdf/1706.03762.pdf)

In [44]:
# very basic transformer with pooling, layer normalization and a residual connection

class transformer(nn.Module):
    
    def __init__(self, k):
        super().__init__()

        self.attention = attention(k)

        self.norm = nn.LayerNorm(k)

        self.ff = nn.Linear(1, 1)
          
    def forward(self, x):
        
        attended = self.attention(x)
        
        x = self.norm(attended + x)           # residual connection from x

        x= F.avg_pool2d(x,kernel_size=(3,3))  # pooling
        
        return self.ff(x) 

In [45]:
trans=transformer(3)

trans(x)            

tensor([[[-0.4486]]], grad_fn=<AddBackward0>)

### Positional encoding

* transformer do not directly have/use sequence information as RNN do
* but in NLP this might be crucial
* why not positional encoding/embedding of input


* one could use e.g. embeddings of integers (1,2, ...): concatenate it to the word embeddings
* one could use e.g. binary encoding of integers (e.g. 0,1,1 for 3)
* one could use some function:  (2 * p) + (3 * i) where p=position, i=embedding index
* but a **sinusoid** is used, we alter the word embedding directly see below
* the same word within a sentence now gets different representations, since the position encoding alters the embedding

see: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/#the-intuition

In [47]:
import numpy as np

Input= np.array([[1,2,3,3],[3,2,1,5],[1,2,3,3]])

d_model = 4                     # Embedding dimension
max_sentence_length = 3 

positional_embeddings = np.zeros((max_sentence_length, d_model))


for position in range(maximum_sentence_length):
    for i in range(0, d_model, 2):
       positional_embeddings[position, i] = (
                                          np.sin(position / (10000 ** ( (2*i) / d_model) ) )
                                            )
       positional_embeddings[position, i + 1] = (
                                              np.cos(position / (10000 ** ( (2 * (i + 1) ) / d_model) ) ))
        

print(Input,"\n\n",positional_embeddings)

print("\nCombined:\n",Input+positional_embeddings)

[[1 2 3 3]
 [3 2 1 5]
 [1 2 3 3]] 

 [[0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00]
 [8.41470985e-01 9.99950000e-01 9.99999998e-05 1.00000000e+00]
 [9.09297427e-01 9.99800007e-01 1.99999999e-04 1.00000000e+00]]

Combined:
 [[1.         3.         3.         4.        ]
 [3.84147098 2.99995    1.0001     6.        ]
 [1.90929743 2.99980001 3.0002     4.        ]]


### Pooling

* pooling: if we derive from a given n x m region of a matrix a smaller region and, as an effect, a smaller matrix
* used in CNNs
* used also to condense a matrix (even down to a single value)


in pytorch
* average: F.avg_pool2(tensor,kernel_size=(n,m) 
* maximum: F.max_pool2(tensor,kernel_size=(n,m)

where n=rows, m=columns

In [48]:
x=torch.rand(1,4,4)
x

tensor([[[0.6260, 0.8184, 0.5762, 0.5772],
         [0.1086, 0.0214, 0.4732, 0.1426],
         [0.0451, 0.4872, 0.2618, 0.7914],
         [0.0903, 0.2658, 0.1414, 0.5247]]])

In [51]:
F.max_pool2d(x,kernel_size=(2,2))

tensor([[[0.3936, 0.4423],
         [0.2221, 0.4298]]])

In [52]:
F.avg_pool2d(x,kernel_size=(2,4))

tensor([[[0.4180],
         [0.3260]]])

In [53]:
F.avg_pool2d(x,kernel_size=(4,4))

tensor([[[0.3720]]])

In [54]:
F.max_pool2d(x,kernel_size=(2,2))

tensor([[[0.8184, 0.5772],
         [0.4872, 0.7914]]])

###  Normalization

(we might have a more detailed look as soon as we discuss CNNs)

* "normalization" means: produces some standard format, makes values comparable, share the same scale
* e.g. vector length = 1 of all normalized vectors

* Batch normalization is a method that normalizes activations in a network across the mini-batch

* Layer normalization normalizes input across the features instead of normalizing input features across the batch dimension as in batch normalization

* Instance: similar to Layer

see: https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8


* nn.LayerNorm(k)