# MAIC Week 5 - Papers With Code
## Attention is All You Need
Link to paper: https://arxiv.org/pdf/1706.03762.pdf <br/><br/><br/>

Start with importing needed librarie

In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import math

If you don't have the all the libraries installed, run the cell below

In [None]:
!pip install tensorflow
!pip install matplotlib
!pip install numpy

Setup some parameters for our 'model', and generate a random sequence to play the part of embeddings for us

In [None]:
embedding_dim = 4
seq_len = 8
d_model = 8
# Pretend these are embeddings of something like words
# Each column is a word embedding
sequence = np.random.rand(embedding_dim, seq_len)
for row in sequence:
    for element in row:
        print(round(element, 3), end=', ')
    print('\n')

# Sinusoidal Positional Embeddings
Positional embedding should match the length of the regular embedding
For each index in our embeddings, the positional embedding should be calculated as a sine or cosine (alternating starting with sin), and the frequency should be altered based on the equation given in the paper.<br/>

The below cell will show a vizualization of this although the formula is not the same

In [None]:
#pretend seq_len is 8 for vizualization purposes
x = np.linspace(0, 8, 100)
# Use one indexing so the first position doesn't just become one
for i in range(1, 8+1):
    if i % 2 == 0:
        y = np.cos(i*x*math.pi/(8*2))
        plt.plot(x, y)
    else:
        y = np.sin(i*x*math.pi/(8*2))
        plt.plot(x, y)
plt.show()

We can can generate the positional embeddings for a given sequence length as seen below (implemented as a tensorflow layer).<br/>
In Attention Is All You Need, they scale their waves based on the following formulas:
$$
PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})
$$
$$
PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})
$$
Where $pos$ is the position in the sequence, $i$ is the dimension, and $d_{model}$ is the size of the vectors of the input tokens, output tokens, and thus the positional encodings as well.<br/><br/>
For more clarity, $pos$ determines where along the sinusoid we should get the value, and $i$ helps determine what the frequency of that sinusoid should be; The same correlations can be seen in the graph above and implementation below.<br/><br/>
They scale it in this way because they add the positional embedding with the original token embeddings and they found that to be a good balance between not having positions tower over learned token embeddings and vice versa.<br/>
Another proposed solution is to concatenate the learned embeddings to the positional ones, however this takes more memory and computational power.<br/><br/>
The idea behind the changes in frequency is that the low frequencies are good for capturing long term dependancies and the high frequencies are good for finding specific, exact locations (ie, moving a certain distance along a wave with a higher frequency has a bigger change in value). And with a higher embedding space you can get more granularity

In [None]:
class SinusoidalEmbedding(tf.keras.layers.Layer):
    def __init__(self, d_model):
        super(SinusoidalEmbedding, self).__init__()
        self.d_model = d_model

    def call(self, inputs):
        poses = np.tile(np.arange(inputs.shape[1])+1, (inputs.shape[0], 1))
        dims = np.tile(np.arange(inputs.shape[0])+1, (inputs.shape[1], 1)).T
        pos_embeddings = np.zeros(poses.shape)
        pos_embeddings[::2] = np.sin(poses[::2]/(10000**(2*dims[::2]/self.d_model)))
        pos_embeddings[1::2] = np.cos(poses[1::2]/(10000**(2*dims[1::2]/self.d_model)))
        return inputs + pos_embeddings

# Attention
Now that the input has it's embeddings, we can put everything through an attention mechanism.<br/>
The first step is to map the input embeddings to queries, keys, and values. The queries, keys, and values are learned as linear/dense/fully-connected layers from the input embeddings.<br/><br/>
The q, k, and v weights are unbiased as well.<br/>

The Queries, Keys, and Values are matrices of the same shape as the input. We represent them as $Q, K, V$<br/>
They call their attention mechanism Scaled Dot-Product Attention and define it as show below where $d_{k}$ is the dimension of the queries and keys. (They also define the dimension of values as $d_{v}$ however it is usually the same as $d_{k}$)
$$
Attention(Q, K, V) = softmax(\frac{QK^{T}}{\sqrt(d_{k})})V
$$
They argue that the reason they use the scaling is because for large $d_{k}$ values, the dot product gets so large that the softmax has very small gradients.

In [None]:
class Attention(tf.keras.layers.Layer):
    def __init__(self):
        super(Attention, self).__init__()
    
    def build(self, input_shape):
        self.q_w = self.add_weight(
            shape=(input_shape[-1], input_shape[-1]),
            initializer="random_normal",
            trainable=True

        )
        self.k_w = self.add_weight(
            shape=(input_shape[-1], input_shape[-1]),
            initializer="random_normal",
            trainable=True
        )
        self.v_w = self.add_weight(
            shape=(input_shape[-1], input_shape[-1]),
            initializer="random_normal",
            trainable=True
        )

    def call(self, inputs, q=None, k=None, v=None):
        if not q:
            q = tf.matmul(inputs, self.q_w)
        if not k:
            k = tf.matmul(inputs, self.k_w)
        if not v:
            v = tf.matmul(inputs, self.v_w)
        kq = tf.matmul(k, q, transpose_b=True) / math.sqrt(q.shape[1])
        o = tf.matmul(tf.nn.softmax(kq), v)
        return o

Sample output can be seen below, the numbers are gibberish but we can see that the shapes all fit what we expect

In [None]:
embed_layer = SinusoidalEmbedding(d_model)
x = embed_layer(sequence)
print(x)
att_layer = Attention()
x = att_layer(sequence)
print(x)

# Multi Head Attention
Multi-head attention is the same thing as single head attention, but done multiple times in parallel (multiple k,q,v's) but then at the end the outputs are all concatenated together and put into a FC layer that outputs the same shape as the inputs (and thus can be stacked)<br/><br/>
$$
MultiHead(Q, K, V) = Concat(head_{1}, ..., head_{h})W^{O}
$$
$$
head_{i} = Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})
$$
The idea behind multiple heads is that each head can attend to a different place (ie, The dog eats steak, when looking at dog we might want one head to attend to eats, and another head to attend to steak)<br/><br/>
They also suggest dividing $d_{model}$ by the number of heads so the size of the network size and computation cost is relatively the same.

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads

    def build(self, input_shape):
        self.heads = []
        for i in range(self.num_heads):
            self.heads.append(Attention())
        self.output_layer = tf.keras.layers.Dense(input_shape[-1])

    def call(self, inputs):
        os = []
        for head in self.heads:
            os.append(head(inputs))
        return self.output_layer(tf.concat(os, -1))

Sample output can be seen below, the numbers are gibberish but we can see that the shapes all fit what we expect

In [None]:
embed_layer = SinusoidalEmbedding(d_model)
x = embed_layer(sequence)
mha = MultiHeadAttention(4)
x = mha(x)
x = mha(x)
print(x)

# Challenges
## Easy - 1 pt
  - As we discussed in this paper, most models add position embeddings to the regular embeddings. However, there are some that concatenate embeddings instead. This works better but has much worse space complexity. Can you create a new embedding layer that concatenates the embeddings instead of adding them.
## Medium - 2 pt
  - Create a pointwise feedforward network as described in the paper.
## Hard - 3 pt
  - Create the encoder and decoder as described in the paper.
    - (If you did the medium problem, you have everything you need, just smush it all together correctly)
## Super Hard - 6 pt
  - Implement the rest of a transformer as described in the paper and train it on some data (things like translation is good)
    - You'll need to add batch dimensions