Assumption: we have an input text of 6 words, and each word is represented by a 512 dimension vector.

In this case, the $sequence = 6$ and $d_{model} = 512$.

Therefore:

$\text{Input matrix} (sequence, d_{model}) = (6, 512)$

Let's assume that our words are:

$\text{A, B, C, D, E, F}$

Then, the illustration of the input matrix looks like this:

$$
\text{Input matrix} = \begin{bmatrix}
\text{A}, & \text{B}, & \text{C}, & \text{D}, & \text{E}, & \text{F} \\
\end{bmatrix}
$$


Let's assume that each word has the following 512 columns of numerical representation (basically acting like 512 dimensions):

$$

\text{word} = \begin{bmatrix}
W_{0},  & W_{1}, & W_{2}, & \dots & W_{511} \\
\end{bmatrix}
$$

Note that each $W_{i}$ here is a number.

With this background, the input matrix (let's call it $W$) in an expanded form should look like this:

$$
W = \begin{bmatrix}
\begin{bmatrix} \text{A}_0, & \text{A}_1, & \text{A}_2, & \dots, & \text{A}_{511} \end{bmatrix} \\
\begin{bmatrix} \text{B}_0, & \text{B}_1, & \text{B}_2, & \dots, & \text{B}_{511} \end{bmatrix} \\
\begin{bmatrix} \text{C}_0, & \text{C}_1, & \text{C}_2, & \dots, & \text{C}_{511} \end{bmatrix} \\
\begin{bmatrix} \text{D}_0, & \text{D}_1, & \text{D}_2, & \dots, & \text{D}_{511} \end{bmatrix} \\
\begin{bmatrix} \text{E}_0, & \text{E}_1, & \text{E}_2, & \dots, & \text{E}_{511} \end{bmatrix} \\
\begin{bmatrix} \text{F}_0, & \text{F}_1, & \text{F}_2, & \dots, & \text{F}_{511} \end{bmatrix}
\end{bmatrix}
$$

The shape, as we can clearly see, is $(6, 512)$.

Also, the transpose of this looks like this:

$$
W^\text{T} = \begin{bmatrix}
\begin{bmatrix} \text{A}_0, & \text{B}_0, & \text{C}_0, & \text{D}_0, & \text{E}_0, & \text{F}_0 \end{bmatrix} \\
\begin{bmatrix} \text{A}_1, & \text{B}_1, & \text{C}_1, & \text{D}_1, & \text{E}_1, & \text{F}_1 \end{bmatrix} \\
\begin{bmatrix} \text{A}_2, & \text{B}_2, & \text{C}_2, & \text{D}_2, & \text{E}_2, & \text{F}_2 \end{bmatrix} \\
\begin{bmatrix} \text{A}_3, & \text{B}_3, & \text{C}_3, & \text{D}_3, & \text{E}_3, & \text{F}_3 \end{bmatrix} \\
\vdots \\
\begin{bmatrix} \text{A}_{511}, & \text{B}_{511}, & \text{C}_{511}, & \text{D}_{511}, & \text{E}_{511}, & \text{F}_{511} \end{bmatrix}
\end{bmatrix}
$$

The shape of this is $(512, 6)$.

In [1]:
import torch
import torch.nn as nn

In [2]:
torch.manual_seed(40)
input_matrix = torch.randn(6, 512)
input_matrix.shape, input_matrix

(torch.Size([6, 512]),
 tensor([[-0.2367,  1.8109,  0.1966,  ..., -0.6320,  0.3352,  0.3928],
         [ 0.0783,  0.5694, -0.6083,  ...,  0.3377,  0.9911, -1.0636],
         [ 0.0525, -0.4094, -0.7481,  ...,  0.7475, -1.0518, -0.2402],
         [-0.7422,  0.5986,  1.1324,  ...,  0.6658, -1.9029,  0.3874],
         [ 0.3757, -0.0370,  0.0536,  ...,  1.3480, -0.3502,  1.4096],
         [ 0.3379,  0.6135,  0.3060,  ..., -0.1643, -1.4237,  1.1242]]))

In [3]:
input_matrix.T.shape, input_matrix.T

(torch.Size([512, 6]),
 tensor([[-0.2367,  0.0783,  0.0525, -0.7422,  0.3757,  0.3379],
         [ 1.8109,  0.5694, -0.4094,  0.5986, -0.0370,  0.6135],
         [ 0.1966, -0.6083, -0.7481,  1.1324,  0.0536,  0.3060],
         ...,
         [-0.6320,  0.3377,  0.7475,  0.6658,  1.3480, -0.1643],
         [ 0.3352,  0.9911, -1.0518, -1.9029, -0.3502, -1.4237],
         [ 0.3928, -1.0636, -0.2402,  0.3874,  1.4096,  1.1242]]))

In [4]:
transpose_dotproduct = input_matrix @ input_matrix.T
transpose_dotproduct.shape, transpose_dotproduct

(torch.Size([6, 6]),
 tensor([[503.2782,  30.9015, -25.3849,  13.0093, -29.7297,  20.1604],
         [ 30.9015, 536.9385, -23.3817, -70.6838,  -8.7096, -19.2283],
         [-25.3849, -23.3817, 484.9792,  24.6444, -29.3236,  -0.9737],
         [ 13.0093, -70.6838,  24.6444, 538.0654, -25.3541,  -8.6208],
         [-29.7297,  -8.7096, -29.3236, -25.3541, 453.2016,  -5.3324],
         [ 20.1604, -19.2283,  -0.9737,  -8.6208,  -5.3324, 539.1165]]))

In [5]:
A = input_matrix[0][:] # first row
A @ A.T # basically dot product of A and A^T

  A @ A.T # basically dot product of A and A^T


tensor(503.2783)

## Positional Encoding

Formulas for positional encoding in the original transformer paper are:

$$
PE(pos, 2i) = sin(pos /  10000^{2i/d_{model}}) \quad \leftarrow \text{(for even i)}
$$

$$
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_{model}}) \quad \leftarrow \text{(for odd i)}
$$

where $pos$ is the position of the word in the sentence, and $i$ is the dimension index (starting from 0 to $d_{model} - 1$).









## Multi-Head Attention

Formula for multi-head attention in the original transformer paper is:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h)W^\text{O}
$$

where:

$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$

where $W_i^Q, W_i^K, W_i^V$ are the weights for the $i$-th head.

and:

$
W^\text{O} = \begin{bmatrix}
W_1^\text{O}, & W_2^\text{O}, & \dots, & W_h^\text{O}
\end{bmatrix}
$



In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')

In [7]:
vocab_size = tokenizer.n_vocab # 50257
d_model = 512

embedding_matrix = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)

def tokenize_text(text):
    token_ids = tokenizer.encode(text)
    embeddings = embedding_matrix(torch.stack([torch.tensor(token_ids)], dim=0))
    return token_ids, embeddings

def decode_text(token_ids):
    return tokenizer.decode(token_ids)

In [8]:
input_text = 'This is a keyboard'

In [9]:
token_ids, input_embeddings = tokenize_text(input_text)

# batch size, sequence length (i.e., number of words), embedding dimension
print(input_embeddings.shape)
print(input_embeddings)
print(token_ids)


torch.Size([1, 4, 512])
tensor([[[ 0.8884,  3.1415, -0.7241,  ..., -0.8155, -0.0804,  1.9023],
         [ 0.6572, -1.5998,  1.3872,  ...,  0.3046,  0.6751, -1.5896],
         [ 0.5262,  0.3106, -0.4733,  ...,  0.5922,  1.3402,  1.1434],
         [-0.7888,  1.1974, -2.1260,  ..., -0.0547,  1.4225,  0.5274]]],
       grad_fn=<EmbeddingBackward0>)
[1212, 318, 257, 10586]


In [10]:
decode_text(token_ids)

'This is a keyboard'

Input embeddings -> DONE

### Let's do some positional encoding now

In [11]:
def positional_encoding(pos: int, i: int, d_model=d_model):
    if i % 2 == 0:
        return torch.sin(torch.tensor(pos) / torch.pow(torch.tensor(10000), (2*i) / d_model))
    else:
        return torch.cos(torch.tensor(pos) / torch.pow(torch.tensor(10000), (2*i) / d_model))

In [12]:
input_embeddings.shape

torch.Size([1, 4, 512])

In [55]:
def encoded_embeddings(input_embeddings):
    batch, seq_len, d_model = input_embeddings.shape
    positions = torch.arange(seq_len, dtype=input_embeddings.dtype).unsqueeze(1)
    dims = torch.arange(d_model, dtype=input_embeddings.dtype).unsqueeze(0)
    denominator = torch.pow(torch.tensor(10000), (2 * dims) / d_model)
    angles = positions / denominator
    encoded = torch.empty_like(angles)
    encoded[:, 0::2] = torch.sin(angles[:, 0::2])
    encoded[:, 1::2] = torch.cos(angles[:, 1::2])
    return encoded.unsqueeze(0).expand(batch, -1, -1) + input_embeddings

print(encoded_embeddings(input_embeddings))
print(input_embeddings)


tensor([[[ 0.8884,  5.1415, -0.7241,  ...,  1.1845, -0.0804,  3.9023],
         [ 2.3402, -0.4604,  2.9912,  ...,  2.3046,  0.6751,  0.4104],
         [ 2.3448, -0.3912,  1.4430,  ...,  2.5922,  1.3402,  3.1434],
         [-0.5066, -0.7416, -1.4404,  ...,  1.9453,  1.4225,  2.5274]]],
       grad_fn=<AddBackward0>)
tensor([[[ 0.8884,  4.1415, -0.7241,  ...,  0.1845, -0.0804,  2.9023],
         [ 1.4987, -1.0301,  2.1892,  ...,  1.3046,  0.6751, -0.5896],
         [ 1.4355, -0.0403,  0.4848,  ...,  1.5922,  1.3402,  2.1434],
         [-0.6477,  0.2279, -1.7832,  ...,  0.9453,  1.4225,  1.5274]]],
       grad_fn=<CopySlices>)
