 # Concise Summary of Transformers: Part 1 - Core Concepts



 ---

 This summary focuses on the fundamental building blocks and concepts of the Transformer architecture, tailored for a quick review.

 ---

 ## Table of Contents



 - [1. The Transformer Layer: Core Building Block](#1-the-transformer-layer-core-building-block)

 - [2. Deconstructing the Transformer Layer](#2-deconstructing-the-transformer-layer)

   - [2.1 Multi-Head Attention: Learning from Different Perspectives](#21-multi-head-attention-learning-from-different-perspectives)

   - [2.2 Scaled Dot-Product Self-Attention: The Engine of a Head](#22-scaled-dot-product-self-attention-the-engine-of-a-head)

   - [2.3 Feed-Forward Network (FFN): Adding Depth](#23-feed-forward-network-ffn-adding-depth)

   - [2.4 Add & Norm: Stabilizing and Enabling Deep Networks](#24-add--norm-stabilizing-and-enabling-deep-networks)

 - [3. Positional Encoding: Understanding Sequence Order](#3-positional-encoding-understanding-sequence-order)

 - [4. Transformers & Natural Language: Basic Preprocessing](#4-transformers--natural-language-basic-preprocessing)

   - [4.1 Word Embeddings: Representing Words as Vectors](#41-word-embeddings-representing-words-as-vectors)

   - [4.2 Tokenization: Breaking Down Text](#42-tokenization-breaking-down-text)

 - [Reference](#reference)

 ## <a id="1-the-transformer-layer-core-building-block"></a>1. The Transformer Layer: Core Building Block



 The Transformer architecture is built by stacking multiple identical **Transformer Layers**. Each layer processes a sequence of input vectors (tokens) and outputs a sequence of vectors of the same dimension. Its primary role is to refine the representation of each token by considering its context within the entire sequence.



 A Transformer layer typically consists of two main sub-layers:

 1.  A **Multi-Head Self-Attention** mechanism.

 2.  A **Position-wise Feed-Forward Network (FFN)**.



 Residual connections and layer normalization are applied around each of these sub-layers.


<br>
<div align="center">
    <img src="image/Figure_9.png" width="250px"/>
    <p><em>Figure 9: Architecture of one transformer layer, showcasing its main components.</em></p>
</div>
<br>

<br>

 ---

 <br>

 ## <a id="2-deconstructing-the-transformer-layer"></a>2. Deconstructing the Transformer Layer

 ### <a id="21-multi-head-attention-learning-from-different-perspectives"></a>2.1 Multi-Head Attention: Learning from Different Perspectives



 Instead of performing a single attention function, Transformers employ **Multi-Head Attention**. This allows the model to jointly attend to information from different representational subspaces at different positions.



 * **Mechanism:** The input queries, keys, and values are linearly projected multiple times (once for each "head") into different lower-dimensional spaces. Attention is then performed in parallel for each of these projected versions. The outputs of the heads are concatenated and linearly projected again to produce the final output.

 * **Benefit:** It enables the model to capture various types of relationships and nuances in the data that a single attention mechanism might miss by averaging them out.


<br>
<div align="center">
    <img src="image/Figure_8.png" width="500px"/>
    <p><em>Figure 8: Information flow for multi-head attention. The input is split, processed by several attention "heads" in parallel, and then their outputs are combined.</em></p>
</div>
<br>

 ---

 ### <a id="22-scaled-dot-product-self-attention-the-engine-of-a-head"></a>2.2 Scaled Dot-Product Self-Attention: The Engine of a Head



 Each head in Multi-Head Attention uses **Scaled Dot-Product Self-Attention**. This is where tokens in a sequence interact to compute attention scores.



 * **Queries, Keys, Values (Q, K, V):** For each input token, three vectors are typically derived through learnable linear transformations:

     * **Query (Q):** Represents the current token seeking information.

     * **Key (K):** Represents an input token advertising its information.

     * **Value (V):** Represents the actual content/features of an input token.

 * **Process:**

     1.  **Similarity Scores:** The dot product of a token's Query vector with all other tokens' Key vectors is computed. This measures similarity.

     2.  **Scaling:** These scores are scaled by dividing by the square root of the dimension of the key vectors ($\sqrt{D_k}$). This helps stabilize gradients during training.

     3.  **Softmax:** A softmax function is applied to the scaled scores to obtain attention weights (probabilities) that sum to 1. These weights determine how much focus to place on other tokens.

     4.  **Weighted Sum:** The final output for the token is a weighted sum of all Value vectors, using the computed attention weights.



 The formula is:

 $$ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{D_k}}\right)V $$


<br>
<div align="center">
    <img src="image/Figure_6.png" width="250px"/>
    <p><em>Figure 6: Information flow in a scaled dot-product self-attention mechanism, the core of an attention head.</em></p>
</div>
<br>



 The initial Q, K, V vectors are derived from the input token embeddings using separate learnable weight matrices ($W^{(q)}, W^{(k)}, W^{(v)}$), allowing the model to learn optimal projections for attention.



<br>
<div align="center">
    <img src="image/Figure_4.png" width="550px"/>
#     <p><em>Figure 4: Calculation of QK<sup>T</sup> from input X and learnable weight matrices.</em></p>
</div>
<br>

<br>

 ---

 <br>

 ### <a id="23-feed-forward-network-ffn-adding-depth"></a>2.3 Feed-Forward Network (FFN): Adding Depth



 Following the multi-head attention sub-layer, each position's output is passed through a **Position-wise Feed-Forward Network (FFN)**.



 * **Mechanism:** This is typically a two-layer fully connected neural network with a non-linear activation function (e.g., ReLU) in between. Importantly, the *same* FFN (with the same weights) is applied independently to each token's representation in the sequence.

 * **Benefit:** It introduces additional non-linearity and allows the model to learn more complex transformations of each token's representation after contextual information has been aggregated by the attention mechanism.



 ---

 ### <a id="24-add--norm-stabilizing-and-enabling-deep-networks"></a>2.4 Add & Norm: Stabilizing and Enabling Deep Networks



 Around each of the two main sub-layers (Multi-Head Attention and FFN) in a Transformer layer, two operations are applied:



 1.  **Residual Connection (Add):** The input to the sub-layer is added to the output of that sub-layer. This helps mitigate the vanishing gradient problem, allowing for much deeper networks to be trained effectively.

     * Output = `SubLayer(Input) + Input`

 2.  **Layer Normalization (Norm):** This operation normalizes the activations across the features for each token independently. It helps stabilize the learning process and reduces sensitivity to the initialization of weights.

     * Final Output = `LayerNorm(SubLayer(Input) + Input)`



 These "Add & Norm" steps are crucial for the successful training of deep Transformer models.



 ---

 ## <a id="3-positional-encoding-understanding-sequence-order"></a>3. Positional Encoding: Understanding Sequence Order



 The self-attention mechanism, by its nature, does not inherently consider the order of tokens in a sequence. If input tokens were shuffled, the attention scores would simply be permuted accordingly, losing the sequential information. This is problematic for tasks like language understanding where word order is critical.



 * **Solution:** **Positional Encodings** are added to the input embeddings at the bottom of the Transformer stack (before the first layer). These are vectors of the same dimension as the token embeddings.

 * **Purpose:** They inject information about the relative or absolute position of tokens in the sequence.

 * **Methods:**

     * **Sinusoidal Positional Encodings:** A common method uses sine and cosine functions of different frequencies across the embedding dimensions. This method has the advantage of potentially generalizing to sequence lengths not seen during training.

     * **Learned Positional Encodings:** Alternatively, the positional encodings can be learnable parameters, similar to token embeddings, where each position has a unique learned vector.



<br>
<div align="center">
    <img src="image/Figure_10_b.png" width="350px"/>
    <p><em>Figure 10b: Heatmap illustrating sinusoidal positional encoding vectors, where each row is a position and each column is an embedding dimension.</em></p>
</div>
<br>

<br>

 ---

 <br>

 ---

 ## <a id="4-transformers--natural-language-basic-preprocessing"></a>4. Transformers & Natural Language: Basic Preprocessing



 While Transformers are general-purpose sequence processors, their initial success was in Natural Language Processing (NLP). Key preprocessing steps are needed to convert raw text into a format suitable for the model.

 ### <a id="41-word-embeddings-representing-words-as-vectors"></a>4.1 Word Embeddings: Representing Words as Vectors



 * **Challenge:** Neural networks operate on numbers, not raw text.

 * **Solution:** Words are mapped to dense vector representations called **word embeddings**. These embeddings capture semantic similarities, meaning words with similar meanings are closer in the vector space.

 * **Learning:** Methods like `word2vec` (e.g., CBOW, Skip-gram) learn these embeddings from large text corpora by predicting words from their context or vice-versa. These embeddings can be pre-trained or learned as part of the Transformer model itself.



<br>
<div align="center">
    <img src="image/Figure_11_a.png" width="350px"/>
    <p><em>Figure 11a: The Continuous Bag of Words (CBOW) model for learning word embeddings.</em></p>
</div>
<br>


---

---

 ### <a id="42-tokenization-breaking-down-text"></a>4.2 Tokenization: Breaking Down Text



 * **Challenge:** Dealing with vast vocabularies, rare words, misspellings, and sub-word structures.

 * **Solution:** **Tokenization** breaks down text into smaller units called tokens, which are often sub-words or characters rather than full words. This helps manage vocabulary size and handle out-of-vocabulary words.

 * **Methods:** Algorithms like Byte Pair Encoding (BPE) start with individual characters and iteratively merge the most frequent pairs to form a vocabulary of tokens.



<br>
<div align="center">
    <img src="image/Figure_12.png" width="450px"/>
    <p><em>Figure 12: Illustration of Byte Pair Encoding, where frequent character pairs like 'pe' are merged into single tokens.</em></p>
</div>
<br>



 ---

 ---

 ### <a id="reference"></a>Reference



 Bishop, C. M. (2024). *Deep Learning: Foundations and Concepts*. Springer. (Chapter 12: Transformers).

 ---

 ---

### Extra: how to make bag of words

Bag of words (BoW) represents text as word frequency vectors using `sklearn.feature_extraction.text.CountVectorizer`  
TF-IDF (Term Frequency-Inverse Document Frequency) weights words by importance using `sklearn.feature_extraction.text.TfidfVectorizer`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example sentences
sentences = [
    "The cat sat on the mat",
    "The dog ran in the park",
    "A cat and a dog in the park"
]

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the sentences
X = vectorizer.fit_transform(sentences)

# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Convert to DataFrame for better visualization
bow_df = pd.DataFrame(X.toarray(), columns=feature_names)


In [3]:
print("Vocabulary:", feature_names)
print("\nBag of Words representation:")
print(bow_df)


Vocabulary: ['and' 'cat' 'dog' 'in' 'mat' 'on' 'park' 'ran' 'sat' 'the']

Bag of Words representation:
   and  cat  dog  in  mat  on  park  ran  sat  the
0    0    1    0   0    1   1     0    0    1    2
1    0    0    1   1    0   0     1    1    0    2
2    1    1    1   1    0   0     1    0    0    1
