<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Transformers**

&copy; Dr. Yves J. Hilpisch

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Self-Attention

The following is a concise **review** of the self-attention mechanism and its interpretation. Example from ChatGPT.

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Step 1: Example text (simple sentences)
texts = ["The cat sat on the mat",
         "The mat was sat on by the cat",
         "The cat is playing with a ball"]

In [None]:
# Step 2: Generate embeddings using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_embeddings = vectorizer.fit_transform(texts).toarray()

In [None]:
# Display the feature names and TF-IDF embeddings
print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Embeddings:\n", tfidf_embeddings)

In [None]:
# Step 3: Define a simple self-attention mechanism
def self_attention(embeddings):
    """
    Calculate attention scores for the input embeddings
    """

    # Step 4: Compute the attention scores
    # Self-attention is often computed using: attention = Q * K.T
    # For simplicity, we use embeddings as both queries (Q) and keys (K)
    attention_scores = np.dot(embeddings, embeddings.T)

    # Step 5: Apply softmax to normalize the attention scores (weights)
    def softmax(x):
        return np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True)

    attention_weights = softmax(attention_scores)

    # Step 6: Multiply the weights with the value (same as embeddings here)
    # Contextualized embeddings (values) = attention_weights * values (here embeddings)
    context_embeddings = np.dot(attention_weights, embeddings)

    return attention_scores, attention_weights, context_embeddings

In [None]:
# Step 7: Calculate self-attention on the embeddings
attention_scores, attention_weights, context_embeddings = self_attention(tfidf_embeddings)

#### Explanation of the Code:

1. **Text Data (`texts`)**: We start with a small list of three simple sentences to illustrate how self-attention works on textual data.

2. **TF-IDF Vectorization**:
   - We use the `TfidfVectorizer` from `sklearn` to create embeddings for each sentence.
   - The `fit_transform` method generates TF-IDF scores for each word, resulting in numerical vectors representing the importance of words in each sentence.
   - The `tfidf_embeddings` is a matrix where each row is the vector representation of a sentence.

3. **Self-Attention Function (`self_attention`)**:
   - **Attention Scores**: We compute the dot product of the embeddings to get the attention scores. These represent how much focus each sentence should pay to the others. Here, the sentences themselves serve as both the query and key.
   - **Softmax**: We apply softmax to the attention scores to convert them into probabilities (attention weights). This ensures that the weights across sentences sum to 1.
   - **Contextualized Embeddings**: Finally, we compute the new embeddings by multiplying the attention weights with the original embeddings. This generates "contextualized" representations of each sentence, influenced by the other sentences.

4. **Interpretation of Results**:
   - **Attention Scores**: These raw scores indicate how much one sentence relates to another. High scores indicate higher similarity.
   - **Attention Weights**: After applying softmax, these normalized scores tell us how much attention one sentence pays to another (normalized to probabilities).
   - **Contextualized Embeddings**: These embeddings represent the sentences after the attention mechanism. Each sentence is now a weighted sum of all the original sentences, capturing the context from the others.

In [None]:
# Step 8: Interpret the results
print("\nAttention Scores (before softmax):\n", attention_scores)
print("\nAttention Weights (after softmax):\n", attention_weights)
print("\nContextualized Embeddings:\n", context_embeddings)

### Key Interpretations:
- **Attention Scores**: The diagonal entries (1.0) represent each sentence's self-attention, while the off-diagonal values (like 0.7968) represent how similar the sentences are to one another.
- **Attention Weights**: These are the normalized attention scores, indicating how much attention each sentence pays to others. For example, sentence 1 pays about 44% attention to itself and 36% to sentence 2.
- **Contextualized Embeddings**: Each sentence now incorporates information from the other sentences, with the weights applied based on attention. For example, sentence 1's new representation is a mix of itself (44%), sentence 2 (36%), and sentence 3 (21%).

This is a basic illustration, but it shows how self-attention works using simple text and embeddings from TF-IDF.

## Transformer Architecture

_From ChatGPT_:

The relationship between **embeddings**, **self-attention**, and **transformers** is central to the functioning of modern natural language processing (NLP) architectures. Here’s a detailed breakdown of how they interrelate:

### 1. **Embeddings (e.g. Word2Vec)**

**Embeddings** are a technique to represent words or tokens in a continuous vector space. The core idea is to encode words into fixed-length dense vectors, where words with similar meanings are placed closer together in this vector space.

- **Word2Vec** is an early and widely-used embedding technique that creates word vectors based on their context (Skip-gram or CBOW). The vectors capture semantic relationships between words. For instance, the vector difference between "king" and "queen" would be similar to that between "man" and "woman."
  
- In modern NLP models (including transformers), embeddings serve as the **input representations** of the words. The input to these models is typically a sequence of embeddings, where each embedding corresponds to a word or sub-word token.

However, **Word2Vec** is static, meaning each word has a single vector regardless of context. In contrast, newer models like **BERT** (a transformer model) produce **contextual embeddings** where the meaning of a word can vary depending on its usage.

### 2. **Self-Attention Mechanism**

The **self-attention** mechanism is a core building block of the transformer architecture and is a way of enabling the model to focus on different parts of a sequence when making predictions about a particular token.

- In self-attention, each token in a sequence interacts with every other token, learning **which parts of the sequence are important** to focus on. The model computes a weighted sum of all the tokens, where the weights are dynamically determined by how relevant each token is to the one being processed.

  This is done using three vectors for each token:
  - **Query (Q)**: What are we looking for in the sequence?
  - **Key (K)**: What information does each token have?
  - **Value (V)**: What is the actual information that each token holds?

The attention score for a token is calculated by taking the dot product of the query vector of the current token with the key vectors of all tokens, followed by a softmax operation to normalize these scores. This allows the model to focus on the most relevant tokens in the sequence when forming a representation for the current token.

**Multi-head attention** extends this concept by allowing the model to use multiple sets of queries, keys, and values, learning different aspects of the relationships between tokens. This enables the model to focus on different patterns or "heads" of attention.

### 3. **Transformers**

Transformers are a deep learning architecture that relies heavily on the self-attention mechanism. Introduced by Vaswani et al. in 2017 with the paper *"Attention is All You Need,"* transformers have since become the foundation for most state-of-the-art NLP models like **BERT**, **GPT**, and **T5**.

The transformer architecture typically consists of two parts:
- **Encoder**: Processes the input sequence.
- **Decoder**: Generates the output sequence (for tasks like translation or text generation).

The key innovations of transformers compared to earlier models (like RNNs or LSTMs) include:

- **Self-attention**: As discussed, this allows the model to look at the entire sequence at once and decide which parts are relevant for processing each token.
  
- **Positional encoding**: Since transformers do not have an inherent sense of token order (like RNNs do), they rely on positional encodings added to the embeddings to preserve information about the position of words in the sequence.

- **Layer normalization and feedforward layers**: After self-attention, each token representation is processed through additional layers for further transformation and refinement.

A transformer consists of multiple layers of self-attention and feedforward neural networks, with **multi-head attention** enabling the model to capture multiple relationships between tokens at different levels of abstraction.

### How These Concepts Relate in the Transformer Pipeline:

1. **Embeddings as Input**: The transformer starts by converting input tokens into embeddings (e.g., word or sub-word embeddings). These embeddings can be pre-trained using techniques like Word2Vec or contextual embeddings from BERT-like models. Positional encodings are then added to the embeddings to introduce order information.
  
2. **Self-Attention Mechanism**: The input embeddings are passed through multiple self-attention layers, allowing each token to gather information from all other tokens in the sequence. Multi-head attention enables the model to learn different relationships between tokens.

3. **Contextual Embeddings as Output**: The output of a transformer layer is a **contextual embedding** for each token, meaning each token’s vector is updated based on its relationship with all the other tokens. These embeddings are used for downstream tasks like text classification, translation, or generating new text.

### Conclusion

- **Embeddings** (like Word2Vec) provide a way to represent tokens in a continuous vector space. In transformers, they serve as the starting input representation.
- **Self-attention** enables the model to dynamically focus on the most relevant parts of the sequence when processing each token. It is a flexible and scalable alternative to the sequential nature of RNNs and LSTMs.
- **Transformers** integrate embeddings, self-attention, and multi-head attention into a robust architecture that processes sequences in parallel, allowing for efficient learning of complex relationships within data. This has made transformers the dominant model in NLP tasks and beyond.

## Transformer Implementation

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

In [None]:
# Define a simple Transformer Encoder layer class
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        """
        Initialize the Transformer encoder layer.
        - embed_dim: Dimension of the embedding space.
        - num_heads: Number of attention heads.
        - ff_dim: Hidden layer size in the feed-forward network.
        - rate: Dropout rate to prevent overfitting.
        """
        super(TransformerEncoder, self).__init__()

        # Define the multi-head attention layer
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)

        # Define the feed-forward network: a two-layer MLP (Dense layers)
        self.ffn = tf.keras.Sequential([
            # First dense layer with ReLU activation
            layers.Dense(ff_dim, activation="relu"),
            # Second dense layer outputting the same dimensions as the input
            layers.Dense(embed_dim),
        ])

        # Define layer normalization to stabilize training
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)

        # Define dropout layers to prevent overfitting
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training=False):
        """
        Forward pass for the Transformer encoder.
        - inputs: Input to the transformer encoder layer.
        - training: Whether the layer is in training mode
               (dropout applied) or inference mode.
        """
        # Apply multi-head attention to the inputs (self-attention)
        attn_output = self.attention(inputs, inputs)

        # Apply dropout during training
        attn_output = self.dropout1(attn_output, training=training)

        # Add and normalize (residual connection and layer normalization)
        out1 = self.layernorm1(inputs + attn_output)

        # Apply feed-forward network
        ffn_output = self.ffn(out1)

        # Apply dropout during training
        ffn_output = self.dropout2(ffn_output, training=training)

        # Add and normalize (residual connection and layer normalization)
        return self.layernorm2(out1 + ffn_output)

## Transformer Example

In [None]:
# Define a Transformer-based text classification model
def create_transformer_model(input_shape, embed_dim,
                             num_heads, ff_dim, num_classes):
    """
    Create a Transformer-based classification model.
    - input_shape: Shape of the input data
        (number of tokens in each sequence).
    - embed_dim: Dimension of the embedding.
    - num_heads: Number of attention heads in the Transformer encoder.
    - ff_dim: Feed-forward network dimension.
    - num_classes: Number of output classes for classification.
    """
    # Define the input layer. Expect sequences of integers (token IDs)
    inputs = layers.Input(shape=input_shape)

    # Embed the input tokens using an embedding layer
    x = layers.Embedding(input_dim=5000, output_dim=embed_dim)(inputs)

    # Pass the embeddings through the Transformer encoder layer
    x = TransformerEncoder(embed_dim, num_heads, ff_dim)(x)

    # Apply global average pooling to reduce the sequence to a
    # fixed size (averaging across tokens)
    x = layers.GlobalAveragePooling1D()(x)

    # Add a dense output layer with softmax activation for classification
    outputs = layers.Dense(num_classes, activation="softmax")(x)

    # Create the Keras model
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

## Transformer Application

### Training

In [None]:
# Define model parameters
embed_dim = 64  # Size of the token embeddings
num_heads = 4  # Number of attention heads
ff_dim = 128  # Hidden layer size in the feed-forward network
num_classes = 2  # Number of output classes (for binary classification)

In [None]:
# Create the model using the function defined above
model = create_transformer_model(input_shape=(100,),
            embed_dim=embed_dim, num_heads=num_heads,
            ff_dim=ff_dim, num_classes=num_classes)

In [None]:
# Compile the model with Adam optimizer,
# sparse categorical crossentropy loss, and accuracy metric
model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
# Print the model summary to visualize the architecture
model.summary()

In [None]:
# Generate random input data (100 sequences of 100 tokens)
# -> numerical data, typically tokenized from text
X_train = np.random.randint(0, 5000, size=(100, 100))
y_train = np.random.randint(0, 2, size=(100,))

In [None]:
X_train[:10, :10]

In [None]:
y_train[:10]

In [None]:
# Train the model
%time model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=False)

In [None]:
model.evaluate(X_train, y_train)

### Prediction

In [None]:
# Sample input text (numerical data, typically tokenized from text)
sample_input = np.random.randint(0, 5000, size=(1, 100))

In [None]:
sample_input

In [None]:
# Predict the class
prediction = model.predict(sample_input)
predicted_class = np.argmax(prediction, axis=1)

print(f"Predicted class: {predicted_class[0]}")

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>