<center>
    <h1>Transformers</h1>
</center>

# Brief Recap of Transformers

*   Transformers are a type of neural network architecture introduced in 2017 by Vaswani et al. in the paper "Attention Is All You Need".

*   They revolutionized natural language processing (NLP) tasks by introducing the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element

*   Transformers have several advantages over previous architectures:
    1. Parallelization: They can process entire sequences simultaneously, unlike recurrent neural networks (RNNs).
    2. Long-range dependencies: They can capture relationships between distant elements in a sequence more effectively.
    3. Scalability: Transformers can be trained on larger datasets and have led to the development of large language models.

*   Transformers have become the foundation for many state-of-the-art AI models, including GPT (Generative Pre-trained Transformer) series, BERT (Bidirectional Encoder Representations from Transformers), and their variants.

*   These models have significantly advanced the field of AI and continue to find new applications across various industries.

<center>
    <img src="static/img1.gif" alt="Transformers Example" style="width:50%;">
</center>

## Architecture of Transformers

*   The Transformer architecture consists of two main components: the encoder and the decoder. Here's an overview of the key elements:

    1. **Input Embedding**: Converts input tokens into continuous vector representations.

    2. **Positional Encoding**: Adds information about the position of each token in the sequence.

    3. **Multi-Head Attention**: The core component of Transformers, allowing the model to attend to different parts of the input sequence simultaneously.

    4. **Feed-Forward Networks**: Process the output of the attention layers.

    5. **Layer Normalization and Residual Connections**: Help stabilize training and allow for deeper networks.

    6. **Output Layer**: Produces the final output, often a probability distribution over possible tokens.

*   The encoder processes the input sequence, while the decoder generates the output sequence.

*   The attention mechanism in the decoder also attends to the encoder's output, allowing it to incorporate information from the input sequence when generating each output token.

<center>
    <img src="static/image2.webp" alt="Transformers Architecture" style="width:50%;">
</center>

## Applications of Transformers

Transformers have found wide-ranging applications across various domains:

1. **Natural Language Processing**:
   - Machine translation
   - Text summarization
   - Named entity recognition
   - Sentiment analysis
   - Question answering

2. **Computer Vision**:
   - Image classification
   - Object detection
   - Image generation

3. **Speech Recognition**: Converting audio signals to transcribed text.

4. **Multimodal Tasks**:
   - Image captioning
   - Visual question answering
   - Text-to-image generation (e.g., DALL-E)

5. **Biological Sequence Analysis**: Analyzing DNA and protein sequences.

6. **Time Series Prediction**: Forecasting in various domains, including finance and weather.

7. **Code Generation**: Writing computer code based on natural language requirements.

8. **Recommendation Systems**: Providing personalized recommendations.

9. **Music Generation**: Creating original musical compositions.

10. **Robotics**: Improving robot control and decision-making processes.

11. **Game Playing**: Evaluating chess board positions and other game-related tasks.


# Implementing Transformers with Tensorflow

In [None]:
# importing necessary libraries
import tensorflow as tf
import numpy as np

##  Implement Positional Encoding

The positional_encoding function generates positional encodings for input sequences, which help the transformer model understand the order of the tokens.

- **Inputs:**
  - position: An integer representing the position in the sequence (e.g., token index).
  - d_model: An integer representing the dimensionality of the model (e.g., embedding size).

- **Process:**
  1. The `get_angles` function computes the angle rates based on the position and dimension, using a formula that ensures unique encodings for each position.
  2. The angle values are calculated for all positions and dimensions using NumPy operations.
  3. The sine function is applied to the even indices, while the cosine function is applied to the odd indices to create the final positional encodings.
  4. The output is reshaped to include a batch dimension and cast to a TensorFlow float32 tensor.

- **Output:**
  - The function returns a TensorFlow tensor containing the positional encodings, which can be added to the input embeddings to incorporate position information into the model.

In [None]:
def positional_encoding(position, d_model):
    """
    Generate positional encoding for a given position and model dimension.

    Args:
    position (int): The position in the sequence.
    d_model (int): The dimension of the model.

    Returns:
    numpy array: Positional encoding for the given position.
    """
    def get_angles(pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
        return pos * angle_rates

    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)

    # Apply sin to even indices in the array
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # Apply cos to odd indices in the array
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

## Define the Scaled Dot Product Attention

The `scaled_dot_product_attention` function computes the attention output and attention weights using the scaled dot-product attention mechanism.

- **Inputs:**
  - q (query): A tensor of shape `(..., seq_len_q, depth)` representing the query vectors.
  - k (key): A tensor of shape `(..., seq_len_k, depth)` representing the key vectors.
  - v (value): A tensor of shape `(..., seq_len_v, depth_v)` representing the value vectors.
  - mask: A tensor used to prevent attention to certain positions, with a shape that can be broadcasted to `(..., seq_len_q, seq_len_k)`.

- **Process:**
  1. **Matrix Multiplication:** The query (`q`) is multiplied with the key (`k`) transposed to get the raw attention scores (`matmul_qk`), which have the shape `(..., seq_len_q, seq_len_k)`.
  2. **Scaling:** The attention scores are scaled by dividing by the square root of the depth of the key (`dk`), which helps stabilize gradients during training.
  3. **Masking:** If a mask is provided, it is added to the scaled attention scores, where masked positions are set to a very large negative value (`-1e9`) to ensure they receive zero attention after applying softmax.
  4. **Softmax:** The softmax function is applied to the scaled scores along the last axis (`seq_len_k`) to obtain the attention weights, which sum to 1 for each query.
  5. **Output Calculation:** The attention weights are multiplied by the value (`v`) to generate the final output, which captures the relevant information from the values based on the attention weights.

- **Output:**
  - The function returns the attention output and the computed attention weights. The output shape is `(..., seq_len_q, depth_v)`, while the attention weights shape is `(..., seq_len_q, seq_len_k)`.

In [None]:
def scaled_dot_product_attention(q, k, v, mask):
    """
    Calculate the attention weights.

    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k)

    Returns:
    output, attention_weights
    """
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # Scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # Add the mask to the scaled tensor
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # Softmax is normalized on the last axis (seq_len_k)
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

## Implement Multi-Head Attention

The `MultiHeadAttention` class implements the multi-head attention mechanism used in transformer models, allowing the model to focus on different parts of the input sequence simultaneously.

- **Initialization (`__init__` method):**
  - **Parameters:**
    - d_model: The dimensionality of the model (embedding size).
    - num_heads: The number of attention heads.
  - The class asserts that `d_model` is divisible by `num_heads`, ensuring that each head has an equal share of the dimensionality.
  - depth: The dimensionality of each attention head, calculated as `d_model // num_heads`.
  - Four dense layers are defined for transforming the query (`wq`), key (`wk`), value (`wv`), and the output (`dense`).

- **Splitting Heads (`split_heads` method):**
  - This method reshapes the input tensor `x` to split the last dimension into multiple heads and permutes the dimensions to have the shape `(batch_size, num_heads, seq_len, depth)`. This allows parallel processing of different attention heads.

- **Call Method (`call` method):**
  - **Inputs:**
    - `v`, `k`, `q`: Value, key, and query tensors.
    - `mask`: An optional tensor to mask certain positions.
  - The method retrieves the batch size and applies the dense layers to the input tensors `q`, `k`, and `v`, transforming them into the model dimension.
  - The transformed tensors are split into multiple heads using the `split_heads` method.
  - The `scaled_dot_product_attention` function is called with the split queries, keys, values, and mask, returning the scaled attention output and attention weights.
  - The attention output is transposed and reshaped to combine the attention heads.
  - The final output is computed by passing the concatenated attention through the output dense layer.

- **Output:**
  - The method returns the attention output (shape: `(batch_size, seq_len_q, d_model)`) and the attention weights (shape: `(batch_size, num_heads, seq_len_q, seq_len_k)`), which indicate the importance of each key for each query in the multi-head attention mechanism.

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth)."""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

## Implement Feed-Forward Network

The `point_wise_feed_forward_network` function creates a feed-forward neural network that processes each position in the input independently and identically, as used in transformer models.

- **Parameters:**
  - d_model: The dimensionality of the model (input/output size).
  - dff: The dimensionality of the feed-forward network's hidden layer (intermediate size).

- **Process:**
  - The function returns a `tf.keras.Sequential` model consisting of two dense layers:
    1. The first dense layer transforms the input from `d_model` to `dff` dimensions using the ReLU activation function. This layer captures non-linear relationships in the data.
    2. The second dense layer transforms the output back from `dff` to `d_model` dimensions, ensuring that the final output maintains the same shape as the original input.

- **Output:**
  - The resulting feed-forward network processes inputs of shape `(batch_size, seq_len, d_model)` and outputs tensors of the same shape `(batch_size, seq_len, d_model)`, with the intermediate representation shaped `(batch_size, seq_len, dff)`. This structure allows each position in the sequence to be processed independently while preserving the overall dimensionality of the model.

In [None]:
def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

## Implement Encoder Layer

The `EncoderLayer` class implements a single layer of the encoder in a transformer model, combining multi-head attention and feed-forward neural networks with normalization and dropout to enhance learning.

- **Initialization (`__init__` method):**
  - **Parameters:**
    - d_model: The dimensionality of the model (embedding size).
    - num_heads: The number of attention heads for multi-head attention.
    - dff: The dimensionality of the feed-forward network's hidden layer.
    - rate: The dropout rate, with a default value of 0.1.
  - The class initializes:
    - mha: An instance of the `MultiHeadAttention` class for performing attention.
    - ffn: A point-wise feed-forward network created using the `point_wise_feed_forward_network` function.
    - Two layer normalization layers (`layernorm1` and `layernorm2`) to stabilize training.
    - Two dropout layers (`dropout1` and `dropout2`) to prevent overfitting.

- **Call Method (`call` method):**
  - **Inputs:**
    - `x`: The input tensor of shape `(batch_size, input_seq_len, d_model)`.
    - `training`: A boolean indicating whether the model is in training mode (used for dropout).
    - `mask`: An optional tensor to mask certain positions in the input.
  - The method performs the following steps:
    1. **Multi-Head Attention:** Computes the attention output using the input `x` as query, key, and value. The output shape is `(batch_size, input_seq_len, d_model)`.
    2. **Dropout:** Applies dropout to the attention output to reduce overfitting.
    3. **Add & Norm:** Adds the original input `x` to the attention output and applies layer normalization (`layernorm1`). This residual connection helps in training deep networks.
    4. **Feed-Forward Network:** Passes the normalized output through the feed-forward network (`ffn_output`), retaining the shape `(batch_size, input_seq_len, d_model)`.
    5. **Dropout:** Applies dropout to the feed-forward network output.
    6. **Add & Norm:** Adds the output of the feed-forward network to the result from the first normalization step and applies another layer normalization (`layernorm2`).

- **Output:**
  - The method returns the final output of the encoder layer, which has the same shape as the input: `(batch_size, input_seq_len, d_model)`. This output can then be fed into subsequent layers of the transformer model, maintaining the rich contextual information learned through attention and feed-forward processing.

In [None]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

## Implement Decoder Layer

The `DecoderLayer` class implements a single layer of the decoder in a transformer model. It combines two multi-head attention mechanisms with a feed-forward neural network, incorporating layer normalization and dropout for stability and regularization.

- **Initialization (`__init__` method):**
  - **Parameters:**
    - d_model: The dimensionality of the model (embedding size).
    - num_heads: The number of attention heads for multi-head attention.
    - dff: The dimensionality of the feed-forward network's hidden layer.
    - rate: The dropout rate, defaulting to 0.1.
  - The class initializes:
    - mha1: The first multi-head attention mechanism, which is responsible for self-attention within the decoder.
    - mha2: The second multi-head attention mechanism, which performs cross-attention with the encoder's output.
    - ffn: A point-wise feed-forward network created using the `point_wise_feed_forward_network` function.
    - Three layer normalization layers (`layernorm1`, `layernorm2`, `layernorm3`) to stabilize training.
    - Three dropout layers (`dropout1`, `dropout2`, `dropout3`) to prevent overfitting.

- **Call Method (`call` method):**
  - **Inputs:**
    - `x`: The input tensor of shape `(batch_size, target_seq_len, d_model)` representing the decoder's current input.
    - `enc_output`: The encoder's output of shape `(batch_size, input_seq_len, d_model)`.
    - `training`: A boolean indicating whether the model is in training mode (used for dropout).
    - `look_ahead_mask`: A mask to prevent attention to future tokens in the target sequence.
    - `padding_mask`: A mask to ignore padding tokens in the encoder's output.
  - The method performs the following steps:
    1. **Self-Attention (First MHA):** Computes self-attention on `x` using `mha1`, applying the look-ahead mask to ensure that the model can only attend to the current and previous tokens. The output shape is `(batch_size, target_seq_len, d_model)`.
    2. **Dropout & Add & Norm:** Applies dropout to the attention output, then adds the original input `x` and applies layer normalization (`layernorm1`).
    3. **Cross-Attention (Second MHA):** Computes attention using the encoder output (`enc_output`) and the output from the first attention block (`out1`). This layer allows the decoder to focus on relevant parts of the input sequence while generating the output.
    4. **Dropout & Add & Norm:** Applies dropout to the second attention output, adds the output from the first normalization step, and applies another layer normalization (`layernorm2`).
    5. **Feed-Forward Network:** Passes the output through the feed-forward network (`ffn_output`), retaining the shape `(batch_size, target_seq_len, d_model)`.
    6. **Dropout & Add & Norm:** Applies dropout to the feed-forward output, adds it to the output from the second normalization step, and applies a final layer normalization (`layernorm3`).

- **Output:**
  - The method returns:
    - `out3`: The final output of the decoder layer, shaped `(batch_size, target_seq_len, d_model)`, which can be used as input for subsequent decoder layers.
    - `attn_weights_block1`: The attention weights from the first multi-head attention layer (self-attention).
    - `attn_weights_block2`: The attention weights from the second multi-head attention layer (cross-attention), allowing for visualization of how the decoder attends to the encoder's outputs.

In [None]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)


    def call(self, x, enc_output, training,
             look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

## Implement Encoder

The `Encoder` class implements the encoder part of the Transformer architecture. It consists of multiple layers of the `EncoderLayer`, with embedding and positional encoding applied to the input sequence. Here’s a breakdown of its components and functionality:

- Initialization (`__init__` Method)

  - **Parameters:**
    - `num_layers`: Number of encoder layers to stack.
    - `d_model`: Dimensionality of the model (embedding size).
    - `num_heads`: Number of attention heads in multi-head attention.
    - `dff`: Dimensionality of the feed-forward network's hidden layer.
    - `input_vocab_size`: Size of the input vocabulary for the embedding layer.
    - `maximum_position_encoding`: The maximum length of the input sequence to define positional encoding.
    - `rate`: Dropout rate, defaulting to 0.1.
  
  - **Attributes:**
    - `self.embedding`: An embedding layer to convert input token indices into dense vectors of shape `(batch_size, input_seq_len, d_model)`.
    - `self.pos_encoding`: Positional encoding added to the input embeddings to retain the sequence order. This is computed using the `positional_encoding` function.
    - `self.enc_layers`: A list of `EncoderLayer` instances created based on `num_layers`.
    - `self.dropout`: A dropout layer applied to the output of the embedding and positional encoding.

- Call Method (`call` Method)

  - **Inputs:**
    - `x`: The input tensor of shape `(batch_size, input_seq_len)` representing token indices.
    - `training`: A boolean indicating whether the model is in training mode (used for dropout).
    - `mask`: A mask to prevent attention to certain positions (e.g., padding tokens).

  - **Functionality:**
    1. **Get Sequence Length:** The method retrieves the sequence length from the input `x` using `tf.shape(x)[1]`.
    
    2. **Embedding and Positional Encoding:**
      - The input tokens are converted into embeddings: `x = self.embedding(x)` gives a shape of `(batch_size, input_seq_len, d_model)`.
      - The embeddings are scaled by the square root of `d_model` to counteract the effect of the embeddings' size.
      - Positional encoding is added: `x += self.pos_encoding[:, :seq_len, :]` allows the model to incorporate the position of each token in the sequence.
    
    3. **Dropout:** The dropout layer is applied to the combined embeddings and positional encodings: `x = self.dropout(x, training=training)`.

    4. **Pass Through Encoder Layers:** The input `x` is sequentially passed through each of the `EncoderLayer` instances: `x = self.enc_layers[i](x, training, mask)`. Each layer performs multi-head attention and feed-forward operations.

- **Output:**
  - The final output `x` has a shape of `(batch_size, input_seq_len, d_model)`, representing the encoded input sequence with learned features from the encoder layers.

In [None]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]

        # Adding embedding and position encoding.
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)

## Implement Decoder

The `Decoder` class implements the decoder part of the Transformer architecture. It consists of multiple layers of the `DecoderLayer`, with embedding and positional encoding applied to the target sequence. Here's a breakdown of its components and functionality:

- Initialization (`__init__` Method)

  - **Parameters:**
    - `num_layers`: Number of decoder layers to stack.
    - `d_model`: Dimensionality of the model (embedding size).
    - `num_heads`: Number of attention heads in multi-head attention.
    - `dff`: Dimensionality of the feed-forward network's hidden layer.
    - `target_vocab_size`: Size of the target vocabulary for the embedding layer.
    - `maximum_position_encoding`: The maximum length of the target sequence to define positional encoding.
    - `rate`: Dropout rate, defaulting to 0.1.

  - **Attributes:**
    - `self.embedding`: An embedding layer to convert target token indices into dense vectors of shape `(batch_size, target_seq_len, d_model)`.
    - `self.pos_encoding`: Positional encoding added to the input embeddings to retain the sequence order. This is computed using the `positional_encoding` function.
    - `self.dec_layers`: A list of `DecoderLayer` instances created based on `num_layers`.
    - `self.dropout`: A dropout layer applied to the output of the embedding and positional encoding.

- Call Method (`call` Method)

  - **Inputs:**
    - `x`: The input tensor of shape `(batch_size, target_seq_len)` representing token indices from the target sequence.
    - `enc_output`: The output from the encoder, which contains the encoded representations of the input sequence.
    - `training`: A boolean indicating whether the model is in training mode (used for dropout).
    - `look_ahead_mask`: A mask to prevent attention to future tokens in the target sequence.
    - `padding_mask`: A mask to prevent attention to padding tokens in the encoder output.

  - **Functionality:**
    1. **Get Sequence Length:** The method retrieves the sequence length from the input `x` using `tf.shape(x)[1]`.
    
    2. **Embedding and Positional Encoding:**
      - The input tokens are converted into embeddings: `x = self.embedding(x)` gives a shape of `(batch_size, target_seq_len, d_model)`.
      - The embeddings are scaled by the square root of `d_model` to counteract the effect of the embeddings' size.
      - Positional encoding is added: `x += self.pos_encoding[:, :seq_len, :]` allows the model to incorporate the position of each token in the sequence.
    
    3. **Dropout:** The dropout layer is applied to the combined embeddings and positional encodings: `x = self.dropout(x, training=training)`.

    4. **Pass Through Decoder Layers:** The input `x` is sequentially passed through each of the `DecoderLayer` instances. During this process, both the encoder output (`enc_output`) and the target sequence's output from the previous layer are used:
      - Each decoder layer returns both the output of the layer and the attention weights from both multi-head attention blocks.
      - The attention weights for each layer are stored in the `attention_weights` dictionary:
        ```python
        attention_weights[f'decoder_layer{i+1}_block1'] = block1
        attention_weights[f'decoder_layer{i+1}_block2'] = block2
        ```

- **Output:**
  - The final output `x` has a shape of `(batch_size, target_seq_len, d_model)`, representing the decoded target sequence with learned features from the decoder layers.
  - The `attention_weights` dictionary contains the attention weights for each layer, which can be useful for visualization or understanding model behavior.

In [None]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, maximum_position_encoding, rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                                   look_ahead_mask, padding_mask)

            attention_weights[f'decoder_layer{i+1}_block1'] = block1
            attention_weights[f'decoder_layer{i+1}_block2'] = block2

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

## Implement Transformer

The `Transformer` class encapsulates the entire Transformer architecture, combining the encoder and decoder to facilitate tasks such as machine translation, text summarization, and other sequence-to-sequence applications. Here is a detailed breakdown of its components and functionality:

- Initialization (`__init__` Method)

  - **Parameters:**
    - `num_layers`: Number of encoder and decoder layers in the Transformer.
    - `d_model`: Dimensionality of the model (size of the embedding and output).
    - `num_heads`: Number of attention heads for the multi-head attention mechanism.
    - `dff`: Dimensionality of the feed-forward network’s hidden layer.
    - `input_vocab_size`: Size of the input vocabulary, which is used for embedding the input tokens.
    - `target_vocab_size`: Size of the target vocabulary for embedding the output tokens.
    - `pe_input`: Maximum position encoding for the input sequence.
    - `pe_target`: Maximum position encoding for the target sequence.
    - `rate`: Dropout rate (defaulting to 0.1).

  - **Attributes:**
    - `self.encoder`: An instance of the `Encoder` class, initialized with the parameters specified.
    - `self.decoder`: An instance of the `Decoder` class, initialized with the relevant parameters.
    - `self.final_layer`: A dense layer that converts the decoder output into logits for each token in the target vocabulary.

- Call Method (`call` Method)

  - **Inputs:**
    - `inp`: The input tensor of shape `(batch_size, inp_seq_len)` representing the source sequence.
    - `tar`: The target tensor of shape `(batch_size, tar_seq_len)` representing the target sequence (usually shifted right).
    - `training`: A boolean indicating whether the model is in training mode (used for dropout).
    - `enc_padding_mask`: A mask to prevent attention to padding tokens in the encoder input.
    - `look_ahead_mask`: A mask to prevent attention to future tokens in the decoder input.
    - `dec_padding_mask`: A mask to prevent attention to padding tokens in the decoder input.

  - **Functionality:**
    1. **Encoder Output:** The input sequence is passed through the encoder:
      ```python
      enc_output = self.encoder(inp, training, enc_padding_mask)
      ```
      This returns the encoded representation of the input, with the shape `(batch_size, inp_seq_len, d_model)`.

    2. **Decoder Output:** The target sequence is then passed through the decoder along with the encoder output:
      ```python
      dec_output, attention_weights = self.decoder(
          tar, enc_output, training, look_ahead_mask, dec_padding_mask)
      ```
      This returns the decoder output of shape `(batch_size, tar_seq_len, d_model)` and a dictionary of attention weights for each decoder layer.

    3. **Final Output:** The decoder output is then passed through the final dense layer to generate logits for each token in the target vocabulary:
      ```python
      final_output = self.final_layer(dec_output)
      ```
      The shape of `final_output` is `(batch_size, tar_seq_len, target_vocab_size)`.

- **Output:**
  - The method returns two outputs:
    - `final_output`: The logits for each token in the target vocabulary, which can be used for loss calculation or predictions.
    - `attention_weights`: A dictionary containing the attention weights from each decoder layer, useful for visualizations and understanding model behavior.

In [None]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
                 target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                               input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                               target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inp, tar, training, enc_padding_mask,
             look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
            tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weights

# Let's Build a Real world project to understand the concept of Transformers better

# News Article Classification using BERT

## Problem Description

We aim to build a news article classification model using a Transformer architecture to categorize articles into predefined categories. This project will demonstrate the effectiveness of Transformer models in capturing contextual information from text data for multi-class classification of news articles.

## Dataset Description

- The AG News dataset consists of 127,600 news articles, split into 120,000 training and 7,600 testing samples.

- Each article is labeled as one of four categories: World, Sports, Business, or Sci/Tech.

- The dataset contains the title and description of each news article.

- Key features of the dataset:
  - 127,600 news articles (120,000 for training, 7,600 for testing)
  - Multi-class classification (4 categories)
  - Raw text data including both title and description
  - Balanced distribution across categories


- For more information about the AG News dataset, you can visit the following link: [AG News Dataset](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)


In [18]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

## Loading the AG News dataset

We use tfds.load() to load the AG News subset dataset. The with_info=True parameter returns dataset info along with the dataset itself. The as_supervised=True parameter ensures that we get (input, label) pairs.

In [19]:
# Load the AG News dataset
(train_data, test_data), info = tfds.load('ag_news_subset', split=['train', 'test'], with_info=True, as_supervised=True)

In [20]:
info

tfds.core.DatasetInfo(
    name='ag_news_subset',
    full_name='ag_news_subset/1.0.0',
    description="""
    AG is a collection of more than 1 million news articles. News articles have been
    gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of
    activity. ComeToMyHead is an academic news search engine which has been running
    since July, 2004. The dataset is provided by the academic comunity for research
    purposes in data mining (clustering, classification, etc), information retrieval
    (ranking, search, etc), xml, data compression, data streaming, and any other
    non-commercial activity. For more information, please refer to the link
    http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
    
    The AG's news topic classification dataset is constructed by Xiang Zhang
    (xiang.zhang@nyu.edu) from the dataset above. It is used as a text
    classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann
    LeCu

In [21]:
# Visualize the sample dataset
for text, label in train_data.take(5):
    print(f"Text: {text.numpy().decode('utf-8')}")
    print(f"Label: {label.numpy()}")
    print('-------------------------------------')

Text: AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.
Label: 3
-------------------------------------
Text: Reuters - Major League Baseball\Monday announced a decision on the appeal filed by Chicago Cubs\pitcher Kerry Wood regarding a suspension stemming from an\incident earlier this season.
Label: 1
-------------------------------------
Text: President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.
Label: 2
-------------------------------------
Text: Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.
Label: 3
-------------------------------------
Text: London, England (Sports Network) - England midfielder Steven Gerrard injured his groin late in Th

## Data Preprocessing


The following code sets up a `TextVectorization` layer from TensorFlow, which is used to preprocess the text data before feeding it into the Transformer model.

- Vocabulary Size (vocab_size): This parameter defines the maximum number of unique tokens (words) that the model will consider. In this case, the model will use the 10,000 most frequent words in the dataset. Any words outside this vocabulary will be replaced with an out-of-vocabulary token.

- Maximum Sequence Length (max_length): This value specifies the maximum number of tokens in each sequence. Sequences shorter than this length will be padded, and longer sequences will be truncated to fit this specified length.

- TextVectorization Layer: This layer transforms the raw text into a sequence of integers. Each integer corresponds to a token in the vocabulary, making it possible for the model to process the data numerically. It standardizes the text data, tokenizes it into words, and then converts these words into their corresponding token indices.

In [22]:
# Preprocess the data
vocab_size = 10000
max_length = 100

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=max_length
)

## Adapting the TextVectorization Layer to Training Data


The following code snippet demonstrates how to adapt the `TextVectorization` layer to the training data in order to build a vocabulary based on the dataset.

- Training Text Extraction:

  - train_text = train_data.map(lambda text, label: text) extracts only the text component from the training dataset. The use of lambda text, label: text ensures that labels are ignored during this operation, as the focus is on analyzing the text itself.

- Layer Adaptation:

  - The vectorize_layer.adapt(train_text) command tunes the TextVectorization layer to the vocabulary present in the training data. During this process, the layer scans through the dataset and builds a list of the most frequent words, limiting the vocabulary to the specified vocab_size.
  - This adaptation step ensures that the layer is familiar with the words and their relative frequencies in the training set, enabling it to accurately transform text into integer sequences based on the built vocabulary.

In [23]:
# Adapt the layer to the training data
train_text = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_text)

## Text Vectorization Function


The following code defines a function named `vectorize_text` that converts the raw text into integer sequences using the previously adapted `TextVectorization` layer.

- Function Purpose:

  - The vectorize_text function takes two inputs: text and label.
  It processes the text using the vectorize_layer, converting the text into a sequence of integers that represent token indices from the vocabulary.
  The function then returns the transformed text along with its corresponding label.

- Input Parameters:

  - text: The raw text data that needs to be converted into numerical format.
  - label: The label associated with the text, which represents the category of the news article.

- Vectorization Process:

  - text = vectorize_layer(text) applies the TextVectorization layer to the input text, transforming it into a sequence of integers.
  This transformation ensures that each word in the text is represented by its corresponding token ID, making the data compatible with the Transformer model.

In [24]:
def vectorize_text(text, label):
    text = vectorize_layer(text)
    return text, label

## Creating Vectorized Datasets


The following code creates vectorized versions of the training and test datasets by applying the `vectorize_text` function to each sample in the datasets.

- Vectorized Training Dataset:

  - train_ds = train_data.map(vectorize_text) applies the vectorize_text function to each entry in the training dataset.
  - The map operation ensures that each text sample in the training dataset is converted into its corresponding integer sequence, as defined by the
  TextVectorization layer.
  - The labels remain unchanged, allowing the model to train on the vectorized text data paired with their respective categories.

- Vectorized Test Dataset:

  - test_ds = test_data.map(vectorize_text) performs a similar operation on the test dataset, ensuring that the text samples in the test set are also transformed into integer sequences.
  - This step guarantees that both the training and test datasets are in the same numerical format, enabling consistent performance evaluation.

In [25]:
# Create vectorized datasets
train_ds = train_data.map(vectorize_text)
test_ds = test_data.map(vectorize_text)

## Configuring the Dataset for Performance


The following code optimizes the performance of the training and test datasets by using caching, shuffling, batching, and prefetching techniques.

- AUTOTUNE:

  - AUTOTUNE = tf.data.AUTOTUNE enables TensorFlow to dynamically adjust and optimize data loading for better performance during training and evaluation.

- Batch Size (BATCH_SIZE):

  - BATCH_SIZE = 32 sets the number of samples that will be processed together in each step of training. Batching helps improve computation speed by processing multiple samples simultaneously.

- Dataset Optimization Techniques:

  - Caching (cache()): train_ds.cache() stores the dataset in memory after the first epoch, which speeds up training by eliminating the need to reload data from disk during each subsequent epoch.

  - Shuffling (shuffle()): train_ds.shuffle(10000) randomly shuffles the dataset with a buffer size of 10,000 to ensure that the training data is presented in a different order in each epoch. This prevents the model from learning patterns based on the order of the data.

  - Batching (batch()): .batch(BATCH_SIZE) groups the dataset into batches of 32 samples. Batching improves training efficiency and helps utilize computational resources more effectively.

  - Prefetching (prefetch()): .prefetch(AUTOTUNE) allows the data loading and model training to overlap. While the model is training on the current batch, the next batch is already being prepared in the background, reducing training time and improving overall throughput.

In [26]:
# Configure the dataset for performance
AUTOTUNE = tf.data.AUTOTUNE
BATCH_SIZE = 32

train_ds = train_ds.cache().shuffle(10000).batch(BATCH_SIZE).prefetch(AUTOTUNE)
test_ds = test_ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)

## Defining the Transformer Model

The following code defines a custom layer for a Transformer model, specifically a building block known as `TransformerBlock`. This block is responsible for performing self-attention and feedforward operations essential for the model's performance.

- Class Initialization (__init__ method):

  - The TransformerBlock class inherits from tf.keras.layers.Layer, enabling it to function as a layer within the Keras model.
  
  - Parameters:
    - embed_dim: The dimensionality of the embedding space, representing the size of the token embeddings.
    
    - num_heads: The number of attention heads used in the multi-head attention mechanism. This allows the model to focus on different parts of the input text simultaneously.
    
    - ff_dim: The dimensionality of the feedforward network hidden layer.
    
    - rate: The dropout rate, which helps prevent overfitting during training.

- Layer Components:

  - Multi-Head Attention (self.att): tf.keras.layers.MultiHeadAttention computes the self-attention scores, allowing the model to weigh the importance of different tokens in the input text relative to each other.

  - Feedforward Network (self.ffn): A sequential model that consists of two dense layers:
    - The first layer applies a ReLU activation function to introduce non-linearity.
    - The second layer projects the output back to the embedding dimension.

  - Layer Normalization (self.layernorm1 and self.layernorm2): Normalizes the output of each sub-layer, stabilizing the learning process and helping to maintain gradients during backpropagation.

  - Dropout Layers (self.dropout1 and self.dropout2): Apply dropout regularization to reduce overfitting by randomly setting a fraction of the input units to zero during training.

- Forward Pass (call method): The call method defines how the layer processes its inputs during the forward pass:
  - Attention Output: Computes self-attention scores by applying the multi-head attention mechanism to the inputs.
  
  - Dropout on Attention Output: Applies dropout to the attention output to mitigate overfitting.
  
  - Layer Normalization and Residual Connection: Adds the original inputs to the attention output and normalizes the result.
  
  - Feedforward Network Output: Passes the normalized output through the feedforward network.
  
  - Final Dropout and Layer Normalization: Applies dropout to the feedforward network output, adds it to the previous result, and normalizes it again.

In [27]:
# Define the Transformer model
class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation="relu"),
            tf.keras.layers.Dense(embed_dim),
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

## Defining the Transformer Model Architecture


The following code constructs a Transformer-based model for classifying news articles into predefined categories. It incorporates an embedding layer, a custom Transformer block, and several dense layers to produce the final classification outputs.

- Model Parameters:

  - embed_dim: Specifies the size of the embedding vector for each token. Here, it is set to 32.
  - num_heads: Indicates the number of attention heads in the multi-head attention mechanism, set to 2.
  - ff_dim: Defines the size of the hidden layer in the feedforward network within the Transformer block, also set to 32.

- Input Layer:

  - tf.keras.layers.Input(shape=(max_length,)): Creates an input layer that accepts sequences of integers with a length defined by max_length. This corresponds to the preprocessed tokenized input sequences.

- Embedding Layer:

  - tf.keras.layers.Embedding(vocab_size, embed_dim): Initializes an embedding layer that maps integer token indices to dense vectors of size embed_dim. The vocab_size parameter determines the size of the embedding matrix.

- Transformer Block:

  - transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim): Instantiates the previously defined TransformerBlock, which will process the embedded inputs.
  - x = transformer_block(x): Passes the embedded input through the Transformer block, allowing the model to learn contextual relationships among the tokens.

- Global Average Pooling Layer:

  - tf.keras.layers.GlobalAveragePooling1D()(x): Averages the sequence of output vectors from the Transformer block, reducing the dimensionality and creating a fixed-size output regardless of the input sequence length.

- Dropout Layers:

  - tf.keras.layers.Dropout(0.1)(x): Applies dropout with a rate of 0.1 after the global pooling layer to reduce overfitting.
  
  - Another dropout layer follows the dense layer to further mitigate the risk of overfitting.

- Dense Layers:

  - tf.keras.layers.Dense(20, activation="relu")(x): A fully connected dense layer with 20 units and ReLU activation, enabling the model to learn non-linear combinations of features.
  
  - outputs = tf.keras.layers.Dense(4, activation="softmax")(x): The final output layer, which has 4 units corresponding to the number of categories in the classification task, and uses the softmax activation function to produce probability distributions over the categories.

- Model Instantiation:

  - model = tf.keras.Model(inputs=inputs, outputs=outputs): Constructs the Keras model by specifying the input and output layers, encapsulating the entire architecture.

In [28]:
# Define the model
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = tf.keras.layers.Input(shape=(max_length,))
embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
x = tf.keras.layers.Dropout(0.1)(x)
x = tf.keras.layers.Dense(20, activation="relu")(x)
x = tf.keras.layers.Dropout(0.1)(x)
outputs = tf.keras.layers.Dense(4, activation="softmax")(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

## Compiling the Model

The following code compiles the previously defined Transformer-based model, setting the optimizer, loss function, and evaluation metrics that will be used during training and evaluation.

- Model Compilation: model.compile(...): This method configures the model for training by specifying the optimizer, loss function, and metrics.

- Optimizer: optimizer="adam": The Adam optimizer is chosen for its adaptive learning rate capabilities, which can help the model converge more quickly and efficiently during training. It combines the benefits of two other extensions of stochastic gradient descent, namely AdaGrad and RMSProp.

- Loss Function: loss="sparse_categorical_crossentropy": This loss function is appropriate for multi-class classification tasks where the target labels are provided as integers (not one-hot encoded). It computes the cross-entropy loss between the predicted probabilities (output from the softmax layer) and the true class labels, measuring how well the predicted distribution aligns with the actual distribution.

- Metrics: metrics=["accuracy"]: Accuracy is selected as the metric to evaluate the model's performance. It calculates the proportion of correctly classified samples out of the total samples, providing a straightforward measure of the model's predictive performance.

In [29]:
# Compile the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

## Training the Model

The following code trains the compiled Transformer-based model on the training dataset while evaluating its performance on the validation dataset.

- Model Training: history = model.fit(...): This method starts the training process for the model, using the provided training and validation datasets over a specified number of epochs.

- Training Dataset: train_ds: The dataset used to train the model. This dataset consists of vectorized text inputs and their corresponding labels.

- Validation Dataset: validation_data=test_ds: The dataset used to evaluate the model's performance after each epoch of training. This helps monitor the model's ability to generalize to unseen data.

- Epochs: epochs=5: The number of complete passes through the training dataset. In this case, the model will train for 5 epochs. Each epoch consists of a forward pass and backward pass, allowing the model to update its weights based on the computed gradients.

In [30]:
# Train the model
history = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=5
)

Epoch 1/5
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 8ms/step - accuracy: 0.7614 - loss: 0.5917 - val_accuracy: 0.9014 - val_loss: 0.2921
Epoch 2/5
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 4ms/step - accuracy: 0.9148 - loss: 0.2568 - val_accuracy: 0.9004 - val_loss: 0.2982
Epoch 3/5
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 4ms/step - accuracy: 0.9247 - loss: 0.2276 - val_accuracy: 0.8996 - val_loss: 0.3086
Epoch 4/5
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 4ms/step - accuracy: 0.9301 - loss: 0.2061 - val_accuracy: 0.8961 - val_loss: 0.3302
Epoch 5/5
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 4ms/step - accuracy: 0.9335 - loss: 0.1893 - val_accuracy: 0.8939 - val_loss: 0.3729


## Evaluating the Model

The following code evaluates the performance of the trained Transformer-based model on the test dataset, providing metrics such as loss and accuracy.

In [31]:
# Evaluate the model
test_loss, test_accuracy = model.evaluate(test_ds)
print(f"Test accuracy: {test_accuracy:.3f}")

[1m238/238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.8960 - loss: 0.3684
Test accuracy: 0.894


## Making Predictions


The following code demonstrates how to use the trained Transformer-based model to make predictions on a set of sample texts. It involves vectorizing the texts and then obtaining predictions from the model.

In [32]:
# Make predictions on some sample texts
sample_texts = [
    "The stock market reached new highs today as tech companies reported strong earnings.",
    "The national soccer team won the World Cup after a thrilling final match.",
    "Scientists discover a new species of deep-sea creature in the Pacific Ocean.",
    "Global leaders met to discuss climate change policies and renewable energy.",
]

# Vectorize the sample texts
sample_vectorized = vectorize_layer(sample_texts)

# Make predictions
predictions = model.predict(sample_vectorized)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 708ms/step


## Printing the Classification Results

The following code snippet prints the predicted class for each sample text based on the model's predictions. It interprets the model's output and displays the corresponding news category for each article.

In [33]:
# Print the results
classes = ['World', 'Sports', 'Business', 'Sci/Tech']
for text, pred in zip(sample_texts, predictions):
    predicted_class = classes[np.argmax(pred)]
    print(f"Text: {text}")
    print(f"Predicted class: {predicted_class}\n")


Text: The stock market reached new highs today as tech companies reported strong earnings.
Predicted class: Business

Text: The national soccer team won the World Cup after a thrilling final match.
Predicted class: Sports

Text: Scientists discover a new species of deep-sea creature in the Pacific Ocean.
Predicted class: Sci/Tech

Text: Global leaders met to discuss climate change policies and renewable energy.
Predicted class: Sci/Tech

